Alerting System Setup — SaaS Builder Playbooks

This page shows a minimal, practical alerting setup for a small SaaS. The target is simple: detect user-facing failures fast, route alerts to the right place, and avoid noisy rules that get ignored.

Use a layered setup:

uptime monitoring for availability
error tracking for application exceptions
host metrics for CPU, memory, disk, and process health
optional business-flow alerts for signups, auth, and payments

This is enough for most MVPs and early production systems.

Quick Fix / Quick Setup

bash

# Minimal alerting stack for a VPS-hosted SaaS
# 1) Add uptime checks with Uptime Kuma or an external monitor
# 2) Add Sentry for application exceptions
# 3) Add basic host alerts with Prometheus node_exporter + Grafana alerts
# 4) Send alerts to Slack/email

# Example: install node_exporter on Ubuntu
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gz
tar xvf node_exporter-1.8.1.linux-amd64.tar.gz
sudo cp node_exporter-1.8.1.linux-amd64/node_exporter /usr/local/bin/

sudo useradd --no-create-home --shell /bin/false node_exporter || true
cat <<'EOF' | sudo tee /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
sudo systemctl status node_exporter

# Quick check
curl http://127.0.0.1:9100/metrics | head

# Example alert conditions to start with:
# - site down for 2 minutes
# - 5xx error rate > 5% for 5 minutes
# - CPU > 90% for 10 minutes
# - memory available < 10% for 10 minutes
# - disk usage > 85%
# - no successful payment webhook events for 15 minutes (if applicable)

Start with 5–7 high-signal alerts only. Add routing and escalation after the baseline works. Good first destinations: Slack for warnings, email or SMS for critical alerts.

What’s happening

Monitoring collects signals. Alerting decides when those signals require action.

For a small SaaS, the minimum useful alert surface is:

availability
app exceptions
infrastructure saturation
worker and queue health
business-critical flows like auth and payments

Bad setups usually fail in one of two ways:

nothing alerts during a real incident
everything alerts constantly and gets muted

A practical setup combines multiple sources:

uptime monitor for endpoint reachability
application error tracking for exceptions
metrics for resource thresholds and latency
synthetic checks for login or payment flows if needed

signal source

rule

notification channel

acknowledgement/escalation

Process Flow

Step-by-step implementation

1) Define critical services

List the components that must work for the product to function:

web app
API
database
Redis or queue backend
background worker
cron/scheduler
email delivery
payment webhooks
object storage access

If a component can fail without users noticing, it should usually be warning-level only.

2) Choose alert destinations

Use simple routing first:

Slack: warnings and non-urgent issues
email: high severity and fallback
SMS or phone: only critical alerts

Do not page yourself for every exception.

Example policy:

warning: Slack only
high: Slack + email
critical: Slack + email + SMS

3) Add uptime monitoring

Monitor these endpoints at minimum:

homepage
login page
API health endpoint
webhook endpoint health if exposed

Example health endpoint in FastAPI:

python

from fastapi import FastAPI

app = FastAPI()

@app.get("/health")
def health():
    return {"status": "ok"}

Example health endpoint in Flask:

python

from flask import Flask, jsonify

app = Flask(__name__)

@app.get("/health")
def health():
    return jsonify({"status": "ok"})

Suggested starter rules:

main site down for 2 consecutive checks
API health endpoint down for 2 consecutive checks
TLS certificate expiry warning at 14 days, critical at 7 days

For deeper setup, also see Uptime Monitoring Setup.

4) Add application error tracking

Use Sentry or equivalent. Alert on:

new unhandled exception in production
exception rate spike
regression of a resolved issue

Basic Python example:

bash

pip install sentry-sdk[flask]

python

import sentry_sdk
from sentry_sdk.integrations.flask import FlaskIntegration

sentry_sdk.init(
    dsn="https://YOUR_DSN@sentry.io/PROJECT_ID",
    integrations=[FlaskIntegration()],
    traces_sample_rate=0.0,
    environment="production",
)

For FastAPI:

bash

pip install sentry-sdk[fastapi]

python

import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration

sentry_sdk.init(
    dsn="https://YOUR_DSN@sentry.io/PROJECT_ID",
    integrations=[FastApiIntegration()],
    traces_sample_rate=0.0,
    environment="production",
)

Set alert rules inside Sentry for:

issue frequency spike
new issue in production
crash-free rate drop if relevant

Related: Error Tracking with Sentry

5) Add host metrics with node_exporter

Install node_exporter on each Linux host.

bash

wget https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gz
tar xvf node_exporter-1.8.1.linux-amd64.tar.gz
sudo cp node_exporter-1.8.1.linux-amd64/node_exporter /usr/local/bin/

sudo useradd --no-create-home --shell /bin/false node_exporter || true
sudo chown root:root /usr/local/bin/node_exporter
sudo chmod 755 /usr/local/bin/node_exporter

Create a systemd unit:

ini

# /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target

Enable it:

bash

sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
sudo systemctl status node_exporter
curl -s http://127.0.0.1:9100/metrics | head

If Prometheus runs remotely, allow scraping only from the collector IP. Do not expose metrics publicly.

Example UFW rule:

bash

sudo ufw allow from <PROMETHEUS_IP> to any port 9100 proto tcp

6) Scrape metrics in Prometheus

Example Prometheus config:

yaml

# /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets:
          - 'app1.example.internal:9100'
          - 'worker1.example.internal:9100'

Reload Prometheus after config changes.

bash

curl -X POST http://127.0.0.1:9090/-/reload

7) Create starter alert rules

Example Prometheus rules:

yaml

# /etc/prometheus/alerts/node.yml
groups:
  - name: node-alerts
    rules:
      - alert: HostHighCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
        for: 10m
        labels:
          severity: high
        annotations:
          summary: "High CPU on {{ $labels.instance }}"
          description: "CPU usage > 90% for 10m"

      - alert: HostLowMemory
        expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
        for: 10m
        labels:
          severity: high
        annotations:
          summary: "Low memory on {{ $labels.instance }}"
          description: "Available memory < 10% for 10m"

      - alert: HostDiskUsageHigh
        expr: (100 - ((node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} * 100) / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"})) > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Disk usage high on {{ $labels.instance }}"
          description: "Disk usage > 85% for 10m"

      - alert: HostDiskUsageCritical
        expr: (100 - ((node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} * 100) / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"})) > 95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk usage critical on {{ $labels.instance }}"
          description: "Disk usage > 95% for 5m"

      - alert: NodeExporterDown
        expr: up{job="node"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "node_exporter down on {{ $labels.instance }}"
          description: "Metrics target not reachable"

8) Configure alert delivery

If using Grafana or Alertmanager, route by severity.

Example Alertmanager config:

yaml

# /etc/alertmanager/alertmanager.yml
route:
  receiver: 'slack-default'
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 2h
  routes:
    - match:
        severity: critical
      receiver: 'email-critical'

receivers:
  - name: 'slack-default'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
        channel: '#alerts'
        send_resolved: true

  - name: 'email-critical'
    email_configs:
      - to: 'ops@example.com'
        from: 'alerts@example.com'
        smarthost: 'smtp.example.com:587'
        auth_username: 'alerts@example.com'
        auth_password: 'SMTP_PASSWORD'
        require_tls: true

Test your channels before relying on them.

Slack webhook test:

bash

curl -X POST https://hooks.slack.com/services/XXX/YYY/ZZZ \
  -H 'Content-type: application/json' \
  --data '{"text":"test alert"}'

9) Add app and web-server alerts

Useful rules:

5xx rate above 3–5% for 5 minutes
p95 latency above threshold for 10 minutes
process restarts above baseline
Gunicorn worker timeout spikes
Nginx upstream failures

If you already ship logs, this pairs well with Logging Setup (Application + Server).

10) Add database and queue alerts

Add only if you run these components.

Database alerts:

too many active connections
failed connections
slow query spike
replication lag
backup missing or failed

Queue and worker alerts:

worker process down
queue length too high
oldest job too old
repeated failures or retry storm

This is critical for background email, webhook processing, and billing jobs.

11) Add business-flow alerts

These catch failures that system metrics miss.

Good examples:

no successful payment webhooks in 15 minutes during active billing periods
signup failure rate above threshold
login failure spike
zero successful checkout events over expected traffic window
webhook backlog growing continuously

Do not add these if traffic is too low to make the signal meaningful.

12) Make alerts actionable

Each alert should include:

service name
environment
metric name
current value
threshold
likely impact
first debugging step
dashboard or logs link
runbook link

Example annotation style:

yaml

annotations:
  summary: "API 5xx rate high on production"
  description: "5xx rate is {{ $value }}% over 5m"
  runbook: "https://internal.example.com/runbooks/api-5xx"

13) Test every alert intentionally

Do not trust untested alert rules.

Test examples:

stop a worker service
force a test exception
block a metrics port temporarily
fill a test filesystem
trigger a known 500 on staging
send a test alert to Slack and email

Keep a short validation checklist after every alert change.

Metric	Warning	High	Critical
Uptime	< 99.9%	< 99.5%	< 99%
Error rate	> 1%	> 3%	> 5%
Response time	> 500 ms	> 1 s	> 3 s
CPU usage	> 60%	> 80%	> 95%
Memory usage	> 70%	> 85%	> 95%

severity matrix for warning, high, and critical thresholds across uptime, app errors, resources, and business events.

Common causes

Most alerting failures come from setup gaps, not tooling limits.

Common causes:

no uptime checks for user-facing endpoints
alert rules created but notification channels never tested
thresholds too sensitive, causing alert fatigue
only infrastructure alerts exist; no application or business-flow alerts
exporters or agents not running after reboot
metrics collector cannot scrape targets because of firewall or bind-address issues
error tracking installed in development but misconfigured in production
critical background workers have no health or backlog alerts
alerts lack context, dashboard links, or runbook references
no silencing during maintenance, causing noisy deploy-time alerts

Debugging tips

Check each layer separately: source signal, alert rule, delivery channel.

Useful commands:

bash

curl -I https://yourdomain.com
curl -s http://127.0.0.1:9100/metrics | head
systemctl status node_exporter
systemctl status nginx
systemctl status gunicorn
journalctl -u nginx -n 100 --no-pager
journalctl -u gunicorn -n 100 --no-pager
df -h
free -m
top
ss -tulpn
curl -X POST https://hooks.slack.com/services/XXX/YYY/ZZZ -H 'Content-type: application/json' --data '{"text":"test alert"}'
dig yourdomain.com
openssl s_client -connect yourdomain.com:443 -servername yourdomain.com </dev/null | openssl x509 -noout -dates
docker ps
docker logs <container_name> --tail 100

Targeted debugging guidance:

If alerts never fire, verify metric collection, rule evaluation, and notification delivery independently.
If alerts are noisy, increase for: durations and use rate-based thresholds.
If Slack or email delivery fails, test credentials and outbound firewall rules.
If uptime checks are flaky, compare external monitor failures with local curl.
If host metrics are wrong, confirm exporter uptime and collector reachability.
If app error alerts are missing, trigger a controlled production-safe exception.
If queue alerts fire constantly, inspect concurrency, retries, and poison jobs.
If alerts spike after deploys, correlate failures with deploy timestamps and migrations.

If alerts point to active incidents, continue with Incident Response Playbook or Debugging Production Issues.

Checklist

✓ Uptime monitoring configured for primary user-facing endpoints
✓ Application exception tracking enabled in production
✓ CPU, memory, disk, and process alerts configured
✓ Database and queue alerts configured if those services exist
✓ Alert destinations tested: Slack, email, SMS if used
✓ Alert severities documented
✓ Every alert includes environment, service, threshold, current value, and next debugging step
✓ Maintenance silence process documented
✓ At least one test incident performed to validate end-to-end delivery
✓ Runbook links attached to common alerts

For broader release readiness, review SaaS Production Checklist.

Related guides

Uptime Monitoring Setup — Uptime Monitoring Setup
Error Tracking with Sentry — Error Tracking with Sentry
Logging Setup (Application + Server) — Logging Setup (Application + Server)
Incident Response Playbook — Incident Response Playbook
Debugging Production Issues — Debugging Production Issues
SaaS Production Checklist — SaaS Production Checklist

FAQ

What alerts should every MVP have first?

Start with endpoint uptime, application exception spikes, server CPU/memory/disk, and process health for web and worker services.

Should alerts be based on logs or metrics?

Use both. Metrics are better for sustained threshold alerts. Logs and error tracking are better for exceptions and event-based failures.

How do I reduce false positives?

Use sustained time windows, rate-based conditions, separate warning from critical, and exclude maintenance periods.

Do I need separate alerts for background jobs?

Yes. Queue backlog, worker offline state, and repeated job failures should be monitored separately from the web app.

How often should I review alert rules?

Review monthly and after every real incident. Remove noisy alerts and add alerts for failures that were missed.

Do I need PagerDuty for a small SaaS?

Usually no. Slack plus email is enough until incidents are frequent or multiple people rotate on-call.

What should page me immediately?

Full downtime, payment processing failures, auth failures, database outage, and worker failures for critical jobs.

How many alerts should I start with?

Keep it small: 5–7 high-signal alerts.

Should I alert on every exception?

No. Alert on spikes, regressions, or high-severity exceptions.

Should staging send alerts?

Usually only to a lower-noise channel, not the main production channel unless you are actively testing.

Final takeaway

A good alerting system for a small SaaS is narrow, tested, and actionable.

Start with:

uptime checks
exception tracking
host resource alerts
queue and worker alerts
payment and auth flow alerts where relevant

If an alert does not help you act faster, tune it or remove it.