Alerting System Setup

The essential playbook for implementing alerting system setup in your SaaS.

This page shows a minimal, practical alerting setup for a small SaaS. The target is simple: detect user-facing failures fast, route alerts to the right place, and avoid noisy rules that get ignored.

Use a layered setup:

  • uptime monitoring for availability
  • error tracking for application exceptions
  • host metrics for CPU, memory, disk, and process health
  • optional business-flow alerts for signups, auth, and payments

This is enough for most MVPs and early production systems.

Quick Fix / Quick Setup

bash
# Minimal alerting stack for a VPS-hosted SaaS
# 1) Add uptime checks with Uptime Kuma or an external monitor
# 2) Add Sentry for application exceptions
# 3) Add basic host alerts with Prometheus node_exporter + Grafana alerts
# 4) Send alerts to Slack/email

# Example: install node_exporter on Ubuntu
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gz
tar xvf node_exporter-1.8.1.linux-amd64.tar.gz
sudo cp node_exporter-1.8.1.linux-amd64/node_exporter /usr/local/bin/

sudo useradd --no-create-home --shell /bin/false node_exporter || true
cat <<'EOF' | sudo tee /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
sudo systemctl status node_exporter

# Quick check
curl http://127.0.0.1:9100/metrics | head

# Example alert conditions to start with:
# - site down for 2 minutes
# - 5xx error rate > 5% for 5 minutes
# - CPU > 90% for 10 minutes
# - memory available < 10% for 10 minutes
# - disk usage > 85%
# - no successful payment webhook events for 15 minutes (if applicable)

Start with 5–7 high-signal alerts only. Add routing and escalation after the baseline works. Good first destinations: Slack for warnings, email or SMS for critical alerts.

What’s happening

Monitoring collects signals. Alerting decides when those signals require action.

For a small SaaS, the minimum useful alert surface is:

  • availability
  • app exceptions
  • infrastructure saturation
  • worker and queue health
  • business-critical flows like auth and payments

Bad setups usually fail in one of two ways:

  • nothing alerts during a real incident
  • everything alerts constantly and gets muted

A practical setup combines multiple sources:

  • uptime monitor for endpoint reachability
  • application error tracking for exceptions
  • metrics for resource thresholds and latency
  • synthetic checks for login or payment flows if needed
signal source
rule
notification channel
acknowledgement/escalation

Process Flow

Step-by-step implementation

1) Define critical services

List the components that must work for the product to function:

  • web app
  • API
  • database
  • Redis or queue backend
  • background worker
  • cron/scheduler
  • email delivery
  • payment webhooks
  • object storage access

If a component can fail without users noticing, it should usually be warning-level only.

2) Choose alert destinations

Use simple routing first:

  • Slack: warnings and non-urgent issues
  • email: high severity and fallback
  • SMS or phone: only critical alerts

Do not page yourself for every exception.

Example policy:

  • warning: Slack only
  • high: Slack + email
  • critical: Slack + email + SMS

3) Add uptime monitoring

Monitor these endpoints at minimum:

  • homepage
  • login page
  • API health endpoint
  • webhook endpoint health if exposed

Example health endpoint in FastAPI:

python
from fastapi import FastAPI

app = FastAPI()

@app.get("/health")
def health():
    return {"status": "ok"}

Example health endpoint in Flask:

python
from flask import Flask, jsonify

app = Flask(__name__)

@app.get("/health")
def health():
    return jsonify({"status": "ok"})

Suggested starter rules:

  • main site down for 2 consecutive checks
  • API health endpoint down for 2 consecutive checks
  • TLS certificate expiry warning at 14 days, critical at 7 days

For deeper setup, also see Uptime Monitoring Setup.

4) Add application error tracking

Use Sentry or equivalent. Alert on:

  • new unhandled exception in production
  • exception rate spike
  • regression of a resolved issue

Basic Python example:

bash
pip install sentry-sdk[flask]
python
import sentry_sdk
from sentry_sdk.integrations.flask import FlaskIntegration

sentry_sdk.init(
    dsn="https://YOUR_DSN@sentry.io/PROJECT_ID",
    integrations=[FlaskIntegration()],
    traces_sample_rate=0.0,
    environment="production",
)

For FastAPI:

bash
pip install sentry-sdk[fastapi]
python
import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration

sentry_sdk.init(
    dsn="https://YOUR_DSN@sentry.io/PROJECT_ID",
    integrations=[FastApiIntegration()],
    traces_sample_rate=0.0,
    environment="production",
)

Set alert rules inside Sentry for:

  • issue frequency spike
  • new issue in production
  • crash-free rate drop if relevant

Related: Error Tracking with Sentry

5) Add host metrics with node_exporter

Install node_exporter on each Linux host.

bash
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gz
tar xvf node_exporter-1.8.1.linux-amd64.tar.gz
sudo cp node_exporter-1.8.1.linux-amd64/node_exporter /usr/local/bin/

sudo useradd --no-create-home --shell /bin/false node_exporter || true
sudo chown root:root /usr/local/bin/node_exporter
sudo chmod 755 /usr/local/bin/node_exporter

Create a systemd unit:

ini
# /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target

Enable it:

bash
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
sudo systemctl status node_exporter
curl -s http://127.0.0.1:9100/metrics | head

If Prometheus runs remotely, allow scraping only from the collector IP. Do not expose metrics publicly.

Example UFW rule:

bash
sudo ufw allow from <PROMETHEUS_IP> to any port 9100 proto tcp

6) Scrape metrics in Prometheus

Example Prometheus config:

yaml
# /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets:
          - 'app1.example.internal:9100'
          - 'worker1.example.internal:9100'

Reload Prometheus after config changes.

bash
curl -X POST http://127.0.0.1:9090/-/reload

7) Create starter alert rules

Example Prometheus rules:

yaml
# /etc/prometheus/alerts/node.yml
groups:
  - name: node-alerts
    rules:
      - alert: HostHighCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
        for: 10m
        labels:
          severity: high
        annotations:
          summary: "High CPU on {{ $labels.instance }}"
          description: "CPU usage > 90% for 10m"

      - alert: HostLowMemory
        expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
        for: 10m
        labels:
          severity: high
        annotations:
          summary: "Low memory on {{ $labels.instance }}"
          description: "Available memory < 10% for 10m"

      - alert: HostDiskUsageHigh
        expr: (100 - ((node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} * 100) / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"})) > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Disk usage high on {{ $labels.instance }}"
          description: "Disk usage > 85% for 10m"

      - alert: HostDiskUsageCritical
        expr: (100 - ((node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} * 100) / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"})) > 95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk usage critical on {{ $labels.instance }}"
          description: "Disk usage > 95% for 5m"

      - alert: NodeExporterDown
        expr: up{job="node"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "node_exporter down on {{ $labels.instance }}"
          description: "Metrics target not reachable"

8) Configure alert delivery

If using Grafana or Alertmanager, route by severity.

Example Alertmanager config:

yaml
# /etc/alertmanager/alertmanager.yml
route:
  receiver: 'slack-default'
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 2h
  routes:
    - match:
        severity: critical
      receiver: 'email-critical'

receivers:
  - name: 'slack-default'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
        channel: '#alerts'
        send_resolved: true

  - name: 'email-critical'
    email_configs:
      - to: 'ops@example.com'
        from: 'alerts@example.com'
        smarthost: 'smtp.example.com:587'
        auth_username: 'alerts@example.com'
        auth_password: 'SMTP_PASSWORD'
        require_tls: true

Test your channels before relying on them.

Slack webhook test:

bash
curl -X POST https://hooks.slack.com/services/XXX/YYY/ZZZ \
  -H 'Content-type: application/json' \
  --data '{"text":"test alert"}'

9) Add app and web-server alerts

Useful rules:

  • 5xx rate above 3–5% for 5 minutes
  • p95 latency above threshold for 10 minutes
  • process restarts above baseline
  • Gunicorn worker timeout spikes
  • Nginx upstream failures

If you already ship logs, this pairs well with Logging Setup (Application + Server).

10) Add database and queue alerts

Add only if you run these components.

Database alerts:

  • too many active connections
  • failed connections
  • slow query spike
  • replication lag
  • backup missing or failed

Queue and worker alerts:

  • worker process down
  • queue length too high
  • oldest job too old
  • repeated failures or retry storm

This is critical for background email, webhook processing, and billing jobs.

11) Add business-flow alerts

These catch failures that system metrics miss.

Good examples:

  • no successful payment webhooks in 15 minutes during active billing periods
  • signup failure rate above threshold
  • login failure spike
  • zero successful checkout events over expected traffic window
  • webhook backlog growing continuously

Do not add these if traffic is too low to make the signal meaningful.

12) Make alerts actionable

Each alert should include:

  • service name
  • environment
  • metric name
  • current value
  • threshold
  • likely impact
  • first debugging step
  • dashboard or logs link
  • runbook link

Example annotation style:

yaml
annotations:
  summary: "API 5xx rate high on production"
  description: "5xx rate is {{ $value }}% over 5m"
  runbook: "https://internal.example.com/runbooks/api-5xx"

13) Test every alert intentionally

Do not trust untested alert rules.

Test examples:

  • stop a worker service
  • force a test exception
  • block a metrics port temporarily
  • fill a test filesystem
  • trigger a known 500 on staging
  • send a test alert to Slack and email

Keep a short validation checklist after every alert change.

MetricWarningHighCritical
Uptime< 99.9%< 99.5%< 99%
Error rate> 1%> 3%> 5%
Response time> 500 ms> 1 s> 3 s
CPU usage> 60%> 80%> 95%
Memory usage> 70%> 85%> 95%

severity matrix for warning, high, and critical thresholds across uptime, app errors, resources, and business events.

Common causes

Most alerting failures come from setup gaps, not tooling limits.

Common causes:

  • no uptime checks for user-facing endpoints
  • alert rules created but notification channels never tested
  • thresholds too sensitive, causing alert fatigue
  • only infrastructure alerts exist; no application or business-flow alerts
  • exporters or agents not running after reboot
  • metrics collector cannot scrape targets because of firewall or bind-address issues
  • error tracking installed in development but misconfigured in production
  • critical background workers have no health or backlog alerts
  • alerts lack context, dashboard links, or runbook references
  • no silencing during maintenance, causing noisy deploy-time alerts

Debugging tips

Check each layer separately: source signal, alert rule, delivery channel.

Useful commands:

bash
curl -I https://yourdomain.com
curl -s http://127.0.0.1:9100/metrics | head
systemctl status node_exporter
systemctl status nginx
systemctl status gunicorn
journalctl -u nginx -n 100 --no-pager
journalctl -u gunicorn -n 100 --no-pager
df -h
free -m
top
ss -tulpn
curl -X POST https://hooks.slack.com/services/XXX/YYY/ZZZ -H 'Content-type: application/json' --data '{"text":"test alert"}'
dig yourdomain.com
openssl s_client -connect yourdomain.com:443 -servername yourdomain.com </dev/null | openssl x509 -noout -dates
docker ps
docker logs <container_name> --tail 100

Targeted debugging guidance:

  • If alerts never fire, verify metric collection, rule evaluation, and notification delivery independently.
  • If alerts are noisy, increase for: durations and use rate-based thresholds.
  • If Slack or email delivery fails, test credentials and outbound firewall rules.
  • If uptime checks are flaky, compare external monitor failures with local curl.
  • If host metrics are wrong, confirm exporter uptime and collector reachability.
  • If app error alerts are missing, trigger a controlled production-safe exception.
  • If queue alerts fire constantly, inspect concurrency, retries, and poison jobs.
  • If alerts spike after deploys, correlate failures with deploy timestamps and migrations.

If alerts point to active incidents, continue with Incident Response Playbook or Debugging Production Issues.

Checklist

  • Uptime monitoring configured for primary user-facing endpoints
  • Application exception tracking enabled in production
  • CPU, memory, disk, and process alerts configured
  • Database and queue alerts configured if those services exist
  • Alert destinations tested: Slack, email, SMS if used
  • Alert severities documented
  • Every alert includes environment, service, threshold, current value, and next debugging step
  • Maintenance silence process documented
  • At least one test incident performed to validate end-to-end delivery
  • Runbook links attached to common alerts

For broader release readiness, review SaaS Production Checklist.

Related guides

FAQ

What alerts should every MVP have first?

Start with endpoint uptime, application exception spikes, server CPU/memory/disk, and process health for web and worker services.

Should alerts be based on logs or metrics?

Use both. Metrics are better for sustained threshold alerts. Logs and error tracking are better for exceptions and event-based failures.

How do I reduce false positives?

Use sustained time windows, rate-based conditions, separate warning from critical, and exclude maintenance periods.

Do I need separate alerts for background jobs?

Yes. Queue backlog, worker offline state, and repeated job failures should be monitored separately from the web app.

How often should I review alert rules?

Review monthly and after every real incident. Remove noisy alerts and add alerts for failures that were missed.

Do I need PagerDuty for a small SaaS?

Usually no. Slack plus email is enough until incidents are frequent or multiple people rotate on-call.

What should page me immediately?

Full downtime, payment processing failures, auth failures, database outage, and worker failures for critical jobs.

How many alerts should I start with?

Keep it small: 5–7 high-signal alerts.

Should I alert on every exception?

No. Alert on spikes, regressions, or high-severity exceptions.

Should staging send alerts?

Usually only to a lower-noise channel, not the main production channel unless you are actively testing.

Final takeaway

A good alerting system for a small SaaS is narrow, tested, and actionable.

Start with:

  • uptime checks
  • exception tracking
  • host resource alerts
  • queue and worker alerts
  • payment and auth flow alerts where relevant

If an alert does not help you act faster, tune it or remove it.