Alerting System Setup
The essential playbook for implementing alerting system setup in your SaaS.
This page shows a minimal, practical alerting setup for a small SaaS. The target is simple: detect user-facing failures fast, route alerts to the right place, and avoid noisy rules that get ignored.
Use a layered setup:
- uptime monitoring for availability
- error tracking for application exceptions
- host metrics for CPU, memory, disk, and process health
- optional business-flow alerts for signups, auth, and payments
This is enough for most MVPs and early production systems.
Quick Fix / Quick Setup
# Minimal alerting stack for a VPS-hosted SaaS
# 1) Add uptime checks with Uptime Kuma or an external monitor
# 2) Add Sentry for application exceptions
# 3) Add basic host alerts with Prometheus node_exporter + Grafana alerts
# 4) Send alerts to Slack/email
# Example: install node_exporter on Ubuntu
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gz
tar xvf node_exporter-1.8.1.linux-amd64.tar.gz
sudo cp node_exporter-1.8.1.linux-amd64/node_exporter /usr/local/bin/
sudo useradd --no-create-home --shell /bin/false node_exporter || true
cat <<'EOF' | sudo tee /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
sudo systemctl status node_exporter
# Quick check
curl http://127.0.0.1:9100/metrics | head
# Example alert conditions to start with:
# - site down for 2 minutes
# - 5xx error rate > 5% for 5 minutes
# - CPU > 90% for 10 minutes
# - memory available < 10% for 10 minutes
# - disk usage > 85%
# - no successful payment webhook events for 15 minutes (if applicable)Start with 5–7 high-signal alerts only. Add routing and escalation after the baseline works. Good first destinations: Slack for warnings, email or SMS for critical alerts.
What’s happening
Monitoring collects signals. Alerting decides when those signals require action.
For a small SaaS, the minimum useful alert surface is:
- availability
- app exceptions
- infrastructure saturation
- worker and queue health
- business-critical flows like auth and payments
Bad setups usually fail in one of two ways:
- nothing alerts during a real incident
- everything alerts constantly and gets muted
A practical setup combines multiple sources:
- uptime monitor for endpoint reachability
- application error tracking for exceptions
- metrics for resource thresholds and latency
- synthetic checks for login or payment flows if needed
Process Flow
Step-by-step implementation
1) Define critical services
List the components that must work for the product to function:
- web app
- API
- database
- Redis or queue backend
- background worker
- cron/scheduler
- email delivery
- payment webhooks
- object storage access
If a component can fail without users noticing, it should usually be warning-level only.
2) Choose alert destinations
Use simple routing first:
- Slack: warnings and non-urgent issues
- email: high severity and fallback
- SMS or phone: only critical alerts
Do not page yourself for every exception.
Example policy:
- warning: Slack only
- high: Slack + email
- critical: Slack + email + SMS
3) Add uptime monitoring
Monitor these endpoints at minimum:
- homepage
- login page
- API health endpoint
- webhook endpoint health if exposed
Example health endpoint in FastAPI:
from fastapi import FastAPI
app = FastAPI()
@app.get("/health")
def health():
return {"status": "ok"}Example health endpoint in Flask:
from flask import Flask, jsonify
app = Flask(__name__)
@app.get("/health")
def health():
return jsonify({"status": "ok"})Suggested starter rules:
- main site down for 2 consecutive checks
- API health endpoint down for 2 consecutive checks
- TLS certificate expiry warning at 14 days, critical at 7 days
For deeper setup, also see Uptime Monitoring Setup.
4) Add application error tracking
Use Sentry or equivalent. Alert on:
- new unhandled exception in production
- exception rate spike
- regression of a resolved issue
Basic Python example:
pip install sentry-sdk[flask]import sentry_sdk
from sentry_sdk.integrations.flask import FlaskIntegration
sentry_sdk.init(
dsn="https://YOUR_DSN@sentry.io/PROJECT_ID",
integrations=[FlaskIntegration()],
traces_sample_rate=0.0,
environment="production",
)For FastAPI:
pip install sentry-sdk[fastapi]import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration
sentry_sdk.init(
dsn="https://YOUR_DSN@sentry.io/PROJECT_ID",
integrations=[FastApiIntegration()],
traces_sample_rate=0.0,
environment="production",
)Set alert rules inside Sentry for:
- issue frequency spike
- new issue in production
- crash-free rate drop if relevant
Related: Error Tracking with Sentry
5) Add host metrics with node_exporter
Install node_exporter on each Linux host.
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gz
tar xvf node_exporter-1.8.1.linux-amd64.tar.gz
sudo cp node_exporter-1.8.1.linux-amd64/node_exporter /usr/local/bin/
sudo useradd --no-create-home --shell /bin/false node_exporter || true
sudo chown root:root /usr/local/bin/node_exporter
sudo chmod 755 /usr/local/bin/node_exporterCreate a systemd unit:
# /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.targetEnable it:
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
sudo systemctl status node_exporter
curl -s http://127.0.0.1:9100/metrics | headIf Prometheus runs remotely, allow scraping only from the collector IP. Do not expose metrics publicly.
Example UFW rule:
sudo ufw allow from <PROMETHEUS_IP> to any port 9100 proto tcp6) Scrape metrics in Prometheus
Example Prometheus config:
# /etc/prometheus/prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'node'
static_configs:
- targets:
- 'app1.example.internal:9100'
- 'worker1.example.internal:9100'Reload Prometheus after config changes.
curl -X POST http://127.0.0.1:9090/-/reload7) Create starter alert rules
Example Prometheus rules:
# /etc/prometheus/alerts/node.yml
groups:
- name: node-alerts
rules:
- alert: HostHighCPU
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 10m
labels:
severity: high
annotations:
summary: "High CPU on {{ $labels.instance }}"
description: "CPU usage > 90% for 10m"
- alert: HostLowMemory
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
for: 10m
labels:
severity: high
annotations:
summary: "Low memory on {{ $labels.instance }}"
description: "Available memory < 10% for 10m"
- alert: HostDiskUsageHigh
expr: (100 - ((node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} * 100) / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"})) > 85
for: 10m
labels:
severity: warning
annotations:
summary: "Disk usage high on {{ $labels.instance }}"
description: "Disk usage > 85% for 10m"
- alert: HostDiskUsageCritical
expr: (100 - ((node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} * 100) / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"})) > 95
for: 5m
labels:
severity: critical
annotations:
summary: "Disk usage critical on {{ $labels.instance }}"
description: "Disk usage > 95% for 5m"
- alert: NodeExporterDown
expr: up{job="node"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "node_exporter down on {{ $labels.instance }}"
description: "Metrics target not reachable"8) Configure alert delivery
If using Grafana or Alertmanager, route by severity.
Example Alertmanager config:
# /etc/alertmanager/alertmanager.yml
route:
receiver: 'slack-default'
group_by: ['alertname', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 2h
routes:
- match:
severity: critical
receiver: 'email-critical'
receivers:
- name: 'slack-default'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
channel: '#alerts'
send_resolved: true
- name: 'email-critical'
email_configs:
- to: 'ops@example.com'
from: 'alerts@example.com'
smarthost: 'smtp.example.com:587'
auth_username: 'alerts@example.com'
auth_password: 'SMTP_PASSWORD'
require_tls: trueTest your channels before relying on them.
Slack webhook test:
curl -X POST https://hooks.slack.com/services/XXX/YYY/ZZZ \
-H 'Content-type: application/json' \
--data '{"text":"test alert"}'9) Add app and web-server alerts
Useful rules:
- 5xx rate above 3–5% for 5 minutes
- p95 latency above threshold for 10 minutes
- process restarts above baseline
- Gunicorn worker timeout spikes
- Nginx upstream failures
If you already ship logs, this pairs well with Logging Setup (Application + Server).
10) Add database and queue alerts
Add only if you run these components.
Database alerts:
- too many active connections
- failed connections
- slow query spike
- replication lag
- backup missing or failed
Queue and worker alerts:
- worker process down
- queue length too high
- oldest job too old
- repeated failures or retry storm
This is critical for background email, webhook processing, and billing jobs.
11) Add business-flow alerts
These catch failures that system metrics miss.
Good examples:
- no successful payment webhooks in 15 minutes during active billing periods
- signup failure rate above threshold
- login failure spike
- zero successful checkout events over expected traffic window
- webhook backlog growing continuously
Do not add these if traffic is too low to make the signal meaningful.
12) Make alerts actionable
Each alert should include:
- service name
- environment
- metric name
- current value
- threshold
- likely impact
- first debugging step
- dashboard or logs link
- runbook link
Example annotation style:
annotations:
summary: "API 5xx rate high on production"
description: "5xx rate is {{ $value }}% over 5m"
runbook: "https://internal.example.com/runbooks/api-5xx"13) Test every alert intentionally
Do not trust untested alert rules.
Test examples:
- stop a worker service
- force a test exception
- block a metrics port temporarily
- fill a test filesystem
- trigger a known 500 on staging
- send a test alert to Slack and email
Keep a short validation checklist after every alert change.
| Metric | Warning | High | Critical |
|---|---|---|---|
| Uptime | < 99.9% | < 99.5% | < 99% |
| Error rate | > 1% | > 3% | > 5% |
| Response time | > 500 ms | > 1 s | > 3 s |
| CPU usage | > 60% | > 80% | > 95% |
| Memory usage | > 70% | > 85% | > 95% |
severity matrix for warning, high, and critical thresholds across uptime, app errors, resources, and business events.
Common causes
Most alerting failures come from setup gaps, not tooling limits.
Common causes:
- no uptime checks for user-facing endpoints
- alert rules created but notification channels never tested
- thresholds too sensitive, causing alert fatigue
- only infrastructure alerts exist; no application or business-flow alerts
- exporters or agents not running after reboot
- metrics collector cannot scrape targets because of firewall or bind-address issues
- error tracking installed in development but misconfigured in production
- critical background workers have no health or backlog alerts
- alerts lack context, dashboard links, or runbook references
- no silencing during maintenance, causing noisy deploy-time alerts
Debugging tips
Check each layer separately: source signal, alert rule, delivery channel.
Useful commands:
curl -I https://yourdomain.com
curl -s http://127.0.0.1:9100/metrics | head
systemctl status node_exporter
systemctl status nginx
systemctl status gunicorn
journalctl -u nginx -n 100 --no-pager
journalctl -u gunicorn -n 100 --no-pager
df -h
free -m
top
ss -tulpn
curl -X POST https://hooks.slack.com/services/XXX/YYY/ZZZ -H 'Content-type: application/json' --data '{"text":"test alert"}'
dig yourdomain.com
openssl s_client -connect yourdomain.com:443 -servername yourdomain.com </dev/null | openssl x509 -noout -dates
docker ps
docker logs <container_name> --tail 100Targeted debugging guidance:
- If alerts never fire, verify metric collection, rule evaluation, and notification delivery independently.
- If alerts are noisy, increase
for:durations and use rate-based thresholds. - If Slack or email delivery fails, test credentials and outbound firewall rules.
- If uptime checks are flaky, compare external monitor failures with local
curl. - If host metrics are wrong, confirm exporter uptime and collector reachability.
- If app error alerts are missing, trigger a controlled production-safe exception.
- If queue alerts fire constantly, inspect concurrency, retries, and poison jobs.
- If alerts spike after deploys, correlate failures with deploy timestamps and migrations.
If alerts point to active incidents, continue with Incident Response Playbook or Debugging Production Issues.
Checklist
- ✓ Uptime monitoring configured for primary user-facing endpoints
- ✓ Application exception tracking enabled in production
- ✓ CPU, memory, disk, and process alerts configured
- ✓ Database and queue alerts configured if those services exist
- ✓ Alert destinations tested: Slack, email, SMS if used
- ✓ Alert severities documented
- ✓ Every alert includes environment, service, threshold, current value, and next debugging step
- ✓ Maintenance silence process documented
- ✓ At least one test incident performed to validate end-to-end delivery
- ✓ Runbook links attached to common alerts
For broader release readiness, review SaaS Production Checklist.
Related guides
- Uptime Monitoring Setup — Uptime Monitoring Setup
- Error Tracking with Sentry — Error Tracking with Sentry
- Logging Setup (Application + Server) — Logging Setup (Application + Server)
- Incident Response Playbook — Incident Response Playbook
- Debugging Production Issues — Debugging Production Issues
- SaaS Production Checklist — SaaS Production Checklist
FAQ
What alerts should every MVP have first?
Start with endpoint uptime, application exception spikes, server CPU/memory/disk, and process health for web and worker services.
Should alerts be based on logs or metrics?
Use both. Metrics are better for sustained threshold alerts. Logs and error tracking are better for exceptions and event-based failures.
How do I reduce false positives?
Use sustained time windows, rate-based conditions, separate warning from critical, and exclude maintenance periods.
Do I need separate alerts for background jobs?
Yes. Queue backlog, worker offline state, and repeated job failures should be monitored separately from the web app.
How often should I review alert rules?
Review monthly and after every real incident. Remove noisy alerts and add alerts for failures that were missed.
Do I need PagerDuty for a small SaaS?
Usually no. Slack plus email is enough until incidents are frequent or multiple people rotate on-call.
What should page me immediately?
Full downtime, payment processing failures, auth failures, database outage, and worker failures for critical jobs.
How many alerts should I start with?
Keep it small: 5–7 high-signal alerts.
Should I alert on every exception?
No. Alert on spikes, regressions, or high-severity exceptions.
Should staging send alerts?
Usually only to a lower-noise channel, not the main production channel unless you are actively testing.
Final takeaway
A good alerting system for a small SaaS is narrow, tested, and actionable.
Start with:
- uptime checks
- exception tracking
- host resource alerts
- queue and worker alerts
- payment and auth flow alerts where relevant
If an alert does not help you act faster, tune it or remove it.