Monitoring Checklist
The essential playbook for implementing monitoring checklist in your SaaS.
Use this checklist to verify your SaaS has the minimum monitoring needed to detect failures, debug issues fast, and respond before users report problems. This page is built for small deployments on VPS, Docker, or simple cloud setups.
Quick Fix / Quick Setup
Monitoring baseline quick setup
1. Error tracking
- Install Sentry in app backend
- Set SENTRY_DSN in production
- Verify one test exception reaches dashboard
2. Uptime checks
- Add external checks for:
- homepage or app URL
- /health endpoint
- API base endpoint
- Alert to email or Slack
3. Structured logs
- Send app logs to stdout or journald
- Capture Nginx/Gunicorn logs
- Include request_id, user_id, path, status_code
4. Metrics
- Track at minimum:
- CPU
- memory
- disk
- restart count
- request latency
- error rate
- DB connections
5. Alerts
- Alert on:
- uptime failure
- 5xx spike
- app crash/restart loop
- disk > 85%
- memory > 90%
- queue backlog growth
6. Runbook
- Document where to check:
- app logs
- web server logs
- Sentry
- uptime monitor
- database status
- recent deploysIf you only do five things today: enable Sentry, create a /health endpoint, add uptime checks, centralize logs, and configure alerts for 5xx errors and server resource exhaustion.
What’s happening
Monitoring is your minimum production visibility layer.
Without it, failures are usually discovered by users first. That creates slower incident response, longer outages, and poor debugging after deploys.
For a small SaaS, the baseline is not full observability. It is enough signal to answer these questions fast:
- Is the app up?
- Is the app healthy?
- Are users getting errors?
- Did the latest deploy break something?
- Is the server running out of memory, disk, or CPU?
- Are background jobs stuck?
- Where do I look first during an incident?
This checklist covers:
- Application error tracking
- Server and process health
- HTTP uptime and endpoint checks
- Performance and latency baselines
- Database visibility
- Background job visibility
- Actionable alerts
- Incident response readiness
Step-by-step implementation
1. Create a health endpoint
Create a fast endpoint that reports basic app readiness and dependency status.
Minimum requirements:
- returns
200when healthy - checks database connectivity
- includes version or commit SHA
- avoids expensive queries
Example JSON response:
{
"status": "ok",
"service": "app",
"version": "git-sha-or-release-id",
"database": "ok",
"timestamp": "2026-04-20T12:00:00Z"
}Basic test:
curl -i https://yourdomain.com/health
curl -sS https://yourdomain.com/health | jq .If your health endpoint includes dependency checks, keep them lightweight. Do not turn it into a slow diagnostic page.
2. Add external uptime checks
Monitor from outside your infrastructure.
At minimum, check:
- main app URL
/healthendpoint- API base endpoint if separate
Example targets:
https://yourdomain.com/
https://yourdomain.com/health
https://api.yourdomain.com/healthRecommended alert routing:
- email for solo operators
- Slack for active team visibility
- pager tool if this is revenue-critical
Use consecutive failure thresholds to avoid noisy alerts.
3. Install error tracking
Use app-level exception tracking such as Sentry.
Backend checklist:
- install SDK
- set
SENTRY_DSNin production - set environment to
production - set release/version tag
- send one test exception
Example env:
SENTRY_DSN=https://examplePublicKey@o0.ingest.sentry.io/0
SENTRY_ENVIRONMENT=production
SENTRY_RELEASE=git-sha-or-versionVerify with a deliberate test error after deploy.
If you have a browser app, add frontend error tracking too.
See:
4. Standardize logs
Use one consistent log format across app, proxy, and workers.
Include:
- timestamp
- severity
- request_id
- user_id if available
- path
- method
- status_code
- latency
- exception name and stack trace
Preferred outputs:
stdoutfor containersjournaldor systemd-managed services for VPS- reverse proxy access and error logs retained separately
Example JSON log line:
{
"ts": "2026-04-20T12:00:00Z",
"level": "error",
"request_id": "req_123",
"user_id": "user_456",
"method": "POST",
"path": "/api/orders",
"status_code": 500,
"message": "database timeout"
}You need access to all of these log sources:
- application logs
- Nginx logs
- Gunicorn or app server logs
- worker logs
- scheduler or cron logs
See:
5. Make logs easy to inspect
For systemd-based VPS:
systemctl status nginx
systemctl status gunicorn
systemctl status celery
journalctl -u nginx -n 200 --no-pager
journalctl -u gunicorn -n 200 --no-pager
journalctl -u celery -n 200 --no-pagerFor Docker:
docker ps
docker logs --tail=200 <container_name>
docker stats --no-streamFor Nginx file logs:
grep ' 5[0-9][0-9] ' /var/log/nginx/access.log | tail -n 50
tail -n 200 /var/log/nginx/error.log
nginx -t6. Track infrastructure metrics
At minimum, capture:
- CPU
- memory
- disk
- load
- network
- process restarts
- container restarts
Useful commands for direct inspection:
df -h
free -m
top
htop
uptime
ss -tulpn
ps aux --sort=-%mem | head
ps aux --sort=-%cpu | headFor a small SaaS, lightweight options are enough:
- Netdata
- node_exporter with a basic dashboard
- platform-provided metrics on managed services
7. Track app metrics
Add app-level metrics where possible:
- request count
- p95 latency
- 4xx rate
- 5xx rate
- queue depth
- failed jobs
- DB pool usage
- cache hit rate if used
If you cannot instrument everything now, prioritize:
- request latency
- 5xx rate
- worker failures
- DB connections
8. Monitor database health
You do not need full DBA tooling for an MVP, but you need to know if the database is unavailable or exhausted.
Track:
- database reachable/unreachable
- connection usage
- slow queries if supported
- storage growth
- replication status if relevant
Health endpoint should verify basic DB connectivity, but alerts should also exist outside the app if possible.
9. Monitor background jobs and workers
If you use queues, cron jobs, or workers, monitor them separately.
Track:
- worker process count
- failed jobs
- queue depth
- oldest job age
- restart loops
If web requests are healthy but async jobs are failing, users still experience production issues.
10. Configure alerts
Set threshold-based alerts for the first set of real failure conditions.
Alert on:
- main site down
- health endpoint failing multiple times
- 5xx burst
- app crash or restart loop
- disk usage above 85%
- memory above 90%
- DB unavailable
- DB connection exhaustion
- queue backlog growth
- webhook failures for critical integrations
Avoid low-value noisy alerts until you know your baseline.
See:
11. Add deploy markers
Tag monitoring data with version or release metadata.
Add:
- commit SHA in health endpoint
- release tag in Sentry
- deploy timestamp in logs
- release note marker in monitoring dashboards if supported
This reduces time-to-root-cause after regressions.
12. Create an incident runbook
Document exactly where to look first.
Minimum runbook sections:
- uptime monitor URL
- Sentry project URL
- dashboard URL
- app log command
- Nginx log command
- worker log command
- rollback command or deploy command
- database status page or admin access path
- owner contact details
See:
13. Test the full monitoring chain
Force one event for each path:
- one application exception
- one failed health check
- one alert delivery test
If possible, verify:
- event reaches Sentry
- uptime tool marks endpoint failed
- alert arrives in email/Slack
- logs show the event
- version tag is visible
14. Review alert noise weekly
For small deployments, alert quality matters more than alert quantity.
Each week, review:
- duplicate alerts
- false positives
- alerts with no clear owner
- thresholds too strict or too loose
- missing checks discovered during recent incidents
Common causes
Common monitoring gaps:
- No external uptime check configured, so outages are discovered by users first.
- Health endpoint exists but does not verify database or dependency health.
- Error tracking installed only in development or missing DSN in production.
- Logs are split across app, proxy, and workers with no consistent access path.
- Alerts are too noisy, so important notifications are ignored.
- No monitoring for background workers, scheduled jobs, or queue backlog.
- No visibility into disk usage, causing failures when logs or uploads fill storage.
- Deploys are not tagged in monitoring tools, making regressions hard to trace.
- Metrics are collected but no thresholds or alerts are defined.
- Monitoring is configured once and never tested after infrastructure changes.
Debugging tips
Use these commands during setup and incident response.
curl -i https://yourdomain.com/health
curl -sS https://yourdomain.com/health | jq .
systemctl status nginx
systemctl status gunicorn
systemctl status celery
journalctl -u nginx -n 200 --no-pager
journalctl -u gunicorn -n 200 --no-pager
journalctl -u celery -n 200 --no-pager
docker ps
docker logs --tail=200 <container_name>
docker stats --no-stream
df -h
free -m
top
htop
uptime
ss -tulpn
ps aux --sort=-%mem | head
ps aux --sort=-%cpu | head
nginx -t
curl -I https://yourdomain.com
grep ' 5[0-9][0-9] ' /var/log/nginx/access.log | tail -n 50
tail -n 200 /var/log/nginx/error.logFast triage order:
- Check external uptime status
- Hit
/health - Inspect recent deploys/releases
- Check app logs
- Check error tracker
- Check Nginx and worker logs
- Check CPU, memory, disk
- Check DB connectivity and queue backlog
Flowchart
Checklist
Use this before launch and after infrastructure changes.
Checklist
- ✓ Health endpoint exists and is reachable without authentication or with controlled access.
- ✓ External uptime checks are configured for the main app and API.
- ✓ Error tracking is installed and a test event has been verified.
- ✓ Application logs are structured and retained.
- ✓ Nginx, Gunicorn, Docker, systemd, or platform logs are accessible.
- ✓ Server metrics are visible in one dashboard.
- ✓ Request latency and 5xx rates are tracked.
- ✓ Database health and connection usage are monitored.
- ✓ Background jobs and queue backlog are monitored.
- ✓ Alert channels are configured and tested.
- ✓ Recent deploy version or commit SHA is visible in logs or monitoring tools.
- ✓ A documented incident response path exists.
- ✓ On-call or owner contact details are current.
- ✓ Log retention and privacy rules are defined.
- ✓ Monitoring has been tested after the latest deployment.
Common setup patterns for small SaaS
- VPS setup: UptimeRobot or Better Stack for uptime, Sentry for errors, journald plus Nginx logs for logs, and a lightweight metrics agent such as Netdata or node_exporter.
- Docker setup: container stdout logs, Docker restart monitoring, cAdvisor or platform metrics, Sentry for app exceptions, and external uptime checks.
- Managed platform setup: use platform logs and metrics, still add external uptime checks, and keep app-level error tracking enabled.
- Background worker setup: monitor worker process count, failed jobs, queue depth, and oldest queued job age.
What to alert on first
- Main site down or health endpoint failing for multiple consecutive checks.
- Burst of 5xx responses over a short window.
- App process crashing or restarting repeatedly.
- Disk usage approaching full capacity.
- Memory exhaustion or swap thrashing.
- Database connection exhaustion or DB unavailable.
- Job queue backlog increasing without recovery.
- Webhook endpoint failure spikes for payments or integrations.
Diagram: request flow from user to Nginx to app to DB with monitoring touchpoints marked.
Related guides
- Logging Setup (Application + Server)
- Error Tracking with Sentry
- Uptime Monitoring Setup
- Alerting System Setup
- Incident Response Playbook
- SaaS Production Checklist
- Security Checklist
- Auth System Checklist
FAQ
What is the minimum monitoring stack for an MVP SaaS?
At minimum: external uptime checks, a health endpoint, structured app and server logs, error tracking such as Sentry, and basic server metrics with alerts for downtime, 5xx spikes, memory, and disk usage.
Should I monitor both the homepage and a health endpoint?
Yes. The homepage confirms user-facing availability. The health endpoint confirms application readiness and can include dependency checks such as database connectivity.
Do I need full observability tooling for a small SaaS?
No. Start with simple monitoring that you will actually check and maintain. Add tracing and advanced metrics only when traffic, complexity, or team size justifies it.
How often should I test alerts?
Test after initial setup, after major deployment or infrastructure changes, and on a recurring schedule such as monthly. Untested alerts are unreliable.
What should a health endpoint return?
A 200 response for healthy state, optional JSON with service name, version, timestamp, and dependency checks. Keep it fast and avoid expensive operations.
How long should logs be retained?
Enough to investigate incidents and regressions. For small SaaS products, start with at least 7 to 30 days depending on cost, legal requirements, and traffic volume.
Final takeaway
Monitoring is not one tool. You need coverage across uptime, logs, exceptions, metrics, and alerts.
For a small SaaS, the minimum production standard is simple:
- health endpoint
- uptime checks
- structured logs
- error tracking
- server metrics
- tested alerts
If an issue happens and you do not know where to look in the first two minutes, the monitoring setup is incomplete.