Uptime Monitoring Setup
The essential playbook for implementing uptime monitoring setup in your SaaS.
Uptime monitoring is the minimum production safety net. For a small SaaS, you need an external monitor that checks your app from outside your infrastructure, hits a stable health endpoint, and alerts you fast when the app is down or degraded.
This setup is enough for most MVPs and small production apps:
- one public health endpoint
- one external uptime provider
- one alert channel you actually watch
- checks for main domain, API domain, and SSL/TLS
If you do nothing else for production visibility, do this first.
Quick Fix / Quick Setup
Add a minimal public health endpoint and monitor it externally.
# 1) Add a minimal health endpoint
# FastAPI
from fastapi import FastAPI
from fastapi.responses import JSONResponse
app = FastAPI()
@app.get("/healthz")
def healthz():
return JSONResponse({"status": "ok"}, status_code=200)# Flask
from flask import Flask, jsonify
app = Flask(__name__)
@app.get("/healthz")
def healthz():
return jsonify({"status": "ok"}), 200Verify it from outside:
curl -i https://yourdomain.com/healthzMonitor these externally every 1 minute if possible:
https://yourdomain.com/healthz
https://www.yourdomain.com/healthz
https://api.yourdomain.com/healthzAlert on:
- non-200 response
- timeout greater than 10s
- SSL certificate problems
- 2+ consecutive failures
Use an external provider, not a cron job on the same VPS. Start with one public health endpoint, one alert channel, and multi-region checks if available.
What’s happening
Uptime monitoring checks whether your app is reachable over the public internet.
It catches issues that app logs alone may miss:
- DNS failures
- expired or invalid SSL certificates
- reverse proxy misconfiguration
- crashed app processes
- bad deploys
- broken routing
- partial regional outages
A useful uptime setup should:
- run outside your infrastructure
- hit a lightweight endpoint
- alert quickly
- avoid false positives from one-off network blips
For a small SaaS, monitor at minimum:
- main app domain
- API domain
- critical user-facing paths if they matter to revenue or auth flows
Process Flow
Step-by-step implementation
1) Define what "up" means
For most small SaaS deployments, basic uptime means:
- DNS resolves
- HTTPS works
- reverse proxy accepts traffic
- app responds successfully
- routing is correct
This is what /healthz should confirm.
If you also want dependency checks like database access, expose a separate readiness endpoint such as /readyz.
2) Add a public /healthz endpoint
Keep it cheap and stable.
Good response:
{"status":"ok"}Avoid in /healthz:
- expensive queries
- third-party API calls
- full-page rendering
- auth requirements
- redirects
Example FastAPI:
from fastapi import FastAPI
from fastapi.responses import JSONResponse
app = FastAPI()
@app.get("/healthz")
def healthz():
return JSONResponse({"status": "ok"}, status_code=200)
@app.get("/readyz")
def readyz():
# Optional deeper check
# Verify DB or queue only if needed
return JSONResponse({"status": "ready"}, status_code=200)Example Flask:
from flask import Flask, jsonify
app = Flask(__name__)
@app.get("/healthz")
def healthz():
return jsonify({"status": "ok"}), 200
@app.get("/readyz")
def readyz():
return jsonify({"status": "ready"}), 2003) Expose the endpoint through your production stack
If you use Nginx in front of Gunicorn/Uvicorn, make sure /healthz is reachable through the public domain.
Example Nginx server block:
server {
listen 80;
server_name yourdomain.com www.yourdomain.com api.yourdomain.com;
return 301 https://$host$request_uri;
}
server {
listen 443 ssl http2;
server_name yourdomain.com www.yourdomain.com api.yourdomain.com;
ssl_certificate /etc/letsencrypt/live/yourdomain.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/yourdomain.com/privkey.pem;
location / {
proxy_pass http://127.0.0.1:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}Test it externally:
curl -i https://yourdomain.com/healthz
curl -i https://www.yourdomain.com/healthz
curl -i https://api.yourdomain.com/healthzIf your deployment stack is not stable yet, see Deploy SaaS with Nginx + Gunicorn.
4) Create external uptime checks
Use any external uptime provider that supports:
- HTTPS checks
- custom interval
- SSL monitoring
- multi-region checks
- email or chat alerts
Recommended initial settings:
- interval: 60 seconds
- timeout: 10 seconds
- expected status: 200
- failure threshold: 2 consecutive failures
- recovery notifications: enabled
- region checks: multiple if available
Monitor these separately:
https://yourdomain.com/healthzhttps://www.yourdomain.com/healthzhttps://api.yourdomain.com/healthz
Do not combine all services into one health endpoint if they are independently exposed.
5) Add SSL certificate monitoring
Enable certificate expiry and TLS validation alerts if your provider supports them.
This catches:
- expired certs
- broken renewal
- invalid chain
- hostname mismatch
Manual verification:
openssl s_client -connect yourdomain.com:443 -servername yourdomain.com6) Configure alerting
Start simple.
Recommended alert design:
- one real-time channel such as Slack, Discord, PagerDuty, or SMS
- alert after 2 failures
- send recovery notification
- repeat every 15 to 30 minutes during ongoing incident
Each alert should include:
- failing URL
- status code
- response time
- failing region
- first failure timestamp
Link the alert destination or runbook to:
7) Add one optional synthetic check
Basic uptime checks only verify reachability.
If a user-facing path is critical, add one synthetic check for:
- login page load
- pricing or checkout page availability
- API auth callback landing page
Do not replace /healthz with synthetic monitoring. Use synthetic checks as a second layer.
8) Document ownership and response path
For each monitor, document:
- who gets alerted
- where alerts go
- where the runbook lives
- which logs to check first
- who can roll back a deploy
This should live in the same place as your production runbooks and checklist. See SaaS Production Checklist.
9) Test the monitor before launch
Do not assume alerting works.
Test by doing one of these in staging:
- point the monitor at a known failing URL
- stop the app process
- block the endpoint temporarily
- break DNS on a temporary test domain
Confirm:
- alert fires
- correct person/channel receives it
- recovery notification arrives after fix
Process Flow
Common causes
Typical reasons uptime monitoring fails or reports a real outage:
- no public health endpoint
- health endpoint returns redirect instead of
200 - DNS record points to wrong server
- Nginx or reverse proxy returns
502or503 - Gunicorn, Uvicorn, or app process is down
- firewall or security group blocks traffic
- SSL certificate expired or chain invalid
- health endpoint depends on database and fails during transient DB issues
- CDN or WAF blocks probes
- recent deploy broke env vars, startup command, or route registration
- provider outage affects only some regions
Debugging tips
Start from outside, then work inward.
External checks
curl -i https://yourdomain.com/healthz
curl -vk https://yourdomain.com/healthz
dig yourdomain.com +short
dig api.yourdomain.com +short
nslookup yourdomain.com
openssl s_client -connect yourdomain.com:443 -servername yourdomain.com
ping yourdomain.comServer checks
systemctl status nginx
systemctl status gunicorn
journalctl -u nginx -n 100 --no-pager
journalctl -u gunicorn -n 100 --no-pager
tail -n 100 /var/log/nginx/access.log
tail -n 100 /var/log/nginx/error.log
ss -tulpn | grep -E ':80|:443|:8000'
curl -I http://127.0.0.1:8000/healthzWhat to compare
When a monitor reports downtime:
- compare failure timestamp to deploy timestamp
- check Nginx access/error logs
- check app logs
- verify DNS resolution
- verify TLS validity
- verify upstream app process
- test from another region or network
If monitor fails but local machine works, suspect:
- DNS propagation
- CDN/WAF rules
- firewall
- regional provider routing issue
If health checks pass but users still report broken flows, add synthetic checks and error tracking via Error Tracking with Sentry.
If the public endpoint is failing with proxy errors, see Debugging Production Issues.
Checklist
- ✓ public
/healthzendpoint exists - ✓
/healthzreturns200quickly - ✓ endpoint is reachable over HTTPS from outside your infrastructure
- ✓ main domain is monitored
- ✓ API domain is monitored
- ✓
wwwdomain is monitored if users access it - ✓ alert channel is configured
- ✓ alert channel has been tested
- ✓ failure threshold is set to 2 or more consecutive failures
- ✓ recovery notifications are enabled
- ✓ SSL certificate monitoring is enabled
- ✓ monitors are labeled by environment
- ✓ staging and production alerts are separated
- ✓ runbook for responding to alerts is documented
- ✓ at least one staged alert test has been completed
Related guides
- Incident Response Playbook — incident handling after an alert fires
- Debugging Production Issues — step-by-step production triage
- Error Tracking with Sentry — capture exceptions when the app is technically up but broken
- SaaS Production Checklist — production launch and monitoring checklist
FAQ
What should my health endpoint return?
Return HTTP 200 with a small static JSON payload such as:
{"status":"ok"}Keep it fast and avoid heavy dependency checks.
Should uptime monitors check authenticated pages?
Not for the primary uptime check. Use a public health endpoint first. Add synthetic authenticated flows only if they are business-critical.
Why is my app marked down even though the server is running?
Uptime monitors validate the full public path:
- DNS
- TLS
- reverse proxy
- routing
- application response
A running process alone does not mean the service is reachable.
How many endpoints should I monitor?
At minimum:
- main app domain
- API domain
Optionally add:
wwwdomain- status page
- webhook receiver path
- login or billing synthetic path
What is the difference between uptime monitoring and error tracking?
Uptime monitoring tells you whether the service is reachable.
Error tracking captures application exceptions and stack traces when requests fail.
Should /healthz check the database?
Usually no. Keep /healthz cheap and stable. If you need dependency-aware checks, use a separate /readyz endpoint.
What interval should I use?
Use 1 minute for production if possible. If budget is tight, 5 minutes is an acceptable starting point.
Should I monitor the homepage or a health endpoint?
Prefer a dedicated health endpoint for reliable uptime checks. Add homepage or browser-based synthetic checks separately if the UI itself is critical.
Do I need multi-region checks?
Yes, if your provider supports them. Multi-region checks reduce false positives and catch regional routing failures.
Can I rely only on my cloud provider status page?
No. You need monitoring against your own domain, DNS, TLS, and app path.
Final takeaway
For a small SaaS, uptime monitoring should be simple, external, and tested.
Start with:
- one fast public
/healthzendpoint - one-minute checks
- real alerting
- SSL monitoring
- tested response path
Then add:
- incident handling via Incident Response Playbook
- production debugging via Debugging Production Issues
- error tracking via Error Tracking with Sentry
- launch validation via SaaS Production Checklist
That gives you the minimum reliable monitoring stack for an MVP or small SaaS in production.