Uptime Monitoring Setup

The essential playbook for implementing uptime monitoring setup in your SaaS.

Uptime monitoring is the minimum production safety net. For a small SaaS, you need an external monitor that checks your app from outside your infrastructure, hits a stable health endpoint, and alerts you fast when the app is down or degraded.

This setup is enough for most MVPs and small production apps:

  • one public health endpoint
  • one external uptime provider
  • one alert channel you actually watch
  • checks for main domain, API domain, and SSL/TLS

If you do nothing else for production visibility, do this first.

Quick Fix / Quick Setup

Add a minimal public health endpoint and monitor it externally.

python
# 1) Add a minimal health endpoint

# FastAPI
from fastapi import FastAPI
from fastapi.responses import JSONResponse

app = FastAPI()

@app.get("/healthz")
def healthz():
    return JSONResponse({"status": "ok"}, status_code=200)
python
# Flask
from flask import Flask, jsonify

app = Flask(__name__)

@app.get("/healthz")
def healthz():
    return jsonify({"status": "ok"}), 200

Verify it from outside:

bash
curl -i https://yourdomain.com/healthz

Monitor these externally every 1 minute if possible:

text
https://yourdomain.com/healthz
https://www.yourdomain.com/healthz
https://api.yourdomain.com/healthz

Alert on:

  • non-200 response
  • timeout greater than 10s
  • SSL certificate problems
  • 2+ consecutive failures

Use an external provider, not a cron job on the same VPS. Start with one public health endpoint, one alert channel, and multi-region checks if available.

What’s happening

Uptime monitoring checks whether your app is reachable over the public internet.

It catches issues that app logs alone may miss:

  • DNS failures
  • expired or invalid SSL certificates
  • reverse proxy misconfiguration
  • crashed app processes
  • bad deploys
  • broken routing
  • partial regional outages

A useful uptime setup should:

  • run outside your infrastructure
  • hit a lightweight endpoint
  • alert quickly
  • avoid false positives from one-off network blips

For a small SaaS, monitor at minimum:

  • main app domain
  • API domain
  • critical user-facing paths if they matter to revenue or auth flows
monitor
DNS
CDN/WAF
Nginx/load balancer
app
database

Process Flow

Step-by-step implementation

1) Define what "up" means

For most small SaaS deployments, basic uptime means:

  • DNS resolves
  • HTTPS works
  • reverse proxy accepts traffic
  • app responds successfully
  • routing is correct

This is what /healthz should confirm.

If you also want dependency checks like database access, expose a separate readiness endpoint such as /readyz.

2) Add a public /healthz endpoint

Keep it cheap and stable.

Good response:

json
{"status":"ok"}

Avoid in /healthz:

  • expensive queries
  • third-party API calls
  • full-page rendering
  • auth requirements
  • redirects

Example FastAPI:

python
from fastapi import FastAPI
from fastapi.responses import JSONResponse

app = FastAPI()

@app.get("/healthz")
def healthz():
    return JSONResponse({"status": "ok"}, status_code=200)

@app.get("/readyz")
def readyz():
    # Optional deeper check
    # Verify DB or queue only if needed
    return JSONResponse({"status": "ready"}, status_code=200)

Example Flask:

python
from flask import Flask, jsonify

app = Flask(__name__)

@app.get("/healthz")
def healthz():
    return jsonify({"status": "ok"}), 200

@app.get("/readyz")
def readyz():
    return jsonify({"status": "ready"}), 200

3) Expose the endpoint through your production stack

If you use Nginx in front of Gunicorn/Uvicorn, make sure /healthz is reachable through the public domain.

Example Nginx server block:

nginx
server {
    listen 80;
    server_name yourdomain.com www.yourdomain.com api.yourdomain.com;
    return 301 https://$host$request_uri;
}

server {
    listen 443 ssl http2;
    server_name yourdomain.com www.yourdomain.com api.yourdomain.com;

    ssl_certificate /etc/letsencrypt/live/yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/yourdomain.com/privkey.pem;

    location / {
        proxy_pass http://127.0.0.1:8000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

Test it externally:

bash
curl -i https://yourdomain.com/healthz
curl -i https://www.yourdomain.com/healthz
curl -i https://api.yourdomain.com/healthz

If your deployment stack is not stable yet, see Deploy SaaS with Nginx + Gunicorn.

4) Create external uptime checks

Use any external uptime provider that supports:

  • HTTPS checks
  • custom interval
  • SSL monitoring
  • multi-region checks
  • email or chat alerts

Recommended initial settings:

  • interval: 60 seconds
  • timeout: 10 seconds
  • expected status: 200
  • failure threshold: 2 consecutive failures
  • recovery notifications: enabled
  • region checks: multiple if available

Monitor these separately:

  • https://yourdomain.com/healthz
  • https://www.yourdomain.com/healthz
  • https://api.yourdomain.com/healthz

Do not combine all services into one health endpoint if they are independently exposed.

5) Add SSL certificate monitoring

Enable certificate expiry and TLS validation alerts if your provider supports them.

This catches:

  • expired certs
  • broken renewal
  • invalid chain
  • hostname mismatch

Manual verification:

bash
openssl s_client -connect yourdomain.com:443 -servername yourdomain.com

6) Configure alerting

Start simple.

Recommended alert design:

  • email
  • one real-time channel such as Slack, Discord, PagerDuty, or SMS
  • alert after 2 failures
  • send recovery notification
  • repeat every 15 to 30 minutes during ongoing incident

Each alert should include:

  • failing URL
  • status code
  • response time
  • failing region
  • first failure timestamp

Link the alert destination or runbook to:

7) Add one optional synthetic check

Basic uptime checks only verify reachability.

If a user-facing path is critical, add one synthetic check for:

  • login page load
  • pricing or checkout page availability
  • API auth callback landing page

Do not replace /healthz with synthetic monitoring. Use synthetic checks as a second layer.

8) Document ownership and response path

For each monitor, document:

  • who gets alerted
  • where alerts go
  • where the runbook lives
  • which logs to check first
  • who can roll back a deploy

This should live in the same place as your production runbooks and checklist. See SaaS Production Checklist.

9) Test the monitor before launch

Do not assume alerting works.

Test by doing one of these in staging:

  • point the monitor at a known failing URL
  • stop the app process
  • block the endpoint temporarily
  • break DNS on a temporary test domain

Confirm:

  • alert fires
  • correct person/channel receives it
  • recovery notification arrives after fix
endpoint
provider
alert channel
failure test
recovery

Process Flow

Common causes

Typical reasons uptime monitoring fails or reports a real outage:

  • no public health endpoint
  • health endpoint returns redirect instead of 200
  • DNS record points to wrong server
  • Nginx or reverse proxy returns 502 or 503
  • Gunicorn, Uvicorn, or app process is down
  • firewall or security group blocks traffic
  • SSL certificate expired or chain invalid
  • health endpoint depends on database and fails during transient DB issues
  • CDN or WAF blocks probes
  • recent deploy broke env vars, startup command, or route registration
  • provider outage affects only some regions

Debugging tips

Start from outside, then work inward.

External checks

bash
curl -i https://yourdomain.com/healthz
curl -vk https://yourdomain.com/healthz
dig yourdomain.com +short
dig api.yourdomain.com +short
nslookup yourdomain.com
openssl s_client -connect yourdomain.com:443 -servername yourdomain.com
ping yourdomain.com

Server checks

bash
systemctl status nginx
systemctl status gunicorn
journalctl -u nginx -n 100 --no-pager
journalctl -u gunicorn -n 100 --no-pager
tail -n 100 /var/log/nginx/access.log
tail -n 100 /var/log/nginx/error.log
ss -tulpn | grep -E ':80|:443|:8000'
curl -I http://127.0.0.1:8000/healthz

What to compare

When a monitor reports downtime:

  • compare failure timestamp to deploy timestamp
  • check Nginx access/error logs
  • check app logs
  • verify DNS resolution
  • verify TLS validity
  • verify upstream app process
  • test from another region or network

If monitor fails but local machine works, suspect:

  • DNS propagation
  • CDN/WAF rules
  • firewall
  • regional provider routing issue

If health checks pass but users still report broken flows, add synthetic checks and error tracking via Error Tracking with Sentry.

If the public endpoint is failing with proxy errors, see Debugging Production Issues.

Checklist

  • public /healthz endpoint exists
  • /healthz returns 200 quickly
  • endpoint is reachable over HTTPS from outside your infrastructure
  • main domain is monitored
  • API domain is monitored
  • www domain is monitored if users access it
  • alert channel is configured
  • alert channel has been tested
  • failure threshold is set to 2 or more consecutive failures
  • recovery notifications are enabled
  • SSL certificate monitoring is enabled
  • monitors are labeled by environment
  • staging and production alerts are separated
  • runbook for responding to alerts is documented
  • at least one staged alert test has been completed

Related guides

FAQ

What should my health endpoint return?

Return HTTP 200 with a small static JSON payload such as:

json
{"status":"ok"}

Keep it fast and avoid heavy dependency checks.

Should uptime monitors check authenticated pages?

Not for the primary uptime check. Use a public health endpoint first. Add synthetic authenticated flows only if they are business-critical.

Why is my app marked down even though the server is running?

Uptime monitors validate the full public path:

  • DNS
  • TLS
  • reverse proxy
  • routing
  • application response

A running process alone does not mean the service is reachable.

How many endpoints should I monitor?

At minimum:

  • main app domain
  • API domain

Optionally add:

  • www domain
  • status page
  • webhook receiver path
  • login or billing synthetic path

What is the difference between uptime monitoring and error tracking?

Uptime monitoring tells you whether the service is reachable.

Error tracking captures application exceptions and stack traces when requests fail.

Should /healthz check the database?

Usually no. Keep /healthz cheap and stable. If you need dependency-aware checks, use a separate /readyz endpoint.

What interval should I use?

Use 1 minute for production if possible. If budget is tight, 5 minutes is an acceptable starting point.

Should I monitor the homepage or a health endpoint?

Prefer a dedicated health endpoint for reliable uptime checks. Add homepage or browser-based synthetic checks separately if the UI itself is critical.

Do I need multi-region checks?

Yes, if your provider supports them. Multi-region checks reduce false positives and catch regional routing failures.

Can I rely only on my cloud provider status page?

No. You need monitoring against your own domain, DNS, TLS, and app path.

Final takeaway

For a small SaaS, uptime monitoring should be simple, external, and tested.

Start with:

  • one fast public /healthz endpoint
  • one-minute checks
  • real alerting
  • SSL monitoring
  • tested response path

Then add:

That gives you the minimum reliable monitoring stack for an MVP or small SaaS in production.