Debugging Production Issues

The essential playbook for implementing debugging production issues in your SaaS.

Use this page when something is broken in production and you need a deterministic triage workflow. The goal is to reduce time-to-diagnosis: confirm scope, check recent changes, inspect logs and metrics, isolate the failing layer, and either fix forward or roll back safely.

Quick Fix / Quick Setup

Start with scope and recency: what broke, when it started, and what changed just before it started. A large percentage of production issues are caused by deploys, config drift, expired credentials, DNS/TLS issues, database saturation, or background worker failures.

bash
# 1) Check what changed
git log --oneline -n 5
systemctl status myapp nginx --no-pager

# 2) Check app + reverse proxy logs
journalctl -u myapp -n 200 --no-pager
journalctl -u nginx -n 200 --no-pager

# 3) Verify process, port, and health endpoint
ss -ltnp | grep -E ':80|:443|:8000'
curl -I https://yourdomain.com/health
curl -sS https://yourdomain.com/health

# 4) Check resource pressure
free -h
df -h
top -o %CPU

# 5) Check database reachability
pg_isready -h $DB_HOST -p $DB_PORT -U $DB_USER

# 6) If issue started after deploy, rollback
# example: switch symlink / redeploy previous image / restart service

What’s happening

Production failures usually happen in one of these layers:

  • DNS / TLS
  • reverse proxy
  • app process
  • database
  • cache
  • background workers
  • external APIs
  • file storage

The fastest path is to classify the symptom first:

  • app down
  • high error rate
  • slow requests
  • login failures
  • payment failures
  • webhook failures
  • missing background jobs

Then compare current behavior to the last known good state and identify the first failing dependency in the request path.

user request
DNS
HTTPS
Nginx
app
DB/cache/queue
third-party API

Process Flow

Step-by-step implementation

1. Confirm scope

Determine blast radius before changing anything.

Check:

  • all users or one tenant
  • one route or all routes
  • web only or workers too
  • one region, domain, or environment

Capture:

  • exact failing URL
  • timestamp in UTC
  • HTTP status code
  • response latency
  • recent deploy or config change time

Minimal incident notes template:

text
Incident start (UTC):
Affected routes:
Affected users/tenants:
Error code:
Recent deploy/config/migration:
Rollback available: yes/no

2. Reproduce safely

Use direct requests instead of relying only on browser behavior.

bash
curl -I https://yourdomain.com/health
curl -sS https://yourdomain.com/health
curl -v https://yourdomain.com/failing-route
curl -v http://127.0.0.1:8000/health

If the public domain fails but localhost works, investigate:

  • Nginx
  • TLS
  • DNS
  • firewall
  • upstream config

If localhost also fails, investigate:

  • app process
  • env vars
  • DB/cache dependencies
  • schema mismatch

3. Check recent changes first

A large percentage of incidents are recent-change related.

Review:

  • code deploy
  • environment variable update
  • secret rotation
  • migration
  • dependency update
  • DNS change
  • TLS renewal
  • cron or worker change

Commands:

bash
git log --oneline -n 10
systemctl status myapp nginx --no-pager
journalctl -u myapp --since "30 minutes ago" --no-pager
journalctl -u nginx --since "30 minutes ago" --no-pager

If using Docker:

bash
docker ps
docker logs --tail 200 <container_name>

Rollback early if:

  • issue started immediately after deploy
  • impact is customer-facing
  • no low-risk fix is obvious within minutes

4. Identify the failing layer

Use symptoms to narrow down where to look.

502 Bad Gateway

Usually means reverse proxy cannot reach the app or upstream fails.

Check:

bash
journalctl -u nginx -n 200 --no-pager
journalctl -u myapp -n 200 --no-pager
ss -ltnp | grep -E ':80|:443|:8000'
nginx -t

Then review upstream config, app bind port, socket permissions, and timeout settings.

For deeper 502-specific checks, see 502 Bad Gateway Fix Guide.

500 Internal Server Error

Usually means the app process handled the request but failed.

Check:

  • stack traces
  • missing env vars
  • runtime import errors
  • DB failures
  • schema mismatch
bash
journalctl -u myapp -n 200 --no-pager
journalctl -xe --no-pager
gunicorn --check-config app:app

Slow requests / timeouts

Look for:

  • slow queries
  • blocked workers
  • exhausted DB pool
  • external API waits
  • CPU or memory pressure
  • disk I/O saturation
bash
top -o %CPU
free -h
df -h
ps aux --sort=-%mem | head

5. Check service health

Verify process state, ports, restart loops, and OOM events.

bash
systemctl status myapp nginx --no-pager
ss -ltnp | grep -E ':80|:443|:8000|:5432|:6379'
dmesg | tail -n 50

Check for:

  • frequent service restarts
  • OOMKilled
  • closed expected ports
  • unhealthy containers
  • crashed workers

If using systemd, restart counts and failure reasons often point directly to the issue.

6. Review logs around the incident window

Correlate logs by timestamp.

Core commands:

bash
journalctl -u myapp -n 200 --no-pager
journalctl -u nginx -n 200 --no-pager
journalctl -xe --no-pager
docker logs --tail 200 <container_name>

Look for:

  • first error after deploy
  • repeated exceptions
  • request path + status code
  • upstream connection refused
  • database connection timeouts
  • auth/session errors
  • webhook signature failures

Best practice:

  • use UTC timestamps
  • include request IDs or correlation IDs
  • tail logs while reproducing
LayerLog file / sinkKey fields
Nginx/var/log/nginx/access.logrequest_id, status, upstream_time
Appstdout / app.logrequest_id, user_id, endpoint, duration
Workerstdout / worker.logrequest_id, job_id, queue, retries
DBpg_log / slow_query.logquery, duration, lock_waits

side-by-side log correlation view for nginx, app, worker, and db.

7. Check metrics and monitoring

Open these at the same time:

  • app logs
  • server logs
  • metrics dashboard
  • error tracking

Review:

  • CPU
  • memory
  • disk
  • network
  • request latency
  • error rate
  • restart count
  • DB connections
  • queue depth
  • external API latency

If you use Sentry, review grouped exceptions and release markers in Error Tracking with Sentry.

8. Validate configuration drift

Compare current production config to the last known good version.

Check for mismatches in:

  • environment variables
  • secrets
  • callback URLs
  • webhook secrets
  • CORS
  • cookie domain
  • ALLOWED_HOSTS
  • CSRF settings
  • JWT secret
  • proxy headers

Example checklist:

bash
printenv | sort > /tmp/current-env.txt
# compare against your last known-good env snapshot or secret manager values

Common production-only auth failures often come from cookie, domain, and proxy issues. See Common Auth Bugs and Fixes.

9. Test dependencies directly

Do not assume a dependency is healthy because the process is running.

PostgreSQL

bash
pg_isready -h $DB_HOST -p $DB_PORT -U $DB_USER
psql "$DATABASE_URL" -c 'select now();'

Redis

bash
redis-cli -u "$REDIS_URL" ping

DNS

bash
dig yourdomain.com
nslookup yourdomain.com

TLS

bash
openssl s_client -connect yourdomain.com:443 -servername yourdomain.com

Test outbound connectivity from the app host to:

  • DB
  • Redis
  • SMTP
  • S3
  • Stripe
  • OAuth provider

If third-party calls fail, check:

  • secret rotation
  • egress firewall rules
  • certificate validation
  • timeout changes
  • rate limits

10. Check migrations and schema drift

After deploy, verify app version matches DB schema.

Look for:

  • migration not applied
  • partial migration
  • app code expecting missing column or table
  • old workers running against new schema

Typical failure indicators:

  • column does not exist
  • relation does not exist
  • serialization / lock errors after migration
  • app works locally but fails only in production data paths

If the issue began after deployment, compare:

  • release version
  • migration version
  • worker version
  • background job payload format

11. Inspect workers and async jobs

Background jobs often fail silently while web requests still appear healthy.

Check workers:

bash
celery -A app inspect ping
rq info

Also verify:

  • queue depth rising
  • retry loops flooding dependencies
  • cron/scheduler still active
  • dead-lettered jobs
  • webhook processing failures

Symptoms linked to worker problems:

  • emails not sending
  • reports not generating
  • payment webhooks delayed
  • import/export stuck
  • user-facing state never updates

12. Decide: fix forward or rollback

Do not debug indefinitely on a broken release.

Rollback when:

  • issue is active and customer-facing
  • incident started right after deploy
  • rollback path is safe and known
  • fix is uncertain or touches multiple systems

Fix forward when:

  • root cause is isolated
  • change is small and reversible
  • rollback would worsen state or data compatibility
  • you can verify the fix quickly

Typical rollback actions:

bash
# symlink-based deploy
ln -sfn /var/www/releases/<previous_release> /var/www/current
systemctl restart myapp nginx

# container-based deploy
# redeploy previous image tag, then verify health endpoint

13. Add a guardrail after recovery

After service is restored, add one prevention mechanism.

Examples:

  • /health/live and /health/ready
  • deploy smoke test
  • migration compatibility check
  • alert on restart loops
  • alert on queue backlog
  • alert on 5xx spike
  • config validation before deploy

Use a release checklist like SaaS Production Checklist to reduce repeat incidents.

deploy
first error spike
mitigation
full recovery
permanent fix

Process Flow

Common causes

Most production issues fall into these buckets:

  • bad deploy introduced a runtime error or incompatible dependency
  • required environment variables or secrets are missing or rotated
  • database migrations were skipped, partially applied, or incompatible
  • Nginx, Gunicorn, or container config points to the wrong socket, port, or upstream
  • app is crashing due to memory pressure, OOM kills, or restart loops
  • database, Redis, SMTP, or third-party APIs are unavailable or timing out
  • TLS certificate expired or DNS records changed incorrectly
  • disk is full, breaking logs, uploads, temp files, or DB operations
  • background workers are stopped or blocked by queue backlog
  • session, cookie, CSRF, or proxy header settings differ in production

Debugging tips

Use these practices during active incidents:

  • use UTC timestamps everywhere
  • tail logs live during reproduction
  • prefer health endpoints that split liveness and readiness
  • inspect disk space early
  • check restart loops before reading only app-level stack traces
  • use synthetic requests with correlation IDs
  • compare tenant-specific config if only one tenant is affected
  • make one reversible change at a time
  • record each action and result during the incident

Useful command set:

bash
systemctl status myapp nginx --no-pager
journalctl -u myapp -n 200 --no-pager
journalctl -u nginx -n 200 --no-pager
journalctl -xe --no-pager
docker ps
docker logs --tail 200 <container_name>
ss -ltnp | grep -E ':80|:443|:8000|:5432|:6379'
curl -I https://yourdomain.com/health
curl -sS https://yourdomain.com/health
curl -v http://127.0.0.1:8000/health
free -h
df -h
top -o %CPU
ps aux --sort=-%mem | head
dmesg | tail -n 50
pg_isready -h $DB_HOST -p $DB_PORT -U $DB_USER
psql "$DATABASE_URL" -c 'select now();'
redis-cli -u "$REDIS_URL" ping
dig yourdomain.com
nslookup yourdomain.com
openssl s_client -connect yourdomain.com:443 -servername yourdomain.com
nginx -t
gunicorn --check-config app:app
celery -A app inspect ping
rq info

Checklist

  • issue scope confirmed: all users, subset, or single tenant
  • recent changes reviewed: deploy, env vars, migrations, DNS, certs, workers
  • app, proxy, database, cache, and worker health verified
  • logs and metrics correlated by timestamp
  • root cause isolated to one layer or dependency
  • rollback decision made if impact is high
  • fix validated with health check and real request path
  • post-incident action added: alert, test, runbook, or config validation

Related guides

FAQ

What is the fastest way to narrow down a production issue?

Start with the symptom and incident start time, then check recent changes, service status, logs, and metrics together. Identify the first failing layer before changing anything.

When should I rollback instead of continuing to debug?

Rollback if the issue began right after deploy, affects users now, and you do not have a verified low-risk fix quickly. Restore service first, investigate second.

How do I debug an issue that only happens in production?

Compare production-only variables:

  • secrets
  • domains
  • proxy headers
  • cookie settings
  • TLS
  • DB size
  • queue load
  • third-party credentials

Reproduce with production-like config in a safe environment if needed.

What should every small SaaS log in production?

At minimum:

  • request path
  • status code
  • latency
  • request ID
  • user or tenant ID where safe
  • exception stack traces
  • deploy version
  • worker job failures
  • outbound API errors

Final takeaway

Production debugging is faster when you follow a fixed order:

  1. confirm scope
  2. check recent changes
  3. inspect logs
  4. inspect metrics
  5. test dependencies
  6. decide fix forward or rollback
  7. add guardrails

Most outages are not mysterious. They are usually a broken dependency, bad config, failed migration, worker issue, or bad release. A repeatable runbook prevents wasted time.