Debugging Production Issues
The essential playbook for implementing debugging production issues in your SaaS.
Use this page when something is broken in production and you need a deterministic triage workflow. The goal is to reduce time-to-diagnosis: confirm scope, check recent changes, inspect logs and metrics, isolate the failing layer, and either fix forward or roll back safely.
Quick Fix / Quick Setup
Start with scope and recency: what broke, when it started, and what changed just before it started. A large percentage of production issues are caused by deploys, config drift, expired credentials, DNS/TLS issues, database saturation, or background worker failures.
# 1) Check what changed
git log --oneline -n 5
systemctl status myapp nginx --no-pager
# 2) Check app + reverse proxy logs
journalctl -u myapp -n 200 --no-pager
journalctl -u nginx -n 200 --no-pager
# 3) Verify process, port, and health endpoint
ss -ltnp | grep -E ':80|:443|:8000'
curl -I https://yourdomain.com/health
curl -sS https://yourdomain.com/health
# 4) Check resource pressure
free -h
df -h
top -o %CPU
# 5) Check database reachability
pg_isready -h $DB_HOST -p $DB_PORT -U $DB_USER
# 6) If issue started after deploy, rollback
# example: switch symlink / redeploy previous image / restart serviceWhat’s happening
Production failures usually happen in one of these layers:
- DNS / TLS
- reverse proxy
- app process
- database
- cache
- background workers
- external APIs
- file storage
The fastest path is to classify the symptom first:
- app down
- high error rate
- slow requests
- login failures
- payment failures
- webhook failures
- missing background jobs
Then compare current behavior to the last known good state and identify the first failing dependency in the request path.
Process Flow
Step-by-step implementation
1. Confirm scope
Determine blast radius before changing anything.
Check:
- all users or one tenant
- one route or all routes
- web only or workers too
- one region, domain, or environment
Capture:
- exact failing URL
- timestamp in UTC
- HTTP status code
- response latency
- recent deploy or config change time
Minimal incident notes template:
Incident start (UTC):
Affected routes:
Affected users/tenants:
Error code:
Recent deploy/config/migration:
Rollback available: yes/no2. Reproduce safely
Use direct requests instead of relying only on browser behavior.
curl -I https://yourdomain.com/health
curl -sS https://yourdomain.com/health
curl -v https://yourdomain.com/failing-route
curl -v http://127.0.0.1:8000/healthIf the public domain fails but localhost works, investigate:
- Nginx
- TLS
- DNS
- firewall
- upstream config
If localhost also fails, investigate:
- app process
- env vars
- DB/cache dependencies
- schema mismatch
3. Check recent changes first
A large percentage of incidents are recent-change related.
Review:
- code deploy
- environment variable update
- secret rotation
- migration
- dependency update
- DNS change
- TLS renewal
- cron or worker change
Commands:
git log --oneline -n 10
systemctl status myapp nginx --no-pager
journalctl -u myapp --since "30 minutes ago" --no-pager
journalctl -u nginx --since "30 minutes ago" --no-pagerIf using Docker:
docker ps
docker logs --tail 200 <container_name>Rollback early if:
- issue started immediately after deploy
- impact is customer-facing
- no low-risk fix is obvious within minutes
4. Identify the failing layer
Use symptoms to narrow down where to look.
502 Bad Gateway
Usually means reverse proxy cannot reach the app or upstream fails.
Check:
journalctl -u nginx -n 200 --no-pager
journalctl -u myapp -n 200 --no-pager
ss -ltnp | grep -E ':80|:443|:8000'
nginx -tThen review upstream config, app bind port, socket permissions, and timeout settings.
For deeper 502-specific checks, see 502 Bad Gateway Fix Guide.
500 Internal Server Error
Usually means the app process handled the request but failed.
Check:
- stack traces
- missing env vars
- runtime import errors
- DB failures
- schema mismatch
journalctl -u myapp -n 200 --no-pager
journalctl -xe --no-pager
gunicorn --check-config app:appSlow requests / timeouts
Look for:
- slow queries
- blocked workers
- exhausted DB pool
- external API waits
- CPU or memory pressure
- disk I/O saturation
top -o %CPU
free -h
df -h
ps aux --sort=-%mem | head5. Check service health
Verify process state, ports, restart loops, and OOM events.
systemctl status myapp nginx --no-pager
ss -ltnp | grep -E ':80|:443|:8000|:5432|:6379'
dmesg | tail -n 50Check for:
- frequent service restarts
OOMKilled- closed expected ports
- unhealthy containers
- crashed workers
If using systemd, restart counts and failure reasons often point directly to the issue.
6. Review logs around the incident window
Correlate logs by timestamp.
Core commands:
journalctl -u myapp -n 200 --no-pager
journalctl -u nginx -n 200 --no-pager
journalctl -xe --no-pager
docker logs --tail 200 <container_name>Look for:
- first error after deploy
- repeated exceptions
- request path + status code
- upstream connection refused
- database connection timeouts
- auth/session errors
- webhook signature failures
Best practice:
- use UTC timestamps
- include request IDs or correlation IDs
- tail logs while reproducing
| Layer | Log file / sink | Key fields |
|---|---|---|
| Nginx | /var/log/nginx/access.log | request_id, status, upstream_time |
| App | stdout / app.log | request_id, user_id, endpoint, duration |
| Worker | stdout / worker.log | request_id, job_id, queue, retries |
| DB | pg_log / slow_query.log | query, duration, lock_waits |
side-by-side log correlation view for nginx, app, worker, and db.
7. Check metrics and monitoring
Open these at the same time:
- app logs
- server logs
- metrics dashboard
- error tracking
Review:
- CPU
- memory
- disk
- network
- request latency
- error rate
- restart count
- DB connections
- queue depth
- external API latency
If you use Sentry, review grouped exceptions and release markers in Error Tracking with Sentry.
8. Validate configuration drift
Compare current production config to the last known good version.
Check for mismatches in:
- environment variables
- secrets
- callback URLs
- webhook secrets
- CORS
- cookie domain
ALLOWED_HOSTS- CSRF settings
- JWT secret
- proxy headers
Example checklist:
printenv | sort > /tmp/current-env.txt
# compare against your last known-good env snapshot or secret manager valuesCommon production-only auth failures often come from cookie, domain, and proxy issues. See Common Auth Bugs and Fixes.
9. Test dependencies directly
Do not assume a dependency is healthy because the process is running.
PostgreSQL
pg_isready -h $DB_HOST -p $DB_PORT -U $DB_USER
psql "$DATABASE_URL" -c 'select now();'Redis
redis-cli -u "$REDIS_URL" pingDNS
dig yourdomain.com
nslookup yourdomain.comTLS
openssl s_client -connect yourdomain.com:443 -servername yourdomain.comTest outbound connectivity from the app host to:
- DB
- Redis
- SMTP
- S3
- Stripe
- OAuth provider
If third-party calls fail, check:
- secret rotation
- egress firewall rules
- certificate validation
- timeout changes
- rate limits
10. Check migrations and schema drift
After deploy, verify app version matches DB schema.
Look for:
- migration not applied
- partial migration
- app code expecting missing column or table
- old workers running against new schema
Typical failure indicators:
column does not existrelation does not exist- serialization / lock errors after migration
- app works locally but fails only in production data paths
If the issue began after deployment, compare:
- release version
- migration version
- worker version
- background job payload format
11. Inspect workers and async jobs
Background jobs often fail silently while web requests still appear healthy.
Check workers:
celery -A app inspect ping
rq infoAlso verify:
- queue depth rising
- retry loops flooding dependencies
- cron/scheduler still active
- dead-lettered jobs
- webhook processing failures
Symptoms linked to worker problems:
- emails not sending
- reports not generating
- payment webhooks delayed
- import/export stuck
- user-facing state never updates
12. Decide: fix forward or rollback
Do not debug indefinitely on a broken release.
Rollback when:
- issue is active and customer-facing
- incident started right after deploy
- rollback path is safe and known
- fix is uncertain or touches multiple systems
Fix forward when:
- root cause is isolated
- change is small and reversible
- rollback would worsen state or data compatibility
- you can verify the fix quickly
Typical rollback actions:
# symlink-based deploy
ln -sfn /var/www/releases/<previous_release> /var/www/current
systemctl restart myapp nginx
# container-based deploy
# redeploy previous image tag, then verify health endpoint13. Add a guardrail after recovery
After service is restored, add one prevention mechanism.
Examples:
/health/liveand/health/ready- deploy smoke test
- migration compatibility check
- alert on restart loops
- alert on queue backlog
- alert on 5xx spike
- config validation before deploy
Use a release checklist like SaaS Production Checklist to reduce repeat incidents.
Process Flow
Common causes
Most production issues fall into these buckets:
- bad deploy introduced a runtime error or incompatible dependency
- required environment variables or secrets are missing or rotated
- database migrations were skipped, partially applied, or incompatible
- Nginx, Gunicorn, or container config points to the wrong socket, port, or upstream
- app is crashing due to memory pressure, OOM kills, or restart loops
- database, Redis, SMTP, or third-party APIs are unavailable or timing out
- TLS certificate expired or DNS records changed incorrectly
- disk is full, breaking logs, uploads, temp files, or DB operations
- background workers are stopped or blocked by queue backlog
- session, cookie, CSRF, or proxy header settings differ in production
Debugging tips
Use these practices during active incidents:
- use UTC timestamps everywhere
- tail logs live during reproduction
- prefer health endpoints that split liveness and readiness
- inspect disk space early
- check restart loops before reading only app-level stack traces
- use synthetic requests with correlation IDs
- compare tenant-specific config if only one tenant is affected
- make one reversible change at a time
- record each action and result during the incident
Useful command set:
systemctl status myapp nginx --no-pager
journalctl -u myapp -n 200 --no-pager
journalctl -u nginx -n 200 --no-pager
journalctl -xe --no-pager
docker ps
docker logs --tail 200 <container_name>
ss -ltnp | grep -E ':80|:443|:8000|:5432|:6379'
curl -I https://yourdomain.com/health
curl -sS https://yourdomain.com/health
curl -v http://127.0.0.1:8000/health
free -h
df -h
top -o %CPU
ps aux --sort=-%mem | head
dmesg | tail -n 50
pg_isready -h $DB_HOST -p $DB_PORT -U $DB_USER
psql "$DATABASE_URL" -c 'select now();'
redis-cli -u "$REDIS_URL" ping
dig yourdomain.com
nslookup yourdomain.com
openssl s_client -connect yourdomain.com:443 -servername yourdomain.com
nginx -t
gunicorn --check-config app:app
celery -A app inspect ping
rq infoChecklist
- ✓ issue scope confirmed: all users, subset, or single tenant
- ✓ recent changes reviewed: deploy, env vars, migrations, DNS, certs, workers
- ✓ app, proxy, database, cache, and worker health verified
- ✓ logs and metrics correlated by timestamp
- ✓ root cause isolated to one layer or dependency
- ✓ rollback decision made if impact is high
- ✓ fix validated with health check and real request path
- ✓ post-incident action added: alert, test, runbook, or config validation
Related guides
- Error Tracking with Sentry
- 502 Bad Gateway Fix Guide
- Common Auth Bugs and Fixes
- SaaS Production Checklist
FAQ
What is the fastest way to narrow down a production issue?
Start with the symptom and incident start time, then check recent changes, service status, logs, and metrics together. Identify the first failing layer before changing anything.
When should I rollback instead of continuing to debug?
Rollback if the issue began right after deploy, affects users now, and you do not have a verified low-risk fix quickly. Restore service first, investigate second.
How do I debug an issue that only happens in production?
Compare production-only variables:
- secrets
- domains
- proxy headers
- cookie settings
- TLS
- DB size
- queue load
- third-party credentials
Reproduce with production-like config in a safe environment if needed.
What should every small SaaS log in production?
At minimum:
- request path
- status code
- latency
- request ID
- user or tenant ID where safe
- exception stack traces
- deploy version
- worker job failures
- outbound API errors
Final takeaway
Production debugging is faster when you follow a fixed order:
- confirm scope
- check recent changes
- inspect logs
- inspect metrics
- test dependencies
- decide fix forward or rollback
- add guardrails
Most outages are not mysterious. They are usually a broken dependency, bad config, failed migration, worker issue, or bad release. A repeatable runbook prevents wasted time.