Debugging Production Issues — SaaS Builder Playbooks

Use this page when something is broken in production and you need a deterministic triage workflow. The goal is to reduce time-to-diagnosis: confirm scope, check recent changes, inspect logs and metrics, isolate the failing layer, and either fix forward or roll back safely.

Quick Fix / Quick Setup

Start with scope and recency: what broke, when it started, and what changed just before it started. A large percentage of production issues are caused by deploys, config drift, expired credentials, DNS/TLS issues, database saturation, or background worker failures.

bash

# 1) Check what changed
git log --oneline -n 5
systemctl status myapp nginx --no-pager

# 2) Check app + reverse proxy logs
journalctl -u myapp -n 200 --no-pager
journalctl -u nginx -n 200 --no-pager

# 3) Verify process, port, and health endpoint
ss -ltnp | grep -E ':80|:443|:8000'
curl -I https://yourdomain.com/health
curl -sS https://yourdomain.com/health

# 4) Check resource pressure
free -h
df -h
top -o %CPU

# 5) Check database reachability
pg_isready -h $DB_HOST -p $DB_PORT -U $DB_USER

# 6) If issue started after deploy, rollback
# example: switch symlink / redeploy previous image / restart service

What’s happening

Production failures usually happen in one of these layers:

DNS / TLS
reverse proxy
app process
database
cache
background workers
external APIs
file storage

The fastest path is to classify the symptom first:

app down
high error rate
slow requests
login failures
payment failures
webhook failures
missing background jobs

Then compare current behavior to the last known good state and identify the first failing dependency in the request path.

user request

DNS

HTTPS

Nginx

app

DB/cache/queue

third-party API

Process Flow

Step-by-step implementation

1. Confirm scope

Determine blast radius before changing anything.

Check:

all users or one tenant
one route or all routes
web only or workers too
one region, domain, or environment

Capture:

exact failing URL
timestamp in UTC
HTTP status code
response latency
recent deploy or config change time

Minimal incident notes template:

text

Incident start (UTC):
Affected routes:
Affected users/tenants:
Error code:
Recent deploy/config/migration:
Rollback available: yes/no

2. Reproduce safely

Use direct requests instead of relying only on browser behavior.

bash

curl -I https://yourdomain.com/health
curl -sS https://yourdomain.com/health
curl -v https://yourdomain.com/failing-route
curl -v http://127.0.0.1:8000/health

If the public domain fails but localhost works, investigate:

Nginx
TLS
DNS
firewall
upstream config

If localhost also fails, investigate:

app process
env vars
DB/cache dependencies
schema mismatch

3. Check recent changes first

A large percentage of incidents are recent-change related.

Review:

code deploy
environment variable update
secret rotation
migration
dependency update
DNS change
TLS renewal
cron or worker change

Commands:

bash

git log --oneline -n 10
systemctl status myapp nginx --no-pager
journalctl -u myapp --since "30 minutes ago" --no-pager
journalctl -u nginx --since "30 minutes ago" --no-pager

If using Docker:

bash

docker ps
docker logs --tail 200 <container_name>

Rollback early if:

issue started immediately after deploy
impact is customer-facing
no low-risk fix is obvious within minutes

4. Identify the failing layer

Use symptoms to narrow down where to look.

502 Bad Gateway

Usually means reverse proxy cannot reach the app or upstream fails.

Check:

bash

journalctl -u nginx -n 200 --no-pager
journalctl -u myapp -n 200 --no-pager
ss -ltnp | grep -E ':80|:443|:8000'
nginx -t

Then review upstream config, app bind port, socket permissions, and timeout settings.

For deeper 502-specific checks, see 502 Bad Gateway Fix Guide.

500 Internal Server Error

Usually means the app process handled the request but failed.

Check:

stack traces
missing env vars
runtime import errors
DB failures
schema mismatch

bash

journalctl -u myapp -n 200 --no-pager
journalctl -xe --no-pager
gunicorn --check-config app:app

Slow requests / timeouts

Look for:

slow queries
blocked workers
exhausted DB pool
external API waits
CPU or memory pressure
disk I/O saturation

bash

top -o %CPU
free -h
df -h
ps aux --sort=-%mem | head

5. Check service health

Verify process state, ports, restart loops, and OOM events.

bash

systemctl status myapp nginx --no-pager
ss -ltnp | grep -E ':80|:443|:8000|:5432|:6379'
dmesg | tail -n 50

Check for:

frequent service restarts
OOMKilled
closed expected ports
unhealthy containers
crashed workers

If using systemd, restart counts and failure reasons often point directly to the issue.

6. Review logs around the incident window

Correlate logs by timestamp.

Core commands:

bash

journalctl -u myapp -n 200 --no-pager
journalctl -u nginx -n 200 --no-pager
journalctl -xe --no-pager
docker logs --tail 200 <container_name>

Look for:

first error after deploy
repeated exceptions
request path + status code
upstream connection refused
database connection timeouts
auth/session errors
webhook signature failures

Best practice:

use UTC timestamps
include request IDs or correlation IDs
tail logs while reproducing

Layer	Log file / sink	Key fields
Nginx	/var/log/nginx/access.log	request_id, status, upstream_time
App	stdout / app.log	request_id, user_id, endpoint, duration
Worker	stdout / worker.log	request_id, job_id, queue, retries
DB	pg_log / slow_query.log	query, duration, lock_waits

side-by-side log correlation view for nginx, app, worker, and db.

7. Check metrics and monitoring

Open these at the same time:

app logs
server logs
metrics dashboard
error tracking

Review:

CPU
memory
disk
network
request latency
error rate
restart count
DB connections
queue depth
external API latency

If you use Sentry, review grouped exceptions and release markers in Error Tracking with Sentry.

8. Validate configuration drift

Compare current production config to the last known good version.

Check for mismatches in:

environment variables
secrets
callback URLs
webhook secrets
CORS
cookie domain
ALLOWED_HOSTS
CSRF settings
JWT secret
proxy headers

Example checklist:

bash

printenv | sort > /tmp/current-env.txt
# compare against your last known-good env snapshot or secret manager values

Common production-only auth failures often come from cookie, domain, and proxy issues. See Common Auth Bugs and Fixes.

9. Test dependencies directly

Do not assume a dependency is healthy because the process is running.

PostgreSQL

bash

pg_isready -h $DB_HOST -p $DB_PORT -U $DB_USER
psql "$DATABASE_URL" -c 'select now();'

Redis

bash

redis-cli -u "$REDIS_URL" ping

DNS

bash

dig yourdomain.com
nslookup yourdomain.com

TLS

bash

openssl s_client -connect yourdomain.com:443 -servername yourdomain.com

Test outbound connectivity from the app host to:

DB
Redis
SMTP
S3
Stripe
OAuth provider

If third-party calls fail, check:

secret rotation
egress firewall rules
certificate validation
timeout changes
rate limits

10. Check migrations and schema drift

After deploy, verify app version matches DB schema.

Look for:

migration not applied
partial migration
app code expecting missing column or table
old workers running against new schema

Typical failure indicators:

column does not exist
relation does not exist
serialization / lock errors after migration
app works locally but fails only in production data paths

If the issue began after deployment, compare:

release version
migration version
worker version
background job payload format

11. Inspect workers and async jobs

Background jobs often fail silently while web requests still appear healthy.

Check workers:

bash

celery -A app inspect ping
rq info

Also verify:

queue depth rising
retry loops flooding dependencies
cron/scheduler still active
dead-lettered jobs
webhook processing failures

Symptoms linked to worker problems:

emails not sending
reports not generating
payment webhooks delayed
import/export stuck
user-facing state never updates

12. Decide: fix forward or rollback

Do not debug indefinitely on a broken release.

Rollback when:

issue is active and customer-facing
incident started right after deploy
rollback path is safe and known
fix is uncertain or touches multiple systems

Fix forward when:

root cause is isolated
change is small and reversible
rollback would worsen state or data compatibility
you can verify the fix quickly

Typical rollback actions:

bash

# symlink-based deploy
ln -sfn /var/www/releases/<previous_release> /var/www/current
systemctl restart myapp nginx

# container-based deploy
# redeploy previous image tag, then verify health endpoint

13. Add a guardrail after recovery

After service is restored, add one prevention mechanism.

Examples:

/health/live and /health/ready
deploy smoke test
migration compatibility check
alert on restart loops
alert on queue backlog
alert on 5xx spike
config validation before deploy

Use a release checklist like SaaS Production Checklist to reduce repeat incidents.

deploy

first error spike

mitigation

full recovery

permanent fix

Process Flow

Common causes

Most production issues fall into these buckets:

bad deploy introduced a runtime error or incompatible dependency
required environment variables or secrets are missing or rotated
database migrations were skipped, partially applied, or incompatible
Nginx, Gunicorn, or container config points to the wrong socket, port, or upstream
app is crashing due to memory pressure, OOM kills, or restart loops
database, Redis, SMTP, or third-party APIs are unavailable or timing out
TLS certificate expired or DNS records changed incorrectly
disk is full, breaking logs, uploads, temp files, or DB operations
background workers are stopped or blocked by queue backlog
session, cookie, CSRF, or proxy header settings differ in production

Debugging tips

Use these practices during active incidents:

use UTC timestamps everywhere
tail logs live during reproduction
prefer health endpoints that split liveness and readiness
inspect disk space early
check restart loops before reading only app-level stack traces
use synthetic requests with correlation IDs
compare tenant-specific config if only one tenant is affected
make one reversible change at a time
record each action and result during the incident

Useful command set:

bash

systemctl status myapp nginx --no-pager
journalctl -u myapp -n 200 --no-pager
journalctl -u nginx -n 200 --no-pager
journalctl -xe --no-pager
docker ps
docker logs --tail 200 <container_name>
ss -ltnp | grep -E ':80|:443|:8000|:5432|:6379'
curl -I https://yourdomain.com/health
curl -sS https://yourdomain.com/health
curl -v http://127.0.0.1:8000/health
free -h
df -h
top -o %CPU
ps aux --sort=-%mem | head
dmesg | tail -n 50
pg_isready -h $DB_HOST -p $DB_PORT -U $DB_USER
psql "$DATABASE_URL" -c 'select now();'
redis-cli -u "$REDIS_URL" ping
dig yourdomain.com
nslookup yourdomain.com
openssl s_client -connect yourdomain.com:443 -servername yourdomain.com
nginx -t
gunicorn --check-config app:app
celery -A app inspect ping
rq info

Checklist

✓ issue scope confirmed: all users, subset, or single tenant
✓ recent changes reviewed: deploy, env vars, migrations, DNS, certs, workers
✓ app, proxy, database, cache, and worker health verified
✓ logs and metrics correlated by timestamp
✓ root cause isolated to one layer or dependency
✓ rollback decision made if impact is high
✓ fix validated with health check and real request path
✓ post-incident action added: alert, test, runbook, or config validation

Related guides

FAQ

What is the fastest way to narrow down a production issue?

Start with the symptom and incident start time, then check recent changes, service status, logs, and metrics together. Identify the first failing layer before changing anything.

When should I rollback instead of continuing to debug?

Rollback if the issue began right after deploy, affects users now, and you do not have a verified low-risk fix quickly. Restore service first, investigate second.

How do I debug an issue that only happens in production?

Compare production-only variables:

secrets
domains
proxy headers
cookie settings
TLS
DB size
queue load
third-party credentials

Reproduce with production-like config in a safe environment if needed.

What should every small SaaS log in production?

At minimum:

request path
status code
latency
request ID
user or tenant ID where safe
exception stack traces
deploy version
worker job failures
outbound API errors

Final takeaway

Production debugging is faster when you follow a fixed order:

confirm scope
check recent changes
inspect logs
inspect metrics
test dependencies
decide fix forward or rollback
add guardrails

Most outages are not mysterious. They are usually a broken dependency, bad config, failed migration, worker issue, or bad release. A repeatable runbook prevents wasted time.