Incident Response Playbook
The essential playbook for implementing incident response playbook in your SaaS.
A small SaaS needs a repeatable way to handle outages, performance regressions, failed deploys, payment failures, and auth incidents. This playbook gives a lightweight incident response process for indie teams: one source of truth, clear severity levels, fast triage, rollback rules, communication templates, and post-incident follow-up.
Internal references for setup and follow-up:
- Alerting System Setup
- Error Tracking with Sentry
- Debugging Production Issues
- App Crashes on Deployment
- SaaS Production Checklist
Quick Fix / Quick Setup
Create a repo-based incident checklist and severity matrix:
mkdir -p ops/runbooks ops/postmortems && cat > ops/runbooks/incident-checklist.md <<'EOF'
# Incident Checklist
## 1. Acknowledge
- Create incident channel/thread
- Assign incident commander
- Record start time
- Freeze non-essential deploys
## 2. Triage
- What is broken?
- Who is affected?
- Is this ongoing?
- What changed in last 60 minutes?
- Severity: SEV1 / SEV2 / SEV3
## 3. Stabilize
- Roll back latest deploy if suspicious
- Enable maintenance mode if needed
- Scale app/workers if saturation issue
- Disable broken integration/feature flag
## 4. Investigate
- Check app logs
- Check server logs
- Check DB health
- Check queue health
- Check third-party status pages
## 5. Communicate
- Internal update every 15-30 min
- Customer status page update if user impact exists
## 6. Resolve
- Verify recovery with metrics + manual test
- Record resolution time
## 7. Postmortem
- Root cause
- Trigger
- Detection gap
- Action items
EOF
cat > ops/runbooks/severity-matrix.md <<'EOF'
# Severity Matrix
- SEV1: Full outage, login/payment/core API broken, major data risk
- SEV2: Major feature degraded, subset of users blocked, delayed jobs/payments
- SEV3: Minor issue, workaround exists, limited impact
EOFStart with:
- one incident runbook
- one severity matrix
- one communication path
- one rollback policy
- one postmortem template
Quick setup
- Create a single incident runbook in your repo or internal docs.
- Define severity levels: SEV1, SEV2, SEV3 with clear examples.
- Assign default roles: incident commander, investigator, communications owner, scribe.
- Freeze unrelated deploys during active incidents.
- Set a rule for updates: internal every 15 minutes for SEV1, every 30 minutes for SEV2.
- Prepare rollback commands, status page templates, and dashboards before the first real incident.
Incident lifecycle flowchart
Example responder roles:
# Incident Roles
- Incident Commander: owns decisions and prioritization
- Investigator: runs technical checks and tests hypotheses
- Communications Owner: posts status updates internally and externally
- Scribe: records timeline, actions, and evidenceExample severity matrix:
# Severity Guide
## SEV1
- Full outage
- Login unavailable
- Payment flow broken for most users
- Data corruption or security risk
- Core API unavailable
## SEV2
- Major degradation
- Background jobs heavily delayed
- One major feature unavailable
- Subset of users blocked
## SEV3
- Minor degradation
- Workaround exists
- Limited user impact
- No immediate revenue or data riskWhat’s happening
An incident is a production issue with user, revenue, security, or data impact.
Common failure in small teams:
- no single source of truth
- no owner during the incident
- no clear severity level
- no rollback decision rule
- no timeline of what changed
The main problem is usually process, not tooling.
Without a playbook:
- responders jump between logs and guesses
- multiple people make conflicting changes
- deploys continue during active impact
- customer communication becomes delayed or inaccurate
- root cause gets lost after recovery
The goal is to reduce:
- time to acknowledge
- time to mitigate
- time to resolve
You do not need enterprise incident tooling for an MVP. You need repeatable steps, evidence capture, and links to your actual systems.
Step-by-step implementation
1. Define what counts as an incident
Use business impact, not annoyance.
Treat these as incidents:
- login broken
- checkout or subscription flow broken
- API returning elevated 5xx errors
- worker backlog causing delayed customer actions
- database outage
- auth provider outage
- deploy failure causing production crash
- data exposure or integrity risk
Do not treat these as formal incidents unless impact is real:
- isolated dev-only bugs
- cosmetic UI bugs with no business impact
- alerts with no user-facing symptoms
2. Define severity by impact
Map severity to customer effect.
SEV1
- Full or near-full outage
- Critical path broken: auth, payments, core API
- Security or major data risk
- Immediate revenue impact
SEV2
- Major degradation
- Affected subset of users or one important workflow
- Delayed jobs, integrations, or billing sync
- Workaround may exist
SEV3
- Minor issue
- Limited scope
- Workaround exists
- Low urgency but still production-impacting3. Document responder roles
Even if one person fills all roles, keep the structure.
Incident Commander
- Declares incident
- Assigns severity
- Freezes non-essential deploys
- Approves mitigation choice
- Decides when incident is resolved
Investigator
- Checks logs, dashboards, infra, and recent changes
- Tests hypotheses
- Proposes mitigation
Communications Owner
- Posts updates to status page/support/internal channels
- Sets next update time
- Avoids speculation
Scribe
- Records timeline
- Captures commands run, screenshots, links, and outcomes4. Choose communication channels
Minimum setup:
- one incident channel or thread in Slack/Discord
- one shared incident doc or issue
- one customer-facing status page
- one issue tracker for follow-up work
Example incident template:
# Incident: <title>
- Start time:
- Severity:
- Incident commander:
- Investigator:
- Communications owner:
- Scribe:
## Impact
- Affected users:
- Affected systems:
- Symptoms:
## Suspected trigger
- Latest deploy / config / migration / provider issue / unknown
## Mitigations attempted
- <timestamp> action -> result
## Next update
- <time>
## Resolution
- Resolved at:
- Recovery verification:5. Route alerts into one path
Alerts should converge into one responder workflow.
Sources:
- uptime monitor
- error tracking
- app metrics
- infrastructure metrics
- queue depth alerts
- payment/auth webhook failure alerts
If alerts are scattered, responders lose time.
Use:
6. Prepare investigation links in advance
Store all responder links in one file.
# ops/runbooks/service-links.md
## App
- Production URL:
- Health endpoint:
- Sentry:
- App logs:
- APM dashboard:
## Infra
- Server metrics:
- Cloud dashboard:
- Container dashboard:
- Nginx logs:
- Process supervisor:
## Data
- DB dashboard:
- Queue dashboard:
- Redis dashboard:
## Third parties
- Stripe status:
- Email provider status:
- Auth provider status:
- Cloud provider status:7. Define mitigation options before the incident
Do not invent mitigation during active impact.
Typical mitigations:
- rollback latest deploy
- disable feature flag
- enable maintenance mode
- scale app instances
- scale workers
- restart failed processes
- pause queue consumers
- pause broken webhook processing
- disable a failing integration
- switch traffic away from a bad node
Example rollback notes:
# Rollback Rules
- If issue started immediately after deploy and rollback is safe, rollback first.
- If migration is destructive or non-reversible, do not blindly rollback app only.
- If provider outage is external, disable dependency path or show degraded-mode UI.
- If issue is resource exhaustion, stabilize capacity before deep debugging.8. Create service-specific runbooks
At minimum, write runbooks for:
- 502 errors
- app crash after deployment
- DB connection failures
- background worker backlog
- auth/session failures
- payment webhook failures
Related fix guides:
9. Use a strict triage sequence
During triage, always answer these questions first:
- What is broken?
- Who is affected?
- Is it still ongoing?
- What changed in the last 60 minutes?
- Is this app, infra, DB, queue, or third-party?
- What is the safest fast mitigation?
Useful recent-change checklist:
- deploy
- config change
- secrets rotation
- migration
- dependency update
- DNS change
- TLS renewal
- traffic spike
- cron or worker change
- provider outage
10. Record a timeline during the incident
Every action needs:
- timestamp
- actor
- action
- result
Example:
14:02 UTC - Alert fired for 65% 5xx on /api/login
14:04 UTC - Incident declared SEV1
14:05 UTC - Deploy freeze announced
14:08 UTC - Latest deploy identified as likely trigger
14:10 UTC - Rollback started
14:14 UTC - Error rate dropping
14:18 UTC - Manual login test passed
14:20 UTC - Status page updated: recovering
14:28 UTC - Metrics stable for 10 minutes
14:30 UTC - Incident resolved11. Communicate with short factual updates
Use:
- current impact
- current action
- confidence level
- next update time
Avoid:
- guesses
- unverified root cause statements
- optimistic recovery estimates without evidence
Example internal update:
SEV2 update 13:30 UTC
- Impact: delayed background jobs for new uploads
- Scope: all users, backlog growing
- Current action: scaling workers and checking Redis connectivity
- Recent trigger: deploy at 13:05 UTC under review
- Next update: 13:45 UTCExample customer update:
We are investigating elevated errors affecting file uploads. Some uploads may be delayed or fail. Next update in 30 minutes.12. Verify recovery before closing
Do not mark resolved because one request succeeded.
Verify with:
- error rate returned to baseline
- latency normalized
- queue depth draining
- DB health normal
- manual test of key user flows
- no active alerts
- customer-facing path working
Example manual verification checklist:
- Login works
- Signup works
- Billing checkout works
- API health endpoint returns 200
- Background jobs processing
- Webhooks received and acknowledged13. Run a postmortem within 24–72 hours
Do this for all SEV1 and SEV2 incidents.
Store the document in version control:
# Postmortem: <incident title>
## Summary
## Customer impact
## Timeline
## Root cause
## Trigger
## Detection gap
## What worked
## What failed
## Corrective actions
- [ ] Action item / owner / due date14. Track follow-up work
A postmortem without tracked actions is only documentation.
Common follow-up categories:
- missing alert
- missing dashboard
- poor log context
- rollback too slow
- no feature flag
- unsafe migration process
- missing runbook
- missing owner
- poor customer communication path
| Metric | Warning | High | Critical |
|---|---|---|---|
| Uptime | < 99.9% | < 99.5% | < 99% |
| Error rate | > 1% | > 3% | > 5% |
| Response time | > 500 ms | > 1 s | > 3 s |
| CPU usage | > 60% | > 80% | > 95% |
| Memory usage | > 70% | > 85% | > 95% |
Severity matrix table with examples for auth outage, billing outage, slow API, worker delay, partial feature degradation
Common causes
- Bad deploy or untested hotfix introduced errors or startup failures.
- Environment variable or secret changed, missing, or loaded incorrectly.
- Database migration caused lockups, schema mismatch, or application incompatibility.
- Infrastructure saturation: CPU, memory, disk, file descriptors, DB connections, worker backlog.
- Reverse proxy or app server misconfiguration causing 502 or timeout errors.
- TLS, DNS, or domain configuration issues after renewal or provider changes.
- Third-party outage affecting auth, payments, email, storage, or webhooks.
- Background job workers stopped, crashed, or cannot reach Redis or RabbitMQ.
- Feature flag or config toggle enabled a broken code path.
- Insufficient monitoring delayed detection, making impact larger.
Debugging tips
Start with fast environment checks:
date && uptime
free -h
df -h
top -o %CPU
ps aux --sort=-%mem | head
ss -ltnpCheck reverse proxy and app service health:
journalctl -u nginx -n 200 --no-pager
journalctl -u gunicorn -n 200 --no-pager
sudo nginx -t
systemctl status nginx
systemctl status gunicorn
tail -n 200 /var/log/nginx/error.log
tail -n 200 /var/log/nginx/access.log
curl -I https://yourdomain.com
curl -sS https://yourdomain.com/healthIf you use containers:
docker ps
docker logs --tail 200 <container_name>
docker stats --no-streamCheck runtime config and app state:
env | sort
printenv | sort
python -m flask routes
alembic current
alembic headsCheck data services:
psql "$DATABASE_URL" -c 'select now();'
redis-cli ping
celery -A app inspect pingCheck external dependency reachability:
curl -v https://api.stripe.com
dig yourdomain.com
nslookup yourdomain.comDebugging sequence:
- confirm user-visible symptom
- inspect recent changes
- check logs and error spikes
- check resource saturation
- isolate whether issue is app, infra, DB, queue, or provider
- mitigate safely first
- verify recovery with metrics and manual checks
If deploy-related, start with:
If issue is broad and unclear, use:
If issue looks like gateway/proxy failure, use:
Checklist
- ✓ Severity levels documented.
- ✓ Incident commander role defined.
- ✓ Rollback procedure documented and tested.
- ✓ Status page or customer communication path ready.
- ✓ Dashboards and logs linked from one runbook.
- ✓ Recent deploy, config, and migration history accessible.
- ✓ Freeze deploy rule defined for active incidents.
- ✓ Postmortem template available.
- ✓ Common incident runbooks created for top 5 failure modes.
- ✓ Alert routing configured to one response path.
- ✓ Manual verification checklist prepared for core user journeys.
- ✓ Postmortem action items assigned with owner and due date.
Suggested visual:
Checklist
- ✓ One-page incident response checklist printable for on-call or deploy operators
Related guides
- Debugging Production Issues
- Alerting System Setup
- Error Tracking with Sentry
- 502 Bad Gateway Fix Guide
- App Crashes on Deployment
- SaaS Production Checklist
FAQ
What is the minimum incident process for a solo founder?
Use one severity matrix, one incident checklist, one communication channel, rollback steps, and a postmortem template. That is enough to respond consistently.
Should every alert create an incident?
No. Only issues with active user, revenue, security, or data impact should become formal incidents. Lower-priority alerts can stay as operational tasks.
How do I decide between restart and rollback?
Restart for transient resource or process failures. Roll back when the issue likely started with a recent deploy, migration, or config change.
Do I need a public status page?
If customers depend on your app for production work or payments, yes. For early MVPs, even a simple hosted status page is better than ad-hoc support replies.
What should I measure after introducing a playbook?
Track:
- time to acknowledge
- time to mitigate
- time to resolve
- alert quality
- number of incidents by cause
- completion rate of postmortem action items
Do I need on-call software for a small SaaS?
No. Start with alerts, one incident channel, a runbook, and clear severity rules.
When should I declare an incident?
As soon as there is real user impact, revenue risk, or data/security concern.
Should I roll back immediately?
Roll back quickly if the latest deploy is the likely trigger and rollback is safe.
How often should I update stakeholders?
Every 15–30 minutes during active impact, even if there is no new resolution yet.
Who owns the postmortem?
The incident commander or service owner, but action items must have individual owners.
Final takeaway
Small SaaS teams do not need a heavy incident management stack. They need a simple, repeatable response process.
Core sequence:
- detect
- acknowledge
- triage
- mitigate
- communicate
- verify
- document
- improve
Most incident recovery time is lost in confusion, not debugging. A written playbook removes that confusion.