Incident Response Playbook

The essential playbook for implementing incident response playbook in your SaaS.

A small SaaS needs a repeatable way to handle outages, performance regressions, failed deploys, payment failures, and auth incidents. This playbook gives a lightweight incident response process for indie teams: one source of truth, clear severity levels, fast triage, rollback rules, communication templates, and post-incident follow-up.

Internal references for setup and follow-up:

Quick Fix / Quick Setup

Create a repo-based incident checklist and severity matrix:

bash
mkdir -p ops/runbooks ops/postmortems && cat > ops/runbooks/incident-checklist.md <<'EOF'
# Incident Checklist

## 1. Acknowledge
- Create incident channel/thread
- Assign incident commander
- Record start time
- Freeze non-essential deploys

## 2. Triage
- What is broken?
- Who is affected?
- Is this ongoing?
- What changed in last 60 minutes?
- Severity: SEV1 / SEV2 / SEV3

## 3. Stabilize
- Roll back latest deploy if suspicious
- Enable maintenance mode if needed
- Scale app/workers if saturation issue
- Disable broken integration/feature flag

## 4. Investigate
- Check app logs
- Check server logs
- Check DB health
- Check queue health
- Check third-party status pages

## 5. Communicate
- Internal update every 15-30 min
- Customer status page update if user impact exists

## 6. Resolve
- Verify recovery with metrics + manual test
- Record resolution time

## 7. Postmortem
- Root cause
- Trigger
- Detection gap
- Action items
EOF

cat > ops/runbooks/severity-matrix.md <<'EOF'
# Severity Matrix
- SEV1: Full outage, login/payment/core API broken, major data risk
- SEV2: Major feature degraded, subset of users blocked, delayed jobs/payments
- SEV3: Minor issue, workaround exists, limited impact
EOF

Start with:

  • one incident runbook
  • one severity matrix
  • one communication path
  • one rollback policy
  • one postmortem template

Quick setup

  • Create a single incident runbook in your repo or internal docs.
  • Define severity levels: SEV1, SEV2, SEV3 with clear examples.
  • Assign default roles: incident commander, investigator, communications owner, scribe.
  • Freeze unrelated deploys during active incidents.
  • Set a rule for updates: internal every 15 minutes for SEV1, every 30 minutes for SEV2.
  • Prepare rollback commands, status page templates, and dashboards before the first real incident.
detection
triage
mitigation
resolution
postmortem

Incident lifecycle flowchart

Example responder roles:

md
# Incident Roles
- Incident Commander: owns decisions and prioritization
- Investigator: runs technical checks and tests hypotheses
- Communications Owner: posts status updates internally and externally
- Scribe: records timeline, actions, and evidence

Example severity matrix:

md
# Severity Guide

## SEV1
- Full outage
- Login unavailable
- Payment flow broken for most users
- Data corruption or security risk
- Core API unavailable

## SEV2
- Major degradation
- Background jobs heavily delayed
- One major feature unavailable
- Subset of users blocked

## SEV3
- Minor degradation
- Workaround exists
- Limited user impact
- No immediate revenue or data risk

What’s happening

An incident is a production issue with user, revenue, security, or data impact.

Common failure in small teams:

  • no single source of truth
  • no owner during the incident
  • no clear severity level
  • no rollback decision rule
  • no timeline of what changed

The main problem is usually process, not tooling.

Without a playbook:

  • responders jump between logs and guesses
  • multiple people make conflicting changes
  • deploys continue during active impact
  • customer communication becomes delayed or inaccurate
  • root cause gets lost after recovery

The goal is to reduce:

  • time to acknowledge
  • time to mitigate
  • time to resolve

You do not need enterprise incident tooling for an MVP. You need repeatable steps, evidence capture, and links to your actual systems.

Step-by-step implementation

1. Define what counts as an incident

Use business impact, not annoyance.

Treat these as incidents:

  • login broken
  • checkout or subscription flow broken
  • API returning elevated 5xx errors
  • worker backlog causing delayed customer actions
  • database outage
  • auth provider outage
  • deploy failure causing production crash
  • data exposure or integrity risk

Do not treat these as formal incidents unless impact is real:

  • isolated dev-only bugs
  • cosmetic UI bugs with no business impact
  • alerts with no user-facing symptoms

2. Define severity by impact

Map severity to customer effect.

md
SEV1
- Full or near-full outage
- Critical path broken: auth, payments, core API
- Security or major data risk
- Immediate revenue impact

SEV2
- Major degradation
- Affected subset of users or one important workflow
- Delayed jobs, integrations, or billing sync
- Workaround may exist

SEV3
- Minor issue
- Limited scope
- Workaround exists
- Low urgency but still production-impacting

3. Document responder roles

Even if one person fills all roles, keep the structure.

md
Incident Commander
- Declares incident
- Assigns severity
- Freezes non-essential deploys
- Approves mitigation choice
- Decides when incident is resolved

Investigator
- Checks logs, dashboards, infra, and recent changes
- Tests hypotheses
- Proposes mitigation

Communications Owner
- Posts updates to status page/support/internal channels
- Sets next update time
- Avoids speculation

Scribe
- Records timeline
- Captures commands run, screenshots, links, and outcomes

4. Choose communication channels

Minimum setup:

  • one incident channel or thread in Slack/Discord
  • one shared incident doc or issue
  • one customer-facing status page
  • one issue tracker for follow-up work

Example incident template:

md
# Incident: <title>

- Start time:
- Severity:
- Incident commander:
- Investigator:
- Communications owner:
- Scribe:

## Impact
- Affected users:
- Affected systems:
- Symptoms:

## Suspected trigger
- Latest deploy / config / migration / provider issue / unknown

## Mitigations attempted
- <timestamp> action -> result

## Next update
- <time>

## Resolution
- Resolved at:
- Recovery verification:

5. Route alerts into one path

Alerts should converge into one responder workflow.

Sources:

  • uptime monitor
  • error tracking
  • app metrics
  • infrastructure metrics
  • queue depth alerts
  • payment/auth webhook failure alerts

If alerts are scattered, responders lose time.

Use:

6. Prepare investigation links in advance

Store all responder links in one file.

md
# ops/runbooks/service-links.md

## App
- Production URL:
- Health endpoint:
- Sentry:
- App logs:
- APM dashboard:

## Infra
- Server metrics:
- Cloud dashboard:
- Container dashboard:
- Nginx logs:
- Process supervisor:

## Data
- DB dashboard:
- Queue dashboard:
- Redis dashboard:

## Third parties
- Stripe status:
- Email provider status:
- Auth provider status:
- Cloud provider status:

7. Define mitigation options before the incident

Do not invent mitigation during active impact.

Typical mitigations:

  • rollback latest deploy
  • disable feature flag
  • enable maintenance mode
  • scale app instances
  • scale workers
  • restart failed processes
  • pause queue consumers
  • pause broken webhook processing
  • disable a failing integration
  • switch traffic away from a bad node

Example rollback notes:

md
# Rollback Rules
- If issue started immediately after deploy and rollback is safe, rollback first.
- If migration is destructive or non-reversible, do not blindly rollback app only.
- If provider outage is external, disable dependency path or show degraded-mode UI.
- If issue is resource exhaustion, stabilize capacity before deep debugging.

8. Create service-specific runbooks

At minimum, write runbooks for:

  • 502 errors
  • app crash after deployment
  • DB connection failures
  • background worker backlog
  • auth/session failures
  • payment webhook failures

Related fix guides:

9. Use a strict triage sequence

During triage, always answer these questions first:

  1. What is broken?
  2. Who is affected?
  3. Is it still ongoing?
  4. What changed in the last 60 minutes?
  5. Is this app, infra, DB, queue, or third-party?
  6. What is the safest fast mitigation?

Useful recent-change checklist:

  • deploy
  • config change
  • secrets rotation
  • migration
  • dependency update
  • DNS change
  • TLS renewal
  • traffic spike
  • cron or worker change
  • provider outage

10. Record a timeline during the incident

Every action needs:

  • timestamp
  • actor
  • action
  • result

Example:

md
14:02 UTC - Alert fired for 65% 5xx on /api/login
14:04 UTC - Incident declared SEV1
14:05 UTC - Deploy freeze announced
14:08 UTC - Latest deploy identified as likely trigger
14:10 UTC - Rollback started
14:14 UTC - Error rate dropping
14:18 UTC - Manual login test passed
14:20 UTC - Status page updated: recovering
14:28 UTC - Metrics stable for 10 minutes
14:30 UTC - Incident resolved

11. Communicate with short factual updates

Use:

  • current impact
  • current action
  • confidence level
  • next update time

Avoid:

  • guesses
  • unverified root cause statements
  • optimistic recovery estimates without evidence

Example internal update:

md
SEV2 update 13:30 UTC
- Impact: delayed background jobs for new uploads
- Scope: all users, backlog growing
- Current action: scaling workers and checking Redis connectivity
- Recent trigger: deploy at 13:05 UTC under review
- Next update: 13:45 UTC

Example customer update:

md
We are investigating elevated errors affecting file uploads. Some uploads may be delayed or fail. Next update in 30 minutes.

12. Verify recovery before closing

Do not mark resolved because one request succeeded.

Verify with:

  • error rate returned to baseline
  • latency normalized
  • queue depth draining
  • DB health normal
  • manual test of key user flows
  • no active alerts
  • customer-facing path working

Example manual verification checklist:

md
- Login works
- Signup works
- Billing checkout works
- API health endpoint returns 200
- Background jobs processing
- Webhooks received and acknowledged

13. Run a postmortem within 24–72 hours

Do this for all SEV1 and SEV2 incidents.

Store the document in version control:

md
# Postmortem: <incident title>

## Summary
## Customer impact
## Timeline
## Root cause
## Trigger
## Detection gap
## What worked
## What failed
## Corrective actions
- [ ] Action item / owner / due date

14. Track follow-up work

A postmortem without tracked actions is only documentation.

Common follow-up categories:

  • missing alert
  • missing dashboard
  • poor log context
  • rollback too slow
  • no feature flag
  • unsafe migration process
  • missing runbook
  • missing owner
  • poor customer communication path
MetricWarningHighCritical
Uptime< 99.9%< 99.5%< 99%
Error rate> 1%> 3%> 5%
Response time> 500 ms> 1 s> 3 s
CPU usage> 60%> 80%> 95%
Memory usage> 70%> 85%> 95%

Severity matrix table with examples for auth outage, billing outage, slow API, worker delay, partial feature degradation

Common causes

  • Bad deploy or untested hotfix introduced errors or startup failures.
  • Environment variable or secret changed, missing, or loaded incorrectly.
  • Database migration caused lockups, schema mismatch, or application incompatibility.
  • Infrastructure saturation: CPU, memory, disk, file descriptors, DB connections, worker backlog.
  • Reverse proxy or app server misconfiguration causing 502 or timeout errors.
  • TLS, DNS, or domain configuration issues after renewal or provider changes.
  • Third-party outage affecting auth, payments, email, storage, or webhooks.
  • Background job workers stopped, crashed, or cannot reach Redis or RabbitMQ.
  • Feature flag or config toggle enabled a broken code path.
  • Insufficient monitoring delayed detection, making impact larger.

Debugging tips

Start with fast environment checks:

bash
date && uptime
free -h
df -h
top -o %CPU
ps aux --sort=-%mem | head
ss -ltnp

Check reverse proxy and app service health:

bash
journalctl -u nginx -n 200 --no-pager
journalctl -u gunicorn -n 200 --no-pager
sudo nginx -t
systemctl status nginx
systemctl status gunicorn
tail -n 200 /var/log/nginx/error.log
tail -n 200 /var/log/nginx/access.log
curl -I https://yourdomain.com
curl -sS https://yourdomain.com/health

If you use containers:

bash
docker ps
docker logs --tail 200 <container_name>
docker stats --no-stream

Check runtime config and app state:

bash
env | sort
printenv | sort
python -m flask routes
alembic current
alembic heads

Check data services:

bash
psql "$DATABASE_URL" -c 'select now();'
redis-cli ping
celery -A app inspect ping

Check external dependency reachability:

bash
curl -v https://api.stripe.com
dig yourdomain.com
nslookup yourdomain.com

Debugging sequence:

  1. confirm user-visible symptom
  2. inspect recent changes
  3. check logs and error spikes
  4. check resource saturation
  5. isolate whether issue is app, infra, DB, queue, or provider
  6. mitigate safely first
  7. verify recovery with metrics and manual checks

If deploy-related, start with:

If issue is broad and unclear, use:

If issue looks like gateway/proxy failure, use:

Checklist

  • Severity levels documented.
  • Incident commander role defined.
  • Rollback procedure documented and tested.
  • Status page or customer communication path ready.
  • Dashboards and logs linked from one runbook.
  • Recent deploy, config, and migration history accessible.
  • Freeze deploy rule defined for active incidents.
  • Postmortem template available.
  • Common incident runbooks created for top 5 failure modes.
  • Alert routing configured to one response path.
  • Manual verification checklist prepared for core user journeys.
  • Postmortem action items assigned with owner and due date.

Suggested visual:

Checklist

  • One-page incident response checklist printable for on-call or deploy operators

Related guides

FAQ

What is the minimum incident process for a solo founder?

Use one severity matrix, one incident checklist, one communication channel, rollback steps, and a postmortem template. That is enough to respond consistently.

Should every alert create an incident?

No. Only issues with active user, revenue, security, or data impact should become formal incidents. Lower-priority alerts can stay as operational tasks.

How do I decide between restart and rollback?

Restart for transient resource or process failures. Roll back when the issue likely started with a recent deploy, migration, or config change.

Do I need a public status page?

If customers depend on your app for production work or payments, yes. For early MVPs, even a simple hosted status page is better than ad-hoc support replies.

What should I measure after introducing a playbook?

Track:

  • time to acknowledge
  • time to mitigate
  • time to resolve
  • alert quality
  • number of incidents by cause
  • completion rate of postmortem action items

Do I need on-call software for a small SaaS?

No. Start with alerts, one incident channel, a runbook, and clear severity rules.

When should I declare an incident?

As soon as there is real user impact, revenue risk, or data/security concern.

Should I roll back immediately?

Roll back quickly if the latest deploy is the likely trigger and rollback is safe.

How often should I update stakeholders?

Every 15–30 minutes during active impact, even if there is no new resolution yet.

Who owns the postmortem?

The incident commander or service owner, but action items must have individual owners.

Final takeaway

Small SaaS teams do not need a heavy incident management stack. They need a simple, repeatable response process.

Core sequence:

  • detect
  • acknowledge
  • triage
  • mitigate
  • communicate
  • verify
  • document
  • improve

Most incident recovery time is lost in confusion, not debugging. A written playbook removes that confusion.