Incident Response Playbook — SaaS Builder Playbooks

A small SaaS needs a repeatable way to handle outages, performance regressions, failed deploys, payment failures, and auth incidents. This playbook gives a lightweight incident response process for indie teams: one source of truth, clear severity levels, fast triage, rollback rules, communication templates, and post-incident follow-up.

Internal references for setup and follow-up:

Quick Fix / Quick Setup

Create a repo-based incident checklist and severity matrix:

bash

mkdir -p ops/runbooks ops/postmortems && cat > ops/runbooks/incident-checklist.md <<'EOF'
# Incident Checklist

## 1. Acknowledge
- Create incident channel/thread
- Assign incident commander
- Record start time
- Freeze non-essential deploys

## 2. Triage
- What is broken?
- Who is affected?
- Is this ongoing?
- What changed in last 60 minutes?
- Severity: SEV1 / SEV2 / SEV3

## 3. Stabilize
- Roll back latest deploy if suspicious
- Enable maintenance mode if needed
- Scale app/workers if saturation issue
- Disable broken integration/feature flag

## 4. Investigate
- Check app logs
- Check server logs
- Check DB health
- Check queue health
- Check third-party status pages

## 5. Communicate
- Internal update every 15-30 min
- Customer status page update if user impact exists

## 6. Resolve
- Verify recovery with metrics + manual test
- Record resolution time

## 7. Postmortem
- Root cause
- Trigger
- Detection gap
- Action items
EOF

cat > ops/runbooks/severity-matrix.md <<'EOF'
# Severity Matrix
- SEV1: Full outage, login/payment/core API broken, major data risk
- SEV2: Major feature degraded, subset of users blocked, delayed jobs/payments
- SEV3: Minor issue, workaround exists, limited impact
EOF

Start with:

one incident runbook
one severity matrix
one communication path
one rollback policy
one postmortem template

Quick setup

Create a single incident runbook in your repo or internal docs.
Define severity levels: SEV1, SEV2, SEV3 with clear examples.
Assign default roles: incident commander, investigator, communications owner, scribe.
Freeze unrelated deploys during active incidents.
Set a rule for updates: internal every 15 minutes for SEV1, every 30 minutes for SEV2.
Prepare rollback commands, status page templates, and dashboards before the first real incident.

detection

triage

mitigation

resolution

postmortem

Incident lifecycle flowchart

Example responder roles:

# Incident Roles
- Incident Commander: owns decisions and prioritization
- Investigator: runs technical checks and tests hypotheses
- Communications Owner: posts status updates internally and externally
- Scribe: records timeline, actions, and evidence

Example severity matrix:

# Severity Guide

## SEV1
- Full outage
- Login unavailable
- Payment flow broken for most users
- Data corruption or security risk
- Core API unavailable

## SEV2
- Major degradation
- Background jobs heavily delayed
- One major feature unavailable
- Subset of users blocked

## SEV3
- Minor degradation
- Workaround exists
- Limited user impact
- No immediate revenue or data risk

What’s happening

An incident is a production issue with user, revenue, security, or data impact.

Common failure in small teams:

no single source of truth
no owner during the incident
no clear severity level
no rollback decision rule
no timeline of what changed

The main problem is usually process, not tooling.

Without a playbook:

responders jump between logs and guesses
multiple people make conflicting changes
deploys continue during active impact
customer communication becomes delayed or inaccurate
root cause gets lost after recovery

The goal is to reduce:

time to acknowledge
time to mitigate
time to resolve

You do not need enterprise incident tooling for an MVP. You need repeatable steps, evidence capture, and links to your actual systems.

Step-by-step implementation

1. Define what counts as an incident

Use business impact, not annoyance.

Treat these as incidents:

login broken
checkout or subscription flow broken
API returning elevated 5xx errors
worker backlog causing delayed customer actions
database outage
auth provider outage
deploy failure causing production crash
data exposure or integrity risk

Do not treat these as formal incidents unless impact is real:

isolated dev-only bugs
cosmetic UI bugs with no business impact
alerts with no user-facing symptoms

2. Define severity by impact

Map severity to customer effect.

SEV1
- Full or near-full outage
- Critical path broken: auth, payments, core API
- Security or major data risk
- Immediate revenue impact

SEV2
- Major degradation
- Affected subset of users or one important workflow
- Delayed jobs, integrations, or billing sync
- Workaround may exist

SEV3
- Minor issue
- Limited scope
- Workaround exists
- Low urgency but still production-impacting

3. Document responder roles

Even if one person fills all roles, keep the structure.

Incident Commander
- Declares incident
- Assigns severity
- Freezes non-essential deploys
- Approves mitigation choice
- Decides when incident is resolved

Investigator
- Checks logs, dashboards, infra, and recent changes
- Tests hypotheses
- Proposes mitigation

Communications Owner
- Posts updates to status page/support/internal channels
- Sets next update time
- Avoids speculation

Scribe
- Records timeline
- Captures commands run, screenshots, links, and outcomes

4. Choose communication channels

Minimum setup:

one incident channel or thread in Slack/Discord
one shared incident doc or issue
one customer-facing status page
one issue tracker for follow-up work

Example incident template:

# Incident: <title>

- Start time:
- Severity:
- Incident commander:
- Investigator:
- Communications owner:
- Scribe:

## Impact
- Affected users:
- Affected systems:
- Symptoms:

## Suspected trigger
- Latest deploy / config / migration / provider issue / unknown

## Mitigations attempted
- <timestamp> action -> result

## Next update
- <time>

## Resolution
- Resolved at:
- Recovery verification:

5. Route alerts into one path

Alerts should converge into one responder workflow.

Sources:

uptime monitor
error tracking
app metrics
infrastructure metrics
queue depth alerts
payment/auth webhook failure alerts

If alerts are scattered, responders lose time.

Use:

6. Prepare investigation links in advance

Store all responder links in one file.

# ops/runbooks/service-links.md

## App
- Production URL:
- Health endpoint:
- Sentry:
- App logs:
- APM dashboard:

## Infra
- Server metrics:
- Cloud dashboard:
- Container dashboard:
- Nginx logs:
- Process supervisor:

## Data
- DB dashboard:
- Queue dashboard:
- Redis dashboard:

## Third parties
- Stripe status:
- Email provider status:
- Auth provider status:
- Cloud provider status:

7. Define mitigation options before the incident

Do not invent mitigation during active impact.

Typical mitigations:

rollback latest deploy
disable feature flag
enable maintenance mode
scale app instances
scale workers
restart failed processes
pause queue consumers
pause broken webhook processing
disable a failing integration
switch traffic away from a bad node

Example rollback notes:

# Rollback Rules
- If issue started immediately after deploy and rollback is safe, rollback first.
- If migration is destructive or non-reversible, do not blindly rollback app only.
- If provider outage is external, disable dependency path or show degraded-mode UI.
- If issue is resource exhaustion, stabilize capacity before deep debugging.

8. Create service-specific runbooks

At minimum, write runbooks for:

502 errors
app crash after deployment
DB connection failures
background worker backlog
auth/session failures
payment webhook failures

Related fix guides:

9. Use a strict triage sequence

During triage, always answer these questions first:

What is broken?
Who is affected?
Is it still ongoing?
What changed in the last 60 minutes?
Is this app, infra, DB, queue, or third-party?
What is the safest fast mitigation?

Useful recent-change checklist:

deploy
config change
secrets rotation
migration
dependency update
DNS change
TLS renewal
traffic spike
cron or worker change
provider outage

10. Record a timeline during the incident

Every action needs:

timestamp
actor
action
result

Example:

14:02 UTC - Alert fired for 65% 5xx on /api/login
14:04 UTC - Incident declared SEV1
14:05 UTC - Deploy freeze announced
14:08 UTC - Latest deploy identified as likely trigger
14:10 UTC - Rollback started
14:14 UTC - Error rate dropping
14:18 UTC - Manual login test passed
14:20 UTC - Status page updated: recovering
14:28 UTC - Metrics stable for 10 minutes
14:30 UTC - Incident resolved

11. Communicate with short factual updates

Use:

current impact
current action
confidence level
next update time

Avoid:

guesses
unverified root cause statements
optimistic recovery estimates without evidence

Example internal update:

SEV2 update 13:30 UTC
- Impact: delayed background jobs for new uploads
- Scope: all users, backlog growing
- Current action: scaling workers and checking Redis connectivity
- Recent trigger: deploy at 13:05 UTC under review
- Next update: 13:45 UTC

Example customer update:

We are investigating elevated errors affecting file uploads. Some uploads may be delayed or fail. Next update in 30 minutes.

12. Verify recovery before closing

Do not mark resolved because one request succeeded.

Verify with:

error rate returned to baseline
latency normalized
queue depth draining
DB health normal
manual test of key user flows
no active alerts
customer-facing path working

Example manual verification checklist:

- Login works
- Signup works
- Billing checkout works
- API health endpoint returns 200
- Background jobs processing
- Webhooks received and acknowledged

13. Run a postmortem within 24–72 hours

Do this for all SEV1 and SEV2 incidents.

Store the document in version control:

# Postmortem: <incident title>

## Summary
## Customer impact
## Timeline
## Root cause
## Trigger
## Detection gap
## What worked
## What failed
## Corrective actions
- [ ] Action item / owner / due date

14. Track follow-up work

A postmortem without tracked actions is only documentation.

Common follow-up categories:

missing alert
missing dashboard
poor log context
rollback too slow
no feature flag
unsafe migration process
missing runbook
missing owner
poor customer communication path

Metric	Warning	High	Critical
Uptime	< 99.9%	< 99.5%	< 99%
Error rate	> 1%	> 3%	> 5%
Response time	> 500 ms	> 1 s	> 3 s
CPU usage	> 60%	> 80%	> 95%
Memory usage	> 70%	> 85%	> 95%

Severity matrix table with examples for auth outage, billing outage, slow API, worker delay, partial feature degradation

Common causes

Bad deploy or untested hotfix introduced errors or startup failures.
Environment variable or secret changed, missing, or loaded incorrectly.
Database migration caused lockups, schema mismatch, or application incompatibility.
Infrastructure saturation: CPU, memory, disk, file descriptors, DB connections, worker backlog.
Reverse proxy or app server misconfiguration causing 502 or timeout errors.
TLS, DNS, or domain configuration issues after renewal or provider changes.
Third-party outage affecting auth, payments, email, storage, or webhooks.
Background job workers stopped, crashed, or cannot reach Redis or RabbitMQ.
Feature flag or config toggle enabled a broken code path.
Insufficient monitoring delayed detection, making impact larger.

Debugging tips

Start with fast environment checks:

bash

date && uptime
free -h
df -h
top -o %CPU
ps aux --sort=-%mem | head
ss -ltnp

Check reverse proxy and app service health:

bash

journalctl -u nginx -n 200 --no-pager
journalctl -u gunicorn -n 200 --no-pager
sudo nginx -t
systemctl status nginx
systemctl status gunicorn
tail -n 200 /var/log/nginx/error.log
tail -n 200 /var/log/nginx/access.log
curl -I https://yourdomain.com
curl -sS https://yourdomain.com/health

If you use containers:

bash

docker ps
docker logs --tail 200 <container_name>
docker stats --no-stream

Check runtime config and app state:

bash

env | sort
printenv | sort
python -m flask routes
alembic current
alembic heads

Check data services:

bash

psql "$DATABASE_URL" -c 'select now();'
redis-cli ping
celery -A app inspect ping

Check external dependency reachability:

bash

curl -v https://api.stripe.com
dig yourdomain.com
nslookup yourdomain.com

Debugging sequence:

confirm user-visible symptom
inspect recent changes
check logs and error spikes
check resource saturation
isolate whether issue is app, infra, DB, queue, or provider
mitigate safely first
verify recovery with metrics and manual checks

If deploy-related, start with:

App Crashes on Deployment

If issue is broad and unclear, use:

Debugging Production Issues

If issue looks like gateway/proxy failure, use:

502 Bad Gateway Fix Guide

Checklist

✓ Severity levels documented.
✓ Incident commander role defined.
✓ Rollback procedure documented and tested.
✓ Status page or customer communication path ready.
✓ Dashboards and logs linked from one runbook.
✓ Recent deploy, config, and migration history accessible.
✓ Freeze deploy rule defined for active incidents.
✓ Postmortem template available.
✓ Common incident runbooks created for top 5 failure modes.
✓ Alert routing configured to one response path.
✓ Manual verification checklist prepared for core user journeys.
✓ Postmortem action items assigned with owner and due date.

Build

Test

Migrate

Deploy

Health Check

Deployment Pipeline

Related guides

FAQ

What is the minimum incident process for a solo founder?

Use one severity matrix, one incident checklist, one communication channel, rollback steps, and a postmortem template. That is enough to respond consistently.

Should every alert create an incident?

No. Only issues with active user, revenue, security, or data impact should become formal incidents. Lower-priority alerts can stay as operational tasks.

How do I decide between restart and rollback?

Restart for transient resource or process failures. Roll back when the issue likely started with a recent deploy, migration, or config change.

Do I need a public status page?

If customers depend on your app for production work or payments, yes. For early MVPs, even a simple hosted status page is better than ad-hoc support replies.

What should I measure after introducing a playbook?

Track:

time to acknowledge
time to mitigate
time to resolve
alert quality
number of incidents by cause
completion rate of postmortem action items

Do I need on-call software for a small SaaS?

No. Start with alerts, one incident channel, a runbook, and clear severity rules.

When should I declare an incident?

As soon as there is real user impact, revenue risk, or data/security concern.

Should I roll back immediately?

Roll back quickly if the latest deploy is the likely trigger and rollback is safe.

How often should I update stakeholders?

Every 15–30 minutes during active impact, even if there is no new resolution yet.

Who owns the postmortem?

The incident commander or service owner, but action items must have individual owners.

Final takeaway

Small SaaS teams do not need a heavy incident management stack. They need a simple, repeatable response process.

Core sequence:

detect
acknowledge
triage
mitigate
communicate
verify
document
improve

Most incident recovery time is lost in confusion, not debugging. A written playbook removes that confusion.