Database Performance Monitoring
The essential playbook for implementing database performance monitoring in your SaaS.
Monitor your database like a production dependency, not a black box.
For a small SaaS, the minimum useful setup is:
- query latency
- error rate
- active connections
- slow query logging
- lock and wait visibility
- disk usage
- backup health
This setup works for PostgreSQL and MySQL-backed apps and gives enough signal to catch regressions before they become user-facing incidents.
Quick Fix / Quick Setup
Start with native database stats and logs before adding a full observability stack.
If you only do three things:
- enable slow query logging
- collect connection metrics
- alert on latency and error spikes
PostgreSQL quick checks
# 1) Enable slow statement visibility
psql "$DATABASE_URL" -c "ALTER SYSTEM SET log_min_duration_statement = '500ms';"
psql "$DATABASE_URL" -c "ALTER SYSTEM SET log_lock_waits = on;"
psql "$DATABASE_URL" -c "ALTER SYSTEM SET deadlock_timeout = '1s';"
psql "$DATABASE_URL" -c "SELECT pg_reload_conf();"
# 2) Find slow queries right now
psql "$DATABASE_URL" -c "SELECT query, calls, total_exec_time, mean_exec_time, rows FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 10;"
# 3) Check active connections and blockers
psql "$DATABASE_URL" -c "SELECT state, count(*) FROM pg_stat_activity GROUP BY state;"
psql "$DATABASE_URL" -c "SELECT pid, usename, wait_event_type, wait_event, state, query FROM pg_stat_activity WHERE state <> 'idle';"
# 4) Basic size + table hotspots
psql "$DATABASE_URL" -c "SELECT pg_size_pretty(pg_database_size(current_database()));"
psql "$DATABASE_URL" -c "SELECT relname, seq_scan, idx_scan, n_live_tup FROM pg_stat_user_tables ORDER BY seq_scan DESC LIMIT 10;"MySQL quick checks
mysql -e "SET GLOBAL slow_query_log = 'ON';"
mysql -e "SET GLOBAL long_query_time = 0.5;"
mysql -e "SHOW PROCESSLIST;"
mysql -e "SHOW GLOBAL STATUS LIKE 'Threads_connected';"
mysql -e "SHOW GLOBAL STATUS LIKE 'Slow_queries';"What’s happening
Database issues usually show up first in the app layer:
- slow requests
- queue buildup
- timeouts
- intermittent 500s
The database failure mode is usually not just “the DB is slow.” It is one of these:
- missing indexes
- connection pool exhaustion
- lock contention
- long-running transactions
- inefficient queries
- disk saturation
- under-provisioned CPU, RAM, or IOPS
- replication lag
You need both:
- real-time metrics for current incidents
- historical trends for recurring regressions
For small SaaS products, focus on actionable signals that help you decide whether to:
- optimize a query
- add an index
- change the pool size
- scale the instance
- investigate a recent deploy or migration
Process Flow
Step-by-step implementation
1) Define the minimum metrics
Track these first:
- p50, p95, p99 query latency
- slow query count
- active and idle connections
- max connections usage percent
- transaction duration
- lock waits
- deadlocks
- replication lag
- CPU
- memory
- disk usage
- disk throughput / IOPS
- backup success/failure
If you use a managed database, also enable provider metrics immediately.
2) Turn on database-native observability
PostgreSQL
Enable pg_stat_statements and useful logging.
postgresql.conf:
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.max = 10000
pg_stat_statements.track = all
log_min_duration_statement = 500ms
log_lock_waits = on
deadlock_timeout = 1s
log_checkpoints = on
log_temp_files = 0
log_autovacuum_min_duration = 1000msApply and create the extension:
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;Useful checks:
psql "$DATABASE_URL" -c "SHOW log_min_duration_statement;"
psql "$DATABASE_URL" -c "SHOW log_lock_waits;"
psql "$DATABASE_URL" -c "SELECT * FROM pg_extension WHERE extname = 'pg_stat_statements';"MySQL
Enable slow query logging and performance schema.
Example my.cnf:
[mysqld]
slow_query_log = ON
long_query_time = 0.5
log_queries_not_using_indexes = OFF
performance_schema = ON
innodb_print_all_deadlocks = ONValidate:
mysql -e "SHOW VARIABLES LIKE 'slow_query_log';"
mysql -e "SHOW VARIABLES LIKE 'long_query_time';"
mysql -e "SHOW VARIABLES LIKE 'performance_schema';"3) Export metrics
Use one of these paths:
- managed provider metrics: RDS, Cloud SQL, Supabase, Neon, PlanetScale
- Prometheus exporters:
postgres_exporter,mysqld_exporter - hosted monitoring: Datadog, Grafana Cloud, New Relic
Example Prometheus scrape config:
scrape_configs:
- job_name: postgres
static_configs:
- targets: ["postgres-exporter:9187"]
- job_name: mysql
static_configs:
- targets: ["mysqld-exporter:9104"]If you use PgBouncer, export pool metrics too.
4) Build a minimal dashboard
First row:
- p95 query latency
- query throughput
- error rate
- active connections / max connections
Second row:
- slow queries per minute
- lock waits
- deadlocks
- replication lag
Third row:
- CPU
- memory
- disk usage
- disk throughput / IOPS
Fourth row:
- top queries by total time
- top queries by mean time
- table scans vs index scans
- storage growth
Suggested visual: dashboard wireframe showing exact row layout and alert thresholds.
5) Add slow query analysis
Review top queries by:
- total execution time
- mean execution time
- call count
PostgreSQL:
psql "$DATABASE_URL" -c "SELECT query, calls, total_exec_time, mean_exec_time, rows FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 20;"MySQL:
SELECT *
FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 20;Run EXPLAIN or EXPLAIN ANALYZE on the worst queries.
PostgreSQL:
psql "$DATABASE_URL" -c "EXPLAIN (ANALYZE, BUFFERS) SELECT ...;"MySQL:
mysql -e "EXPLAIN ANALYZE SELECT ...;"6) Monitor connection pooling
Many incidents are pool problems, not raw database capacity problems.
Track:
- pool size
- waiting clients
- checkout wait time
- connection churn
- active vs idle sessions
If you use PgBouncer:
psql "postgres://USER:PASSWORD@PGBOUNCER_HOST:6432/pgbouncer" -c "SHOW POOLS;"
psql "postgres://USER:PASSWORD@PGBOUNCER_HOST:6432/pgbouncer" -c "SHOW STATS;"Common rule:
- keep app-side concurrency aligned with pool size
- do not let web workers and background workers open unlimited connections
7) Track maintenance signals
PostgreSQL
Watch for:
- autovacuum activity
- long-running transactions
- dead tuples
- table bloat
- checkpoints
- vacuum lag
Commands:
psql "$DATABASE_URL" -c "SELECT relname, n_live_tup, n_dead_tup, seq_scan, idx_scan FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 20;"
psql "$DATABASE_URL" -c "SELECT pid, now() - xact_start AS txn_age, state, query FROM pg_stat_activity WHERE xact_start IS NOT NULL ORDER BY txn_age DESC LIMIT 20;"MySQL
Watch for:
- InnoDB buffer pool pressure
- row lock waits
- transaction backlog
- temp table spill
- deadlocks
Commands:
mysql -e "SHOW ENGINE INNODB STATUS\G"
mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool%';"8) Set alerts
Use sustained thresholds, not single spikes.
Recommended starter alerts:
- p95 query latency above threshold for 5–10 minutes
- active connections above 80% of max
- lock waits spike above baseline
- deadlocks > 0 in sustained windows
- replication lag above app tolerance
- free disk below safe threshold
- backup failure
- restore test failure
Example alert thresholds:
p95 query latency > 300ms for 10m
active_connections / max_connections > 0.80 for 5m
replication_lag > 30s for 5m
free_disk_percent < 15% for 10m
deadlocks > 0 for 10m9) Tie alerts to a runbook
Every alert should map to a first action.
Examples:
| Alert | First action |
|---|---|
| p95 latency high | inspect top queries and recent deploys |
| connections high | inspect pool saturation and long-lived sessions |
| lock waits high | identify blocked and blocking sessions |
| replication lag high | inspect replica health and write volume |
| disk low | inspect table growth, temp file usage, retention |
This should connect directly with your incident workflow in Debugging Production Issues.
10) Review weekly
Database degradation is often gradual.
Weekly review:
- top queries by total time
- top tables by size growth
- scan patterns
- dead tuples / bloat indicators
- connection trends
- recent migrations
- background job impact
Also review schema changes using Database Migration Strategy.
Common causes
- Missing or incorrect indexes on filtered, joined, or sorted columns
- N+1 queries from ORM usage
- Connection pool misconfiguration
- Long-running transactions blocking vacuum or holding locks
- Lock contention from bulk updates or hot rows
- Full table scans from stale stats or poor query patterns
- Database instance too small for current traffic
- Disk saturation or low IOPS
- Storage nearing capacity
- Replication lag
- PostgreSQL autovacuum not keeping up
- Expensive background jobs competing with user traffic
- Unbounded growth in audit, event, or log tables
- Background workers sharing the same database without pool limits
Debugging tips
Start by identifying which class of problem you have:
- query latency
- connection saturation
- lock contention
- replication lag
- storage / IO bottleneck
Do not scale blindly before isolating the bottleneck.
PostgreSQL commands
psql "$DATABASE_URL" -c "SELECT now(), state, count(*) FROM pg_stat_activity GROUP BY state ORDER BY state;"psql "$DATABASE_URL" -c "SELECT pid, usename, application_name, client_addr, state, wait_event_type, wait_event, now() - query_start AS runtime, query FROM pg_stat_activity WHERE state <> 'idle' ORDER BY runtime DESC;"psql "$DATABASE_URL" -c "SELECT query, calls, total_exec_time, mean_exec_time, rows FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 20;"psql "$DATABASE_URL" -c "SELECT blocked.pid AS blocked_pid, blocker.pid AS blocker_pid, blocked.query AS blocked_query, blocker.query AS blocker_query FROM pg_stat_activity blocked JOIN pg_locks bl ON blocked.pid = bl.pid AND NOT bl.granted JOIN pg_locks kl ON bl.locktype = kl.locktype AND bl.database IS NOT DISTINCT FROM kl.database AND bl.relation IS NOT DISTINCT FROM kl.relation AND bl.page IS NOT DISTINCT FROM kl.page AND bl.tuple IS NOT DISTINCT FROM kl.tuple AND bl.virtualxid IS NOT DISTINCT FROM kl.virtualxid AND bl.transactionid IS NOT DISTINCT FROM kl.transactionid AND bl.classid IS NOT DISTINCT FROM kl.classid AND bl.objid IS NOT DISTINCT FROM kl.objid AND bl.objsubid IS NOT DISTINCT FROM kl.objsubid AND kl.granted JOIN pg_stat_activity blocker ON kl.pid = blocker.pid;"psql "$DATABASE_URL" -c "SELECT relname, n_live_tup, n_dead_tup, seq_scan, idx_scan FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 20;"psql "$DATABASE_URL" -c "EXPLAIN (ANALYZE, BUFFERS) SELECT ...;"MySQL commands
mysql -e "SHOW FULL PROCESSLIST;"mysql -e "SHOW ENGINE INNODB STATUS\G"mysql -e "SHOW GLOBAL STATUS LIKE 'Threads_connected';"mysql -e "SHOW GLOBAL STATUS LIKE 'Slow_queries';"mysql -e "EXPLAIN ANALYZE SELECT ...;"Host-level checks
df -h
iostat -xz 1 5
vmstat 1 5Practical debugging rules
- Compare current behavior against the last known good deploy, migration, or traffic increase
- Look for a few expensive queries dominating total DB time
- Identify blocker sessions before killing random connections
- Verify indexes match actual
WHERE,JOIN,ORDER BY, andLIMITpatterns - Check ORM-generated SQL, not only handwritten SQL
- If CPU is low but latency is high, suspect locks, IO waits, or pool waiting
- Separate write-path and read-path metrics if using replicas
- During incidents, capture snapshots immediately because transient issues disappear fast
- If the app is also under pressure, compare with High CPU / Memory Usage
Checklist
- ✓ Slow query logging enabled
- ✓ Query statistics enabled and reviewed regularly
- ✓ Dashboard includes latency, connections, locks, storage, and replication
- ✓ Alerts configured for p95 latency, connection usage, disk, deadlocks, and replication lag
- ✓ Connection pool metrics visible
- ✓ Backup success tracked
- ✓ Restore tests tracked
- ✓ Recent migrations reviewed for performance regressions
- ✓ Runbook exists for slow queries, lock contention, and connection exhaustion
- ✓ Retention policy defined for logs and metrics
- ✓ Provider-specific metrics enabled for managed databases
- ✓ Monitoring stack reviewed against Monitoring Checklist
- ✓ Production readiness reviewed in SaaS Production Checklist
Related guides
- Debugging Production Issues
- Database Migration Strategy
- High CPU / Memory Usage
- Monitoring Checklist
- SaaS Production Checklist
FAQ
What should I monitor first for a small SaaS database?
Start with:
- query latency
- slow query count
- active connections
- lock waits
- CPU
- disk space
- backup status
These cover the most common production failures.
How do I know if the problem is the app or the database?
Compare app request latency with database query latency and connection wait time.
- If request latency rises while DB latency and connection usage stay flat, the app may be the bottleneck.
- If DB latency, locks, or connections spike, start with the database.
Should I use a read replica to fix slow queries?
Not first.
Replicas help with read scaling, but they do not fix:
- bad query plans
- missing indexes
- write lock contention
- connection pool issues
What is a safe slow query threshold?
Use a threshold that matches your SLA.
For many small SaaS apps, 200ms to 500ms is a practical starting point. Lower it for latency-sensitive APIs.
Do managed databases remove the need for monitoring?
No.
Managed services reduce infrastructure work, but these are still your responsibility:
- query quality
- schema growth
- lock contention
- pool usage
- replication behavior
- backup validation
Is pg_stat_statements safe in production?
Yes. It is commonly used in production PostgreSQL environments.
Enable it intentionally and monitor overhead if your environment is very constrained.
Final takeaway
Good database monitoring is not about collecting every metric.
It is about quickly answering:
- are queries slow?
- are connections exhausted?
- are locks blocking progress?
- is storage or IO the bottleneck?
- did a recent deploy or migration cause the change?
For small SaaS teams, the highest-value setup is:
- slow query visibility
- connection monitoring
- lock inspection
- storage tracking
- alert thresholds tied to a clear runbook
Start there, keep the dashboard small, and review it every week.