Database Performance Monitoring

The essential playbook for implementing database performance monitoring in your SaaS.

Monitor your database like a production dependency, not a black box.

For a small SaaS, the minimum useful setup is:

  • query latency
  • error rate
  • active connections
  • slow query logging
  • lock and wait visibility
  • disk usage
  • backup health

This setup works for PostgreSQL and MySQL-backed apps and gives enough signal to catch regressions before they become user-facing incidents.

Quick Fix / Quick Setup

Start with native database stats and logs before adding a full observability stack.

If you only do three things:

  1. enable slow query logging
  2. collect connection metrics
  3. alert on latency and error spikes

PostgreSQL quick checks

bash
# 1) Enable slow statement visibility
psql "$DATABASE_URL" -c "ALTER SYSTEM SET log_min_duration_statement = '500ms';"
psql "$DATABASE_URL" -c "ALTER SYSTEM SET log_lock_waits = on;"
psql "$DATABASE_URL" -c "ALTER SYSTEM SET deadlock_timeout = '1s';"
psql "$DATABASE_URL" -c "SELECT pg_reload_conf();"

# 2) Find slow queries right now
psql "$DATABASE_URL" -c "SELECT query, calls, total_exec_time, mean_exec_time, rows FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 10;"

# 3) Check active connections and blockers
psql "$DATABASE_URL" -c "SELECT state, count(*) FROM pg_stat_activity GROUP BY state;"
psql "$DATABASE_URL" -c "SELECT pid, usename, wait_event_type, wait_event, state, query FROM pg_stat_activity WHERE state <> 'idle';"

# 4) Basic size + table hotspots
psql "$DATABASE_URL" -c "SELECT pg_size_pretty(pg_database_size(current_database()));"
psql "$DATABASE_URL" -c "SELECT relname, seq_scan, idx_scan, n_live_tup FROM pg_stat_user_tables ORDER BY seq_scan DESC LIMIT 10;"

MySQL quick checks

bash
mysql -e "SET GLOBAL slow_query_log = 'ON';"
mysql -e "SET GLOBAL long_query_time = 0.5;"
mysql -e "SHOW PROCESSLIST;"
mysql -e "SHOW GLOBAL STATUS LIKE 'Threads_connected';"
mysql -e "SHOW GLOBAL STATUS LIKE 'Slow_queries';"

What’s happening

Database issues usually show up first in the app layer:

  • slow requests
  • queue buildup
  • timeouts
  • intermittent 500s

The database failure mode is usually not just “the DB is slow.” It is one of these:

  • missing indexes
  • connection pool exhaustion
  • lock contention
  • long-running transactions
  • inefficient queries
  • disk saturation
  • under-provisioned CPU, RAM, or IOPS
  • replication lag

You need both:

  • real-time metrics for current incidents
  • historical trends for recurring regressions

For small SaaS products, focus on actionable signals that help you decide whether to:

  • optimize a query
  • add an index
  • change the pool size
  • scale the instance
  • investigate a recent deploy or migration
app requests
connection pool
database
slow queries / locks / storage / replication
app symptoms

Process Flow

Step-by-step implementation

1) Define the minimum metrics

Track these first:

  • p50, p95, p99 query latency
  • slow query count
  • active and idle connections
  • max connections usage percent
  • transaction duration
  • lock waits
  • deadlocks
  • replication lag
  • CPU
  • memory
  • disk usage
  • disk throughput / IOPS
  • backup success/failure

If you use a managed database, also enable provider metrics immediately.

2) Turn on database-native observability

PostgreSQL

Enable pg_stat_statements and useful logging.

postgresql.conf:

conf
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.max = 10000
pg_stat_statements.track = all

log_min_duration_statement = 500ms
log_lock_waits = on
deadlock_timeout = 1s
log_checkpoints = on
log_temp_files = 0
log_autovacuum_min_duration = 1000ms

Apply and create the extension:

sql
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

Useful checks:

bash
psql "$DATABASE_URL" -c "SHOW log_min_duration_statement;"
psql "$DATABASE_URL" -c "SHOW log_lock_waits;"
psql "$DATABASE_URL" -c "SELECT * FROM pg_extension WHERE extname = 'pg_stat_statements';"

MySQL

Enable slow query logging and performance schema.

Example my.cnf:

conf
[mysqld]
slow_query_log = ON
long_query_time = 0.5
log_queries_not_using_indexes = OFF
performance_schema = ON
innodb_print_all_deadlocks = ON

Validate:

bash
mysql -e "SHOW VARIABLES LIKE 'slow_query_log';"
mysql -e "SHOW VARIABLES LIKE 'long_query_time';"
mysql -e "SHOW VARIABLES LIKE 'performance_schema';"

3) Export metrics

Use one of these paths:

  • managed provider metrics: RDS, Cloud SQL, Supabase, Neon, PlanetScale
  • Prometheus exporters: postgres_exporter, mysqld_exporter
  • hosted monitoring: Datadog, Grafana Cloud, New Relic

Example Prometheus scrape config:

yaml
scrape_configs:
  - job_name: postgres
    static_configs:
      - targets: ["postgres-exporter:9187"]

  - job_name: mysql
    static_configs:
      - targets: ["mysqld-exporter:9104"]

If you use PgBouncer, export pool metrics too.

4) Build a minimal dashboard

First row:

  • p95 query latency
  • query throughput
  • error rate
  • active connections / max connections

Second row:

  • slow queries per minute
  • lock waits
  • deadlocks
  • replication lag

Third row:

  • CPU
  • memory
  • disk usage
  • disk throughput / IOPS

Fourth row:

  • top queries by total time
  • top queries by mean time
  • table scans vs index scans
  • storage growth

Suggested visual: dashboard wireframe showing exact row layout and alert thresholds.

5) Add slow query analysis

Review top queries by:

  • total execution time
  • mean execution time
  • call count

PostgreSQL:

bash
psql "$DATABASE_URL" -c "SELECT query, calls, total_exec_time, mean_exec_time, rows FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 20;"

MySQL:

sql
SELECT *
FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 20;

Run EXPLAIN or EXPLAIN ANALYZE on the worst queries.

PostgreSQL:

bash
psql "$DATABASE_URL" -c "EXPLAIN (ANALYZE, BUFFERS) SELECT ...;"

MySQL:

bash
mysql -e "EXPLAIN ANALYZE SELECT ...;"

6) Monitor connection pooling

Many incidents are pool problems, not raw database capacity problems.

Track:

  • pool size
  • waiting clients
  • checkout wait time
  • connection churn
  • active vs idle sessions

If you use PgBouncer:

bash
psql "postgres://USER:PASSWORD@PGBOUNCER_HOST:6432/pgbouncer" -c "SHOW POOLS;"
psql "postgres://USER:PASSWORD@PGBOUNCER_HOST:6432/pgbouncer" -c "SHOW STATS;"

Common rule:

  • keep app-side concurrency aligned with pool size
  • do not let web workers and background workers open unlimited connections

7) Track maintenance signals

PostgreSQL

Watch for:

  • autovacuum activity
  • long-running transactions
  • dead tuples
  • table bloat
  • checkpoints
  • vacuum lag

Commands:

bash
psql "$DATABASE_URL" -c "SELECT relname, n_live_tup, n_dead_tup, seq_scan, idx_scan FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 20;"
psql "$DATABASE_URL" -c "SELECT pid, now() - xact_start AS txn_age, state, query FROM pg_stat_activity WHERE xact_start IS NOT NULL ORDER BY txn_age DESC LIMIT 20;"

MySQL

Watch for:

  • InnoDB buffer pool pressure
  • row lock waits
  • transaction backlog
  • temp table spill
  • deadlocks

Commands:

bash
mysql -e "SHOW ENGINE INNODB STATUS\G"
mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool%';"

8) Set alerts

Use sustained thresholds, not single spikes.

Recommended starter alerts:

  • p95 query latency above threshold for 5–10 minutes
  • active connections above 80% of max
  • lock waits spike above baseline
  • deadlocks > 0 in sustained windows
  • replication lag above app tolerance
  • free disk below safe threshold
  • backup failure
  • restore test failure

Example alert thresholds:

text
p95 query latency > 300ms for 10m
active_connections / max_connections > 0.80 for 5m
replication_lag > 30s for 5m
free_disk_percent < 15% for 10m
deadlocks > 0 for 10m

9) Tie alerts to a runbook

Every alert should map to a first action.

Examples:

AlertFirst action
p95 latency highinspect top queries and recent deploys
connections highinspect pool saturation and long-lived sessions
lock waits highidentify blocked and blocking sessions
replication lag highinspect replica health and write volume
disk lowinspect table growth, temp file usage, retention

This should connect directly with your incident workflow in Debugging Production Issues.

10) Review weekly

Database degradation is often gradual.

Weekly review:

  • top queries by total time
  • top tables by size growth
  • scan patterns
  • dead tuples / bloat indicators
  • connection trends
  • recent migrations
  • background job impact

Also review schema changes using Database Migration Strategy.

Common causes

  • Missing or incorrect indexes on filtered, joined, or sorted columns
  • N+1 queries from ORM usage
  • Connection pool misconfiguration
  • Long-running transactions blocking vacuum or holding locks
  • Lock contention from bulk updates or hot rows
  • Full table scans from stale stats or poor query patterns
  • Database instance too small for current traffic
  • Disk saturation or low IOPS
  • Storage nearing capacity
  • Replication lag
  • PostgreSQL autovacuum not keeping up
  • Expensive background jobs competing with user traffic
  • Unbounded growth in audit, event, or log tables
  • Background workers sharing the same database without pool limits

Debugging tips

Start by identifying which class of problem you have:

  • query latency
  • connection saturation
  • lock contention
  • replication lag
  • storage / IO bottleneck

Do not scale blindly before isolating the bottleneck.

PostgreSQL commands

bash
psql "$DATABASE_URL" -c "SELECT now(), state, count(*) FROM pg_stat_activity GROUP BY state ORDER BY state;"
bash
psql "$DATABASE_URL" -c "SELECT pid, usename, application_name, client_addr, state, wait_event_type, wait_event, now() - query_start AS runtime, query FROM pg_stat_activity WHERE state <> 'idle' ORDER BY runtime DESC;"
bash
psql "$DATABASE_URL" -c "SELECT query, calls, total_exec_time, mean_exec_time, rows FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 20;"
bash
psql "$DATABASE_URL" -c "SELECT blocked.pid AS blocked_pid, blocker.pid AS blocker_pid, blocked.query AS blocked_query, blocker.query AS blocker_query FROM pg_stat_activity blocked JOIN pg_locks bl ON blocked.pid = bl.pid AND NOT bl.granted JOIN pg_locks kl ON bl.locktype = kl.locktype AND bl.database IS NOT DISTINCT FROM kl.database AND bl.relation IS NOT DISTINCT FROM kl.relation AND bl.page IS NOT DISTINCT FROM kl.page AND bl.tuple IS NOT DISTINCT FROM kl.tuple AND bl.virtualxid IS NOT DISTINCT FROM kl.virtualxid AND bl.transactionid IS NOT DISTINCT FROM kl.transactionid AND bl.classid IS NOT DISTINCT FROM kl.classid AND bl.objid IS NOT DISTINCT FROM kl.objid AND bl.objsubid IS NOT DISTINCT FROM kl.objsubid AND kl.granted JOIN pg_stat_activity blocker ON kl.pid = blocker.pid;"
bash
psql "$DATABASE_URL" -c "SELECT relname, n_live_tup, n_dead_tup, seq_scan, idx_scan FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 20;"
bash
psql "$DATABASE_URL" -c "EXPLAIN (ANALYZE, BUFFERS) SELECT ...;"

MySQL commands

bash
mysql -e "SHOW FULL PROCESSLIST;"
bash
mysql -e "SHOW ENGINE INNODB STATUS\G"
bash
mysql -e "SHOW GLOBAL STATUS LIKE 'Threads_connected';"
bash
mysql -e "SHOW GLOBAL STATUS LIKE 'Slow_queries';"
bash
mysql -e "EXPLAIN ANALYZE SELECT ...;"

Host-level checks

bash
df -h
iostat -xz 1 5
vmstat 1 5

Practical debugging rules

  • Compare current behavior against the last known good deploy, migration, or traffic increase
  • Look for a few expensive queries dominating total DB time
  • Identify blocker sessions before killing random connections
  • Verify indexes match actual WHERE, JOIN, ORDER BY, and LIMIT patterns
  • Check ORM-generated SQL, not only handwritten SQL
  • If CPU is low but latency is high, suspect locks, IO waits, or pool waiting
  • Separate write-path and read-path metrics if using replicas
  • During incidents, capture snapshots immediately because transient issues disappear fast
  • If the app is also under pressure, compare with High CPU / Memory Usage

Checklist

  • Slow query logging enabled
  • Query statistics enabled and reviewed regularly
  • Dashboard includes latency, connections, locks, storage, and replication
  • Alerts configured for p95 latency, connection usage, disk, deadlocks, and replication lag
  • Connection pool metrics visible
  • Backup success tracked
  • Restore tests tracked
  • Recent migrations reviewed for performance regressions
  • Runbook exists for slow queries, lock contention, and connection exhaustion
  • Retention policy defined for logs and metrics
  • Provider-specific metrics enabled for managed databases
  • Monitoring stack reviewed against Monitoring Checklist
  • Production readiness reviewed in SaaS Production Checklist

Related guides

FAQ

What should I monitor first for a small SaaS database?

Start with:

  • query latency
  • slow query count
  • active connections
  • lock waits
  • CPU
  • disk space
  • backup status

These cover the most common production failures.

How do I know if the problem is the app or the database?

Compare app request latency with database query latency and connection wait time.

  • If request latency rises while DB latency and connection usage stay flat, the app may be the bottleneck.
  • If DB latency, locks, or connections spike, start with the database.

Should I use a read replica to fix slow queries?

Not first.

Replicas help with read scaling, but they do not fix:

  • bad query plans
  • missing indexes
  • write lock contention
  • connection pool issues

What is a safe slow query threshold?

Use a threshold that matches your SLA.

For many small SaaS apps, 200ms to 500ms is a practical starting point. Lower it for latency-sensitive APIs.

Do managed databases remove the need for monitoring?

No.

Managed services reduce infrastructure work, but these are still your responsibility:

  • query quality
  • schema growth
  • lock contention
  • pool usage
  • replication behavior
  • backup validation

Is pg_stat_statements safe in production?

Yes. It is commonly used in production PostgreSQL environments.

Enable it intentionally and monitor overhead if your environment is very constrained.

Final takeaway

Good database monitoring is not about collecting every metric.

It is about quickly answering:

  • are queries slow?
  • are connections exhausted?
  • are locks blocking progress?
  • is storage or IO the bottleneck?
  • did a recent deploy or migration cause the change?

For small SaaS teams, the highest-value setup is:

  • slow query visibility
  • connection monitoring
  • lock inspection
  • storage tracking
  • alert thresholds tied to a clear runbook

Start there, keep the dashboard small, and review it every week.