Database Performance Monitoring — SaaS Builder Playbooks

Monitor your database like a production dependency, not a black box.

For a small SaaS, the minimum useful setup is:

query latency
error rate
active connections
slow query logging
lock and wait visibility
disk usage
backup health

This setup works for PostgreSQL and MySQL-backed apps and gives enough signal to catch regressions before they become user-facing incidents.

Quick Fix / Quick Setup

Start with native database stats and logs before adding a full observability stack.

If you only do three things:

enable slow query logging
collect connection metrics
alert on latency and error spikes

PostgreSQL quick checks

bash

# 1) Enable slow statement visibility
psql "$DATABASE_URL" -c "ALTER SYSTEM SET log_min_duration_statement = '500ms';"
psql "$DATABASE_URL" -c "ALTER SYSTEM SET log_lock_waits = on;"
psql "$DATABASE_URL" -c "ALTER SYSTEM SET deadlock_timeout = '1s';"
psql "$DATABASE_URL" -c "SELECT pg_reload_conf();"

# 2) Find slow queries right now
psql "$DATABASE_URL" -c "SELECT query, calls, total_exec_time, mean_exec_time, rows FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 10;"

# 3) Check active connections and blockers
psql "$DATABASE_URL" -c "SELECT state, count(*) FROM pg_stat_activity GROUP BY state;"
psql "$DATABASE_URL" -c "SELECT pid, usename, wait_event_type, wait_event, state, query FROM pg_stat_activity WHERE state <> 'idle';"

# 4) Basic size + table hotspots
psql "$DATABASE_URL" -c "SELECT pg_size_pretty(pg_database_size(current_database()));"
psql "$DATABASE_URL" -c "SELECT relname, seq_scan, idx_scan, n_live_tup FROM pg_stat_user_tables ORDER BY seq_scan DESC LIMIT 10;"

MySQL quick checks

bash

mysql -e "SET GLOBAL slow_query_log = 'ON';"
mysql -e "SET GLOBAL long_query_time = 0.5;"
mysql -e "SHOW PROCESSLIST;"
mysql -e "SHOW GLOBAL STATUS LIKE 'Threads_connected';"
mysql -e "SHOW GLOBAL STATUS LIKE 'Slow_queries';"

What’s happening

Database issues usually show up first in the app layer:

slow requests
queue buildup
timeouts
intermittent 500s

The database failure mode is usually not just “the DB is slow.” It is one of these:

missing indexes
connection pool exhaustion
lock contention
long-running transactions
inefficient queries
disk saturation
under-provisioned CPU, RAM, or IOPS
replication lag

You need both:

real-time metrics for current incidents
historical trends for recurring regressions

For small SaaS products, focus on actionable signals that help you decide whether to:

optimize a query
add an index
change the pool size
scale the instance
investigate a recent deploy or migration

app requests

connection pool

database

slow queries / locks / storage / replication

app symptoms

Process Flow

Step-by-step implementation

1) Define the minimum metrics

Track these first:

p50, p95, p99 query latency
slow query count
active and idle connections
max connections usage percent
transaction duration
lock waits
deadlocks
replication lag
CPU
memory
disk usage
disk throughput / IOPS
backup success/failure

If you use a managed database, also enable provider metrics immediately.

2) Turn on database-native observability

PostgreSQL

Enable pg_stat_statements and useful logging.

postgresql.conf:

conf

shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.max = 10000
pg_stat_statements.track = all

log_min_duration_statement = 500ms
log_lock_waits = on
deadlock_timeout = 1s
log_checkpoints = on
log_temp_files = 0
log_autovacuum_min_duration = 1000ms

Apply and create the extension:

sql

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

Useful checks:

bash

psql "$DATABASE_URL" -c "SHOW log_min_duration_statement;"
psql "$DATABASE_URL" -c "SHOW log_lock_waits;"
psql "$DATABASE_URL" -c "SELECT * FROM pg_extension WHERE extname = 'pg_stat_statements';"

MySQL

Enable slow query logging and performance schema.

Example my.cnf:

conf

[mysqld]
slow_query_log = ON
long_query_time = 0.5
log_queries_not_using_indexes = OFF
performance_schema = ON
innodb_print_all_deadlocks = ON

Validate:

bash

mysql -e "SHOW VARIABLES LIKE 'slow_query_log';"
mysql -e "SHOW VARIABLES LIKE 'long_query_time';"
mysql -e "SHOW VARIABLES LIKE 'performance_schema';"

3) Export metrics

Use one of these paths:

managed provider metrics: RDS, Cloud SQL, Supabase, Neon, PlanetScale
Prometheus exporters: postgres_exporter, mysqld_exporter
hosted monitoring: Datadog, Grafana Cloud, New Relic

Example Prometheus scrape config:

yaml

scrape_configs:
  - job_name: postgres
    static_configs:
      - targets: ["postgres-exporter:9187"]

  - job_name: mysql
    static_configs:
      - targets: ["mysqld-exporter:9104"]

If you use PgBouncer, export pool metrics too.

4) Build a minimal dashboard

First row:

p95 query latency
query throughput
error rate
active connections / max connections

Second row:

slow queries per minute
lock waits
deadlocks
replication lag

Third row:

CPU
memory
disk usage
disk throughput / IOPS

Fourth row:

top queries by total time
top queries by mean time
table scans vs index scans
storage growth

5) Add slow query analysis

Review top queries by:

total execution time
mean execution time
call count

PostgreSQL:

bash

psql "$DATABASE_URL" -c "SELECT query, calls, total_exec_time, mean_exec_time, rows FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 20;"

MySQL:

sql

SELECT *
FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 20;

Run EXPLAIN or EXPLAIN ANALYZE on the worst queries.

PostgreSQL:

bash

psql "$DATABASE_URL" -c "EXPLAIN (ANALYZE, BUFFERS) SELECT ...;"

MySQL:

bash

mysql -e "EXPLAIN ANALYZE SELECT ...;"

6) Monitor connection pooling

Many incidents are pool problems, not raw database capacity problems.

Track:

pool size
waiting clients
checkout wait time
connection churn
active vs idle sessions

If you use PgBouncer:

bash

psql "postgres://USER:PASSWORD@PGBOUNCER_HOST:6432/pgbouncer" -c "SHOW POOLS;"
psql "postgres://USER:PASSWORD@PGBOUNCER_HOST:6432/pgbouncer" -c "SHOW STATS;"

Common rule:

keep app-side concurrency aligned with pool size
do not let web workers and background workers open unlimited connections

7) Track maintenance signals

PostgreSQL

Watch for:

autovacuum activity
long-running transactions
dead tuples
table bloat
checkpoints
vacuum lag

Commands:

bash

psql "$DATABASE_URL" -c "SELECT relname, n_live_tup, n_dead_tup, seq_scan, idx_scan FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 20;"
psql "$DATABASE_URL" -c "SELECT pid, now() - xact_start AS txn_age, state, query FROM pg_stat_activity WHERE xact_start IS NOT NULL ORDER BY txn_age DESC LIMIT 20;"

MySQL

Watch for:

InnoDB buffer pool pressure
row lock waits
transaction backlog
temp table spill
deadlocks

Commands:

bash

mysql -e "SHOW ENGINE INNODB STATUS\G"
mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool%';"

8) Set alerts

Use sustained thresholds, not single spikes.

Recommended starter alerts:

p95 query latency above threshold for 5–10 minutes
active connections above 80% of max
lock waits spike above baseline
deadlocks > 0 in sustained windows
replication lag above app tolerance
free disk below safe threshold
backup failure
restore test failure

Example alert thresholds:

text

p95 query latency > 300ms for 10m
active_connections / max_connections > 0.80 for 5m
replication_lag > 30s for 5m
free_disk_percent < 15% for 10m
deadlocks > 0 for 10m

9) Tie alerts to a runbook

Every alert should map to a first action.

Examples:

Alert	First action
p95 latency high	inspect top queries and recent deploys
connections high	inspect pool saturation and long-lived sessions
lock waits high	identify blocked and blocking sessions
replication lag high	inspect replica health and write volume
disk low	inspect table growth, temp file usage, retention

This should connect directly with your incident workflow in Debugging Production Issues.

10) Review weekly

Database degradation is often gradual.

Weekly review:

top queries by total time
top tables by size growth
scan patterns
dead tuples / bloat indicators
connection trends
recent migrations
background job impact

Also review schema changes using Database Migration Strategy.

Common causes

Missing or incorrect indexes on filtered, joined, or sorted columns
N+1 queries from ORM usage
Connection pool misconfiguration
Long-running transactions blocking vacuum or holding locks
Lock contention from bulk updates or hot rows
Full table scans from stale stats or poor query patterns
Database instance too small for current traffic
Disk saturation or low IOPS
Storage nearing capacity
Replication lag
PostgreSQL autovacuum not keeping up
Expensive background jobs competing with user traffic
Unbounded growth in audit, event, or log tables
Background workers sharing the same database without pool limits

Debugging tips

Start by identifying which class of problem you have:

query latency
connection saturation
lock contention
replication lag
storage / IO bottleneck

Do not scale blindly before isolating the bottleneck.

PostgreSQL commands

bash

psql "$DATABASE_URL" -c "SELECT now(), state, count(*) FROM pg_stat_activity GROUP BY state ORDER BY state;"

bash

psql "$DATABASE_URL" -c "SELECT pid, usename, application_name, client_addr, state, wait_event_type, wait_event, now() - query_start AS runtime, query FROM pg_stat_activity WHERE state <> 'idle' ORDER BY runtime DESC;"

bash

psql "$DATABASE_URL" -c "SELECT query, calls, total_exec_time, mean_exec_time, rows FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 20;"

bash

psql "$DATABASE_URL" -c "SELECT blocked.pid AS blocked_pid, blocker.pid AS blocker_pid, blocked.query AS blocked_query, blocker.query AS blocker_query FROM pg_stat_activity blocked JOIN pg_locks bl ON blocked.pid = bl.pid AND NOT bl.granted JOIN pg_locks kl ON bl.locktype = kl.locktype AND bl.database IS NOT DISTINCT FROM kl.database AND bl.relation IS NOT DISTINCT FROM kl.relation AND bl.page IS NOT DISTINCT FROM kl.page AND bl.tuple IS NOT DISTINCT FROM kl.tuple AND bl.virtualxid IS NOT DISTINCT FROM kl.virtualxid AND bl.transactionid IS NOT DISTINCT FROM kl.transactionid AND bl.classid IS NOT DISTINCT FROM kl.classid AND bl.objid IS NOT DISTINCT FROM kl.objid AND bl.objsubid IS NOT DISTINCT FROM kl.objsubid AND kl.granted JOIN pg_stat_activity blocker ON kl.pid = blocker.pid;"

bash

psql "$DATABASE_URL" -c "SELECT relname, n_live_tup, n_dead_tup, seq_scan, idx_scan FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 20;"

bash

psql "$DATABASE_URL" -c "EXPLAIN (ANALYZE, BUFFERS) SELECT ...;"

MySQL commands

bash

mysql -e "SHOW FULL PROCESSLIST;"

bash

mysql -e "SHOW ENGINE INNODB STATUS\G"

bash

mysql -e "SHOW GLOBAL STATUS LIKE 'Threads_connected';"

bash

mysql -e "SHOW GLOBAL STATUS LIKE 'Slow_queries';"

bash

mysql -e "EXPLAIN ANALYZE SELECT ...;"

Host-level checks

bash

df -h
iostat -xz 1 5
vmstat 1 5

Practical debugging rules

Compare current behavior against the last known good deploy, migration, or traffic increase
Look for a few expensive queries dominating total DB time
Identify blocker sessions before killing random connections
Verify indexes match actual WHERE, JOIN, ORDER BY, and LIMIT patterns
Check ORM-generated SQL, not only handwritten SQL
If CPU is low but latency is high, suspect locks, IO waits, or pool waiting
Separate write-path and read-path metrics if using replicas
During incidents, capture snapshots immediately because transient issues disappear fast
If the app is also under pressure, compare with High CPU / Memory Usage

Checklist

✓ Slow query logging enabled
✓ Query statistics enabled and reviewed regularly
✓ Dashboard includes latency, connections, locks, storage, and replication
✓ Alerts configured for p95 latency, connection usage, disk, deadlocks, and replication lag
✓ Connection pool metrics visible
✓ Backup success tracked
✓ Restore tests tracked
✓ Recent migrations reviewed for performance regressions
✓ Runbook exists for slow queries, lock contention, and connection exhaustion
✓ Retention policy defined for logs and metrics
✓ Provider-specific metrics enabled for managed databases
✓ Monitoring stack reviewed against Monitoring Checklist
✓ Production readiness reviewed in SaaS Production Checklist

Related guides

FAQ

What should I monitor first for a small SaaS database?

Start with:

query latency
slow query count
active connections
lock waits
CPU
disk space
backup status

These cover the most common production failures.

How do I know if the problem is the app or the database?

Compare app request latency with database query latency and connection wait time.

If request latency rises while DB latency and connection usage stay flat, the app may be the bottleneck.
If DB latency, locks, or connections spike, start with the database.

Should I use a read replica to fix slow queries?

Not first.

Replicas help with read scaling, but they do not fix:

bad query plans
missing indexes
write lock contention
connection pool issues

What is a safe slow query threshold?

Use a threshold that matches your SLA.

For many small SaaS apps, 200ms to 500ms is a practical starting point. Lower it for latency-sensitive APIs.

Do managed databases remove the need for monitoring?

No.

Managed services reduce infrastructure work, but these are still your responsibility:

query quality
schema growth
lock contention
pool usage
replication behavior
backup validation

Is `pg_stat_statements` safe in production?

Yes. It is commonly used in production PostgreSQL environments.

Enable it intentionally and monitor overhead if your environment is very constrained.

Final takeaway

Good database monitoring is not about collecting every metric.

It is about quickly answering:

are queries slow?
are connections exhausted?
are locks blocking progress?
is storage or IO the bottleneck?
did a recent deploy or migration cause the change?

For small SaaS teams, the highest-value setup is:

slow query visibility
connection monitoring
lock inspection
storage tracking
alert thresholds tied to a clear runbook

Start there, keep the dashboard small, and review it every week.

Quick Fix / Quick Setup

PostgreSQL quick checks

MySQL quick checks

What’s happening

Step-by-step implementation

1) Define the minimum metrics

2) Turn on database-native observability

PostgreSQL

MySQL

3) Export metrics

4) Build a minimal dashboard

5) Add slow query analysis

6) Monitor connection pooling

7) Track maintenance signals

PostgreSQL

MySQL

8) Set alerts

9) Tie alerts to a runbook

10) Review weekly

Common causes

Debugging tips

PostgreSQL commands

MySQL commands

Host-level checks

Practical debugging rules

Checklist

Related guides

FAQ

What should I monitor first for a small SaaS database?

How do I know if the problem is the app or the database?

Should I use a read replica to fix slow queries?

What is a safe slow query threshold?

Do managed databases remove the need for monitoring?

Is pg_stat_statements safe in production?

Final takeaway

Is `pg_stat_statements` safe in production?