Scaling Basics (Vertical & Horizontal)

A small SaaS usually scales in two phases: first by giving one server more CPU and RAM, then by running multiple app instances behind a load balancer. This page gives a practical path from single-server MVP deployment to a more resilient multi-instance setup without overengineering early.

Quick Fix / Quick Setup

bash

# 1) Check current resource pressure
uptime
free -h
df -h
nproc

# 2) Inspect top CPU and memory consumers
ps aux --sort=-%mem | head
ps aux --sort=-%cpu | head

# 3) If using Gunicorn, increase workers based on CPU
# common starting point: workers = (2 * CPU) + 1
gunicorn app.main:app -w 5 -k uvicorn.workers.UvicornWorker -b 0.0.0.0:8000

# 4) Put Nginx in front of multiple local app instances
upstream app_servers {
    server 127.0.0.1:8001;
    server 127.0.0.1:8002;
    server 127.0.0.1:8003;
}

server {
    listen 80;
    server_name example.com;

    location / {
        proxy_pass http://app_servers;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

# 5) Verify the upstream and reload Nginx
sudo nginx -t && sudo systemctl reload nginx

# 6) If scaling across servers, move shared state out of app nodes:
# - sessions -> Redis / database
# - uploads -> S3-compatible storage
# - background jobs -> Redis/RabbitMQ-backed workers
# - cache -> Redis

For most MVPs: scale vertically first, optimize obvious bottlenecks, then make the app stateless before adding more servers. Horizontal scaling breaks quickly if sessions, uploads, or background jobs depend on local disk or in-memory state.

What’s happening

Vertical scaling means increasing resources on one machine: more CPU, RAM, disk IOPS, or a better instance class.
Horizontal scaling means running multiple app instances and distributing traffic across them with Nginx, a cloud load balancer, or a container orchestrator.
Vertical scaling is simpler and usually the first move for early SaaS products.
Horizontal scaling improves capacity and availability, but only works cleanly if the app is mostly stateless.
Your real bottleneck may not be the web app. It may be the database, background jobs, disk, external APIs, or missing caching.
Scaling app servers without fixing database contention, slow queries, or blocking tasks often gives little improvement.

decision flowchart for choosing vertical vs horizontal scaling based on CPU, RAM, response time, and single-point-of-failure requirements.

Which client type or architecture fits your needs?

Choosing vertical

Diagnose: choosing vertical

Horizontal scaling based on CPU, RAM, response time, and single-point-of-failure requirements

Diagnose: horizontal scaling based on CPU, RAM, response time, and single-point-of-failure requirements

Step-by-step implementation

1. Measure the bottleneck first

Check whether the bottleneck is CPU, memory, disk, database, or slow requests.

bash

uptime
top
htop
free -h
vmstat 1 5
df -h
iostat -xz 1 5
nproc
ps aux --sort=-%cpu | head -20
ps aux --sort=-%mem | head -20
ss -s

Check app and proxy status:

bash

sudo systemctl status nginx
sudo systemctl status gunicorn
sudo journalctl -u gunicorn -n 100 --no-pager
sudo journalctl -u nginx -n 100 --no-pager
curl -I http://127.0.0.1:8000/health

Check database pressure:

bash

psql "$DATABASE_URL" -c "select now();"
psql "$DATABASE_URL" -c "select * from pg_stat_activity;"
psql "$DATABASE_URL" -c "select query, calls, total_exec_time, mean_exec_time from pg_stat_statements order by total_exec_time desc limit 10;"

If you do not already have monitoring, set it up before scaling. See Metrics and Performance Monitoring.

2. Scale vertically first if one machine is close

If CPU or RAM is consistently saturated, one larger instance is usually the simplest next step.

Common Gunicorn starting point:

bash

gunicorn app.main:app \
  -w 5 \
  -k uvicorn.workers.UvicornWorker \
  -b 0.0.0.0:8000 \
  --timeout 60 \
  --keep-alive 5 \
  --max-requests 1000 \
  --max-requests-jitter 100

Worker rule of thumb:

text

workers = (2 * CPU cores) + 1

This is a starting point, not a final answer. Validate against memory usage and latency.

Example systemd service:

ini

# /etc/systemd/system/gunicorn.service
[Unit]
Description=Gunicorn
After=network.target

[Service]
User=www-data
Group=www-data
WorkingDirectory=/srv/app
Environment="PATH=/srv/app/.venv/bin"
ExecStart=/srv/app/.venv/bin/gunicorn app.main:app \
  -w 5 \
  -k uvicorn.workers.UvicornWorker \
  -b 127.0.0.1:8000 \
  --timeout 60
Restart=always

[Install]
WantedBy=multi-user.target

Reload and restart:

bash

sudo systemctl daemon-reload
sudo systemctl restart gunicorn
sudo systemctl enable gunicorn

If you are still running app, database, worker, and scheduler on the same host, separate them before buying more complexity. See Deploy SaaS with Nginx + Gunicorn.

3. Make the app stateless

Before horizontal scaling, remove dependencies on local instance state.

Move these out of app memory or local disk:

sessions
cache
rate-limit counters
uploads
generated files
background jobs
scheduled jobs

Typical target architecture:

sessions/cache -> Redis
uploads/assets -> S3-compatible object storage
jobs -> Redis or RabbitMQ queue + worker process
scheduler -> dedicated process, not every web node

If requests depend on a specific node, horizontal scaling will fail under deploys and restarts.

4. Externalize shared components

Redis for sessions/cache

Example environment:

bash

export REDIS_URL=redis://redis.internal:6379/0

Validate Redis:

bash

redis-cli ping
redis-cli info memory

Object storage for uploads

Use S3-compatible storage instead of local paths like /tmp/uploads or /srv/app/media.

bash

export S3_BUCKET=my-saas-uploads
export S3_REGION=us-east-1
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...

Background jobs

Move long-running tasks out of the request path:

email sending
webhook retries
file processing
report generation
image/video transformations
imports/exports

Do not let web workers block on these.

5. Separate responsibilities

A practical small SaaS production split:

web server: Nginx
app server: Gunicorn/Uvicorn
worker: queue consumer
scheduler: one process only
database: managed Postgres if possible
cache/session store: Redis
uploads: object storage

This reduces resource contention and makes bottlenecks visible.

6. Add multiple app instances

Start multiple app processes on separate ports.

bash

gunicorn app.main:app -w 3 -k uvicorn.workers.UvicornWorker -b 127.0.0.1:8001
gunicorn app.main:app -w 3 -k uvicorn.workers.UvicornWorker -b 127.0.0.1:8002
gunicorn app.main:app -w 3 -k uvicorn.workers.UvicornWorker -b 127.0.0.1:8003

If using systemd, define separate units or templates for each instance.

Example template:

ini

# /etc/systemd/system/gunicorn@.service
[Unit]
Description=Gunicorn instance %i
After=network.target

[Service]
User=www-data
Group=www-data
WorkingDirectory=/srv/app
Environment="PATH=/srv/app/.venv/bin"
ExecStart=/srv/app/.venv/bin/gunicorn app.main:app \
  -w 3 \
  -k uvicorn.workers.UvicornWorker \
  -b 127.0.0.1:%i
Restart=always

[Install]
WantedBy=multi-user.target

Start instances:

bash

sudo systemctl daemon-reload
sudo systemctl enable --now gunicorn@8001
sudo systemctl enable --now gunicorn@8002
sudo systemctl enable --now gunicorn@8003

7. Add load balancing

Example Nginx upstream:

nginx

upstream app_servers {
    least_conn;
    server 127.0.0.1:8001 max_fails=3 fail_timeout=10s;
    server 127.0.0.1:8002 max_fails=3 fail_timeout=10s;
    server 127.0.0.1:8003 max_fails=3 fail_timeout=10s;
}

server {
    listen 80;
    server_name example.com;

    location /health {
        proxy_pass http://app_servers/health;
        proxy_set_header Host $host;
    }

    location / {
        proxy_pass http://app_servers;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_read_timeout 60s;
    }
}

Validate and reload:

bash

sudo nginx -t
sudo systemctl reload nginx

Check listening ports:

bash

ss -ltnp

For multi-server setups, use a cloud load balancer or Nginx on a dedicated edge node.

8. Avoid sticky sessions if possible

Preferred setup: any request can hit any instance.

If sessions are shared through Redis or DB-backed sessions, no stickiness is needed.

Only use sticky sessions if:

you cannot refactor session handling yet
you understand failover limitations
you accept uneven traffic distribution

If sticky sessions are required temporarily, treat it as migration debt, not the final state.

9. Plan database scaling separately

App scaling often exposes database limits.

Check for:

missing indexes
N+1 queries
lock contention
connection exhaustion
expensive sorts/joins
long-running transactions

Useful Postgres checks:

bash

psql "$DATABASE_URL" -c "select * from pg_stat_activity;"
psql "$DATABASE_URL" -c "select query, calls, total_exec_time, mean_exec_time from pg_stat_statements order by total_exec_time desc limit 10;"

Potential database actions:

add indexes
reduce query count
introduce connection pooling
move reads to replicas
upgrade managed DB tier
cache repeated reads in Redis

Scaling web nodes without addressing DB limits usually moves the failure point, not the capacity ceiling.

10. Test failure scenarios

Stop one instance and verify traffic still works.

bash

sudo systemctl stop gunicorn@8002
curl -I http://127.0.0.1/
ab -n 1000 -c 50 http://127.0.0.1/
wrk -t4 -c100 -d30s http://127.0.0.1/

Verify:

requests still complete
login/session flow still works
uploads still work
jobs still process
error rate does not spike badly

This is the minimum validation for horizontal readiness.

11. Automate deployments

Manual per-node deploys create drift.

Minimum requirement:

same app version on every node
same env vars/secrets
migrations applied safely
controlled restart order
rollback path documented

If using containers, standardize image build and rollout. See Docker Production Setup for SaaS.

12. Recheck after rollout

After scaling, verify these improved:

lower p95/p99 latency
fewer 502/504/5xx errors
stable memory usage
no swap growth
stable DB connections
lower queue depth
healthy instance replacement behavior

Common causes

CPU saturation from too few app workers or expensive request handling
Memory pressure causing swapping or OOM kills
Database bottlenecks: slow queries, missing indexes, lock contention, connection exhaustion
Background jobs running inside web requests instead of a queue worker
Sessions stored in memory or local process state, breaking multi-instance requests
Uploads or generated files stored on local disk, unavailable on other instances
Nginx/load balancer misconfiguration sending traffic to unhealthy nodes
Uneven traffic distribution or no health checks
Too many Gunicorn workers for available RAM
No caching for repeated expensive reads
Blocking external API calls inside request handlers
Single server running app, worker, scheduler, and database competing for the same resources

Debugging tips

Use these commands during scale planning and incidents.

Host and process pressure

bash

uptime
top
htop
free -h
vmstat 1 5
df -h
iostat -xz 1 5
nproc
ps aux --sort=-%cpu | head -20
ps aux --sort=-%mem | head -20

Network and listeners

bash

ss -ltnp
ss -s
curl -I http://127.0.0.1:8000/health

Load testing

bash

ab -n 1000 -c 50 http://127.0.0.1/
wrk -t4 -c100 -d30s http://127.0.0.1/

Nginx and Gunicorn

bash

sudo nginx -t
sudo systemctl status nginx
sudo systemctl status gunicorn
sudo journalctl -u gunicorn -n 100 --no-pager
sudo journalctl -u nginx -n 100 --no-pager
gunicorn --check-config app.main:app

Redis

bash

redis-cli ping
redis-cli info memory

Postgres

bash

psql "$DATABASE_URL" -c "select now();"
psql "$DATABASE_URL" -c "select * from pg_stat_activity;"
psql "$DATABASE_URL" -c "select query, calls, total_exec_time, mean_exec_time from pg_stat_statements order by total_exec_time desc limit 10;"

If you are troubleshooting active resource pressure, use High CPU / Memory Usage.

Checklist

✓ App does not depend on local filesystem for user uploads or shared generated files.
✓ Sessions are shared across instances or sticky sessions are intentionally configured.
✓ Background jobs run outside the request-response cycle.
✓ Database connection limits are known and app pool sizes are tuned.
✓ Nginx or load balancer has health checks and correct proxy headers.
✓ Monitoring covers CPU, RAM, latency, 5xx rate, DB load, queue depth, and disk usage.
✓ Deploy process updates all instances consistently.
✓ Rollback path is documented and tested.
✓ One instance can fail without total outage if horizontally scaled.

For broader production readiness, review SaaS Production Checklist.

Related guides

FAQ

What is the simplest safe scaling plan for a small SaaS?

Increase server size, tune app workers, move background work out of requests, then externalize sessions and files before adding more app instances.

How many Gunicorn workers should I start with?

A common starting point is:

text

(2 x CPU cores) + 1

Then adjust based on memory use, latency, and workload type.

Why do logins break after adding a second app server?

Usually because sessions are stored in local memory, local files, or signed cookies are misconfigured across instances. Use shared session storage and consistent secrets.

Why didn’t adding more app servers improve performance?

The real bottleneck is often the database, slow external APIs, disk I/O, or background work still happening inside web requests.

Do I need object storage for horizontal scaling?

If users upload files or your app generates assets that must be available on every instance, yes. Local disk does not scale cleanly across nodes.

Final takeaway

Scale in this order: measure, optimize, scale vertically, remove shared local state, then scale horizontally.
Horizontal scaling is not just more servers. It requires stateless app design, shared storage/services, and deployment discipline.
If you cannot restart any app node without breaking sessions, uploads, or jobs, the app is not ready for horizontal scaling.