Scaling Basics (Vertical & Horizontal)

The essential playbook for implementing scaling basics (vertical & horizontal) in your SaaS.

A small SaaS usually scales in two phases: first by giving one server more CPU and RAM, then by running multiple app instances behind a load balancer. This page gives a practical path from single-server MVP deployment to a more resilient multi-instance setup without overengineering early.

Quick Fix / Quick Setup

bash
# 1) Check current resource pressure
uptime
free -h
df -h
nproc

# 2) Inspect top CPU and memory consumers
ps aux --sort=-%mem | head
ps aux --sort=-%cpu | head

# 3) If using Gunicorn, increase workers based on CPU
# common starting point: workers = (2 * CPU) + 1
gunicorn app.main:app -w 5 -k uvicorn.workers.UvicornWorker -b 0.0.0.0:8000

# 4) Put Nginx in front of multiple local app instances
upstream app_servers {
    server 127.0.0.1:8001;
    server 127.0.0.1:8002;
    server 127.0.0.1:8003;
}

server {
    listen 80;
    server_name example.com;

    location / {
        proxy_pass http://app_servers;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

# 5) Verify the upstream and reload Nginx
sudo nginx -t && sudo systemctl reload nginx

# 6) If scaling across servers, move shared state out of app nodes:
# - sessions -> Redis / database
# - uploads -> S3-compatible storage
# - background jobs -> Redis/RabbitMQ-backed workers
# - cache -> Redis

For most MVPs: scale vertically first, optimize obvious bottlenecks, then make the app stateless before adding more servers. Horizontal scaling breaks quickly if sessions, uploads, or background jobs depend on local disk or in-memory state.

What’s happening

  • Vertical scaling means increasing resources on one machine: more CPU, RAM, disk IOPS, or a better instance class.
  • Horizontal scaling means running multiple app instances and distributing traffic across them with Nginx, a cloud load balancer, or a container orchestrator.
  • Vertical scaling is simpler and usually the first move for early SaaS products.
  • Horizontal scaling improves capacity and availability, but only works cleanly if the app is mostly stateless.
  • Your real bottleneck may not be the web app. It may be the database, background jobs, disk, external APIs, or missing caching.
  • Scaling app servers without fixing database contention, slow queries, or blocking tasks often gives little improvement.
  • decision flowchart for choosing vertical vs horizontal scaling based on CPU, RAM, response time, and single-point-of-failure requirements.

    Which client type or architecture fits your needs?
    Choosing vertical
    Diagnose: choosing vertical
    Horizontal scaling based on CPU, RAM, response time, and single-point-of-failure requirements
    Diagnose: horizontal scaling based on CPU, RAM, response time, and single-point-of-failure requirements

Step-by-step implementation

1. Measure the bottleneck first

Check whether the bottleneck is CPU, memory, disk, database, or slow requests.

bash
uptime
top
htop
free -h
vmstat 1 5
df -h
iostat -xz 1 5
nproc
ps aux --sort=-%cpu | head -20
ps aux --sort=-%mem | head -20
ss -s

Check app and proxy status:

bash
sudo systemctl status nginx
sudo systemctl status gunicorn
sudo journalctl -u gunicorn -n 100 --no-pager
sudo journalctl -u nginx -n 100 --no-pager
curl -I http://127.0.0.1:8000/health

Check database pressure:

bash
psql "$DATABASE_URL" -c "select now();"
psql "$DATABASE_URL" -c "select * from pg_stat_activity;"
psql "$DATABASE_URL" -c "select query, calls, total_exec_time, mean_exec_time from pg_stat_statements order by total_exec_time desc limit 10;"

If you do not already have monitoring, set it up before scaling. See Metrics and Performance Monitoring.

2. Scale vertically first if one machine is close

If CPU or RAM is consistently saturated, one larger instance is usually the simplest next step.

Common Gunicorn starting point:

bash
gunicorn app.main:app \
  -w 5 \
  -k uvicorn.workers.UvicornWorker \
  -b 0.0.0.0:8000 \
  --timeout 60 \
  --keep-alive 5 \
  --max-requests 1000 \
  --max-requests-jitter 100

Worker rule of thumb:

text
workers = (2 * CPU cores) + 1

This is a starting point, not a final answer. Validate against memory usage and latency.

Example systemd service:

ini
# /etc/systemd/system/gunicorn.service
[Unit]
Description=Gunicorn
After=network.target

[Service]
User=www-data
Group=www-data
WorkingDirectory=/srv/app
Environment="PATH=/srv/app/.venv/bin"
ExecStart=/srv/app/.venv/bin/gunicorn app.main:app \
  -w 5 \
  -k uvicorn.workers.UvicornWorker \
  -b 127.0.0.1:8000 \
  --timeout 60
Restart=always

[Install]
WantedBy=multi-user.target

Reload and restart:

bash
sudo systemctl daemon-reload
sudo systemctl restart gunicorn
sudo systemctl enable gunicorn

If your app is still running app, database, worker, and scheduler on the same host, separate them before buying more complexity. See Deploy SaaS with Nginx + Gunicorn.

3. Make the app stateless

Before horizontal scaling, remove dependencies on local instance state.

Move these out of app memory or local disk:

  • sessions
  • cache
  • rate-limit counters
  • uploads
  • generated files
  • background jobs
  • scheduled jobs

Typical target architecture:

  • sessions/cache -> Redis
  • uploads/assets -> S3-compatible object storage
  • jobs -> Redis or RabbitMQ queue + worker process
  • scheduler -> dedicated process, not every web node

If requests depend on a specific node, horizontal scaling will fail under deploys and restarts.

4. Externalize shared components

Redis for sessions/cache

Example environment:

bash
export REDIS_URL=redis://redis.internal:6379/0

Validate Redis:

bash
redis-cli ping
redis-cli info memory

Object storage for uploads

Use S3-compatible storage instead of local paths like /tmp/uploads or /srv/app/media.

bash
export S3_BUCKET=my-saas-uploads
export S3_REGION=us-east-1
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...

Background jobs

Move long-running tasks out of the request path:

  • email sending
  • webhook retries
  • file processing
  • report generation
  • image/video transformations
  • imports/exports

Do not let web workers block on these.

5. Separate responsibilities

A practical small SaaS production split:

  • web server: Nginx
  • app server: Gunicorn/Uvicorn
  • worker: queue consumer
  • scheduler: one process only
  • database: managed Postgres if possible
  • cache/session store: Redis
  • uploads: object storage

This reduces resource contention and makes bottlenecks visible.

6. Add multiple app instances

Start multiple app processes on separate ports.

bash
gunicorn app.main:app -w 3 -k uvicorn.workers.UvicornWorker -b 127.0.0.1:8001
gunicorn app.main:app -w 3 -k uvicorn.workers.UvicornWorker -b 127.0.0.1:8002
gunicorn app.main:app -w 3 -k uvicorn.workers.UvicornWorker -b 127.0.0.1:8003

If using systemd, define separate units or templates for each instance.

Example template:

ini
# /etc/systemd/system/gunicorn@.service
[Unit]
Description=Gunicorn instance %i
After=network.target

[Service]
User=www-data
Group=www-data
WorkingDirectory=/srv/app
Environment="PATH=/srv/app/.venv/bin"
ExecStart=/srv/app/.venv/bin/gunicorn app.main:app \
  -w 3 \
  -k uvicorn.workers.UvicornWorker \
  -b 127.0.0.1:%i
Restart=always

[Install]
WantedBy=multi-user.target

Start instances:

bash
sudo systemctl daemon-reload
sudo systemctl enable --now gunicorn@8001
sudo systemctl enable --now gunicorn@8002
sudo systemctl enable --now gunicorn@8003

7. Add load balancing

Example Nginx upstream:

nginx
upstream app_servers {
    least_conn;
    server 127.0.0.1:8001 max_fails=3 fail_timeout=10s;
    server 127.0.0.1:8002 max_fails=3 fail_timeout=10s;
    server 127.0.0.1:8003 max_fails=3 fail_timeout=10s;
}

server {
    listen 80;
    server_name example.com;

    location /health {
        proxy_pass http://app_servers/health;
        proxy_set_header Host $host;
    }

    location / {
        proxy_pass http://app_servers;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_read_timeout 60s;
    }
}

Validate and reload:

bash
sudo nginx -t
sudo systemctl reload nginx

Check listening ports:

bash
ss -ltnp

For multi-server setups, use a cloud load balancer or Nginx on a dedicated edge node.

8. Avoid sticky sessions if possible

Preferred setup: any request can hit any instance.

If sessions are shared through Redis or DB-backed sessions, no stickiness is needed.

Only use sticky sessions if:

  • you cannot refactor session handling yet
  • you understand failover limitations
  • you accept uneven traffic distribution

If sticky sessions are required temporarily, treat it as migration debt, not the final state.

9. Plan database scaling separately

App scaling often exposes database limits.

Check for:

  • missing indexes
  • N+1 queries
  • lock contention
  • connection exhaustion
  • expensive sorts/joins
  • long-running transactions

Useful Postgres checks:

bash
psql "$DATABASE_URL" -c "select * from pg_stat_activity;"
psql "$DATABASE_URL" -c "select query, calls, total_exec_time, mean_exec_time from pg_stat_statements order by total_exec_time desc limit 10;"

Potential database actions:

  • add indexes
  • reduce query count
  • introduce connection pooling
  • move reads to replicas
  • upgrade managed DB tier
  • cache repeated reads in Redis

Scaling web nodes without addressing DB limits usually moves the failure point, not the capacity ceiling.

10. Test failure scenarios

Stop one instance and verify traffic still works.

bash
sudo systemctl stop gunicorn@8002
curl -I http://127.0.0.1/
ab -n 1000 -c 50 http://127.0.0.1/
wrk -t4 -c100 -d30s http://127.0.0.1/

Verify:

  • requests still complete
  • login/session flow still works
  • uploads still work
  • jobs still process
  • error rate does not spike badly

This is the minimum validation for horizontal readiness.

11. Automate deployments

Manual per-node deploys create drift.

Minimum requirement:

  • same app version on every node
  • same env vars/secrets
  • migrations applied safely
  • controlled restart order
  • rollback path documented

If using containers, standardize image build and rollout. See Docker Production Setup for SaaS.

12. Recheck after rollout

After scaling, verify these improved:

  • lower p95/p99 latency
  • fewer 502/504/5xx errors
  • stable memory usage
  • no swap growth
  • stable DB connections
  • lower queue depth
  • healthy instance replacement behavior

Common causes

  • CPU saturation from too few app workers or expensive request handling
  • Memory pressure causing swapping or OOM kills
  • Database bottlenecks: slow queries, missing indexes, lock contention, connection exhaustion
  • Background jobs running inside web requests instead of a queue worker
  • Sessions stored in memory or local process state, breaking multi-instance requests
  • Uploads or generated files stored on local disk, unavailable on other instances
  • Nginx/load balancer misconfiguration sending traffic to unhealthy nodes
  • Uneven traffic distribution or no health checks
  • Too many Gunicorn workers for available RAM
  • No caching for repeated expensive reads
  • Blocking external API calls inside request handlers
  • Single server running app, worker, scheduler, and database competing for the same resources

Debugging tips

Use these commands during scale planning and incidents.

Host and process pressure

bash
uptime
top
htop
free -h
vmstat 1 5
df -h
iostat -xz 1 5
nproc
ps aux --sort=-%cpu | head -20
ps aux --sort=-%mem | head -20

Network and listeners

bash
ss -ltnp
ss -s
curl -I http://127.0.0.1:8000/health

Load testing

bash
ab -n 1000 -c 50 http://127.0.0.1/
wrk -t4 -c100 -d30s http://127.0.0.1/

Nginx and Gunicorn

bash
sudo nginx -t
sudo systemctl status nginx
sudo systemctl status gunicorn
sudo journalctl -u gunicorn -n 100 --no-pager
sudo journalctl -u nginx -n 100 --no-pager
gunicorn --check-config app.main:app

Redis

bash
redis-cli ping
redis-cli info memory

Postgres

bash
psql "$DATABASE_URL" -c "select now();"
psql "$DATABASE_URL" -c "select * from pg_stat_activity;"
psql "$DATABASE_URL" -c "select query, calls, total_exec_time, mean_exec_time from pg_stat_statements order by total_exec_time desc limit 10;"

If you are troubleshooting active resource pressure, use High CPU / Memory Usage.

Checklist

  • App does not depend on local filesystem for user uploads or shared generated files.
  • Sessions are shared across instances or sticky sessions are intentionally configured.
  • Background jobs run outside the request-response cycle.
  • Database connection limits are known and app pool sizes are tuned.
  • Nginx or load balancer has health checks and correct proxy headers.
  • Monitoring covers CPU, RAM, latency, 5xx rate, DB load, queue depth, and disk usage.
  • Deploy process updates all instances consistently.
  • Rollback path is documented and tested.
  • One instance can fail without total outage if horizontally scaled.

For broader production readiness, review SaaS Production Checklist.

Related guides

FAQ

What is the simplest safe scaling plan for a small SaaS?

Increase server size, tune app workers, move background work out of requests, then externalize sessions and files before adding more app instances.

How many Gunicorn workers should I start with?

A common starting point is:

text
(2 x CPU cores) + 1

Then adjust based on memory use, latency, and workload type.

Why do logins break after adding a second app server?

Usually because sessions are stored in local memory, local files, or signed cookies are misconfigured across instances. Use shared session storage and consistent secrets.

Why didn’t adding more app servers improve performance?

The real bottleneck is often the database, slow external APIs, disk I/O, or background work still happening inside web requests.

Do I need object storage for horizontal scaling?

If users upload files or your app generates assets that must be available on every instance, yes. Local disk does not scale cleanly across nodes.

Final takeaway

  • Scale in this order: measure, optimize, scale vertically, remove shared local state, then scale horizontally.
  • Horizontal scaling is not just more servers. It requires stateless app design, shared storage/services, and deployment discipline.
  • If you cannot restart any app node without breaking sessions, uploads, or jobs, the app is not ready for horizontal scaling.