Scaling Basics (Vertical & Horizontal)
The essential playbook for implementing scaling basics (vertical & horizontal) in your SaaS.
A small SaaS usually scales in two phases: first by giving one server more CPU and RAM, then by running multiple app instances behind a load balancer. This page gives a practical path from single-server MVP deployment to a more resilient multi-instance setup without overengineering early.
Quick Fix / Quick Setup
# 1) Check current resource pressure
uptime
free -h
df -h
nproc
# 2) Inspect top CPU and memory consumers
ps aux --sort=-%mem | head
ps aux --sort=-%cpu | head
# 3) If using Gunicorn, increase workers based on CPU
# common starting point: workers = (2 * CPU) + 1
gunicorn app.main:app -w 5 -k uvicorn.workers.UvicornWorker -b 0.0.0.0:8000
# 4) Put Nginx in front of multiple local app instances
upstream app_servers {
server 127.0.0.1:8001;
server 127.0.0.1:8002;
server 127.0.0.1:8003;
}
server {
listen 80;
server_name example.com;
location / {
proxy_pass http://app_servers;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
# 5) Verify the upstream and reload Nginx
sudo nginx -t && sudo systemctl reload nginx
# 6) If scaling across servers, move shared state out of app nodes:
# - sessions -> Redis / database
# - uploads -> S3-compatible storage
# - background jobs -> Redis/RabbitMQ-backed workers
# - cache -> RedisFor most MVPs: scale vertically first, optimize obvious bottlenecks, then make the app stateless before adding more servers. Horizontal scaling breaks quickly if sessions, uploads, or background jobs depend on local disk or in-memory state.
What’s happening
- Vertical scaling means increasing resources on one machine: more CPU, RAM, disk IOPS, or a better instance class.
- Horizontal scaling means running multiple app instances and distributing traffic across them with Nginx, a cloud load balancer, or a container orchestrator.
- Vertical scaling is simpler and usually the first move for early SaaS products.
- Horizontal scaling improves capacity and availability, but only works cleanly if the app is mostly stateless.
- Your real bottleneck may not be the web app. It may be the database, background jobs, disk, external APIs, or missing caching.
- Scaling app servers without fixing database contention, slow queries, or blocking tasks often gives little improvement.
decision flowchart for choosing vertical vs horizontal scaling based on CPU, RAM, response time, and single-point-of-failure requirements.
Step-by-step implementation
1. Measure the bottleneck first
Check whether the bottleneck is CPU, memory, disk, database, or slow requests.
uptime
top
htop
free -h
vmstat 1 5
df -h
iostat -xz 1 5
nproc
ps aux --sort=-%cpu | head -20
ps aux --sort=-%mem | head -20
ss -sCheck app and proxy status:
sudo systemctl status nginx
sudo systemctl status gunicorn
sudo journalctl -u gunicorn -n 100 --no-pager
sudo journalctl -u nginx -n 100 --no-pager
curl -I http://127.0.0.1:8000/healthCheck database pressure:
psql "$DATABASE_URL" -c "select now();"
psql "$DATABASE_URL" -c "select * from pg_stat_activity;"
psql "$DATABASE_URL" -c "select query, calls, total_exec_time, mean_exec_time from pg_stat_statements order by total_exec_time desc limit 10;"If you do not already have monitoring, set it up before scaling. See Metrics and Performance Monitoring.
2. Scale vertically first if one machine is close
If CPU or RAM is consistently saturated, one larger instance is usually the simplest next step.
Common Gunicorn starting point:
gunicorn app.main:app \
-w 5 \
-k uvicorn.workers.UvicornWorker \
-b 0.0.0.0:8000 \
--timeout 60 \
--keep-alive 5 \
--max-requests 1000 \
--max-requests-jitter 100Worker rule of thumb:
workers = (2 * CPU cores) + 1This is a starting point, not a final answer. Validate against memory usage and latency.
Example systemd service:
# /etc/systemd/system/gunicorn.service
[Unit]
Description=Gunicorn
After=network.target
[Service]
User=www-data
Group=www-data
WorkingDirectory=/srv/app
Environment="PATH=/srv/app/.venv/bin"
ExecStart=/srv/app/.venv/bin/gunicorn app.main:app \
-w 5 \
-k uvicorn.workers.UvicornWorker \
-b 127.0.0.1:8000 \
--timeout 60
Restart=always
[Install]
WantedBy=multi-user.targetReload and restart:
sudo systemctl daemon-reload
sudo systemctl restart gunicorn
sudo systemctl enable gunicornIf your app is still running app, database, worker, and scheduler on the same host, separate them before buying more complexity. See Deploy SaaS with Nginx + Gunicorn.
3. Make the app stateless
Before horizontal scaling, remove dependencies on local instance state.
Move these out of app memory or local disk:
- sessions
- cache
- rate-limit counters
- uploads
- generated files
- background jobs
- scheduled jobs
Typical target architecture:
- sessions/cache -> Redis
- uploads/assets -> S3-compatible object storage
- jobs -> Redis or RabbitMQ queue + worker process
- scheduler -> dedicated process, not every web node
If requests depend on a specific node, horizontal scaling will fail under deploys and restarts.
4. Externalize shared components
Redis for sessions/cache
Example environment:
export REDIS_URL=redis://redis.internal:6379/0Validate Redis:
redis-cli ping
redis-cli info memoryObject storage for uploads
Use S3-compatible storage instead of local paths like /tmp/uploads or /srv/app/media.
export S3_BUCKET=my-saas-uploads
export S3_REGION=us-east-1
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...Background jobs
Move long-running tasks out of the request path:
- email sending
- webhook retries
- file processing
- report generation
- image/video transformations
- imports/exports
Do not let web workers block on these.
5. Separate responsibilities
A practical small SaaS production split:
- web server: Nginx
- app server: Gunicorn/Uvicorn
- worker: queue consumer
- scheduler: one process only
- database: managed Postgres if possible
- cache/session store: Redis
- uploads: object storage
This reduces resource contention and makes bottlenecks visible.
6. Add multiple app instances
Start multiple app processes on separate ports.
gunicorn app.main:app -w 3 -k uvicorn.workers.UvicornWorker -b 127.0.0.1:8001
gunicorn app.main:app -w 3 -k uvicorn.workers.UvicornWorker -b 127.0.0.1:8002
gunicorn app.main:app -w 3 -k uvicorn.workers.UvicornWorker -b 127.0.0.1:8003If using systemd, define separate units or templates for each instance.
Example template:
# /etc/systemd/system/gunicorn@.service
[Unit]
Description=Gunicorn instance %i
After=network.target
[Service]
User=www-data
Group=www-data
WorkingDirectory=/srv/app
Environment="PATH=/srv/app/.venv/bin"
ExecStart=/srv/app/.venv/bin/gunicorn app.main:app \
-w 3 \
-k uvicorn.workers.UvicornWorker \
-b 127.0.0.1:%i
Restart=always
[Install]
WantedBy=multi-user.targetStart instances:
sudo systemctl daemon-reload
sudo systemctl enable --now gunicorn@8001
sudo systemctl enable --now gunicorn@8002
sudo systemctl enable --now gunicorn@80037. Add load balancing
Example Nginx upstream:
upstream app_servers {
least_conn;
server 127.0.0.1:8001 max_fails=3 fail_timeout=10s;
server 127.0.0.1:8002 max_fails=3 fail_timeout=10s;
server 127.0.0.1:8003 max_fails=3 fail_timeout=10s;
}
server {
listen 80;
server_name example.com;
location /health {
proxy_pass http://app_servers/health;
proxy_set_header Host $host;
}
location / {
proxy_pass http://app_servers;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_read_timeout 60s;
}
}Validate and reload:
sudo nginx -t
sudo systemctl reload nginxCheck listening ports:
ss -ltnpFor multi-server setups, use a cloud load balancer or Nginx on a dedicated edge node.
8. Avoid sticky sessions if possible
Preferred setup: any request can hit any instance.
If sessions are shared through Redis or DB-backed sessions, no stickiness is needed.
Only use sticky sessions if:
- you cannot refactor session handling yet
- you understand failover limitations
- you accept uneven traffic distribution
If sticky sessions are required temporarily, treat it as migration debt, not the final state.
9. Plan database scaling separately
App scaling often exposes database limits.
Check for:
- missing indexes
- N+1 queries
- lock contention
- connection exhaustion
- expensive sorts/joins
- long-running transactions
Useful Postgres checks:
psql "$DATABASE_URL" -c "select * from pg_stat_activity;"
psql "$DATABASE_URL" -c "select query, calls, total_exec_time, mean_exec_time from pg_stat_statements order by total_exec_time desc limit 10;"Potential database actions:
- add indexes
- reduce query count
- introduce connection pooling
- move reads to replicas
- upgrade managed DB tier
- cache repeated reads in Redis
Scaling web nodes without addressing DB limits usually moves the failure point, not the capacity ceiling.
10. Test failure scenarios
Stop one instance and verify traffic still works.
sudo systemctl stop gunicorn@8002
curl -I http://127.0.0.1/
ab -n 1000 -c 50 http://127.0.0.1/
wrk -t4 -c100 -d30s http://127.0.0.1/Verify:
- requests still complete
- login/session flow still works
- uploads still work
- jobs still process
- error rate does not spike badly
This is the minimum validation for horizontal readiness.
11. Automate deployments
Manual per-node deploys create drift.
Minimum requirement:
- same app version on every node
- same env vars/secrets
- migrations applied safely
- controlled restart order
- rollback path documented
If using containers, standardize image build and rollout. See Docker Production Setup for SaaS.
12. Recheck after rollout
After scaling, verify these improved:
- lower p95/p99 latency
- fewer 502/504/5xx errors
- stable memory usage
- no swap growth
- stable DB connections
- lower queue depth
- healthy instance replacement behavior
Common causes
- CPU saturation from too few app workers or expensive request handling
- Memory pressure causing swapping or OOM kills
- Database bottlenecks: slow queries, missing indexes, lock contention, connection exhaustion
- Background jobs running inside web requests instead of a queue worker
- Sessions stored in memory or local process state, breaking multi-instance requests
- Uploads or generated files stored on local disk, unavailable on other instances
- Nginx/load balancer misconfiguration sending traffic to unhealthy nodes
- Uneven traffic distribution or no health checks
- Too many Gunicorn workers for available RAM
- No caching for repeated expensive reads
- Blocking external API calls inside request handlers
- Single server running app, worker, scheduler, and database competing for the same resources
Debugging tips
Use these commands during scale planning and incidents.
Host and process pressure
uptime
top
htop
free -h
vmstat 1 5
df -h
iostat -xz 1 5
nproc
ps aux --sort=-%cpu | head -20
ps aux --sort=-%mem | head -20Network and listeners
ss -ltnp
ss -s
curl -I http://127.0.0.1:8000/healthLoad testing
ab -n 1000 -c 50 http://127.0.0.1/
wrk -t4 -c100 -d30s http://127.0.0.1/Nginx and Gunicorn
sudo nginx -t
sudo systemctl status nginx
sudo systemctl status gunicorn
sudo journalctl -u gunicorn -n 100 --no-pager
sudo journalctl -u nginx -n 100 --no-pager
gunicorn --check-config app.main:appRedis
redis-cli ping
redis-cli info memoryPostgres
psql "$DATABASE_URL" -c "select now();"
psql "$DATABASE_URL" -c "select * from pg_stat_activity;"
psql "$DATABASE_URL" -c "select query, calls, total_exec_time, mean_exec_time from pg_stat_statements order by total_exec_time desc limit 10;"If you are troubleshooting active resource pressure, use High CPU / Memory Usage.
Checklist
- ✓ App does not depend on local filesystem for user uploads or shared generated files.
- ✓ Sessions are shared across instances or sticky sessions are intentionally configured.
- ✓ Background jobs run outside the request-response cycle.
- ✓ Database connection limits are known and app pool sizes are tuned.
- ✓ Nginx or load balancer has health checks and correct proxy headers.
- ✓ Monitoring covers CPU, RAM, latency, 5xx rate, DB load, queue depth, and disk usage.
- ✓ Deploy process updates all instances consistently.
- ✓ Rollback path is documented and tested.
- ✓ One instance can fail without total outage if horizontally scaled.
For broader production readiness, review SaaS Production Checklist.
Related guides
- Deploy SaaS with Nginx + Gunicorn
- Docker Production Setup for SaaS
- Metrics and Performance Monitoring
- High CPU / Memory Usage
- SaaS Production Checklist
FAQ
What is the simplest safe scaling plan for a small SaaS?
Increase server size, tune app workers, move background work out of requests, then externalize sessions and files before adding more app instances.
How many Gunicorn workers should I start with?
A common starting point is:
(2 x CPU cores) + 1Then adjust based on memory use, latency, and workload type.
Why do logins break after adding a second app server?
Usually because sessions are stored in local memory, local files, or signed cookies are misconfigured across instances. Use shared session storage and consistent secrets.
Why didn’t adding more app servers improve performance?
The real bottleneck is often the database, slow external APIs, disk I/O, or background work still happening inside web requests.
Do I need object storage for horizontal scaling?
If users upload files or your app generates assets that must be available on every instance, yes. Local disk does not scale cleanly across nodes.
Final takeaway
- Scale in this order: measure, optimize, scale vertically, remove shared local state, then scale horizontally.
- Horizontal scaling is not just more servers. It requires stateless app design, shared storage/services, and deployment discipline.
- If you cannot restart any app node without breaking sessions, uploads, or jobs, the app is not ready for horizontal scaling.