Zero Downtime Deployment

The essential playbook for implementing zero downtime deployment in your SaaS.

A zero downtime deployment replaces app processes without taking the site offline. For small SaaS products, the goal is simple: new code goes live, in-flight requests finish, health checks stay green, and rollback is fast if the release is bad.

This page focuses on practical deployment patterns for Gunicorn, Nginx, systemd, and container-based setups.

Quick Fix / Quick Setup

Use this if you deploy on a VPS with immutable release directories and a graceful Gunicorn reload:

bash
# Example: blue/green-style release switch with symlink + Gunicorn reload
set -e
APP_DIR=/var/www/myapp
RELEASES=$APP_DIR/releases
CURRENT=$APP_DIR/current
NEW_RELEASE=$RELEASES/$(date +%Y%m%d%H%M%S)

mkdir -p "$NEW_RELEASE"
rsync -a . "$NEW_RELEASE"/
cd "$NEW_RELEASE"
python -m venv .venv
. .venv/bin/activate
pip install -r requirements.txt
alembic upgrade head

ln -sfn "$NEW_RELEASE" "$CURRENT"
sudo systemctl reload gunicorn
curl -f http://127.0.0.1:8000/health

# rollback if health check fails
# ln -sfn /var/www/myapp/releases/<previous_release> /var/www/myapp/current
# sudo systemctl reload gunicorn

Use reload only if your app server supports graceful worker replacement. Run destructive database migrations separately or make them backward-compatible before switching traffic.


What’s happening

Downtime usually happens when old app processes stop before new ones are ready.

A safe deployment keeps at least one healthy app instance serving traffic during code replacement.

For a small SaaS setup, zero downtime usually depends on these controls:

  • immutable release directories
  • health checks
  • graceful worker replacement
  • backward-compatible database migrations
  • fast rollback to the previous release

Nginx should continue routing requests while Gunicorn workers restart gracefully or while traffic shifts between old and new releases.

The database is often the real blocker. App restarts are easy compared to schema changes that lock tables, break compatibility, or invalidate queued jobs.


Step-by-step implementation

1. Pick a deployment pattern

Use one of these patterns:

  • Graceful reload: best for one VPS with Gunicorn and moderate traffic
  • Blue/green: best when you can run old and new versions briefly at the same time
  • Rolling: best when multiple app instances sit behind a load balancer or reverse proxy

For most small SaaS apps on one server, graceful reload plus immutable releases is the simplest reliable option.

2. Add a health endpoint

Your deploy should never switch traffic before the app proves it can serve requests.

Minimal Flask or FastAPI-style endpoint:

python
@app.get("/health")
def health():
    return {"status": "ok"}

If database availability is critical to request handling, include a lightweight DB check. Keep it fast. Do not run expensive queries.

Example with SQLAlchemy:

python
from sqlalchemy import text

@app.get("/health")
def health():
    db.session.execute(text("SELECT 1"))
    return {"status": "ok"}

Use both local and public checks during deploy:

bash
curl -f http://127.0.0.1:8000/health
curl -f https://yourdomain.com/health

3. Build outside the live path

Do not deploy directly into the active code directory.

Recommended layout:

text
/var/www/myapp/
├── current -> /var/www/myapp/releases/20260420123000
├── releases/
│   ├── 20260419110000
│   └── 20260420123000
└── shared/
    ├── .env
    ├── uploads/
    └── logs/

Keep shared state outside the release directory:

  • secrets
  • uploads
  • runtime sockets
  • logs
  • cache volumes if needed

4. Prepare a systemd service that points to current

Example gunicorn.service:

ini
[Unit]
Description=Gunicorn for myapp
After=network.target

[Service]
User=www-data
Group=www-data
WorkingDirectory=/var/www/myapp/current
EnvironmentFile=/var/www/myapp/shared/.env
ExecStart=/var/www/myapp/current/.venv/bin/gunicorn app:app \
  --workers 3 \
  --bind 127.0.0.1:8000 \
  --timeout 60 \
  --graceful-timeout 30 \
  --access-logfile - \
  --error-logfile -
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
KillSignal=SIGTERM
TimeoutStopSec=60

[Install]
WantedBy=multi-user.target

Reload systemd if you change the unit:

bash
sudo systemctl daemon-reload
sudo systemctl restart gunicorn

For regular deploys, prefer:

bash
sudo systemctl reload gunicorn

Do not use restart unless you accept a stop/start cycle.

5. Put Nginx in front of Gunicorn

Example Nginx server block:

nginx
server {
    listen 80;
    server_name yourdomain.com;

    location / {
        proxy_pass http://127.0.0.1:8000;
        proxy_set_header Host $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        proxy_connect_timeout 5s;
        proxy_send_timeout 60s;
        proxy_read_timeout 60s;
    }

    location /health {
        proxy_pass http://127.0.0.1:8000/health;
        access_log off;
    }
}

Validate config before reload:

bash
sudo nginx -t
sudo systemctl reload nginx

If you use Unix sockets, keep the socket path stable across releases or let systemd manage socket creation consistently.

6. Run pre-deploy checks

Before touching live traffic:

bash
set -e
python -m venv .venv
. .venv/bin/activate
pip install -r requirements.txt
pytest
alembic current

If your app has asset compilation:

bash
npm ci
npm run build

If config is environment-driven, validate required variables before switch:

bash
test -n "$DATABASE_URL"
test -n "$SECRET_KEY"

7. Use safe database migrations

Zero downtime app deploys fail most often because of schema changes.

Use an expand/contract pattern:

  1. add new nullable columns, tables, or indexes
  2. deploy code that can handle old and new schema
  3. backfill data separately
  4. remove old columns only after old code is gone

Safe examples:

  • add nullable column
  • add new table
  • add index concurrently where supported
  • write code that reads both old and new fields temporarily

Unsafe examples during live traffic:

  • dropping columns old code still reads
  • renaming columns without compatibility layer
  • blocking table rewrites during peak traffic

Run migration status checks:

bash
alembic current
alembic history

If you need a full migration strategy, see Database Migration Strategy.

8. Switch traffic gracefully

For a symlink-based deploy:

bash
APP_DIR=/var/www/myapp
PREVIOUS=$(readlink -f $APP_DIR/current)
NEW_RELEASE=$APP_DIR/releases/$(date +%Y%m%d%H%M%S)

mkdir -p "$NEW_RELEASE"
rsync -a . "$NEW_RELEASE"/
cd "$NEW_RELEASE"

python -m venv .venv
. .venv/bin/activate
pip install -r requirements.txt
alembic upgrade head

ln -sfn "$NEW_RELEASE" "$APP_DIR/current"
sudo systemctl reload gunicorn

curl -f http://127.0.0.1:8000/health
curl -f https://yourdomain.com/health

If the health check fails:

bash
ln -sfn "$PREVIOUS" /var/www/myapp/current
sudo systemctl reload gunicorn

9. Verify post-deploy state

Check these immediately after release:

bash
systemctl status gunicorn
journalctl -u gunicorn -n 200 --no-pager
journalctl -u nginx -n 200 --no-pager
curl -I http://127.0.0.1:8000/health
curl -I https://yourdomain.com/health
readlink -f /var/www/myapp/current
tail -n 200 /var/log/nginx/error.log
tail -n 200 /var/log/nginx/access.log

Look for:

  • 502 or 503 responses
  • worker boot errors
  • missing env vars
  • import errors
  • static asset 404s
  • DB connection failures
  • background workers using old code

For production triage workflows, see Debugging Production Issues.

10. Keep rollback immediate

Rollback should be a traffic switch, not a restore operation.

Good rollback:

  • switch current symlink back
  • revert image tag
  • point Nginx upstream back to old app
  • reload services gracefully

Bad rollback:

  • rebuild app from scratch under pressure
  • restore a full backup for a bad code push
  • manually edit files in the live directory

Capture release metadata in every deploy:

bash
echo "release=$(date +%Y%m%d%H%M%S)"
echo "git_sha=$(git rev-parse --short HEAD)"
echo "deployed_at=$(date -Iseconds)"
build
migrate
warm
health check
switch traffic
verify
rollback

Process Flow


Common causes

These are the most common reasons “zero downtime” deployments still cause outages:

  • using systemctl restart instead of graceful reload
  • running breaking database migrations with incompatible app code
  • deploying directly into the live directory
  • no health endpoint, so traffic shifts too early
  • single Gunicorn worker configuration
  • non-versioned static assets
  • background workers running old code against new payloads
  • changing Nginx upstream or socket path without a clean handoff
  • not enough RAM to run old and new processes briefly
  • rollback requiring a backup restore instead of a fast release switch

Debugging tips

Use these commands during or after a failed deploy:

bash
systemctl status gunicorn
journalctl -u gunicorn -n 200 --no-pager
journalctl -u nginx -n 200 --no-pager
nginx -t
ps aux | grep gunicorn
ss -ltnp | grep 8000
curl -I http://127.0.0.1:8000/health
curl -I https://yourdomain.com/health
readlink -f /var/www/myapp/current
ls -lah /var/www/myapp/releases
tail -n 200 /var/log/nginx/error.log
tail -n 200 /var/log/nginx/access.log
alembic current
alembic history
docker ps
docker logs <container_name> --tail 200

Additional checks:

Confirm Gunicorn is actually reloading gracefully

Watch worker PIDs before and after reload:

bash
ps -ef | grep gunicorn
sudo systemctl reload gunicorn
sleep 2
ps -ef | grep gunicorn

You want to see new workers appear before old ones fully disappear.

Check for Nginx upstream failures

Search for common upstream errors:

bash
grep -i "upstream\|connect() failed\|502\|503" /var/log/nginx/error.log | tail -n 50

Check release pointer state

bash
readlink -f /var/www/myapp/current
ls -lah /var/www/myapp/releases

If current points to the wrong release, rollback may be a symlink issue, not an app issue.

Check worker and web deploy sync

If you use Celery or RQ, verify worker version and queue state. Incompatible job payloads often look like partial deploy failures.


Checklist

  • Health endpoint exists and is used during deploy
  • New release builds outside the live path
  • Database migrations are backward-compatible
  • Gunicorn reload is graceful, not stop/start
  • Nginx continues serving during process replacement
  • Static files are versioned or atomically switched
  • Background workers are updated safely
  • Rollback path is tested
  • Logs and metrics are checked after release
  • Old release is retained until deploy is confirmed stable

For broader release hardening, review Deployment Checklist and SaaS Production Checklist.


Related guides


FAQ

What is the minimum setup for zero downtime on a VPS?

Use Nginx in front of Gunicorn, run multiple Gunicorn workers, deploy to a new release directory, run compatible migrations, switch the current symlink, and reload Gunicorn gracefully.

Can I use zero downtime deployment with Flask or FastAPI?

Yes. The framework matters less than the process manager, reverse proxy, health checks, and migration strategy.

When should I avoid automatic migrations during deploy?

Avoid automatic migrations when they are large, blocking, or destructive. Run those in a planned step with compatibility checks and rollback planning.

How do I know if my reload is graceful?

Watch active requests during deploy, confirm old workers exit after finishing work, and verify Nginx does not show a spike in 502 or 503 responses.

What breaks zero downtime most often?

Schema incompatibility, direct in-place file deployment, and restarting the entire app stack at once are the most common causes.

Can a single VPS do zero downtime deployment?

Yes, if you use graceful reloads or briefly run old and new app processes side by side within available CPU and RAM.

Are database migrations the main risk?

Usually yes. Process replacement is manageable. Schema changes are where most deployment failures happen.

Should I use blue/green or rolling?

Blue/green is simpler on one host if resources allow two versions at once. Rolling is better with multiple instances.

Is Docker required?

No. Release directories plus systemd and Nginx are enough for many small SaaS products.

Can I guarantee zero dropped requests?

Not completely. You can reduce the risk significantly, but long-running requests, forced kills, bad health checks, and resource exhaustion can still interrupt traffic.


Final takeaway

Zero downtime deployment is mostly operational discipline:

  • build separately
  • migrate safely
  • warm the new version
  • switch traffic gracefully
  • verify health
  • keep rollback immediate

For a small SaaS product, the simplest reliable setup is usually:

  • immutable release directories
  • health checks
  • graceful Gunicorn reloads
  • backward-compatible database changes
  • a tested rollback path

If your current deploy still uses in-place file changes or restart, fix that first.