Background Jobs Not Running — SaaS Builder Playbooks

Background jobs fail in production for a small set of repeatable reasons: the worker is not running, the queue backend is unreachable, the worker cannot import your app, or the job was never enqueued correctly.

This page gives a fast fix path for Celery and RQ in small SaaS deployments.

request

enqueue

broker

worker

job execution

logs/results

Process Flow

Quick Fix / Quick Setup

Run these checks in order:

bash

# 1) Verify worker process exists
ps aux | egrep 'celery|rq|worker'

# 2) Check queue backend connectivity
redis-cli ping
# or for RabbitMQ
rabbitmqctl status

# 3) Start a worker manually from the app environment
# Celery
celery -A app.celery_app worker -l info

# RQ
rq worker default

# 4) Enqueue a known test job from app shell
python -c "from app.jobs import test_job; r=test_job.delay() if hasattr(test_job,'delay') else None; print(r)"

# 5) Inspect logs immediately
journalctl -u celery -n 100 --no-pager
journalctl -u rq-worker -n 100 --no-pager
journalctl -u gunicorn -n 100 --no-pager

If the worker starts manually but not via systemd, the problem is usually the service unit:

wrong WorkingDirectory
wrong virtualenv path in ExecStart
missing environment variables
wrong service user permissions

What’s happening

Your app accepts the request, but the async job is never executed or is delayed indefinitely.

The failure point is usually one of four places:

enqueue
broker / queue
worker process
job execution

A healthy setup requires all of these:

app can enqueue jobs
broker is reachable
worker is running with the correct code and environment
failures are visible in logs

Quick triage flow

Confirm the job is actually being queued.
Add a log line right after enqueue and capture the returned task or job ID.
Check broker health first.
Verify a worker process is running on the target host or container.
Start one worker manually in the same runtime environment.
Inspect failed-job registries, dead-letter queues, or Celery task state.

Step-by-step implementation

1) Verify the enqueue path

Make sure the production code path that should queue the job is actually executing.

Example:

python

job = send_email.delay(user_id)
logger.info("job_enqueued", extra={"task": "send_email", "user_id": user_id, "job_id": str(job)})

For RQ:

python

job = queue.enqueue(send_email, user_id)
logger.info("job_enqueued", extra={"task": "send_email", "user_id": user_id, "job_id": job.id})

If this log never appears, the problem is before the worker.

2) Add explicit enqueue logging

Log these fields every time:

queue name
task/job name
argument summary
returned task/job ID
request or user ID if available

Do not log secrets or full payloads.

3) Check broker connectivity from the worker environment

For Redis:

bash

redis-cli ping
redis-cli llen default
redis-cli keys '*'

For RabbitMQ:

bash

rabbitmqctl status
rabbitmqctl list_queues

If these fail from the worker host or container, jobs will not run.

4) Validate worker startup command

Make sure the command matches the app structure and queue names.

Celery examples:

bash

celery -A app.celery_app worker -l info
celery -A app.celery_app worker -l info -Q default
celery -A app.celery_app inspect registered
celery -A app.celery_app inspect active
celery -A app.celery_app status

RQ examples:

bash

rq info
rq worker default
rq worker high default low

Check:

correct app module
correct queue names
correct concurrency settings
correct environment variables
correct working directory

5) Confirm web and worker use the same code version

A common production issue is:

web container/server deployed with new code
worker still running old code

Verify the deployed revision, image tag, or release path for both processes.

Useful checks:

bash

docker ps
docker logs <worker-container> --tail 200
docker exec -it <worker-container> sh
env | sort
python -c "import app; print('app import ok')"

6) Inspect process manager config

systemd example: Celery

ini

[Unit]
Description=Celery Worker
After=network.target

[Service]
User=deploy
Group=deploy
WorkingDirectory=/var/www/myapp/current
EnvironmentFile=/var/www/myapp/shared/.env
ExecStart=/var/www/myapp/venv/bin/celery -A app.celery_app worker -l info -Q default
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

systemd example: RQ

ini

[Unit]
Description=RQ Worker
After=network.target

[Service]
User=deploy
Group=deploy
WorkingDirectory=/var/www/myapp/current
EnvironmentFile=/var/www/myapp/shared/.env
ExecStart=/var/www/myapp/venv/bin/rq worker default
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

After editing:

bash

sudo systemctl daemon-reload
sudo systemctl restart celery
sudo systemctl restart rq-worker
sudo systemctl status celery
sudo systemctl status rq-worker

Check for:

wrong ExecStart
wrong WorkingDirectory
missing EnvironmentFile
wrong user
no restart policy

7) Run a minimal smoke-test job

Use a job that only logs a line.

Celery:

python

from celery import shared_task
import logging

logger = logging.getLogger(__name__)

@shared_task
def smoke_test():
    logger.info("celery_smoke_test_ok")
    return "ok"

RQ:

python

import logging
logger = logging.getLogger(__name__)

def smoke_test():
    logger.info("rq_smoke_test_ok")
    return "ok"

Enqueue it manually and verify the log appears.

If the smoke test works, the worker is healthy and the problem is inside the real job code.

8) Check imports and dependencies

Production-only worker failures often come from:

missing package in production image
circular import
incorrect module path
environment-specific import side effects

Test imports directly:

bash

python -c "import app; print('app import ok')"
python -c "from app.jobs import smoke_test; print('job import ok')"

If manual worker startup fails with import errors, fix imports before anything else.

9) Check queue selection

If jobs are enqueued to high but the worker only listens to default, nothing will process.

Celery worker on specific queues:

bash

celery -A app.celery_app worker -l info -Q high,default

RQ worker on specific queues:

bash

rq worker high default

Make queue names explicit in both producer and worker configuration.

10) Check serialization and payload size

Do not pass:

ORM model instances
request objects
open file handles
non-JSON-serializable objects

Prefer:

python

send_email.delay(user_id=user.id)

Instead of:

python

send_email.delay(user)

Fetch state inside the job.

11) Enqueue after transaction commit

If the job depends on a database row created in the request, enqueue after commit. Otherwise the worker may run before data is visible.

Pattern:

python

from django.db import transaction

transaction.on_commit(lambda: send_email.delay(user.id))

Equivalent patterns exist in other frameworks using post-commit hooks.

12) Make worker logging persistent

Worker logs should include:

timestamp
queue
task/job name
task/job ID
traceback
retry count if relevant

If you only log to stdout without collection, failures may disappear after restart.

For broader production debugging, see Debugging Production Issues and Error Tracking with Sentry.

Common causes

Most production failures come from one of these:

worker service is not running or is crash-looping
wrong broker URL or broker credentials
Redis or RabbitMQ unavailable from worker host
worker listening to the wrong queue
incorrect app module path in worker startup command
missing environment variables in systemd or Docker
web app and worker running different code versions
task/job import errors or circular imports
non-serializable job arguments
jobs enqueued before DB transaction commits
insufficient permissions for logs, files, sockets, or temp paths
container exits immediately because command or entrypoint is wrong
Redis memory eviction or broker-side resource exhaustion
failed jobs accumulating silently without alerts

Systemd and service configuration checks

Ensure ExecStart points to the virtualenv binary, not a global binary.
Set WorkingDirectory to the live release directory.
Load env vars with Environment= or EnvironmentFile=.
Run the worker under a user with access to project files and logs.
Enable restart policy with Restart=always or Restart=on-failure.
Run systemctl daemon-reload after unit changes.

Container and Docker checks

Confirm the worker container exists in production deployment manifests.
Verify the worker command is not overridden by a script that exits immediately.
Ensure app and worker containers share the same REDIS_URL, DATABASE_URL, and secrets.
Add health checks if possible.
Check image tags so web and worker run the same release.

If your deployment setup is unstable, review Docker Production Setup for SaaS and Environment Setup on VPS.

Common queue-specific checks

Celery

verify broker_url
verify result_backend
inspect registered tasks
ensure autodiscovery works
confirm worker listens to the intended queues when using -Q

Commands:

bash

celery -A app.celery_app status
celery -A app.celery_app inspect active
celery -A app.celery_app inspect registered
celery -A app.celery_app worker -l info

RQ

verify exact queue name
inspect failed job registries
confirm worker can import job modules

Commands:

bash

rq info
rq worker default

Redis-backed systems

Check:

memory pressure
eviction policy
network connectivity from both web and worker

Debugging tips

Run one worker in foreground with verbose logging first.

Useful commands:

bash

ps aux | egrep 'celery|rq|worker'
systemctl status celery
systemctl status rq-worker
journalctl -u celery -n 200 --no-pager
journalctl -u rq-worker -n 200 --no-pager
redis-cli ping
redis-cli llen default
redis-cli keys '*'
rabbitmqctl status
rabbitmqctl list_queues
celery -A app.celery_app status
celery -A app.celery_app inspect active
celery -A app.celery_app inspect registered
celery -A app.celery_app worker -l info
rq info
rq worker default
docker ps
docker logs <worker-container> --tail 200
docker exec -it <worker-container> sh
env | sort
python -c "import app; print('app import ok')"

Additional rules:

use a dedicated smoke-test job in every environment
do not pass model instances or request objects into jobs
log queue names explicitly
enqueue after transaction commit if the job depends on DB writes
verify outbound network access and DNS if jobs call external APIs
if auth-related jobs fail during signup or login flows, also review Common Auth Bugs and Fixes

Use this troubleshooting matrix to quickly isolate background-job failures:

Symptom	Likely cause	Check	Fix
Jobs stay queued and never start	Worker is down or listening to the wrong queue	`ps aux	egrep 'celery
Worker starts then exits immediately	Bad ExecStart, wrong venv path, or import error	systemctl status celery --no-pager -l, journalctl -u celery -n 200 --no-pager	Fix systemd unit paths/module target, reload daemon, restart service
Worker runs but cannot consume jobs	Broker URL/credentials wrong or broker unreachable	redis-cli ping or rabbitmqctl status from worker host/container	Correct broker env vars and network routing, then restart worker
Only some jobs fail	Job payload is not serializable or job logic depends on missing state	Inspect traceback in worker logs; verify arguments are IDs, not ORM objects	Pass primitive IDs only, load state inside job, add guardrails and retries
Job fails after enqueue during create flow	Job runs before DB transaction commit	Compare enqueue timing with DB commit in logs	Enqueue in post-commit hook (transaction.on_commit or equivalent)
Works locally but fails in production	Web and worker run different release/env configuration	Compare image tags/revision/env across web and worker	Deploy web and worker together with shared env management

Checklist

✓ worker process running
✓ broker reachable from app and worker
✓ correct queue names configured
✓ same environment variables on web and worker
✓ same release version on web and worker
✓ test job executes successfully
✓ tracebacks visible in logs
✓ failed jobs can be inspected
✓ worker restarts automatically on failure
✓ monitoring and alerts configured for queue backlog and worker downtime

For final hardening, use the SaaS Production Checklist.

Related guides

FAQ

How do I know if the problem is enqueue vs worker execution?

Log immediately after enqueue and capture the task or job ID. If no ID is returned or enqueue raises an error, the problem is before the worker. If the ID exists but nothing processes it, inspect the queue and worker.

What is the most common production mistake with Celery or RQ?

The worker service is missing production environment variables or starts from the wrong working directory, so it cannot import the app or connect to Redis or RabbitMQ.

Should I run multiple workers?

Yes if queue latency matters, but first make one worker stable and observable. Add more workers only after queue names, retries, logging, and resource limits are correct.

Why are only some jobs failing?

Those jobs usually depend on unavailable files, external APIs, database state, or non-serializable arguments. Test with a minimal job and isolate the failing dependency.

Why do jobs run locally but not in production?

Usually:

missing env vars
wrong broker URL
wrong working directory
queue mismatch
worker process not running

Why are jobs queued but never processed?

The worker may:

listen to a different queue
lack broker access
crash-loop on startup

Why do jobs disappear?

Possible causes:

Redis eviction
TTL or expiration settings
result backend confusion
jobs moved to failed registries

Should the worker run on the same server as the app?

For a small SaaS, yes if resources allow. Use separate processes or services and monitor CPU and memory contention.

Final takeaway

Treat background jobs as a separate production service.

Most issues are process management or configuration problems, not queue-library bugs. Debug in this order:

enqueue
broker
worker
job execution

Make worker startup explicit, log job IDs, keep web and worker on the same release, and monitor queue backlog so failures are visible early.