Background Jobs Not Running
The essential playbook for implementing background jobs not running in your SaaS.
Background jobs fail in production for a small set of repeatable reasons: the worker is not running, the queue backend is unreachable, the worker cannot import your app, or the job was never enqueued correctly.
This page gives a fast fix path for Celery and RQ in small SaaS deployments.
Process Flow
Quick Fix / Quick Setup
Run these checks in order:
# 1) Verify worker process exists
ps aux | egrep 'celery|rq|worker'
# 2) Check queue backend connectivity
redis-cli ping
# or for RabbitMQ
rabbitmqctl status
# 3) Start a worker manually from the app environment
# Celery
celery -A app.celery_app worker -l info
# RQ
rq worker default
# 4) Enqueue a known test job from app shell
python -c "from app.jobs import test_job; r=test_job.delay() if hasattr(test_job,'delay') else None; print(r)"
# 5) Inspect logs immediately
journalctl -u celery -n 100 --no-pager
journalctl -u rq-worker -n 100 --no-pager
journalctl -u gunicorn -n 100 --no-pagerIf the worker starts manually but not via systemd, the problem is usually the service unit:
- wrong
WorkingDirectory - wrong virtualenv path in
ExecStart - missing environment variables
- wrong service user permissions
What’s happening
Your app accepts the request, but the async job is never executed or is delayed indefinitely.
The failure point is usually one of four places:
- enqueue
- broker / queue
- worker process
- job execution
A healthy setup requires all of these:
- app can enqueue jobs
- broker is reachable
- worker is running with the correct code and environment
- failures are visible in logs
Quick triage flow
- Confirm the job is actually being queued.
- Add a log line right after enqueue and capture the returned task or job ID.
- Check broker health first.
- Verify a worker process is running on the target host or container.
- Start one worker manually in the same runtime environment.
- Inspect failed-job registries, dead-letter queues, or Celery task state.
Step-by-step implementation
1) Verify the enqueue path
Make sure the production code path that should queue the job is actually executing.
Example:
job = send_email.delay(user_id)
logger.info("job_enqueued", extra={"task": "send_email", "user_id": user_id, "job_id": str(job)})For RQ:
job = queue.enqueue(send_email, user_id)
logger.info("job_enqueued", extra={"task": "send_email", "user_id": user_id, "job_id": job.id})If this log never appears, the problem is before the worker.
2) Add explicit enqueue logging
Log these fields every time:
- queue name
- task/job name
- argument summary
- returned task/job ID
- request or user ID if available
Do not log secrets or full payloads.
3) Check broker connectivity from the worker environment
For Redis:
redis-cli ping
redis-cli llen default
redis-cli keys '*'For RabbitMQ:
rabbitmqctl status
rabbitmqctl list_queuesIf these fail from the worker host or container, jobs will not run.
4) Validate worker startup command
Make sure the command matches the app structure and queue names.
Celery examples:
celery -A app.celery_app worker -l info
celery -A app.celery_app worker -l info -Q default
celery -A app.celery_app inspect registered
celery -A app.celery_app inspect active
celery -A app.celery_app statusRQ examples:
rq info
rq worker default
rq worker high default lowCheck:
- correct app module
- correct queue names
- correct concurrency settings
- correct environment variables
- correct working directory
5) Confirm web and worker use the same code version
A common production issue is:
- web container/server deployed with new code
- worker still running old code
Verify the deployed revision, image tag, or release path for both processes.
Useful checks:
docker ps
docker logs <worker-container> --tail 200
docker exec -it <worker-container> sh
env | sort
python -c "import app; print('app import ok')"6) Inspect process manager config
systemd example: Celery
[Unit]
Description=Celery Worker
After=network.target
[Service]
User=deploy
Group=deploy
WorkingDirectory=/var/www/myapp/current
EnvironmentFile=/var/www/myapp/shared/.env
ExecStart=/var/www/myapp/venv/bin/celery -A app.celery_app worker -l info -Q default
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.targetsystemd example: RQ
[Unit]
Description=RQ Worker
After=network.target
[Service]
User=deploy
Group=deploy
WorkingDirectory=/var/www/myapp/current
EnvironmentFile=/var/www/myapp/shared/.env
ExecStart=/var/www/myapp/venv/bin/rq worker default
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.targetAfter editing:
sudo systemctl daemon-reload
sudo systemctl restart celery
sudo systemctl restart rq-worker
sudo systemctl status celery
sudo systemctl status rq-workerCheck for:
- wrong
ExecStart - wrong
WorkingDirectory - missing
EnvironmentFile - wrong user
- no restart policy
7) Run a minimal smoke-test job
Use a job that only logs a line.
Celery:
from celery import shared_task
import logging
logger = logging.getLogger(__name__)
@shared_task
def smoke_test():
logger.info("celery_smoke_test_ok")
return "ok"RQ:
import logging
logger = logging.getLogger(__name__)
def smoke_test():
logger.info("rq_smoke_test_ok")
return "ok"Enqueue it manually and verify the log appears.
If the smoke test works, the worker is healthy and the problem is inside the real job code.
8) Check imports and dependencies
Production-only worker failures often come from:
- missing package in production image
- circular import
- incorrect module path
- environment-specific import side effects
Test imports directly:
python -c "import app; print('app import ok')"
python -c "from app.jobs import smoke_test; print('job import ok')"If manual worker startup fails with import errors, fix imports before anything else.
9) Check queue selection
If jobs are enqueued to high but the worker only listens to default, nothing will process.
Celery worker on specific queues:
celery -A app.celery_app worker -l info -Q high,defaultRQ worker on specific queues:
rq worker high defaultMake queue names explicit in both producer and worker configuration.
10) Check serialization and payload size
Do not pass:
- ORM model instances
- request objects
- open file handles
- non-JSON-serializable objects
Prefer:
send_email.delay(user_id=user.id)Instead of:
send_email.delay(user)Fetch state inside the job.
11) Enqueue after transaction commit
If the job depends on a database row created in the request, enqueue after commit. Otherwise the worker may run before data is visible.
Pattern:
from django.db import transaction
transaction.on_commit(lambda: send_email.delay(user.id))Equivalent patterns exist in other frameworks using post-commit hooks.
12) Make worker logging persistent
Worker logs should include:
- timestamp
- queue
- task/job name
- task/job ID
- traceback
- retry count if relevant
If you only log to stdout without collection, failures may disappear after restart.
For broader production debugging, see Debugging Production Issues and Error Tracking with Sentry.
Common causes
Most production failures come from one of these:
- worker service is not running or is crash-looping
- wrong broker URL or broker credentials
- Redis or RabbitMQ unavailable from worker host
- worker listening to the wrong queue
- incorrect app module path in worker startup command
- missing environment variables in systemd or Docker
- web app and worker running different code versions
- task/job import errors or circular imports
- non-serializable job arguments
- jobs enqueued before DB transaction commits
- insufficient permissions for logs, files, sockets, or temp paths
- container exits immediately because command or entrypoint is wrong
- Redis memory eviction or broker-side resource exhaustion
- failed jobs accumulating silently without alerts
Systemd and service configuration checks
- Ensure
ExecStartpoints to the virtualenv binary, not a global binary. - Set
WorkingDirectoryto the live release directory. - Load env vars with
Environment=orEnvironmentFile=. - Run the worker under a user with access to project files and logs.
- Enable restart policy with
Restart=alwaysorRestart=on-failure. - Run
systemctl daemon-reloadafter unit changes.
Container and Docker checks
- Confirm the worker container exists in production deployment manifests.
- Verify the worker command is not overridden by a script that exits immediately.
- Ensure app and worker containers share the same
REDIS_URL,DATABASE_URL, and secrets. - Add health checks if possible.
- Check image tags so web and worker run the same release.
If your deployment setup is unstable, review Docker Production Setup for SaaS and Environment Setup on VPS.
Common queue-specific checks
Celery
- verify
broker_url - verify
result_backend - inspect registered tasks
- ensure autodiscovery works
- confirm worker listens to the intended queues when using
-Q
Commands:
celery -A app.celery_app status
celery -A app.celery_app inspect active
celery -A app.celery_app inspect registered
celery -A app.celery_app worker -l infoRQ
- verify exact queue name
- inspect failed job registries
- confirm worker can import job modules
Commands:
rq info
rq worker defaultRedis-backed systems
Check:
- memory pressure
- eviction policy
- network connectivity from both web and worker
Debugging tips
Run one worker in foreground with verbose logging first.
Useful commands:
ps aux | egrep 'celery|rq|worker'
systemctl status celery
systemctl status rq-worker
journalctl -u celery -n 200 --no-pager
journalctl -u rq-worker -n 200 --no-pager
redis-cli ping
redis-cli llen default
redis-cli keys '*'
rabbitmqctl status
rabbitmqctl list_queues
celery -A app.celery_app status
celery -A app.celery_app inspect active
celery -A app.celery_app inspect registered
celery -A app.celery_app worker -l info
rq info
rq worker default
docker ps
docker logs <worker-container> --tail 200
docker exec -it <worker-container> sh
env | sort
python -c "import app; print('app import ok')"Additional rules:
- use a dedicated smoke-test job in every environment
- do not pass model instances or request objects into jobs
- log queue names explicitly
- enqueue after transaction commit if the job depends on DB writes
- verify outbound network access and DNS if jobs call external APIs
- if auth-related jobs fail during signup or login flows, also review Common Auth Bugs and Fixes
Use this troubleshooting matrix to quickly isolate background-job failures:
| Symptom | Likely cause | Check | Fix |
|---|---|---|---|
| Jobs stay queued and never start | Worker is down or listening to the wrong queue | `ps aux | egrep 'celery |
| Worker starts then exits immediately | Bad ExecStart, wrong venv path, or import error | systemctl status celery --no-pager -l, journalctl -u celery -n 200 --no-pager | Fix systemd unit paths/module target, reload daemon, restart service |
| Worker runs but cannot consume jobs | Broker URL/credentials wrong or broker unreachable | redis-cli ping or rabbitmqctl status from worker host/container | Correct broker env vars and network routing, then restart worker |
| Only some jobs fail | Job payload is not serializable or job logic depends on missing state | Inspect traceback in worker logs; verify arguments are IDs, not ORM objects | Pass primitive IDs only, load state inside job, add guardrails and retries |
| Job fails after enqueue during create flow | Job runs before DB transaction commit | Compare enqueue timing with DB commit in logs | Enqueue in post-commit hook (transaction.on_commit or equivalent) |
| Works locally but fails in production | Web and worker run different release/env configuration | Compare image tags/revision/env across web and worker | Deploy web and worker together with shared env management |
Checklist
- ✓ worker process running
- ✓ broker reachable from app and worker
- ✓ correct queue names configured
- ✓ same environment variables on web and worker
- ✓ same release version on web and worker
- ✓ test job executes successfully
- ✓ tracebacks visible in logs
- ✓ failed jobs can be inspected
- ✓ worker restarts automatically on failure
- ✓ monitoring and alerts configured for queue backlog and worker downtime
For final hardening, use the SaaS Production Checklist.
Related guides
- Handling Background Jobs (Celery / RQ)
- Docker Production Setup for SaaS
- Environment Setup on VPS
- Debugging Production Issues
- Logging Setup (Application + Server)
FAQ
How do I know if the problem is enqueue vs worker execution?
Log immediately after enqueue and capture the task or job ID. If no ID is returned or enqueue raises an error, the problem is before the worker. If the ID exists but nothing processes it, inspect the queue and worker.
What is the most common production mistake with Celery or RQ?
The worker service is missing production environment variables or starts from the wrong working directory, so it cannot import the app or connect to Redis or RabbitMQ.
Should I run multiple workers?
Yes if queue latency matters, but first make one worker stable and observable. Add more workers only after queue names, retries, logging, and resource limits are correct.
Why are only some jobs failing?
Those jobs usually depend on unavailable files, external APIs, database state, or non-serializable arguments. Test with a minimal job and isolate the failing dependency.
Why do jobs run locally but not in production?
Usually:
- missing env vars
- wrong broker URL
- wrong working directory
- queue mismatch
- worker process not running
Why are jobs queued but never processed?
The worker may:
- listen to a different queue
- lack broker access
- crash-loop on startup
Why do jobs disappear?
Possible causes:
- Redis eviction
- TTL or expiration settings
- result backend confusion
- jobs moved to failed registries
Should the worker run on the same server as the app?
For a small SaaS, yes if resources allow. Use separate processes or services and monitor CPU and memory contention.
Final takeaway
Treat background jobs as a separate production service.
Most issues are process management or configuration problems, not queue-library bugs. Debug in this order:
- enqueue
- broker
- worker
- job execution
Make worker startup explicit, log job IDs, keep web and worker on the same release, and monitor queue backlog so failures are visible early.