Background Jobs Not Running

The essential playbook for implementing background jobs not running in your SaaS.

Background jobs fail in production for a small set of repeatable reasons: the worker is not running, the queue backend is unreachable, the worker cannot import your app, or the job was never enqueued correctly.

This page gives a fast fix path for Celery and RQ in small SaaS deployments.

request
enqueue
broker
worker
job execution
logs/results

Process Flow

Quick Fix / Quick Setup

Run these checks in order:

bash
# 1) Verify worker process exists
ps aux | egrep 'celery|rq|worker'

# 2) Check queue backend connectivity
redis-cli ping
# or for RabbitMQ
rabbitmqctl status

# 3) Start a worker manually from the app environment
# Celery
celery -A app.celery_app worker -l info

# RQ
rq worker default

# 4) Enqueue a known test job from app shell
python -c "from app.jobs import test_job; r=test_job.delay() if hasattr(test_job,'delay') else None; print(r)"

# 5) Inspect logs immediately
journalctl -u celery -n 100 --no-pager
journalctl -u rq-worker -n 100 --no-pager
journalctl -u gunicorn -n 100 --no-pager

If the worker starts manually but not via systemd, the problem is usually the service unit:

  • wrong WorkingDirectory
  • wrong virtualenv path in ExecStart
  • missing environment variables
  • wrong service user permissions

What’s happening

Your app accepts the request, but the async job is never executed or is delayed indefinitely.

The failure point is usually one of four places:

  1. enqueue
  2. broker / queue
  3. worker process
  4. job execution

A healthy setup requires all of these:

  • app can enqueue jobs
  • broker is reachable
  • worker is running with the correct code and environment
  • failures are visible in logs

Quick triage flow

  • Confirm the job is actually being queued.
  • Add a log line right after enqueue and capture the returned task or job ID.
  • Check broker health first.
  • Verify a worker process is running on the target host or container.
  • Start one worker manually in the same runtime environment.
  • Inspect failed-job registries, dead-letter queues, or Celery task state.

Step-by-step implementation

1) Verify the enqueue path

Make sure the production code path that should queue the job is actually executing.

Example:

python
job = send_email.delay(user_id)
logger.info("job_enqueued", extra={"task": "send_email", "user_id": user_id, "job_id": str(job)})

For RQ:

python
job = queue.enqueue(send_email, user_id)
logger.info("job_enqueued", extra={"task": "send_email", "user_id": user_id, "job_id": job.id})

If this log never appears, the problem is before the worker.

2) Add explicit enqueue logging

Log these fields every time:

  • queue name
  • task/job name
  • argument summary
  • returned task/job ID
  • request or user ID if available

Do not log secrets or full payloads.

3) Check broker connectivity from the worker environment

For Redis:

bash
redis-cli ping
redis-cli llen default
redis-cli keys '*'

For RabbitMQ:

bash
rabbitmqctl status
rabbitmqctl list_queues

If these fail from the worker host or container, jobs will not run.

4) Validate worker startup command

Make sure the command matches the app structure and queue names.

Celery examples:

bash
celery -A app.celery_app worker -l info
celery -A app.celery_app worker -l info -Q default
celery -A app.celery_app inspect registered
celery -A app.celery_app inspect active
celery -A app.celery_app status

RQ examples:

bash
rq info
rq worker default
rq worker high default low

Check:

  • correct app module
  • correct queue names
  • correct concurrency settings
  • correct environment variables
  • correct working directory

5) Confirm web and worker use the same code version

A common production issue is:

  • web container/server deployed with new code
  • worker still running old code

Verify the deployed revision, image tag, or release path for both processes.

Useful checks:

bash
docker ps
docker logs <worker-container> --tail 200
docker exec -it <worker-container> sh
env | sort
python -c "import app; print('app import ok')"

6) Inspect process manager config

systemd example: Celery

ini
[Unit]
Description=Celery Worker
After=network.target

[Service]
User=deploy
Group=deploy
WorkingDirectory=/var/www/myapp/current
EnvironmentFile=/var/www/myapp/shared/.env
ExecStart=/var/www/myapp/venv/bin/celery -A app.celery_app worker -l info -Q default
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

systemd example: RQ

ini
[Unit]
Description=RQ Worker
After=network.target

[Service]
User=deploy
Group=deploy
WorkingDirectory=/var/www/myapp/current
EnvironmentFile=/var/www/myapp/shared/.env
ExecStart=/var/www/myapp/venv/bin/rq worker default
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

After editing:

bash
sudo systemctl daemon-reload
sudo systemctl restart celery
sudo systemctl restart rq-worker
sudo systemctl status celery
sudo systemctl status rq-worker

Check for:

  • wrong ExecStart
  • wrong WorkingDirectory
  • missing EnvironmentFile
  • wrong user
  • no restart policy

7) Run a minimal smoke-test job

Use a job that only logs a line.

Celery:

python
from celery import shared_task
import logging

logger = logging.getLogger(__name__)

@shared_task
def smoke_test():
    logger.info("celery_smoke_test_ok")
    return "ok"

RQ:

python
import logging
logger = logging.getLogger(__name__)

def smoke_test():
    logger.info("rq_smoke_test_ok")
    return "ok"

Enqueue it manually and verify the log appears.

If the smoke test works, the worker is healthy and the problem is inside the real job code.

8) Check imports and dependencies

Production-only worker failures often come from:

  • missing package in production image
  • circular import
  • incorrect module path
  • environment-specific import side effects

Test imports directly:

bash
python -c "import app; print('app import ok')"
python -c "from app.jobs import smoke_test; print('job import ok')"

If manual worker startup fails with import errors, fix imports before anything else.

9) Check queue selection

If jobs are enqueued to high but the worker only listens to default, nothing will process.

Celery worker on specific queues:

bash
celery -A app.celery_app worker -l info -Q high,default

RQ worker on specific queues:

bash
rq worker high default

Make queue names explicit in both producer and worker configuration.

10) Check serialization and payload size

Do not pass:

  • ORM model instances
  • request objects
  • open file handles
  • non-JSON-serializable objects

Prefer:

python
send_email.delay(user_id=user.id)

Instead of:

python
send_email.delay(user)

Fetch state inside the job.

11) Enqueue after transaction commit

If the job depends on a database row created in the request, enqueue after commit. Otherwise the worker may run before data is visible.

Pattern:

python
from django.db import transaction

transaction.on_commit(lambda: send_email.delay(user.id))

Equivalent patterns exist in other frameworks using post-commit hooks.

12) Make worker logging persistent

Worker logs should include:

  • timestamp
  • queue
  • task/job name
  • task/job ID
  • traceback
  • retry count if relevant

If you only log to stdout without collection, failures may disappear after restart.

For broader production debugging, see Debugging Production Issues and Error Tracking with Sentry.

Common causes

Most production failures come from one of these:

  • worker service is not running or is crash-looping
  • wrong broker URL or broker credentials
  • Redis or RabbitMQ unavailable from worker host
  • worker listening to the wrong queue
  • incorrect app module path in worker startup command
  • missing environment variables in systemd or Docker
  • web app and worker running different code versions
  • task/job import errors or circular imports
  • non-serializable job arguments
  • jobs enqueued before DB transaction commits
  • insufficient permissions for logs, files, sockets, or temp paths
  • container exits immediately because command or entrypoint is wrong
  • Redis memory eviction or broker-side resource exhaustion
  • failed jobs accumulating silently without alerts

Systemd and service configuration checks

  • Ensure ExecStart points to the virtualenv binary, not a global binary.
  • Set WorkingDirectory to the live release directory.
  • Load env vars with Environment= or EnvironmentFile=.
  • Run the worker under a user with access to project files and logs.
  • Enable restart policy with Restart=always or Restart=on-failure.
  • Run systemctl daemon-reload after unit changes.

Container and Docker checks

  • Confirm the worker container exists in production deployment manifests.
  • Verify the worker command is not overridden by a script that exits immediately.
  • Ensure app and worker containers share the same REDIS_URL, DATABASE_URL, and secrets.
  • Add health checks if possible.
  • Check image tags so web and worker run the same release.

If your deployment setup is unstable, review Docker Production Setup for SaaS and Environment Setup on VPS.

Common queue-specific checks

Celery

  • verify broker_url
  • verify result_backend
  • inspect registered tasks
  • ensure autodiscovery works
  • confirm worker listens to the intended queues when using -Q

Commands:

bash
celery -A app.celery_app status
celery -A app.celery_app inspect active
celery -A app.celery_app inspect registered
celery -A app.celery_app worker -l info

RQ

  • verify exact queue name
  • inspect failed job registries
  • confirm worker can import job modules

Commands:

bash
rq info
rq worker default

Redis-backed systems

Check:

  • memory pressure
  • eviction policy
  • network connectivity from both web and worker

Debugging tips

Run one worker in foreground with verbose logging first.

Useful commands:

bash
ps aux | egrep 'celery|rq|worker'
systemctl status celery
systemctl status rq-worker
journalctl -u celery -n 200 --no-pager
journalctl -u rq-worker -n 200 --no-pager
redis-cli ping
redis-cli llen default
redis-cli keys '*'
rabbitmqctl status
rabbitmqctl list_queues
celery -A app.celery_app status
celery -A app.celery_app inspect active
celery -A app.celery_app inspect registered
celery -A app.celery_app worker -l info
rq info
rq worker default
docker ps
docker logs <worker-container> --tail 200
docker exec -it <worker-container> sh
env | sort
python -c "import app; print('app import ok')"

Additional rules:

  • use a dedicated smoke-test job in every environment
  • do not pass model instances or request objects into jobs
  • log queue names explicitly
  • enqueue after transaction commit if the job depends on DB writes
  • verify outbound network access and DNS if jobs call external APIs
  • if auth-related jobs fail during signup or login flows, also review Common Auth Bugs and Fixes

Use this troubleshooting matrix to quickly isolate background-job failures:

SymptomLikely causeCheckFix
Jobs stay queued and never startWorker is down or listening to the wrong queue`ps auxegrep 'celery
Worker starts then exits immediatelyBad ExecStart, wrong venv path, or import errorsystemctl status celery --no-pager -l, journalctl -u celery -n 200 --no-pagerFix systemd unit paths/module target, reload daemon, restart service
Worker runs but cannot consume jobsBroker URL/credentials wrong or broker unreachableredis-cli ping or rabbitmqctl status from worker host/containerCorrect broker env vars and network routing, then restart worker
Only some jobs failJob payload is not serializable or job logic depends on missing stateInspect traceback in worker logs; verify arguments are IDs, not ORM objectsPass primitive IDs only, load state inside job, add guardrails and retries
Job fails after enqueue during create flowJob runs before DB transaction commitCompare enqueue timing with DB commit in logsEnqueue in post-commit hook (transaction.on_commit or equivalent)
Works locally but fails in productionWeb and worker run different release/env configurationCompare image tags/revision/env across web and workerDeploy web and worker together with shared env management

Checklist

  • worker process running
  • broker reachable from app and worker
  • correct queue names configured
  • same environment variables on web and worker
  • same release version on web and worker
  • test job executes successfully
  • tracebacks visible in logs
  • failed jobs can be inspected
  • worker restarts automatically on failure
  • monitoring and alerts configured for queue backlog and worker downtime

For final hardening, use the SaaS Production Checklist.

Related guides

FAQ

How do I know if the problem is enqueue vs worker execution?

Log immediately after enqueue and capture the task or job ID. If no ID is returned or enqueue raises an error, the problem is before the worker. If the ID exists but nothing processes it, inspect the queue and worker.

What is the most common production mistake with Celery or RQ?

The worker service is missing production environment variables or starts from the wrong working directory, so it cannot import the app or connect to Redis or RabbitMQ.

Should I run multiple workers?

Yes if queue latency matters, but first make one worker stable and observable. Add more workers only after queue names, retries, logging, and resource limits are correct.

Why are only some jobs failing?

Those jobs usually depend on unavailable files, external APIs, database state, or non-serializable arguments. Test with a minimal job and isolate the failing dependency.

Why do jobs run locally but not in production?

Usually:

  • missing env vars
  • wrong broker URL
  • wrong working directory
  • queue mismatch
  • worker process not running

Why are jobs queued but never processed?

The worker may:

  • listen to a different queue
  • lack broker access
  • crash-loop on startup

Why do jobs disappear?

Possible causes:

  • Redis eviction
  • TTL or expiration settings
  • result backend confusion
  • jobs moved to failed registries

Should the worker run on the same server as the app?

For a small SaaS, yes if resources allow. Use separate processes or services and monitor CPU and memory contention.

Final takeaway

Treat background jobs as a separate production service.

Most issues are process management or configuration problems, not queue-library bugs. Debug in this order:

  1. enqueue
  2. broker
  3. worker
  4. job execution

Make worker startup explicit, log job IDs, keep web and worker on the same release, and monitor queue backlog so failures are visible early.