API Monitoring and Rate Limits

The essential playbook for implementing api monitoring and rate limits in your SaaS.

This page covers the minimum production setup for monitoring API health and enforcing rate limits. The goal is to track request volume, latency, status codes, and abuse patterns, then apply rate limits that protect your app without blocking normal users.

Quick Fix / Quick Setup

Start with a working baseline:

  • /health endpoint for uptime checks
  • request logging on every API call
  • Prometheus-compatible metrics
  • global rate limit
  • stricter limits on auth and expensive routes
  • Redis-backed counters if you run multiple instances

FastAPI example:

bash
pip install slowapi prometheus-fastapi-instrumentator redis
python
from fastapi import FastAPI, Request
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.errors import RateLimitExceeded
from slowapi.util import get_remote_address
from prometheus_fastapi_instrumentator import Instrumentator

app = FastAPI()

limiter = Limiter(
    key_func=get_remote_address,
    default_limits=["60/minute"]
)

app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

Instrumentator().instrument(app).expose(app)

@app.get("/health")
def health():
    return {"ok": True}

@app.get("/api/data")
@limiter.limit("30/minute")
def get_data(request: Request):
    return {"status": "ok"}

Minimum production note:

  • use per-IP limits first
  • expose /metrics
  • add a dedicated /health
  • switch to Redis-backed shared rate limits if traffic goes through multiple app instances or containers

Basic request logging middleware example:

python
import time
import uuid
import logging
from fastapi import Request

logger = logging.getLogger("api")

@app.middleware("http")
async def log_requests(request: Request, call_next):
    request_id = str(uuid.uuid4())
    start = time.time()

    response = None
    try:
        response = await call_next(request)
        return response
    finally:
        duration_ms = round((time.time() - start) * 1000, 2)
        logger.info(
            "api_request",
            extra={
                "request_id": request_id,
                "method": request.method,
                "path": request.url.path,
                "query": str(request.url.query),
                "status_code": getattr(response, "status_code", 500),
                "duration_ms": duration_ms,
                "client_ip": request.client.host if request.client else None,
                "user_agent": request.headers.get("user-agent"),
            },
        )

Recommended first alerts:

  • 5xx error rate spike
  • p95 latency spike
  • health check failures
  • sudden increase in 429 responses
client
reverse proxy
app
Redis limiter
metrics/logging
dashboard/alerts

diagram showing request flow: client -> reverse proxy -> app -> Redis limiter -> metrics/logging -> dashboard/alerts

What’s happening

Without API monitoring, you do not see rising error rates, slow endpoints, burst traffic, or abusive clients until users report problems.

Without rate limits, one client, bot, or buggy script can consume worker capacity, increase database load, and degrade service for everyone.

Basic health checks are not enough. You need endpoint-level visibility:

  • request count
  • status code distribution
  • latency percentiles
  • 429 rate-limit responses
  • top endpoints by traffic
  • top clients by request volume

Rate limiting must match your architecture:

  • in-memory counters are acceptable only for a single process or single instance
  • multiple instances require shared state such as Redis or an API gateway
  • proxy-aware client IP extraction must be configured correctly, or all traffic may collapse into one bucket

Step-by-step implementation

1. Define what to measure

Track these first:

  • requests per minute
  • status code counts: 2xx, 4xx, 5xx
  • p50, p95, and p99 latency
  • rate-limit hits
  • top endpoints by volume
  • top failing endpoints
  • active workers or process saturation
  • upstream dependency latency if your API calls other services

A practical minimum dashboard should answer:

  • Is the API up?
  • Which endpoints are slow?
  • Are 5xx errors increasing?
  • Are clients hitting rate limits?
  • Is one client causing burst load?

2. Add request logging middleware

Every request log should include:

  • request_id
  • user_id if authenticated
  • api_key_id or token identifier if applicable
  • client_ip
  • method
  • path
  • status_code
  • duration_ms
  • user_agent

Example with user ID support:

python
@app.middleware("http")
async def log_requests(request: Request, call_next):
    start = time.time()
    request_id = str(uuid.uuid4())
    request.state.request_id = request_id

    response = None
    try:
        response = await call_next(request)
        return response
    finally:
        duration_ms = round((time.time() - start) * 1000, 2)
        user_id = getattr(getattr(request, "state", None), "user_id", None)

        logger.info(
            "api_request",
            extra={
                "request_id": request_id,
                "user_id": user_id,
                "client_ip": request.client.host if request.client else None,
                "method": request.method,
                "path": request.url.path,
                "status_code": getattr(response, "status_code", 500),
                "duration_ms": duration_ms,
                "user_agent": request.headers.get("user-agent"),
            },
        )

If you are behind Nginx or a load balancer, make sure forwarded IP headers are set correctly.

Nginx example:

nginx
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;

If using trusted proxy handling, only trust known proxies. Do not blindly trust public X-Forwarded-For.

3. Add API metrics collection

Expose metrics via /metrics or send them to your APM.

FastAPI + Prometheus:

python
from prometheus_fastapi_instrumentator import Instrumentator

Instrumentator().instrument(app).expose(app, endpoint="/metrics")

Metrics to capture:

  • request total by method/path/status
  • latency histogram by route
  • 429 responses
  • exception count
  • worker/process stats if available

If route labels are too granular, normalize paths. Avoid high-cardinality labels like raw user IDs or full URLs with IDs.

4. Create dashboards

Build separate dashboards for:

Throughput

  • requests per minute
  • requests by endpoint
  • requests by status family

Errors

  • 5xx rate
  • 4xx rate
  • top failing routes
  • exception counts

Latency

  • p50/p95/p99 by endpoint
  • slowest endpoints
  • latency after deploys

Abuse and limits

  • total 429s
  • 429s by route
  • top IPs or API keys by request volume
  • burst traffic patterns

Suggested visual:

  • dashboard mockup showing throughput, latency, 5xx, and 429 panels

5. Add health and readiness checks

Use at least one lightweight health endpoint:

python
@app.get("/health")
def health():
    return {"ok": True}

If needed, add a deeper readiness check:

python
@app.get("/ready")
def ready():
    # verify DB, Redis, queue, or upstream dependencies
    return {"ready": True}

Use /health for simple uptime checks. Use /ready when you want deployment systems or orchestrators to verify dependencies.

Do not make your public uptime probe depend on every downstream service unless you intentionally want full dependency failure to mark the app unhealthy.

6. Implement baseline rate limits

Start with a global limit, then override sensitive routes.

Example:

  • global: 60/minute per IP
  • login: 5/minute per IP
  • signup: 5/minute per IP
  • password reset: 3/minute per IP
  • expensive search/report route: 10/minute per user or API key

FastAPI route-specific example:

python
@app.post("/auth/login")
@limiter.limit("5/minute")
def login(request: Request):
    return {"ok": True}

@app.post("/auth/password-reset")
@limiter.limit("3/minute")
def password_reset(request: Request):
    return {"ok": True}

@app.get("/api/search")
@limiter.limit("10/minute")
def search(request: Request):
    return {"results": []}

Protect these more aggressively:

  • login
  • signup
  • password reset
  • magic link
  • OTP verification
  • search
  • export/report generation
  • AI or compute-heavy routes
  • public API endpoints

7. Use Redis for shared counters in multi-instance deployments

If you run:

  • multiple containers
  • multiple app replicas
  • autoscaling workers
  • load-balanced traffic

do not use local in-memory counters.

Use Redis-backed rate limiting or your API gateway.

Redis checks:

bash
redis-cli PING
redis-cli KEYS '*rate*'
redis-cli MONITOR

Operational rule:

  • one instance: in-process can work temporarily
  • more than one instance: shared backend required

8. Return proper 429 responses

Clients need a predictable error response.

Example response:

http
HTTP/1.1 429 Too Many Requests
Retry-After: 60
Content-Type: application/json
json
{
  "error": "rate_limit_exceeded",
  "message": "Too many requests. Retry later."
}

If supported by your stack, include:

  • X-RateLimit-Limit
  • X-RateLimit-Remaining
  • Retry-After

This helps frontend and API consumers back off correctly.

9. Handle exemptions deliberately

Do not accidentally exempt routes.

Common cases:

  • health checks
  • internal admin routes
  • webhook endpoints
  • trusted internal automation

Webhook endpoints can be exempted, but only if you also have:

  • signature verification
  • idempotency handling
  • logging
  • replay protection where relevant

If you need help debugging webhook behavior, see Payment Webhooks Failing.

10. Add alerts

Start with these:

  • health check failures
  • 5xx error rate spike
  • p95 latency above threshold
  • sudden jump in 429s
  • sudden traffic spike from a single route or token

Practical alert examples:

  • 5xx > 2% for 5 minutes
  • p95 latency > 1000ms for 10 minutes
  • 429 count > baseline for 10 minutes
  • /health failing from 2 regions

Use alerts that map to action. Avoid noisy alerts with no owner.

For alert setup patterns, see Alerting System Setup.

11. Review and tune limits with real data

Do not guess forever. Review weekly:

  • top routes by traffic
  • top 429-producing routes
  • clients frequently near limit
  • shared-IP false positives
  • frontend retry loops or polling

If real users behind NAT or office networks hit limits, move some rules from IP-only to layered keys:

  • per IP
  • per user ID
  • per API key
  • per route

12. Document your policy

Document:

  • global default limits
  • route-specific overrides
  • 429 response behavior
  • retry guidance
  • authenticated vs anonymous policy
  • webhook exemptions
  • support path for higher API limits

If you expose auth endpoints publicly, combine this page with your auth hardening work:

Common causes

  • No request-level metrics or logs, so API failures are only noticed through support tickets
  • Rate limiting stored in local memory while running multiple app instances, causing inconsistent enforcement
  • Reverse proxy headers not configured correctly, so all requests appear from one IP
  • Frontend polling, retries, or websocket fallback logic causing accidental request floods
  • No per-route overrides, so login or search endpoints are either unprotected or overly restricted
  • 429 responses returned without Retry-After or clear body, making clients retry incorrectly
  • Webhook endpoints accidentally rate-limited, causing delivery failures from payment providers
  • Metrics middleware registered incorrectly or after exception handlers, so failed requests are not counted

Debugging tips

Run these first:

bash
curl -i https://yourapp.com/health
curl -i https://yourapp.com/metrics
curl -I https://yourapp.com/api/data

Trigger the limiter:

bash
for i in {1..70}; do
  curl -s -o /dev/null -w "%{http_code}\n" https://yourapp.com/api/data
done

Inspect Redis-backed counters:

bash
redis-cli KEYS '*rate*'
redis-cli MONITOR

Check app logs:

bash
journalctl -u gunicorn -n 200 --no-pager
docker logs <app_container> --tail 200

Check proxy forwarding config:

bash
nginx -T | grep -i 'real_ip\|proxy_set_header\|x-forwarded-for'

Test forwarded IP behavior carefully:

bash
curl -H 'X-Forwarded-For: 1.2.3.4' -i https://yourapp.com/api/data

Generate load:

bash
ab -n 200 -c 20 https://yourapp.com/api/data
hey -n 500 -c 50 https://yourapp.com/api/data

What to look for:

  • all requests using the same IP bucket
  • 429s not appearing in metrics
  • failed requests missing from logs
  • high latency with low request volume
  • route-specific decorators not being applied
  • Redis not being reached by all app instances

If metrics look low but users report slowness, check:

  • database latency
  • upstream API latency
  • worker saturation
  • queue depth
  • connection pool exhaustion

If you need broader logging coverage, see Logging Setup (Application + Server).

If you need dashboard and latency guidance, see Metrics and Performance Monitoring.

If you need incident handling after alerts fire, see Incident Response Playbook.

Checklist

  • Health endpoint exists and is monitored
  • Request logs include request ID, IP, user ID, path, status, and duration
  • Metrics endpoint is exposed and scraped
  • Dashboard shows request rate, error rate, and latency percentiles
  • Alerts exist for 5xx spikes, latency spikes, and 429 spikes
  • Default global rate limit is configured
  • Sensitive endpoints have stricter custom limits
  • Shared backend like Redis is used for multi-instance deployments
  • Proxy headers are trusted only from known proxies
  • 429 responses include useful headers or error details
  • Webhook and internal routes are handled deliberately, not accidentally exempted
  • Rate limits are documented for frontend and API clients

Product CTA

Use a reusable production setup to standardize:

  • health checks
  • request logging
  • metrics
  • alerting
  • shared rate limiting

This avoids ad hoc middleware and inconsistent abuse protection across endpoints. The goal is one repeatable API baseline for every MVP and small SaaS service.

Related guides

FAQ

Should I rate-limit authenticated and anonymous traffic differently?

Yes. Anonymous traffic is usually best limited by IP. Authenticated traffic should also use user ID or API key to avoid punishing shared IPs and to stop one account from abusing the API.

Do I need Redis for rate limiting?

If you run a single instance, in-process limits can work temporarily. If you run multiple instances, containers, or autoscaling, use Redis or a gateway-level limiter so counters are shared.

What should I alert on first?

Start with 5xx error rate, p95 latency, uptime failures, and sudden increases in 429 responses. These catch most API incidents early.

Can I exclude webhook endpoints from rate limits?

Yes, but do it intentionally. Payment or third-party webhooks should be protected with signature verification and logging if they are exempted from normal limits.

Should rate limiting happen in the app or at the reverse proxy?

For small SaaS, app-level is easier to start. For higher traffic or multi-service APIs, use gateway or proxy limits plus app-level limits on sensitive routes.

What is a safe default limit?

Start with conservative defaults like 60 requests/minute per IP for general endpoints, then tune from production traffic patterns.

Why am I seeing too many 429s from legitimate users?

Common causes are frontend retry loops, polling too frequently, office/shared IPs, and limits applied to static or health routes by mistake.

Final takeaway

Minimum viable API monitoring is:

  • request logs
  • latency and error dashboards
  • health checks
  • actionable alerts

Minimum viable rate limiting is:

  • one global default
  • stricter limits on auth and expensive endpoints
  • shared storage or gateway enforcement for multi-instance deployments

Before trusting your numbers, verify:

  • proxy/IP handling
  • metrics coverage
  • failed request logging
  • 429 response behavior

Treat 429 spikes and latency increases as production signals. Tune both client behavior and server capacity.