Production ML systems
Scaling a Recommendation API to ~1,000 Requests per Second
Real-Time ML Recommendation Engine · FastAPI, Redis, PostgreSQL, Locust, Python
The problem
Serving personalized recommendations by recomputing heavy collaborative and content scores on every request does not survive bursty traffic. The goal was a read path that stays fast and stable while training, ingestion, and retrain jobs run on their own schedule—closer to how product teams ship recommenders in practice.
Architecture
Ratings and catalog data live in PostgreSQL. Ingest and offline/nightly retrain jobs read that ground truth, fit models (hybrid collaborative + content scoring in the online stack), and materialize ranked lists into Redis. The FastAPI app serves GET /recommendations/{user_id} from precomputed lists and cache-friendly paths so the hot read path avoids redoing full model math per request.
What I shipped
- Two-stage pattern: offline training / retrain decoupled from the always-on read API.
- Redis-backed materialization so recommendation retrieval stays an in-memory lookup on the hot path.
- Read-heavy Locust runs: ~1,000 RPS, p95 ~56 ms, 0 failures over more than a million requests (local benchmark; see repo docs).
- A/B harness with deterministic hash bucketing and two-proportion z-test style significance for CTR (see
app/services/statsig.pyin the repo).
Proof (load test)
Drop your cropped Locust stats PNG next to the site as assets/img/case-studies/locust-stats.png to mirror the README benchmark figure. Until then, the table below reflects the same recorded run documented in the repository.
Read path benchmark (documented run)
Source: docs/benchmark_results.md in the GitHub repo. Re-run with make loadtest-read.