Overview
Designed and built a RAG system to handle real-time event queries with low latency and high concurrency. Served as the primary interface for 400+ attendees and supported dynamic updates from 80+ event coordinators via a Google Sheets CMS.
Performance: Processed 3,690+ queries over 15 days with 100% uptime and 1.2s average latency. Reduced response time from 4.2s to 18ms (for cached responses) using multi-tier caching (in-memory, Redis, semantic).
Key Engineering Decisions
- Multi-Tier Caching (L1 + L2 + Semantic): Combined exact match (Redis) and embedding-based similarity caching to reduce LLM calls by 36% and achieve sub-20ms responses for repeated queries.
- Cross-Encoder Reranking: Used
ms-marco-MiniLM-L-6-v2 to rerank retrieval results, improving answer precision at a ~50–100ms latency cost.
- Dynamic CMS (Google Sheets): Enabled non-technical coordinators to update event data with auto-sync every 5 minutes and zero-downtime updates.
- Async Backend Design: Built an async FastAPI pipeline supporting high concurrency with only 2 workers while maintaining stable latency.
Screenshots
Chat interface
maintaining 663ms average response time with custom RAG pipeline.
Real time analytics
dashboard showing 372 queries processed, with 'General' and 'Schedule' as highest traffic
intents.
User telemetry revealing
21% mobile traffic, leading to late stage mobile first UI optimization.
Security layer
intercepting 16+ prompt injection attempts during stress testing.
Operational observability via real time analytics (latency, intent
distribution, abuse logs).
Architecture Highlights
Request Lifecycle: Security Gate → Intent Classification (8 categories) → Multi-Tier Cache (L1 → L2 → Semantic) → Vector Retrieval (top-k=50) → Cross-Encoder Reranking → LLM Generation (Groq LLaMA 3 70B) → Background Tasks.
Scaling Approach: Async I/O, hard timeouts (10s vector, 30s LLM), and query normalization. Key rotation across multiple Groq API keys bypasses rate limits during peak usage (413+ queries/day).
Tech Stack Rationale:
- FastAPI: Async first design prevents blocking I/O from LLM calls and vector
search
- ChromaDB: SQLite-based vector store, no separate server required
- Groq: Low latency inference for real time chat UX
- Redis (optional): Graceful degradation to in memory cache when unavailable
Architecture: Security Gate → Intent Classification → Cache (L1 → L2 → Semantic) → Vector Retrieval → Reranking → LLM Generation → Background Tasks
Performance Stats
- Avg latency: 1.2s
- Cache latency: 18ms (L1/L2 hits)
- p95 latency: <2.5s
- Peak load: 413 queries/day
System Design Highlights
- Query normalization to improve cache reuse
- API key rotation to handle LLM rate limits
- Hard timeouts (vector: 10s, LLM: 30s)
- Rate limiting (60 req/min per IP)
Security & Observability
- AES-256 encryption for PII, SHA-256 for IP hashing
- Real-time monitoring with Prometheus and Grafana
- Abuse detection and logging with automatic cleanup
Leadership
Led a team of 6 developers.
- Owned system architecture, core implementation (RAG pipeline, caching, deployment).
- Coordinated work across data preparation, embedding experiments, and testing.
Bottlenecks & Improvements
- ChromaDB linear scan limits scaling beyond ~10K chunks → planned migration to FAISS/ANN
- In-memory abuse detection → move to Redis for distributed scaling
- Add streaming responses (SSE/WebSockets) to improve perceived latency
Summary
Built a real-time event assistant using RAG, combining multi-layer caching, dynamic data ingestion, and async system design to deliver low-latency responses under production constraints.