Aurora RAG Chatbot

Overview

Designed and built a RAG system to handle real-time event queries with low latency and high concurrency. Served as the primary interface for 400+ attendees and supported dynamic updates from event coordinators via a Google Sheets CMS.

Performance: Handled thousands of attendee queries during a 15-day live deployment with stable low-latency performance and 1.2s average latency. Reduced response time from 4.2s to sub-20ms using multi-tier caching (in-memory, Redis, semantic).

Key Engineering Decisions

Multi-Tier Caching (L1 + L2 + Semantic): Combined exact match (Redis) and embedding-based similarity caching to reduce LLM calls and achieve sub-20ms responses for repeated queries.
Cross-Encoder Reranking: Used ms-marco-MiniLM-L-6-v2 to rerank retrieval results, improving answer precision at a ~50–100ms latency cost.
Dynamic CMS (Google Sheets): Built a Google Sheets CMS with background embedding refresh to support zero-downtime content updates by non-technical event coordinators.
Async Backend Design: Built an async FastAPI pipeline supporting high concurrency with only 2 workers while maintaining stable latency.

Screenshots

Chat interface maintaining stable response times with custom RAG pipeline.

Real time analytics dashboard showing intent distribution, with 'General' and 'Schedule' as highest traffic intents.

User telemetry revealing mobile traffic distribution, leading to late stage mobile UI optimization.

Security layer intercepting prompt injection attempts during stress testing.

Operational observability via real time analytics (latency, intent distribution, abuse logs).

Architecture Highlights

Request Lifecycle: Security Gate → Intent Classification (8 categories) → Multi-Tier Cache (L1 → L2 → Semantic) → Vector Retrieval (top-k=50) → Cross-Encoder Reranking → LLM Generation (Groq LLaMA 3 70B) → Background Tasks.

Scaling Approach: Async I/O, hard timeouts (10s vector, 30s LLM), and query normalization. Key rotation across multiple Groq API keys bypasses rate limits during peak usage.

Tech Stack Rationale:

FastAPI: Async first design prevents blocking I/O from LLM calls and vector search
ChromaDB: SQLite-based vector store, no separate server required
Groq: Low latency inference for real time chat UX
Redis (optional): Graceful degradation to in memory cache when unavailable

Architecture: Security Gate → Intent Classification → Cache (L1 → L2 → Semantic) → Vector Retrieval → Reranking → LLM Generation → Background Tasks

Performance Stats

Avg latency: 1.2s
Cache latency: sub-20ms (L1/L2/semantic hits)
p95 latency: <2.5s

System Design Highlights

Query normalization to improve cache reuse
API key rotation to handle LLM rate limits
Hard timeouts (vector: 10s, LLM: 30s)
Rate limiting (60 req/min per IP)

Security & Observability

AES-256 encryption for PII, SHA-256 for IP hashing
Real-time monitoring with Prometheus and Grafana
Abuse detection and logging with automatic cleanup

Leadership

Led a 6-member development team across retrieval pipeline development, deployment, and testing.

Owned system architecture, core implementation (RAG pipeline, caching, deployment).
Coordinated work across data preparation, embedding experiments, and testing.

Bottlenecks & Improvements

ChromaDB linear scan limits scaling beyond ~10K chunks → planned migration to FAISS/ANN
In-memory abuse detection → move to Redis for distributed scaling
Add streaming responses (SSE/WebSockets) to improve perceived latency

Summary

Built a real-time event assistant using RAG, combining multi-layer caching, dynamic data ingestion, and async system design to deliver low-latency responses under production constraints.

Tech Stack

Links