Aurora RAG Chatbot

Technical Lead (POC) — RAG Assistant for ISTE Aurora Fest

Tech Stack

Python 3.11 FastAPI (Async) Redis ChromaDB Groq (LLaMA 3 70B) Docker Nginx

Links

GitHub

Overview

Designed and built a RAG system to handle real-time event queries with low latency and high concurrency. Served as the primary interface for 400+ attendees and supported dynamic updates from event coordinators via a Google Sheets CMS.

Performance: Handled thousands of attendee queries during a 15-day live deployment with stable low-latency performance and 1.2s average latency. Reduced response time from 4.2s to sub-20ms using multi-tier caching (in-memory, Redis, semantic).

Key Engineering Decisions

  • Multi-Tier Caching (L1 + L2 + Semantic): Combined exact match (Redis) and embedding-based similarity caching to reduce LLM calls and achieve sub-20ms responses for repeated queries.
  • Cross-Encoder Reranking: Used ms-marco-MiniLM-L-6-v2 to rerank retrieval results, improving answer precision at a ~50–100ms latency cost.
  • Dynamic CMS (Google Sheets): Built a Google Sheets CMS with background embedding refresh to support zero-downtime content updates by non-technical event coordinators.
  • Async Backend Design: Built an async FastAPI pipeline supporting high concurrency with only 2 workers while maintaining stable latency.

Screenshots

Aurora Chat Interface

Chat interface maintaining stable response times with custom RAG pipeline.

Aurora Analytics Dashboard

Real time analytics dashboard showing intent distribution, with 'General' and 'Schedule' as highest traffic intents.

Geographic Analytics

User telemetry revealing mobile traffic distribution, leading to late stage mobile UI optimization.

Security Logs

Security layer intercepting prompt injection attempts during stress testing.

Operational observability via real time analytics (latency, intent distribution, abuse logs).

Architecture Highlights

Request Lifecycle: Security Gate → Intent Classification (8 categories) → Multi-Tier Cache (L1 → L2 → Semantic) → Vector Retrieval (top-k=50) → Cross-Encoder Reranking → LLM Generation (Groq LLaMA 3 70B) → Background Tasks.

Scaling Approach: Async I/O, hard timeouts (10s vector, 30s LLM), and query normalization. Key rotation across multiple Groq API keys bypasses rate limits during peak usage.

Tech Stack Rationale:

  • FastAPI: Async first design prevents blocking I/O from LLM calls and vector search
  • ChromaDB: SQLite-based vector store, no separate server required
  • Groq: Low latency inference for real time chat UX
  • Redis (optional): Graceful degradation to in memory cache when unavailable

Architecture: Security Gate → Intent Classification → Cache (L1 → L2 → Semantic) → Vector Retrieval → Reranking → LLM Generation → Background Tasks

Performance Stats

  • Avg latency: 1.2s
  • Cache latency: sub-20ms (L1/L2/semantic hits)
  • p95 latency: <2.5s

System Design Highlights

  • Query normalization to improve cache reuse
  • API key rotation to handle LLM rate limits
  • Hard timeouts (vector: 10s, LLM: 30s)
  • Rate limiting (60 req/min per IP)

Security & Observability

  • AES-256 encryption for PII, SHA-256 for IP hashing
  • Real-time monitoring with Prometheus and Grafana
  • Abuse detection and logging with automatic cleanup

Leadership

Led a 6-member development team across retrieval pipeline development, deployment, and testing.

  • Owned system architecture, core implementation (RAG pipeline, caching, deployment).
  • Coordinated work across data preparation, embedding experiments, and testing.

Bottlenecks & Improvements

  • ChromaDB linear scan limits scaling beyond ~10K chunks → planned migration to FAISS/ANN
  • In-memory abuse detection → move to Redis for distributed scaling
  • Add streaming responses (SSE/WebSockets) to improve perceived latency

Summary

Built a real-time event assistant using RAG, combining multi-layer caching, dynamic data ingestion, and async system design to deliver low-latency responses under production constraints.