GIC Insurance Analytics RAG

Grounded Analytics RAG for General Insurance Data

Role

Analytics Intern, Star Health Insurance

Timeline

December 2025

Tech Stack

Python Streamlit Groq (Llama 3.3) ChromaDB Jupyter

Project Overview

Objective: Build a grounded RAG system with citation enforcement for insurance premium analytics.

Data Coverage: FY24 & FY25 (April-October), 34 insurance companies, 9 segments

Core Value: The system focuses on domain-specific analytics and document generation, with RAG used as a semantic interface over computed insights rather than raw data.

Implementation Details

  • LLM: Groq (Llama 3.3 70B)
  • Vector Database: ChromaDB - Persistent semantic search
  • Embeddings: sentence-transformers (all-MiniLM-L6-v2)
  • Frontend: Streamlit - Rapid analytics interface (core complexity in data engineering)
  • Data Processing: Pandas, NumPy

Grounding & Guardrails

  • Closed knowledge base (49 documents)
  • Low-temperature generation (0.1)
  • Mandatory citations for numerical claims
  • Template-based fallback when retrieval confidence is low
GIC Analytics RAG System Demo

Step-by-Step Process

Phase 1: Data Collection & Preprocessing

Collected 7 monthly Excel reports (Apr–Oct 2025) with segment-wise premium data. Handled inconsistent formatting, multi-sheet files, and naming issues. Consolidated into master CSV with 2,380 records. Derived monthly premiums from YTD deltas.

Phase 2: Analytics & Insights Generation

Built analytics module with growth metrics, volatility analysis, portfolio concentration, and risk classifications (Health strategy, Misc segment risk). EDA performed to validate trends.

Phase 3: RAG Document Generation

Generated 49 semantic documents from structured data:

  • Company Summaries (34): Premium, growth, top segment, concentration
  • Segment Summaries (9): Market share, growth, trends
  • Segment Rankings (9): Top 10 companies per segment
  • Risk Classifications (4): Crop-risk, health strategy types
  • Industry Overview, Growth Insights, Segment Comparison, Company Rankings

All documents citation-ready with document IDs.

Phase 4: RAG Engine Development

Components:

  • Retrieval: Cosine similarity search (k=3) over 49 documents
  • Generation: Groq (Llama 3.3 70B), temperature 0.1
  • Guardrails: Similarity threshold check, template fallback, citation validation

Phase 5: Streamlit Interface

Built query interface with auto-generated knowledge base, chat history, and error handling.

Testing

  • Analytics and rankings validated against source data
  • RAG responses manually reviewed for citation correctness
  • End-to-end queries tested via Streamlit interface

Technical Architecture

System Flow

The system follows a clear data flow from user query to final answer:

  • User Query → Streamlit Interface
  • RAG Copilot → Retrieval Phase (encode query, ChromaDB search)
  • Retrieved Documents → Generation Phase (build context, Groq inference)
  • Final Answer → Structured response with citations

Key Design Decisions

  • Monthly premiums derived from YTD deltas
  • Negative growth preserved for volatility analysis
  • Mid-year company entries handled as new baselines
  • Standard deviation used for volatility (interpretable units)

Results

  • 2,380 records consolidated from 7 monthly reports
  • 34 companies across 9 insurance segments
  • 49 semantic documents generated
  • 1–3s query latency
  • All tested queries returned citation-backed answers

Future Enhancements

  • Evaluation harness (precision@k, citation correctness)
  • Hybrid retrieval (BM25 + vector with RRF)
  • Incremental monthly ingestion