News Sentiment Analyzer | Mithil Portfolio

Project Overview

Designed a robust news intelligence platform that ingests live articles from any source URL to detect underlying sentiment and topic clusters. Unlike standard scrapers, this system employs a fault-tolerant dual-strategy pipeline (Newspaper3k + Custom DOM Parsing) to ensure high data availability even on complex modern websites.

The system processes articles in bulk, categorizing them into 8+ domains (Politics, Tech, Finance, etc.) while providing granular sentiment scores coupled with "example sentences" that justify the rating—bridging the gap between raw metrics and explainable AI.

Visual Tour

1. Landing & Configuration

Clean, minimalist interface allowing users to input any URL or select a preset. Users can configure batch size (up to 20 articles) for deep processing.

2. Source Selection

Quick-access dashboard for major international outlets (BBC, Reuters, Al Jazeera), enabling one-click sentiment auditing of global news.

3. Analysis Dashboard

Real-time visualization layer showing the "Sentiment Distribution" (Positive/Neutral/Negative) and "Topic Categorization" across the analyzed batch.

4. Granular Results

Detailed cards for each article featuring the calculated sentiment score, auto-detected tag (e.g., "Health", "World"), and the specific "driver statement" that influenced the rating.

Core Features

Dual-Engine Scraping: Automatic fallback from Newspaper3k library to custom BeautifulSoup parsers if structural extraction fails, achieving >95% success rate.
Context-Aware Sentiment: Uses VADER with custom heuristic rules to extract "representative sentences"—automatically identifying the specific text segments driving the positive or negative score.
Intelligent Categorization: Keyword-density algorithm classifies content into predefined buckets like Politics, Business, Technology, and Health without external API dependencies.
Bulk Processing Pipeline: Concurrent processing capability to analyze batches of 5-20 articles per request with real-time progress feedback.

Technical Implementation

Sentiment & Explanation Logic

The core engine doesn't just return a score; it explains why. By tokenizing full articles into sentences and scoring them individually, the system isolates the most polarized statements to present as evidence.

Challenges & Solutions

Challenge: Modern news sites often block standard scrapers or use heavy JavaScript.
Solution: Implemented a "User-Agent" rotating header strategy and a direct DOM traversal fallback that hunts for semantic tag patterns (e.g., <article>, meta[name="description"]) when standard extraction fails.
Challenge: Generic sentiment libraries struggle with news nuance (e.g., reporting on a crime isn't "negative" news about the article quality, but the content).
Solution: Tuned VADER thresholds specifically for long-form text (compound score > 0.05) rather than social media brevity rules.
Constraint: Deployment is currently limited to batches of 5 articles due to Render Free Tier execution timeouts (30s limit for synchronous requests).

Automated News Ingestion and Analysis Pipeline

Timeline

Tech Stack

Links