Automated News Ingestion and Analysis Pipeline

Real-Time News Analysis with NLP

Timeline

December 2024 - May 2025

Tech Stack

Python Flask NLP VADER

Links

GitHub

Project Overview

Designed a robust news intelligence platform that ingests live articles from any source URL to detect underlying sentiment and topic clusters. Unlike standard scrapers, this system employs a fault-tolerant dual-strategy pipeline (Newspaper3k + Custom DOM Parsing) to ensure high data availability even on complex modern websites.

The system processes articles in bulk, categorizing them into 8+ domains (Politics, Tech, Finance, etc.) while providing granular sentiment scores coupled with "example sentences" that justify the rating—bridging the gap between raw metrics and explainable AI.

Visual Tour

1. Landing & Configuration

News Analyzer Landing Page

Clean, minimalist interface allowing users to input any URL or select a preset. Users can configure batch size (up to 20 articles) for deep processing.

2. Source Selection

Popular News Sources

Quick-access dashboard for major international outlets (BBC, Reuters, Al Jazeera), enabling one-click sentiment auditing of global news.

3. Analysis Dashboard

Sentiment Statistics

Real-time visualization layer showing the "Sentiment Distribution" (Positive/Neutral/Negative) and "Topic Categorization" across the analyzed batch.

4. Granular Results

Detailed Article Results

Detailed cards for each article featuring the calculated sentiment score, auto-detected tag (e.g., "Health", "World"), and the specific "driver statement" that influenced the rating.

Core Features

  • Dual-Engine Scraping: Automatic fallback from Newspaper3k library to custom BeautifulSoup parsers if structural extraction fails, achieving >95% success rate.
  • Context-Aware Sentiment: Uses VADER with custom heuristic rules to extract "representative sentences"—automatically identifying the specific text segments driving the positive or negative score.
  • Intelligent Categorization: Keyword-density algorithm classifies content into predefined buckets like Politics, Business, Technology, and Health without external API dependencies.
  • Bulk Processing Pipeline: Concurrent processing capability to analyze batches of 5-20 articles per request with real-time progress feedback.

Technical Implementation

Sentiment & Explanation Logic

The core engine doesn't just return a score; it explains why. By tokenizing full articles into sentences and scoring them individually, the system isolates the most polarized statements to present as evidence.

Challenges & Solutions

  • Challenge: Modern news sites often block standard scrapers or use heavy JavaScript.
  • Solution: Implemented a "User-Agent" rotating header strategy and a direct DOM traversal fallback that hunts for semantic tag patterns (e.g., <article>, meta[name="description"]) when standard extraction fails.
  • Challenge: Generic sentiment libraries struggle with news nuance (e.g., reporting on a crime isn't "negative" news about the article quality, but the content).
  • Solution: Tuned VADER thresholds specifically for long-form text (compound score > 0.05) rather than social media brevity rules.
  • Constraint: Deployment is currently limited to batches of 5 articles due to Render Free Tier execution timeouts (30s limit for synchronous requests).