Skip to content

Advanced Retrieval-Augmented Generation (RAG) system designed as an interactive learning portal for political analytics.

Notifications You must be signed in to change notification settings

vashishthdoshi/data-analytics-in-politics-learningportal

Repository files navigation

📜 Political Analytics RAG Learning Portal

This project is an advanced Retrieval-Augmented Generation (RAG) system designed as an interactive learning portal for political analytics. It leverages the Socratic method, using a Q&A interface to help users explore complex topics in the intersection of data science and political science.
The system is built with two distinct architectures: a foundational RAG pipeline and an enhanced version that incorporates structured, step-by-step reasoning for more complex analytical queries.


✨ Key Features

Dual RAG Architecture

  • Simple RAG: A robust baseline system for answering factual questions by retrieving relevant text chunks from a knowledge base.
  • Enhanced Reasoning RAG: Utilizes the llm-reasoners framework to introduce a multi-step reasoning process for handling complex, analytical questions.

Interactive Learning Portal

  • A user-friendly web interface built with Streamlit that allows users to ask questions, view answers, and explore source documents.
  • Includes features like modular viewing and progress tracking to guide the user's educational journey.

Comprehensive Data Ingestion

  • Builds a knowledge base from a combination of dynamic web pages (17 URLs) and static academic PDFs (7 documents).

Robust Evaluation Framework

  • Integrated with the RAGAs framework to quantitatively evaluate the performance of both the simple and enhanced RAG systems.

⚙️ System Architecture

The project follows a modular RAG pipeline that progresses from data ingestion to user interaction:

  1. Data Ingestion: Content from specified URLs and local PDFs is loaded into the system.
  2. Text Processing & Vectorization: Documents are chunked and then converted into numerical embeddings.
  3. Vector Store Creation: The embeddings are stored in a FAISS vector database, creating a searchable knowledge index for efficient retrieval.
  4. Core RAG Logic:
    • When a user asks a question, the system performs a semantic search on the vector store to retrieve relevant document chunks.
    • For complex questions, the llm-reasoners framework adds a structured reasoning layer via prompt engineering before the final answer is generated by an OpenAI model.
  5. User Interface: A Streamlit application provides the front-end for users to interact with the RAG system, displaying the answer, sources, and learning progress.

📊 Evaluation Results

The system was evaluated using the RAGAs framework to compare the baseline RAG against the version enhanced with llm-reasoners.
The results showed a general decline in performance after the enhancements, likely due to "over-reasoning," where adding a reasoning framework to a capable model like gpt-4o-mini confounds its natural generation abilities.

Metric Basic RAG Score llm-reasoners Enhanced RAG Change
Faithfulness 0.821 0.536 -0.285
Answer Relevancy 0.957 0.830 -0.127
Context Precision 1.000 0.806 -0.194
Context Recall 0.819 0.769 -0.050

🚀 Getting Started

1. Prerequisites

  • Python 3.8+
  • Git

2. Installation

Clone the repository:

git clone <your-repo-url>
cd <your-repo-folder>

Create a virtual environment and activate it:

python -m venv venv
# Windows
venv\Scripts\activate
# macOS/Linux
source venv/bin/activate

Install the required dependencies:

pip install -r requirements.txt

Key packages include: streamlit, langchain, ragas, faiss-cpu, sentence-transformers, openai, beautifulsoup4, pypdf, python-dotenv.

Set up environment variables: Create a file named .env in the root directory of the project and add your OpenAI API key:

OPENAI_API_KEY="sk-..."

3. Usage

Build the Knowledge Base
Run the ingestion script to process URLs and PDFs, create embeddings, and save the FAISS index.

Launch the Web Application

streamlit run streamlit_base.py

Run Evaluations
To reproduce evaluation results, run the scripts generating RAGAs scores for both pipelines.


📁 File Structure

.
├── src/
│   ├── datapdfs/               # Local PDF documents
│   └── test_questions_lp.txt   # Test questions for evaluation
├── config.py                   # Central configuration
├── rag_base.py                 # Simple RAG logic
├── rag_run.py                  # Ingestion and vector store creation
├── enhanced_rag_base.py        # Reasoning-enhanced RAG logic
├── enhanced_rag_run.py         # Example script for enhanced RAG
├── streamlit_base.py           # Streamlit front-end
├── rag_base_ragas_data_creation.py # Evaluation data prep (simple RAG)
├── rag_base_ragas_results.py   # Run RAGAs on simple RAG
├── enhanced_ragas_data_creation.py # Evaluation data prep (enhanced RAG)
└── enhanced_ragas_results.py   # Run RAGAs on enhanced RAG

🛠️ Technologies Used

  • Core Framework: LangChain
  • Reasoning Framework: llm-reasoners
  • Web UI: Streamlit
  • Vector Database: FAISS
  • Embeddings: Hugging Face Sentence-Transformers
  • LLMs: OpenAI
  • Evaluation: RAGAs

About

Advanced Retrieval-Augmented Generation (RAG) system designed as an interactive learning portal for political analytics.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages