This project is an advanced Retrieval-Augmented Generation (RAG) system designed as an interactive learning portal for political analytics. It leverages the Socratic method, using a Q&A interface to help users explore complex topics in the intersection of data science and political science.
The system is built with two distinct architectures: a foundational RAG pipeline and an enhanced version that incorporates structured, step-by-step reasoning for more complex analytical queries.
- Simple RAG: A robust baseline system for answering factual questions by retrieving relevant text chunks from a knowledge base.
- Enhanced Reasoning RAG: Utilizes the llm-reasoners framework to introduce a multi-step reasoning process for handling complex, analytical questions.
- A user-friendly web interface built with Streamlit that allows users to ask questions, view answers, and explore source documents.
- Includes features like modular viewing and progress tracking to guide the user's educational journey.
- Builds a knowledge base from a combination of dynamic web pages (17 URLs) and static academic PDFs (7 documents).
- Integrated with the RAGAs framework to quantitatively evaluate the performance of both the simple and enhanced RAG systems.
The project follows a modular RAG pipeline that progresses from data ingestion to user interaction:
- Data Ingestion: Content from specified URLs and local PDFs is loaded into the system.
- Text Processing & Vectorization: Documents are chunked and then converted into numerical embeddings.
- Vector Store Creation: The embeddings are stored in a FAISS vector database, creating a searchable knowledge index for efficient retrieval.
- Core RAG Logic:
- When a user asks a question, the system performs a semantic search on the vector store to retrieve relevant document chunks.
- For complex questions, the llm-reasoners framework adds a structured reasoning layer via prompt engineering before the final answer is generated by an OpenAI model.
- User Interface: A Streamlit application provides the front-end for users to interact with the RAG system, displaying the answer, sources, and learning progress.
The system was evaluated using the RAGAs framework to compare the baseline RAG against the version enhanced with llm-reasoners.
The results showed a general decline in performance after the enhancements, likely due to "over-reasoning," where adding a reasoning framework to a capable model like gpt-4o-mini confounds its natural generation abilities.
| Metric | Basic RAG Score | llm-reasoners Enhanced RAG | Change |
|---|---|---|---|
| Faithfulness | 0.821 | 0.536 | -0.285 |
| Answer Relevancy | 0.957 | 0.830 | -0.127 |
| Context Precision | 1.000 | 0.806 | -0.194 |
| Context Recall | 0.819 | 0.769 | -0.050 |
- Python 3.8+
- Git
Clone the repository:
git clone <your-repo-url>
cd <your-repo-folder>Create a virtual environment and activate it:
python -m venv venv
# Windows
venv\Scripts\activate
# macOS/Linux
source venv/bin/activateInstall the required dependencies:
pip install -r requirements.txtKey packages include: streamlit, langchain, ragas, faiss-cpu, sentence-transformers, openai, beautifulsoup4, pypdf, python-dotenv.
Set up environment variables:
Create a file named .env in the root directory of the project and add your OpenAI API key:
OPENAI_API_KEY="sk-..."Build the Knowledge Base
Run the ingestion script to process URLs and PDFs, create embeddings, and save the FAISS index.
Launch the Web Application
streamlit run streamlit_base.pyRun Evaluations
To reproduce evaluation results, run the scripts generating RAGAs scores for both pipelines.
.
├── src/
│ ├── datapdfs/ # Local PDF documents
│ └── test_questions_lp.txt # Test questions for evaluation
├── config.py # Central configuration
├── rag_base.py # Simple RAG logic
├── rag_run.py # Ingestion and vector store creation
├── enhanced_rag_base.py # Reasoning-enhanced RAG logic
├── enhanced_rag_run.py # Example script for enhanced RAG
├── streamlit_base.py # Streamlit front-end
├── rag_base_ragas_data_creation.py # Evaluation data prep (simple RAG)
├── rag_base_ragas_results.py # Run RAGAs on simple RAG
├── enhanced_ragas_data_creation.py # Evaluation data prep (enhanced RAG)
└── enhanced_ragas_results.py # Run RAGAs on enhanced RAG
- Core Framework: LangChain
- Reasoning Framework: llm-reasoners
- Web UI: Streamlit
- Vector Database: FAISS
- Embeddings: Hugging Face Sentence-Transformers
- LLMs: OpenAI
- Evaluation: RAGAs