Retrieval-Augmented Generation

This repository provides a fully modular implementation of a Retrieval-Augmented Generation (RAG) pipeline tailored for Italian legal-domain documents. The system handles the complete workflow: extracting and preprocessing raw text data, transforming it into dense vector representations, and storing embeddings efficiently in Milvus for retrieval. Beyond storage, it integrates a hybrid retrieval approach that combines BM25 with dense vector similarity, followed by reranking, to achieve high-quality, contextually relevant document retrieval.

The pipeline is optimized for modern machine learning hardware accelerators (e.g., GPUs or specialized inference hardware), and parameters are configurable to adapt to different workloads.

Designed with modularity in mind, each component can be run independently as a Python module, while the entire pipeline can be orchestrated through a central entry point. This makes experimentation, debugging, and production deployment more flexible.

The project is licensed under the Apache License 2.0, which permits both academic and commercial usage with proper attribution.

🧱 Components & Features

🔗 Parent-Child Chunking

This component implements hierarchical parent-child chunking using LangChain's RecursiveCharacterTextSplitter to divide documents into larger parent chunks for broader context and smaller child chunks for precise semantic retrieval and embedding. It generates unique MD5-hashed IDs for chunks, computes metadata including word/char/token counts and validity checks, and links children to parents for efficient storage in Milvus, enabling retrieval of fine-grained matches while accessing full contextual parents during generation. This strategy optimizes RAG for legal texts by preserving structural coherence and reducing information loss in long documents.

♻️ Deduplication

This component handles duplicate text detection to ensure storage efficiency and avoid redundant data in the vector database. The current implementation uses a signature-based deduplication strategy.

🧬 Embedding

This component generates dense vector embeddings for chunked texts using the SentenceTransformer library with a sekected model such as dlicari/Italian-Legal-BERT-SC. It supports parent-child chunking strategies for hierarchical retrieval, processes directories or individual files with configurable max/min chunk lengths, and saves normalized, truncated embeddings while handling metadata like word counts and parent IDs. Optimized for hardware accelerators (prioritizing XPU or GPU with CPU fallback), the module includes comprehensive logging for device usage, embedding generation, and error recovery to ensure robust, modular operation in RAG pipelines.

🔀 Hybrid Retrieval

This component implements hybrid retrieval by combining BM25 sparse retrieval for keyword matching with dense vector similarity search using Milvus embeddings to capture semantic relevance in texts. Retrieved candidates from both methods are fused using a weighted score.

🎯 Reranker

This component refines fused candidates from hybrid retrieval using a cross-encoder model, such as dlicari/Italian-Legal-BERT, which jointly embeds query and chunk pairs to compute nuanced relevance scores beyond initial similarity metrics.

🛡️ Anonymization

Before passing prompts to the generator, an anonymization step is applied to mask private names in the input text. This step employs the DeepMount00/universal_ner_ita model, a zero-shot named entity recognition model tailored specifically for Italian language texts. It detects and replaces personal names with a generic placeholder to ensure privacy and compliance with data protection requirements.

💧 Watermarking

To support provenance and responsible AI deployment, the generator module employs watermarking functionality. Each generated text is embedded with a statistical or algorithmic watermark using adapted scripts from lm-watermarking. This watermarking allows for subsequent identification and verification of text provenance, making it possible to establish attribution, monitor outputs in downstream systems, and prevent unauthorized reuse. Incorporating watermarking is critical in legal and compliance contexts, as it helps to reliably track which texts were generated by this pipeline.

📥 Flexible Model Loading

The repository supports flexible model loading for the generation stage, integrating Hugging Face Transformers with optional PEFT adapters for fine-tuned causal, seq2seq, and encoder-only models, alongside Ollama for local LLM inference, enabling hardware-optimized deployment across diverse architectures.

📝 Logging

Logging is embedded in each module using Python's built-in logging library, with loggers named by module to enable granular control and avoid root logger conflicts. Messages are categorized by levels (DEBUG for diagnostics, INFO for workflow tracking, WARNING/ERROR for issues) and output to console for real-time monitoring while persisting to timestamped files in logs/ for auditing and debugging in the RAG pipeline.

🚀 How to Run the Project

This project follows a modular structure, and each script can be executed as a module using Python's -m flag from the repository's root directory. This approach ensures that relative imports are resolved correctly.

To run the complete RAG pipeline (after installing all required dependencies), execute:

python -m main

🚧 Status

Note: This project is under active development. Expect changes in structure and functionality in the near future.

🙏 Acknowledgments

This project incorporates watermarking functionality from the lm-watermarking repository, developed by John Kirchenbauer et al., licensed under the Apache License, Version 2.0. The script watermark_processor.py (e.g., extended_watermark_processor.py) is included in src/generation/ while the scripts alternative_prf_schemes.py, homoglyphs.py, and normalizers.py are included in src/utils/watermarking/. Their associated LICENSE file is included in (third_party/LICENSE.md). The scripts are adapted to integrate with the LLMGenerator class for watermarking AI-generated legal text outputs in the RAG pipeline.

This project also uses the anonymization model from DeepMount00/universal_ner_ita · Hugging Face to mask personal names in Italian texts as part of the prompt preprocessing step.

📄 License

This project is licensed under the Apache License, Version 2.0 (the "License"). You may not use, copy, modify, or distribute this project except in compliance with the License. A copy of the License is included in the LICENSE file in this repository.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations.

Name		Name	Last commit message	Last commit date
Latest commit History 389 Commits
configs		configs
src		src
tests		tests
third_party		third_party
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
main.py		main.py
requirements_cuda.txt		requirements_cuda.txt
requirements_xpu.txt		requirements_xpu.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Retrieval-Augmented Generation

🧱 Components & Features

🔗 Parent-Child Chunking

♻️ Deduplication

🧬 Embedding

🔀 Hybrid Retrieval

🎯 Reranker

🛡️ Anonymization

💧 Watermarking

📥 Flexible Model Loading

📝 Logging

🚀 How to Run the Project

🚧 Status

🙏 Acknowledgments

📄 License

About

Uh oh!

Releases

Packages

Languages

License

kooroshsajadi/retrieval-augmented-generation

Folders and files

Latest commit

History

Repository files navigation

Retrieval-Augmented Generation

🧱 Components & Features

🔗 Parent-Child Chunking

♻️ Deduplication

🧬 Embedding

🔀 Hybrid Retrieval

🎯 Reranker

🛡️ Anonymization

💧 Watermarking

📥 Flexible Model Loading

📝 Logging

🚀 How to Run the Project

🚧 Status

🙏 Acknowledgments

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages