This repository provides a fully modular implementation of a Retrieval-Augmented Generation (RAG) pipeline tailored for Italian legal-domain documents. The system handles the complete workflow: extracting and preprocessing raw text data, transforming it into dense vector representations, and storing embeddings efficiently in Milvus for retrieval. Beyond storage, it integrates a hybrid retrieval approach that combines BM25 with dense vector similarity, followed by reranking, to achieve high-quality, contextually relevant document retrieval.
The pipeline is optimized for modern machine learning hardware accelerators (e.g., GPUs or specialized inference hardware), and parameters are configurable to adapt to different workloads.
Designed with modularity in mind, each component can be run independently as a Python module, while the entire pipeline can be orchestrated through a central entry point. This makes experimentation, debugging, and production deployment more flexible.
The project is licensed under the Apache License 2.0, which permits both academic and commercial usage with proper attribution.
This component implements hierarchical parent-child chunking using LangChain's RecursiveCharacterTextSplitter to divide documents into larger parent chunks for broader context and smaller child chunks for precise semantic retrieval and embedding. It generates unique MD5-hashed IDs for chunks, computes metadata including word/char/token counts and validity checks, and links children to parents for efficient storage in Milvus, enabling retrieval of fine-grained matches while accessing full contextual parents during generation. This strategy optimizes RAG for legal texts by preserving structural coherence and reducing information loss in long documents.
This component handles duplicate text detection to ensure storage efficiency and avoid redundant data in the vector database. The current implementation uses a signature-based deduplication strategy.
This component generates dense vector embeddings for chunked texts using the SentenceTransformer library with a sekected model such as dlicari/Italian-Legal-BERT-SC. It supports parent-child chunking strategies for hierarchical retrieval, processes directories or individual files with configurable max/min chunk lengths, and saves normalized, truncated embeddings while handling metadata like word counts and parent IDs. Optimized for hardware accelerators (prioritizing XPU or GPU with CPU fallback), the module includes comprehensive logging for device usage, embedding generation, and error recovery to ensure robust, modular operation in RAG pipelines.
This component implements hybrid retrieval by combining BM25 sparse retrieval for keyword matching with dense vector similarity search using Milvus embeddings to capture semantic relevance in texts. Retrieved candidates from both methods are fused using a weighted score.
This component refines fused candidates from hybrid retrieval using a cross-encoder model, such as dlicari/Italian-Legal-BERT, which jointly embeds query and chunk pairs to compute nuanced relevance scores beyond initial similarity metrics.
Before passing prompts to the generator, an anonymization step is applied to mask private names in the input text. This step employs the DeepMount00/universal_ner_ita model, a zero-shot named entity recognition model tailored specifically for Italian language texts. It detects and replaces personal names with a generic placeholder to ensure privacy and compliance with data protection requirements.
To support provenance and responsible AI deployment, the generator module employs watermarking functionality. Each generated text is embedded with a statistical or algorithmic watermark using adapted scripts from lm-watermarking. This watermarking allows for subsequent identification and verification of text provenance, making it possible to establish attribution, monitor outputs in downstream systems, and prevent unauthorized reuse. Incorporating watermarking is critical in legal and compliance contexts, as it helps to reliably track which texts were generated by this pipeline.
The repository supports flexible model loading for the generation stage, integrating Hugging Face Transformers with optional PEFT adapters for fine-tuned causal, seq2seq, and encoder-only models, alongside Ollama for local LLM inference, enabling hardware-optimized deployment across diverse architectures.
Logging is embedded in each module using Python's built-in logging library, with loggers named by module to enable granular control and avoid root logger conflicts. Messages are categorized by levels (DEBUG for diagnostics, INFO for workflow tracking, WARNING/ERROR for issues) and output to console for real-time monitoring while persisting to timestamped files in logs/ for auditing and debugging in the RAG pipeline.
This project follows a modular structure, and each script can be executed as a module using Python's -m flag from the repository's root directory. This approach ensures that relative imports are resolved correctly.
To run the complete RAG pipeline (after installing all required dependencies), execute:
python -m main
Note: This project is under active development. Expect changes in structure and functionality in the near future.
This project incorporates watermarking functionality from the lm-watermarking repository, developed by John Kirchenbauer et al., licensed under the Apache License, Version 2.0. The script watermark_processor.py (e.g., extended_watermark_processor.py) is included in src/generation/ while the scripts alternative_prf_schemes.py, homoglyphs.py, and normalizers.py are included in src/utils/watermarking/. Their associated LICENSE file is included in (third_party/LICENSE.md). The scripts are adapted to integrate with the LLMGenerator class for watermarking AI-generated legal text outputs in the RAG pipeline.
This project also uses the anonymization model from DeepMount00/universal_ner_ita · Hugging Face to mask personal names in Italian texts as part of the prompt preprocessing step.
This project is licensed under the Apache License, Version 2.0 (the "License"). You may not use, copy, modify, or distribute this project except in compliance with the License. A copy of the License is included in the LICENSE file in this repository.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations.