Automating Clinical Trial Eligibility: Building a Smart Pipeline with LangGraph, ChromaDB, and Gemini
Clinical trials are vital for advancing medical research, yet identifying eligible patients is a slow and error-prone process. This project introduces an AI-driven pipeline that automates patient eligibility screening using NLP, vector databases, and LLMs to bridge the gap between clinical data and trial criteria.
The Clinical Trial Eligibility Pipeline processes clinical discharge notes from the MIMIC-IV dataset, summarizes them, matches them against trial inclusion/exclusion criteria, and evaluates patient eligibility.
Built with Python, LangGraph, ChromaDB, and Google’s Gemini API, the system streamlines the entire eligibility workflow through a modular, agent-based design and an interactive Gradio UI.
- Summarization — Condenses lengthy clinical notes into concise, structured summaries.
- Criteria Processing — Converts trial conditions into structured LLM prompts.
- Chunking & Embedding — Stores summary vectors in a ChromaDB collection for retrieval.
- Eligibility Evaluation — Uses Gemini and chain-of-thought (CoT) reasoning to assess patient-trial fit.
- Evaluation Metrics — ROUGE, cosine similarity, and fuzzy matching validate results.
- Interactive UI — A Gradio dashboard visualizes results and database contents.
The workflow is orchestrated using LangGraph, a stateful framework managing data flow between each pipeline component. Each node represents a distinct processing step—from summarization to evaluation—within a controlled execution graph.
-
Summarize Clinical Data: Fine-tune a Gemini model to summarize MIMIC-IV discharge notes into concise summaries capturing key diagnoses and outcomes.
-
Database Storage: Save raw and summarized texts into a local SQLite database for downstream tasks.
-
Criteria Processing: Read inclusion/exclusion criteria from CSVs, convert to JSON, and generate structured CoT prompts.
-
Chunking & Embedding: Split summaries into 450-character chunks, embed them using Google’s
text-embedding-004, and store vectors in ChromaDB. -
Eligibility Evaluation: Retrieve relevant chunks and evaluate eligibility using Gemini reasoning (avoiding inference from symptoms).
-
Result Evaluation: Validate Gemini’s decisions via ROUGE and similarity-based metrics; save CSV reports for each trial.
-
Visualization (Gradio UI): Display eligibility results, database content, embeddings, and logs in an interactive web interface.
| Component | Description |
|---|---|
| LangGraph Agent | Orchestrates pipeline execution and manages state between nodes. |
| Gemini Models | Summarization and reasoning (Gemini 1.5 Flash). |
| ChromaDB | Vector database for chunk retrieval and similarity search. |
| SQLite | Stores summaries and trial metadata. |
| Gradio UI | Frontend interface for results visualization and exploration. |
The pipeline demonstrates how AI can reduce manual screening workload and accelerate patient-trial matching, increasing recruitment efficiency while maintaining interpretability and compliance.
Future Directions:
- Integrate real-time eligibility visualization.
- Expand support to multimodal data (e.g., imaging and lab reports).
- Deploy as a clinical decision-support tool for hospital systems.
Developed by:
This repository showcases how LangGraph, ChromaDB, and LLMs like Gemini can collaborate in a modular agent architecture to make clinical trial screening faster, more accurate, and scalable.
