This is the Master Thesis project of Francesco Congiu (@wakaflocka17, repository owner), a student at the University of Cagliari, in collaboration with XFERENCE.
This repository offers a Retrieval Augmented Generation (RAG) pipeline based on an Expectation-Maximisation algorithm and inspired by the framework Iterative Self-Incentivization by Shi et al. (2025). After an initial pre-processing step, we exploit Qdrant for semantic retrieval, apply dynamic filters on the metadata and finally a contextual reranking module that selects the most relevant textual chunks."
The RAG Pipeline integrates multiple stages, from data pre-processing to query generation and document retrieval using LLMs and VectorDB. This pipeline is designed to provide efficient and accurate responses from large documents, leveraging state-of-the-art AI models.
Thanks to all the contributors who made this project possible! ๐
| Francesco Congiu MSc Student, UniCa f.congiu38@studenti.unica.it |
Ludovico Boratto Professor, UniCa ludovico.boratto@unica.it |
Gianni Fenu Professor, UniCa gianni.fenu@unica.it |
Gionathan Desogus SWE/AI, XFERENCE gionathan.desogus@xference.ai |
Michele Fadda CTO, XFERENCE michele.fadda@xference.ai |
The pipeline is divided into several stages, each of which can be easily used by following these instructions:
-
Clone the repository to your local machine:
git clone https://github.com/wakaflocka17/RAG_PIPELINE.git
-
Create and activate a Python virtual environment:
- macOS/Linux:
python3.11 -m venv myenv source myenv/bin/activate- Windows (PowerShell):
python3.11 -m venv myenv .\myenv\Scripts\Activate.ps1
- Windows (Command Prompt):
python3.11 -m venv myenv .\myenv\Scripts\activate.bat
-
Install
pip-tools, compilerequirements.txtfromrequirements.inand install the dependencies:First of all, we're going to download the
pip-toolslibrary.pip install pip-tools
Next, we will generate our
.txtfile using the command below.pip-compile requirements.in
Finally, with the command below that uses the requirements.txt file, we will install the dependencies necessary for RetrievEM to work.
pip install -r requirements.txt
-
Download and run Qdrant (using Docker):
First, pull the latest Qdrant image from Docker Hub:
docker pull qdrant/qdrant
Then, run the Qdrant service, mounting a local qdrant_storage directory for persistent storage:
docker run -p 6333:6333 -p 6334:6334 \ -v "$(pwd)/qdrant_storage:/qdrant/storage:z" \ qdrant/qdrantUnder this configuration, all Qdrant data will be stored in
./qdrant_storageon your host machine.Qdrant will now be accessible at:
- REST API: http://localhost:6333
- Web UI: http://localhost:6333/dashboard
- gRPC API: localhost:6334
-
Execute the pipeline:
The framework can be launched in two main modes: interactive and experimental evaluation.
To start a multi-turn RAG session with document download, chunking, enrichment, embeddings, and ingestion into Qdrant:
python main.py pipeline
This will initiate the following steps (only if not already completed), such as downloading the required documents, generating chunks, enriching them, creating embeddings, ingesting the embeddings into Qdrant, and starting the entire pipeline (for demo or quick testing purposes only, not used in the thesis experiments).
All experiments described in the thesis are managed through
src/evaluation/eval.py. For each Research Question (RQ), it is possible to run the corresponding evaluation:python src/evaluation/eval.py --rq rq1 --dataset fiqapython src/evaluation/eval.py --rq rq2 --dataset fiqapython src/evaluation/eval.py --rq rq3 --dataset fiqapython src/evaluation/eval.py --rq rq4 --dataset fiqapython src/evaluation/eval.py --rq rq5 --dataset fiqa
Note
The experiments have been conducted so far using a MacBook Pro (2020) with an Intel Core i5 processor, relying solely on CPU computation. The system specifications are as follows:
- Operating system: macOS Sonoma
- Processor: Intel Core i5 (4 cores)
- GPU: Integrated Intel UHD Graphics
- RAM: 32 GB (DDR4)
Warning
While this configuration has been sufficient for development and testing on small-scale datasets, further stages of the projectโparticularly involving large models or datasetsโmay require more powerful hardware, to ensure faster training times and better performance.
We relied on the Iterative Self-Incentivization by Shi et al. (2025), adopting their mechanism of evaluation of search trajectories as the foundation for our multi-stage retrieval.
@misc{shi2025iterativeselfincentivizationempowerslarge,
title = {Iterative Self-Incentivization Empowers Large Language Models as Agentic Searchers},
author = {Zhengliang Shi and Lingyong Yan and Dawei Yin and Suzan Verberne and Maarten de Rijke and Zhaochun Ren},
year = {2025},
eprint = {2505.20128},
archivePrefix= {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2505.20128},
}For the embedding model used in this project, please refer to the following citation:
@misc{nussbaum2024nomic,
title={Nomic Embed: Training a Reproducible Long Context Text Embedder},
author={Zach Nussbaum and John X. Morris and Brandon Duderstadt and Andriy Mulyar},
year={2024},
eprint={2402.01613},
archivePrefix={arXiv},
primaryClass={cs.CL}
}For the model used to extract metadata from the pre-processed RFC documents in this project, please refer to the following citation:
@misc{qwen3technicalreport,
title={Qwen3 Technical Report},
author={Qwen Team},
year={2025},
eprint={2505.09388},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.09388},
}This project represents the Master's Thesis of Francesco Congiu (@wakaflocka17, repository owner), a student at the University of Cagliari, in collaboration with XFERENCE. If you wish to cite this project, please use the following reference:
Congiu, F. (2025). RetrievEM: Confidentiality-Preserving RAG via Expectation-Maximization. Master's Thesis, University of Cagliari, in collaboration with XFERENCE.

