🤖 RetrievEM: Confidentiality-Preserving RAG via Expectation-Maximization

This is the Master Thesis project of Francesco Congiu (@wakaflocka17, repository owner), a student at the University of Cagliari, in collaboration with XFERENCE.

This repository offers a Retrieval Augmented Generation (RAG) pipeline based on an Expectation-Maximisation algorithm and inspired by the framework Iterative Self-Incentivization by Shi et al. (2025). After an initial pre-processing step, we exploit Qdrant for semantic retrieval, apply dynamic filters on the metadata and finally a contextual reranking module that selects the most relevant textual chunks."

🔧 Project Overview

The RAG Pipeline integrates multiple stages, from data pre-processing to query generation and document retrieval using LLMs and VectorDB. This pipeline is designed to provide efficient and accurate responses from large documents, leveraging state-of-the-art AI models.

👥 Contributors

Thanks to all the contributors who made this project possible! 🚀


Francesco Congiu MSc Student, UniCa f.congiu38@studenti.unica.it	Ludovico Boratto Professor, UniCa ludovico.boratto@unica.it	Gianni Fenu Professor, UniCa gianni.fenu@unica.it	Gionathan Desogus SWE/AI, XFERENCE gionathan.desogus@xference.ai	Michele Fadda CTO, XFERENCE michele.fadda@xference.ai

⚙️ How to Use the Pipeline

The pipeline is divided into several stages, each of which can be easily used by following these instructions:

Clone the repository to your local machine:

git clone https://github.com/wakaflocka17/RAG_PIPELINE.git

Create and activate a Python virtual environment:

macOS/Linux:

python3.11 -m venv myenv
source myenv/bin/activate

Windows (PowerShell):

python3.11 -m venv myenv
.\myenv\Scripts\Activate.ps1

Windows (Command Prompt):

python3.11 -m venv myenv
.\myenv\Scripts\activate.bat

Install pip-tools, compile requirements.txt from requirements.in and install the dependencies:

3.1 Install pip-tools

First of all, we're going to download the pip-tools library.
```
  pip install pip-tools
```
3.2 Generate requirements.txt

Next, we will generate our .txt file using the command below.
```
  pip-compile requirements.in
```
3.3 Install all dependencies

Finally, with the command below that uses the requirements.txt file, we will install the dependencies necessary for RetrievEM to work.
```
  pip install -r requirements.txt
```
Download and run Qdrant (using Docker):

First, pull the latest Qdrant image from Docker Hub:
```
docker pull qdrant/qdrant
```
Then, run the Qdrant service, mounting a local qdrant_storage directory for persistent storage:
```
docker run -p 6333:6333 -p 6334:6334 \
-v "$(pwd)/qdrant_storage:/qdrant/storage:z" \
qdrant/qdrant
```
Under this configuration, all Qdrant data will be stored in ./qdrant_storage on your host machine.
Qdrant will now be accessible at:
- REST API: http://localhost:6333
- Web UI: http://localhost:6333/dashboard
- gRPC API: localhost:6334
Execute the pipeline:

The framework can be launched in two main modes: interactive and experimental evaluation.

5.1 🧠 Run the Full RAG Pipeline (Interactive Mode)

To start a multi-turn RAG session with document download, chunking, enrichment, embeddings, and ingestion into Qdrant:
```
python main.py pipeline
```
This will initiate the following steps (only if not already completed), such as downloading the required documents, generating chunks, enriching them, creating embeddings, ingesting the embeddings into Qdrant, and starting the entire pipeline (for demo or quick testing purposes only, not used in the thesis experiments).

5.2 📊 Experimental Evaluation Mode

All experiments described in the thesis are managed through src/evaluation/eval.py. For each Research Question (RQ), it is possible to run the corresponding evaluation:

RQ1 – Impact of Embeddings
```
python src/evaluation/eval.py --rq rq1 --dataset fiqa
```
RQ2 – Cross-Encoder Reranking
```
 python src/evaluation/eval.py --rq rq2 --dataset fiqa
```
RQ3 – Fusion Strategies
```
 python src/evaluation/eval.py --rq rq3 --dataset fiqa
```
RQ4 – Subquery Decomposition
```
 python src/evaluation/eval.py --rq rq4 --dataset fiqa
```
RQ5 – Confidentiality-Aware Retrieval
```
 python src/evaluation/eval.py --rq rq5 --dataset fiqa
```

🖥️ Hardware Limitations

Note

The experiments have been conducted so far using a MacBook Pro (2020) with an Intel Core i5 processor, relying solely on CPU computation. The system specifications are as follows:

Operating system: macOS Sonoma
Processor: Intel Core i5 (4 cores)
GPU: Integrated Intel UHD Graphics
RAM: 32 GB (DDR4)

Warning

While this configuration has been sufficient for development and testing on small-scale datasets, further stages of the project—particularly involving large models or datasets—may require more powerful hardware, to ensure faster training times and better performance.

📚 Bibliography

We relied on the Iterative Self-Incentivization by Shi et al. (2025), adopting their mechanism of evaluation of search trajectories as the foundation for our multi-stage retrieval.

@misc{shi2025iterativeselfincentivizationempowerslarge,
  title        = {Iterative Self-Incentivization Empowers Large Language Models as Agentic Searchers},
  author       = {Zhengliang Shi and Lingyong Yan and Dawei Yin and Suzan Verberne and Maarten de Rijke and Zhaochun Ren},
  year         = {2025},
  eprint       = {2505.20128},
  archivePrefix= {arXiv},
  primaryClass = {cs.CL},
  url          = {https://arxiv.org/abs/2505.20128},
}

For the embedding model used in this project, please refer to the following citation:

@misc{nussbaum2024nomic,
  title={Nomic Embed: Training a Reproducible Long Context Text Embedder},
  author={Zach Nussbaum and John X. Morris and Brandon Duderstadt and Andriy Mulyar},
  year={2024},
  eprint={2402.01613},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

For the model used to extract metadata from the pre-processed RFC documents in this project, please refer to the following citation:

@misc{qwen3technicalreport,
      title={Qwen3 Technical Report}, 
      author={Qwen Team},
      year={2025},
      eprint={2505.09388},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.09388}, 
}

📜 Citations

This project represents the Master's Thesis of Francesco Congiu (@wakaflocka17, repository owner), a student at the University of Cagliari, in collaboration with XFERENCE. If you wish to cite this project, please use the following reference:

Congiu, F. (2025). RetrievEM: Confidentiality-Preserving RAG via Expectation-Maximization. Master's Thesis, University of Cagliari, in collaboration with XFERENCE.

Name		Name	Last commit message	Last commit date
Latest commit History 210 Commits
data		data
embeddings		embeddings
figures		figures
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.in		requirements.in
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖 RetrievEM: Confidentiality-Preserving RAG via Expectation-Maximization

📋 Summary

🔧 Project Overview

👥 Contributors

⚙️ How to Use the Pipeline

3.1 Install `pip-tools`

3.2 Generate requirements.txt

3.3 Install all dependencies

5.1 🧠 Run the Full RAG Pipeline (Interactive Mode)

5.2 📊 Experimental Evaluation Mode

RQ1 – Impact of Embeddings

RQ2 – Cross-Encoder Reranking

RQ3 – Fusion Strategies

RQ4 – Subquery Decomposition

RQ5 – Confidentiality-Aware Retrieval

🖥️ Hardware Limitations

📚 Bibliography

📜 Citations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🤖 RetrievEM: Confidentiality-Preserving RAG via Expectation-Maximization

📋 Summary

🔧 Project Overview

👥 Contributors

⚙️ How to Use the Pipeline

3.1 Install pip-tools

3.2 Generate requirements.txt

3.3 Install all dependencies

5.1 🧠 Run the Full RAG Pipeline (Interactive Mode)

5.2 📊 Experimental Evaluation Mode

RQ1 – Impact of Embeddings

RQ2 – Cross-Encoder Reranking

RQ3 – Fusion Strategies

RQ4 – Subquery Decomposition

RQ5 – Confidentiality-Aware Retrieval

🖥️ Hardware Limitations

📚 Bibliography

📜 Citations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

3.1 Install `pip-tools`

Packages