Skip to content

wakaflocka17/RetrievEM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

210 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿค– RetrievEM: Confidentiality-Preserving RAG via Expectation-Maximization

UniCa Progress X-Ference

RetrievEM-2

This is the Master Thesis project of Francesco Congiu (@wakaflocka17, repository owner), a student at the University of Cagliari, in collaboration with XFERENCE.

This repository offers a Retrieval Augmented Generation (RAG) pipeline based on an Expectation-Maximisation algorithm and inspired by the framework Iterative Self-Incentivization by Shi et al. (2025). After an initial pre-processing step, we exploit Qdrant for semantic retrieval, apply dynamic filters on the metadata and finally a contextual reranking module that selects the most relevant textual chunks."


๐Ÿ“‹ Summary

  1. Project Overview
  2. Contributors
  3. How to Use the Pipeline
  4. Hardware Limitations
  5. Bibliography
  6. Citations

๐Ÿ”ง Project Overview

The RAG Pipeline integrates multiple stages, from data pre-processing to query generation and document retrieval using LLMs and VectorDB. This pipeline is designed to provide efficient and accurate responses from large documents, leveraging state-of-the-art AI models.

๐Ÿ‘ฅ Contributors

Thanks to all the contributors who made this project possible! ๐Ÿš€

Francesco's Avatar Ludovico's Avatar Gianni's Avatar Gionathan's Avatar Michele's Avatar
Francesco Congiu
MSc Student, UniCa
f.congiu38@studenti.unica.it
Ludovico Boratto
Professor, UniCa
ludovico.boratto@unica.it
Gianni Fenu
Professor, UniCa
gianni.fenu@unica.it
Gionathan Desogus
SWE/AI, XFERENCE
gionathan.desogus@xference.ai
Michele Fadda
CTO, XFERENCE
michele.fadda@xference.ai

โš™๏ธ How to Use the Pipeline

The pipeline is divided into several stages, each of which can be easily used by following these instructions:

  1. Clone the repository to your local machine:

    git clone https://github.com/wakaflocka17/RAG_PIPELINE.git
  2. Create and activate a Python virtual environment:

    • macOS/Linux:
    python3.11 -m venv myenv
    source myenv/bin/activate
    • Windows (PowerShell):
    python3.11 -m venv myenv
    .\myenv\Scripts\Activate.ps1
    • Windows (Command Prompt):
    python3.11 -m venv myenv
    .\myenv\Scripts\activate.bat
  3. Install pip-tools, compile requirements.txt from requirements.in and install the dependencies:

    3.1 Install pip-tools

    First of all, we're going to download the pip-tools library.

      pip install pip-tools

    3.2 Generate requirements.txt

    Next, we will generate our .txt file using the command below.

      pip-compile requirements.in

    3.3 Install all dependencies

    Finally, with the command below that uses the requirements.txt file, we will install the dependencies necessary for RetrievEM to work.

      pip install -r requirements.txt
  4. Download and run Qdrant (using Docker):

    First, pull the latest Qdrant image from Docker Hub:

    docker pull qdrant/qdrant

    Then, run the Qdrant service, mounting a local qdrant_storage directory for persistent storage:

    docker run -p 6333:6333 -p 6334:6334 \
    -v "$(pwd)/qdrant_storage:/qdrant/storage:z" \
    qdrant/qdrant

    Under this configuration, all Qdrant data will be stored in ./qdrant_storage on your host machine.

    Qdrant will now be accessible at:

  5. Execute the pipeline:

    The framework can be launched in two main modes: interactive and experimental evaluation.

    5.1 ๐Ÿง  Run the Full RAG Pipeline (Interactive Mode)

    To start a multi-turn RAG session with document download, chunking, enrichment, embeddings, and ingestion into Qdrant:

    python main.py pipeline

    This will initiate the following steps (only if not already completed), such as downloading the required documents, generating chunks, enriching them, creating embeddings, ingesting the embeddings into Qdrant, and starting the entire pipeline (for demo or quick testing purposes only, not used in the thesis experiments).

    5.2 ๐Ÿ“Š Experimental Evaluation Mode

    All experiments described in the thesis are managed through src/evaluation/eval.py. For each Research Question (RQ), it is possible to run the corresponding evaluation:

    RQ1 โ€“ Impact of Embeddings
    python src/evaluation/eval.py --rq rq1 --dataset fiqa
    
    RQ2 โ€“ Cross-Encoder Reranking
     python src/evaluation/eval.py --rq rq2 --dataset fiqa
    
    RQ3 โ€“ Fusion Strategies
     python src/evaluation/eval.py --rq rq3 --dataset fiqa
    
    RQ4 โ€“ Subquery Decomposition
     python src/evaluation/eval.py --rq rq4 --dataset fiqa
    
    RQ5 โ€“ Confidentiality-Aware Retrieval
     python src/evaluation/eval.py --rq rq5 --dataset fiqa
    

๐Ÿ–ฅ๏ธ Hardware Limitations

Note

The experiments have been conducted so far using a MacBook Pro (2020) with an Intel Core i5 processor, relying solely on CPU computation. The system specifications are as follows:

  • Operating system: macOS Sonoma
  • Processor: Intel Core i5 (4 cores)
  • GPU: Integrated Intel UHD Graphics
  • RAM: 32 GB (DDR4)

Warning

While this configuration has been sufficient for development and testing on small-scale datasets, further stages of the projectโ€”particularly involving large models or datasetsโ€”may require more powerful hardware, to ensure faster training times and better performance.


๐Ÿ“š Bibliography

We relied on the Iterative Self-Incentivization by Shi et al. (2025), adopting their mechanism of evaluation of search trajectories as the foundation for our multi-stage retrieval.

@misc{shi2025iterativeselfincentivizationempowerslarge,
  title        = {Iterative Self-Incentivization Empowers Large Language Models as Agentic Searchers},
  author       = {Zhengliang Shi and Lingyong Yan and Dawei Yin and Suzan Verberne and Maarten de Rijke and Zhaochun Ren},
  year         = {2025},
  eprint       = {2505.20128},
  archivePrefix= {arXiv},
  primaryClass = {cs.CL},
  url          = {https://arxiv.org/abs/2505.20128},
}

For the embedding model used in this project, please refer to the following citation:

@misc{nussbaum2024nomic,
  title={Nomic Embed: Training a Reproducible Long Context Text Embedder},
  author={Zach Nussbaum and John X. Morris and Brandon Duderstadt and Andriy Mulyar},
  year={2024},
  eprint={2402.01613},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

For the model used to extract metadata from the pre-processed RFC documents in this project, please refer to the following citation:

@misc{qwen3technicalreport,
      title={Qwen3 Technical Report}, 
      author={Qwen Team},
      year={2025},
      eprint={2505.09388},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.09388}, 
}

๐Ÿ“œ Citations

This project represents the Master's Thesis of Francesco Congiu (@wakaflocka17, repository owner), a student at the University of Cagliari, in collaboration with XFERENCE. If you wish to cite this project, please use the following reference:

Congiu, F. (2025). RetrievEM: Confidentiality-Preserving RAG via Expectation-Maximization. Master's Thesis, University of Cagliari, in collaboration with XFERENCE.

About

This repository implements a Retrieval Augmented Generation (RAG) pipeline for processing textual data. Includes pre-processing, embedding creation, semantic search via QDrant, and LLM response generation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors