This is a repository for our summer school project 2024 at university of applied sciences Kalrsurhe. The goal of this project is to create a retrieval-augmented generation (RAG) chatbot that can answer questions about the Point Cloud Library (PCL) documentation. The chatbot will be able to answer questions about the Point Cloud Library by retrieving relevant information from the documentation and generating an answer based on the retrieved information.
- Web Scraping: Utilizes BeautifulSoup to parse HTML content of the documentation and extract relevant information.
- Data Processing: Employs pandas for data manipulation and storage.
- Document Analysis: Analyzes different types of documentation elements such as classes, functions, and descriptions.
- CSV Export: Outputs the processed data into a CSV file for easy access and further analysis.
- Streamlit Integration: Provides a user-friendly interface to interact with the processed data.
- Retrieval-Augmented Generation (RAG): Implements a RAG pipeline using Haystack with HyDE (with an option to alternatively use HyQE) to generate answers to user questions based on the processed data.
-
The project requires Python 3.10 or later and depends on
beautifulsoup4pandasstreamlithaystack-aiqdrant-haystackpypdfmarkdown-it-pysentence-transformerscryptographylangfuse-haystacklangdetect
-
The project also requires the following tools:
ollamadocker
-
While this app can be run on any operating system supporting the above dependencies and tools, it has been tested and instructions have been provided for Ubuntu 22.04.
-
install poetry via pipx:
pip3 install pipx pipx install poetry
-
Install ollama:
cd ~ curl -fsSL https://ollama.com/install.sh | sh
-
Clone the repository:
git clone https://github.com/your-repo/rag-project.git cd rag-project -
Setup your virtual environment:
python3 -m venv .venv
-
install dependencies from repository root:
source .venv/bin/activate poetry install -
Setup environment variables for tracing via Langfuse:
echo "export LANGFUSE_SECRET_KEY=<your-secret-key> >> ~/.bashrc" echo "export LANGFUSE_PUBLIC_KEY=<your-public-key> >> ~/.bashrc"
-
Pull the latest version of llama3.1:
ollama pull llama3.1
-
Start your local qdrant intance:
docker run -p 6333:6333 -p 6334:6334 \ -v ~/qdrant_storage:/qdrant/storage:z \ qdrant/qdrant -
Activate the virtual environment:
source .venv/bin/activate -
From the
srcfolder of the repository, run the app:cd src streamlit run main.py -
An instance of your browser should open, to look something like this:
