Chat with your docs!

A RAG (Retrieval Augmented Generation) setup for further exploration of chatting to company documents

How to use this repo

! This repo is tested on a Windows platform

Preparation

Clone this repo to a folder of your choice
In a folder of your choice, create a file named ".env"
When using the OpenAI API, enter your OpenAI API key in the first line of this file:
OPENAI_API_KEY="sk-....."

If you don't have an OpenAI API key yet, you can obtain one here: https://platform.openai.com/account/api-keys
Click on + Create new secret key
Enter an identifier name (optional) and click on Create secret key

When using Azure OpenAI Services, enter the variable AZURE_OPENAI_API_KEY="....." in the .env file
The value of this variable can be found in your Azure OpenAI Services subscription
In case you want to use one of the open source models API's that are available on Huggingface:
Enter your Huggingface API key in the ".env" file :
HUGGINGFACEHUB_API_TOKEN="hf_....."

If you don't have an Huggingface API key yet, you can register at https://huggingface.co/join
When registered and logged in, you can get your API key in your Huggingface profile settings

This repository also allows for using one of the Ollama open source models on-premise. You can do this by following the steps below:

In Windows go to "Turn Windows features on or off" and check the features "Virtual Machine Platform" and "Windows Subsystem for Linux"
Download and install the Ubuntu Windows Subsystem for Linux (WSL) by opening a terminal window and type wsl --install
Start WSL by typing opening a terminal and typing wsl, and install Ollama with curl -fsSL https://ollama.com/install.sh | sh
When you decide to use a local LLM and/or embedding model, make sure that the Ollama server is running by:
- opening a terminal and typing wsl
- starting the Ollama server with ollama serve. This makes any downloaded models accessible through the Ollama API

Conda virtual environment setup

Open an Anaconda prompt or other command prompt
Go to the root folder of the project and create a Python environment with conda with conda env create -f appl-docchat.yml
NB: The name of the environment is appl-docchat by default. It can be changed to a name of your choice in the first line of the file appl-docchat.yml
Activate this environment with conda activate appl-docchat

Pip virtual environment setup

Open an Anaconda prompt or other command prompt
Go to the root folder of the project and create a Python environment with pip with python -m venv venv
This will create a basic virtual environment folder named venv in the root of your project folder NB: The chosen name of the environment folder is here venv. It can be changed to a name of your choice
Activate this environment with venv\Scripts\activate
All required packages can now be installed with pip install -r requirements.txt
If you would like to run unit tests, you need to pip install -e appl-docchat as well.

Setting parameters

The file settings_template.py contains all parameters that can be used and needs to be copied to settings.py. In settings.py, fill in the parameter values you want to use for your use case. Examples and restrictions for parameter values are given in the comment lines

NLTK

When the NLTKTextSplitter is used for chunking the documents, it is necessary to download the punkt and punkt_tab module of NLTK.
This can be done in the activated environment by starting a Python interactive session: type python.
Once in the Python session, type import nltk + Enter
Then nltk.download('punkt') + Enter
And finally, nltk.download('punkt_tab') + Enter

FlashRank reranker

This repo allows reranking the retrieved documents from the vector store, by using the FlashRank reranker The very first use will download and unzip the required model as indicated in settings.py from HuggingFace platform For more information on the Flashrank reranker, see https://github.com/PrithivirajDamodaran/FlashRank

Ingesting documents

The file ingest.py can be used to vectorize all documents in a chosen folder and store the vectors and texts in a vector database for later use.
Execution is done in the activated virtual environment with python ingest.py

Querying documents

The file query.py can be used to query any folder with documents, provided that the associated vector database exists.
Execution is done in the activated virtual environment with python query.py

Summarizing documents

The file summarize.py can be used to summarize every file individually in a document folder. Two options for summarization are implemented:

Map Reduce: this will create a summary in a fast way
Refine: this will create a more refined summary, but can take a long time to run, especially for larger documents

Execution is done in the activated virtual environment with python summarize.py. The user will be prompted for the summarization method

Ingesting and querying documents through a Streamlit User Interface

The functionalities described above can also be used through a GUI.
In the activated virtual environment, the GUI can be started with streamlit run streamlit_app.py
When this command is used, a browser session will open automatically

Querying multiple documents with multiple questions in batch

The file review.py uses the standard question-answer technique but allows you to ask multiple questions to each document in a folder sequentially, enabling the user to comparable data from a range of documents. It is aimed at conducting a systematic review of multiple sources. To use the review functionality the following steps need to be executed:

Creation of a docs/your_docs/review folder
Creation of a docs/your_docs/review/questions.csv file (see folder review for an example)
Filling in the questions that shall be posed to the documents
1. Question_Type - Define the question type, either Initial or Follow Up (Follow Up will retain context from previous question)
2. Question - The actual question you would like to ask
3. Question_Template (optional) - Gives instructions how the large language model shall behave. If provided, it needs to include the terms "{context}" & "{question}" (with the brackets). If not provided, the template defined in settings.RETRIEVER_PROMPT_TEMPLATE will be used
4. Summary_Template (optional) - gives instructions for the creation of a summary of all the document's answer to the question (if not defined, no summary will be provided); If provided it needs to include the terms "{question}" & "{answer_string}" (with the brackets)
5. Classification (optional) - Indicator (y/n or blank) that indicates whether the question is considered a classification question or not
6. Classes (optional) - If Classification is "y", this field should contain the names of the classes, each class on a new line
Execution is done in the activated virtual environment with python review.py
Specify document folder when asked

All the results, including the answers and the sources used to create the answers, are stored in a file answers.tsv which is also stored in the subfolder review
If it was chosen to use the summary template, an additional file, answers_summary.tsv is generated, in the same location

For developers: Monitoring the results of the chunking process through a Streamlit User Interface

When parsing files, the raw text is chunked. To see and compare the results of different chunking methods, use the chunks analysis GUI.
In the activated virtual environment, the chunks analysis GUI can be started with streamlit run streamlit_chunks.py
When this command is used, a browser session will open automatically

For developers: Evaluation of Question Answer results

The file evaluate.py can be used to evaluate the generated answers for a list of questions, provided that the file eval.json exists, containing not only the list of questions but also the related list of desired answers (ground truth).
Evaluation is done at folder level (one or multiple folders) in the activated virtual environment with python evaluate.py

For developers: Monitoring the evaluation results through a Streamlit User Interface

All evaluation results can be viewed by using a dedicated evaluation GUI.
In the activated virtual environment, this evaluation GUI can be started with streamlit run streamlit_evaluate.py
When this command is used, a browser session will open automatically

References

This repo is mainly inspired by:

Name		Name	Last commit message	Last commit date
Latest commit History 781 Commits
.github/workflows		.github/workflows
.streamlit		.streamlit
climateNLP2025		climateNLP2025
docs		docs
images		images
info		info
ingest		ingest
prompts		prompts
query		query
review		review
summarize		summarize
tests		tests
.gitignore		.gitignore
CODEOWNERS		CODEOWNERS
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
appl-docchat.yml		appl-docchat.yml
count_single_classes.py		count_single_classes.py
eval_post_process.py		eval_post_process.py
evaluate.py		evaluate.py
hyperparameter_tuning.py		hyperparameter_tuning.py
ingest.py		ingest.py
query.py		query.py
requirements.txt		requirements.txt
review.py		review.py
review_multi.py		review_multi.py
settings_template.py		settings_template.py
setup.py		setup.py
streamlit_app.py		streamlit_app.py
streamlit_chunks.py		streamlit_chunks.py
streamlit_evaluate.py		streamlit_evaluate.py
summarize.py		summarize.py
unit_testing.py		unit_testing.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chat with your docs!

How to use this repo

Preparation

Conda virtual environment setup

Pip virtual environment setup

Setting parameters

NLTK

FlashRank reranker

Ingesting documents

Querying documents

Summarizing documents

Ingesting and querying documents through a Streamlit User Interface

Querying multiple documents with multiple questions in batch

For developers: Monitoring the results of the chunking process through a Streamlit User Interface

For developers: Evaluation of Question Answer results

For developers: Monitoring the evaluation results through a Streamlit User Interface

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Chat with your docs!

How to use this repo

Preparation

Conda virtual environment setup

Pip virtual environment setup

Setting parameters

NLTK

FlashRank reranker

Ingesting documents

Querying documents

Summarizing documents

Ingesting and querying documents through a Streamlit User Interface

Querying multiple documents with multiple questions in batch

For developers: Monitoring the results of the chunking process through a Streamlit User Interface

For developers: Evaluation of Question Answer results

For developers: Monitoring the evaluation results through a Streamlit User Interface

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages