This project builds an interactive QA (Question-Answering) system that processes PDF documents and answers questions based on the document's content. It uses a combination of BERT for natural language understanding, Pinecone for vector search, and Cohere for text generation. The project includes both a Python API and a Jupyter notebook for embedding generation and index management.
The solution includes two main components:
- App Interface: An interactive Gradio interface for uploading PDF documents and asking questions based on the PDF content.
- Model Training & Indexing Pipeline: A pipeline to extract text from PDF files, generate embeddings using BERT, and store them in Pinecone for fast retrieval.
- Upload PDF documents.
- Extract text from the PDF and generate a context vector.
- Answer questions based on the content of the uploaded PDF.
- Use BERT-based embeddings for accurate question-answer matching.
- Use Cohere API for text generation based on the retrieved context.
├── app.py # Main application code with Gradio interface
├── GENAI.ipynb # Jupyter notebook for embedding generation & Pinecone indexing
├── model.pth # Saved BERT model weights
├── tokenizer/ # Directory containing saved tokenizer
├── README.md # This README file
Before running the project, install the following dependencies:
- Python 3.7+
- PyTorch
- Transformers (Hugging Face)
- Gradio
- Pinecone
- Requests
- pdfplumber (For extracting text from PDFs)
- Pandas (For handling the text data)
To install the dependencies, run:
python -m venv myenv
myenv\Scripts\activate
pip install torch transformers gradio pinecone-client requests pdfplumber pandas PyPDF2model cannot be uploaded directly to this repository hence, i have provided link for the saved model link: https://drive.google.com/file/d/1IAacKCRv5Rp6O-wHp-LaMbO4cpgSFQ-W/view?usp=sharing
- PDF Processing: Extract text from each page of the uploaded PDF using
PyPDF2orpdfplumber. - Tokenization & Embedding Generation: Tokenize the extracted text using BERT's
AutoTokenizerand generate embeddings usingAutoModel. - Embedding Indexing with Pinecone: Store these embeddings in a Pinecone vector index for fast similarity search.
- Context Retrieval: For a user query, encode the question into an embedding, find the closest matching embeddings from the index, and generate a context vector.
- Text Generation with Cohere: Generate an answer based on the context vector using the Cohere API.
The Gradio interface allows users to:
- Upload a PDF file.
- Ask questions based on the content of the PDF.
- View the answers generated by the model.
- Navigate to the directory where
app.pyis located. - Run the following command:
python app.pyThis will launch a local Gradio web app that you can open in your browser.
-
Download Pretrained Model: Download the
bert-base-uncasedmodel and tokenizer from Hugging Face.- The model weights are saved in
model.pthand the tokenizer is saved in thetokenizer/directory.
- The model weights are saved in
-
Pinecone Setup:
- Create an account on Pinecone and retrieve your API key.
- Replace the API key in
app.pyandGENAI.ipynbwith your own key. - Set up a Pinecone index for vector search:
pc.create_index(name="genai", dimension=768, metric='euclidean', ...)
-
Running the Application:
- Run the Jupyter notebook
GENAI.ipynbto generate the embeddings for any new PDF documents. - Launch the Gradio app (
app.py) to interact with the QA bot.
- Run the Jupyter notebook
-
Colab Setup:
- To run the notebook in Google Colab, upload
GENAI.ipynbto Colab and run the cells step-by-step. - You can directly interact with Pinecone and Cohere via Colab without any additional setup.
- To run the notebook in Google Colab, upload
- PDFs can be quite large, especially when multiple pages are involved. To solve this, the PDF text is split into smaller segments and tokenized individually, making the process more memory-efficient.
- BERT embeddings allow for a precise understanding of the question's meaning, and Pinecone is used to retrieve semantically similar text, ensuring that the context for answering questions is highly relevant.
- Pinecone allows batch insertion of embeddings to reduce the time spent uploading large datasets. The system dynamically handles batch sizes based on the size of the text data.
- Upload a PDF file using the "Upload PDF" button.
- Ask any question related to the content of the uploaded PDF.
- The model will retrieve relevant information and provide a response.
- Open the Colab notebook
GENAI.ipynb. - Ensure you have access to the necessary API keys (Pinecone, Cohere).
- Run the notebook step by step to generate embeddings and index them in Pinecone.
- Hugging Face for providing pre-trained models and tokenizers.
- Pinecone for the vector database service.
- Cohere for the text generation API.