Interactive QA Bot Using BERT, Pinecone, and Gradio

This project builds an interactive QA (Question-Answering) system that processes PDF documents and answers questions based on the document's content. It uses a combination of BERT for natural language understanding, Pinecone for vector search, and Cohere for text generation. The project includes both a Python API and a Jupyter notebook for embedding generation and index management.

Project Overview

The solution includes two main components:

App Interface: An interactive Gradio interface for uploading PDF documents and asking questions based on the PDF content.
Model Training & Indexing Pipeline: A pipeline to extract text from PDF files, generate embeddings using BERT, and store them in Pinecone for fast retrieval.

Features

Upload PDF documents.
Extract text from the PDF and generate a context vector.
Answer questions based on the content of the uploaded PDF.
Use BERT-based embeddings for accurate question-answer matching.
Use Cohere API for text generation based on the retrieved context.

Repository Structure

├── app.py                # Main application code with Gradio interface
├── GENAI.ipynb            # Jupyter notebook for embedding generation & Pinecone indexing
├── model.pth              # Saved BERT model weights
├── tokenizer/             # Directory containing saved tokenizer
├── README.md              # This README file

Requirements

Before running the project, install the following dependencies:

Python 3.7+
PyTorch
Transformers (Hugging Face)
Gradio
Pinecone
Requests
pdfplumber (For extracting text from PDFs)
Pandas (For handling the text data)

To install the dependencies, run:

python -m venv myenv
myenv\Scripts\activate

pip install torch transformers gradio pinecone-client requests pdfplumber pandas PyPDF2

Model Pipeline

model cannot be uploaded directly to this repository hence, i have provided link for the saved model link: https://drive.google.com/file/d/1IAacKCRv5Rp6O-wHp-LaMbO4cpgSFQ-W/view?usp=sharing

PDF Processing: Extract text from each page of the uploaded PDF using PyPDF2 or pdfplumber.
Tokenization & Embedding Generation: Tokenize the extracted text using BERT's AutoTokenizer and generate embeddings using AutoModel.
Embedding Indexing with Pinecone: Store these embeddings in a Pinecone vector index for fast similarity search.
Context Retrieval: For a user query, encode the question into an embedding, find the closest matching embeddings from the index, and generate a context vector.
Text Generation with Cohere: Generate an answer based on the context vector using the Cohere API.

Gradio Interface

The Gradio interface allows users to:

Upload a PDF file.
Ask questions based on the content of the PDF.
View the answers generated by the model.

To run the interface:

Navigate to the directory where app.py is located.
Run the following command:

python app.py

This will launch a local Gradio web app that you can open in your browser.

Deployment Instructions

Download Pretrained Model: Download the bert-base-uncased model and tokenizer from Hugging Face.
- The model weights are saved in model.pth and the tokenizer is saved in the tokenizer/ directory.
Pinecone Setup:
- Create an account on Pinecone and retrieve your API key.
- Replace the API key in app.py and GENAI.ipynb with your own key.
- Set up a Pinecone index for vector search:
```
pc.create_index(name="genai", dimension=768, metric='euclidean', ...)
```
Running the Application:
- Run the Jupyter notebook GENAI.ipynb to generate the embeddings for any new PDF documents.
- Launch the Gradio app (app.py) to interact with the QA bot.
Colab Setup:
- To run the notebook in Google Colab, upload GENAI.ipynb to Colab and run the cells step-by-step.
- You can directly interact with Pinecone and Cohere via Colab without any additional setup.

Challenges and Solutions

1. Handling Large Text from PDFs:

PDFs can be quite large, especially when multiple pages are involved. To solve this, the PDF text is split into smaller segments and tokenized individually, making the process more memory-efficient.

2. Accurate Retrieval:

BERT embeddings allow for a precise understanding of the question's meaning, and Pinecone is used to retrieve semantically similar text, ensuring that the context for answering questions is highly relevant.

3. Time Efficiency in Embedding Upload:

Pinecone allows batch insertion of embeddings to reduce the time spent uploading large datasets. The system dynamically handles batch sizes based on the size of the text data.

Usage Instructions

1. Using the Gradio Interface:

Upload a PDF file using the "Upload PDF" button.
Ask any question related to the content of the uploaded PDF.
The model will retrieve relevant information and provide a response.

2. Using the Colab Notebook:

Open the Colab notebook GENAI.ipynb.
Ensure you have access to the necessary API keys (Pinecone, Cohere).
Run the notebook step by step to generate embeddings and index them in Pinecone.

Acknowledgments

Hugging Face for providing pre-trained models and tokenizers.
Pinecone for the vector database service.
Cohere for the text generation API.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
tokenizer		tokenizer
271_AI Lect Notes.pdf		271_AI Lect Notes.pdf
GENAI.ipynb		GENAI.ipynb
README.md		README.md
Screenshot 2024-09-22 231502.png		Screenshot 2024-09-22 231502.png
Screenshot 2024-09-22 231619.png		Screenshot 2024-09-22 231619.png
Screenshot 2024-09-22 231643.png		Screenshot 2024-09-22 231643.png
app.py		app.py
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Interactive QA Bot Using BERT, Pinecone, and Gradio

Project Overview

Features

Repository Structure

Requirements

Model Pipeline

Gradio Interface

To run the interface:

Deployment Instructions

Challenges and Solutions

1. Handling Large Text from PDFs:

2. Accurate Retrieval:

3. Time Efficiency in Embedding Upload:

Usage Instructions

1. Using the Gradio Interface:

2. Using the Colab Notebook:

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Interactive QA Bot Using BERT, Pinecone, and Gradio

Project Overview

Features

Repository Structure

Requirements

Model Pipeline

Gradio Interface

To run the interface:

Deployment Instructions

Challenges and Solutions

1. Handling Large Text from PDFs:

2. Accurate Retrieval:

3. Time Efficiency in Embedding Upload:

Usage Instructions

1. Using the Gradio Interface:

2. Using the Colab Notebook:

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages