Skip to content

kzohan2000/Interactive-QA-Bot-Using-BERT-Pinecone-and-Gradio

Repository files navigation

Interactive QA Bot Using BERT, Pinecone, and Gradio

This project builds an interactive QA (Question-Answering) system that processes PDF documents and answers questions based on the document's content. It uses a combination of BERT for natural language understanding, Pinecone for vector search, and Cohere for text generation. The project includes both a Python API and a Jupyter notebook for embedding generation and index management.

Project Overview

The solution includes two main components:

  1. App Interface: An interactive Gradio interface for uploading PDF documents and asking questions based on the PDF content.
  2. Model Training & Indexing Pipeline: A pipeline to extract text from PDF files, generate embeddings using BERT, and store them in Pinecone for fast retrieval.

Features

  • Upload PDF documents.
  • Extract text from the PDF and generate a context vector.
  • Answer questions based on the content of the uploaded PDF.
  • Use BERT-based embeddings for accurate question-answer matching.
  • Use Cohere API for text generation based on the retrieved context.

Repository Structure

├── app.py                # Main application code with Gradio interface
├── GENAI.ipynb            # Jupyter notebook for embedding generation & Pinecone indexing
├── model.pth              # Saved BERT model weights
├── tokenizer/             # Directory containing saved tokenizer
├── README.md              # This README file

Requirements

Before running the project, install the following dependencies:

  • Python 3.7+
  • PyTorch
  • Transformers (Hugging Face)
  • Gradio
  • Pinecone
  • Requests
  • pdfplumber (For extracting text from PDFs)
  • Pandas (For handling the text data)

To install the dependencies, run:

python -m venv myenv
myenv\Scripts\activate                                                                           
pip install torch transformers gradio pinecone-client requests pdfplumber pandas PyPDF2

Model Pipeline

model cannot be uploaded directly to this repository hence, i have provided link for the saved model link: https://drive.google.com/file/d/1IAacKCRv5Rp6O-wHp-LaMbO4cpgSFQ-W/view?usp=sharing

  1. PDF Processing: Extract text from each page of the uploaded PDF using PyPDF2 or pdfplumber.
  2. Tokenization & Embedding Generation: Tokenize the extracted text using BERT's AutoTokenizer and generate embeddings using AutoModel.
  3. Embedding Indexing with Pinecone: Store these embeddings in a Pinecone vector index for fast similarity search.
  4. Context Retrieval: For a user query, encode the question into an embedding, find the closest matching embeddings from the index, and generate a context vector.
  5. Text Generation with Cohere: Generate an answer based on the context vector using the Cohere API.

Gradio Interface

The Gradio interface allows users to:

  • Upload a PDF file.
  • Ask questions based on the content of the PDF.
  • View the answers generated by the model.

To run the interface:

  1. Navigate to the directory where app.py is located.
  2. Run the following command:
python app.py

This will launch a local Gradio web app that you can open in your browser.


Deployment Instructions

  1. Download Pretrained Model: Download the bert-base-uncased model and tokenizer from Hugging Face.

    • The model weights are saved in model.pth and the tokenizer is saved in the tokenizer/ directory.
  2. Pinecone Setup:

    • Create an account on Pinecone and retrieve your API key.
    • Replace the API key in app.py and GENAI.ipynb with your own key.
    • Set up a Pinecone index for vector search:
      pc.create_index(name="genai", dimension=768, metric='euclidean', ...)
  3. Running the Application:

    • Run the Jupyter notebook GENAI.ipynb to generate the embeddings for any new PDF documents.
    • Launch the Gradio app (app.py) to interact with the QA bot.
  4. Colab Setup:

    • To run the notebook in Google Colab, upload GENAI.ipynb to Colab and run the cells step-by-step.
    • You can directly interact with Pinecone and Cohere via Colab without any additional setup.

Challenges and Solutions

1. Handling Large Text from PDFs:

  • PDFs can be quite large, especially when multiple pages are involved. To solve this, the PDF text is split into smaller segments and tokenized individually, making the process more memory-efficient.

2. Accurate Retrieval:

  • BERT embeddings allow for a precise understanding of the question's meaning, and Pinecone is used to retrieve semantically similar text, ensuring that the context for answering questions is highly relevant.

3. Time Efficiency in Embedding Upload:

  • Pinecone allows batch insertion of embeddings to reduce the time spent uploading large datasets. The system dynamically handles batch sizes based on the size of the text data.

Usage Instructions

1. Using the Gradio Interface:

  • Upload a PDF file using the "Upload PDF" button.
  • Ask any question related to the content of the uploaded PDF.
  • The model will retrieve relevant information and provide a response.

2. Using the Colab Notebook:

  • Open the Colab notebook GENAI.ipynb.
  • Ensure you have access to the necessary API keys (Pinecone, Cohere).
  • Run the notebook step by step to generate embeddings and index them in Pinecone.

Acknowledgments

  • Hugging Face for providing pre-trained models and tokenizers.
  • Pinecone for the vector database service.
  • Cohere for the text generation API.


About

This project is an interactive QA bot that processes PDF documents and answers questions based on the content. It uses BERT for text embeddings, Pinecone for vector search, and Cohere for text generation. With a Gradio interface, users can upload PDFs and get answers to their queries.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors