Skip to content

Latest commit

 

History

History
60 lines (50 loc) · 2.22 KB

File metadata and controls

60 lines (50 loc) · 2.22 KB

Chat with PDFs

This repository contains a streamlit application that allows a user to chat with their PDF documents. The user can ask a question about the document's contents in natural language, and the application will provide a response based on the content. A language model is used to generate answers.

Contents

Architecture

Architecture Diagram

The application follows these steps to provide responses to your questions:

  • Text Extraction: Extracts text from the uploaded PDF documents.
  • Text Chunking: The extracted text is divided into manageable chunks.
  • LLM: The app uses a language model to generate vector representations - embeddings of the text chunks.
  • Semantic Matching: When the user asks a question, the app compares it with the text chunks and identifies the semantically similar chunks.
  • Response: The similar chunks are passed to the language model, which generates a response based on the content of the PDFs.

The model used in this project is google's flan-t5-xxl

Requirements

  • langchain
  • PyPDF2
  • python-dotenv
  • streamlit
  • faiss-cpu
  • altair
  • tiktoken
  • huggingface-hub
  • InstructorEmbedding
  • sentence-transformers
  • torch
  • torchvision
  • setuptools
pip install langchain PyPDF2 python-dotenv streamlit faiss-cpu altair tiktoken huggingface-hub InstructorEmbedding sentence-transformers torch torchvision setuptools

Obtain an API Token from HuggingFace and add it to the .env file in your directory

HUGGINGFACEHUB_API_TOKEN = your_api_key

Usage

  • Install the required dependencies.
  • Run app.py to start the streamlit application, using
streamlit run app.py

User Interface

  • Upload your PDF documents using the file uploader.
  • Click the Process button to process the uploaded documents.
  • Ask questions through the chat.

Ask a Question

Note

This project uses Instructor to generate text embeddings and runs locally on your machine. The processing time will vary based on your hardware.