Skip to content

SaniaNemade/University-Rag-Assistant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

University-Rag-Assistant

Stevens RAG

πŸŽ“ University RAG Assistant

An AI-powered document intelligence system for Stevens Institute of Technology
Ask questions about admissions, tuition, deadlines, courses, and international student policies β€”
and get cited, conflict-aware answers grounded strictly in official university documents.


Python Streamlit ChromaDB Groq License


Demo Screenshot Placeholder


✨ What It Does

This project turns five official Stevens Institute PDF documents into a queryable knowledge base. You ask a question in natural language; the system retrieves the most relevant chunks, sends them to an LLM, and returns a sourced, structured answer.

The standout feature is conflict detection β€” if two documents disagree on the same fact (e.g. two different application deadlines), the app doesn't silently pick one. It:

  1. 🟒 Highlights the preferred answer from the more recently published source
  2. πŸ”΄ Flags the conflict with both values and their sources side-by-side
  3. Recommends the student verify directly with the relevant office

πŸ“‚ Project Structure

University-Rag-Assistant/
β”‚
β”œβ”€β”€ app.py                    # Streamlit UI β€” two-column layout, conflict cards
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ rag.py                # Retrieval + LLM answering + conflict-aware prompt
β”‚   └── ingest.py             # PDF loading, sentence-aware chunking, ChromaDB upsert
β”‚
β”œβ”€β”€ data/                     # Drop your PDF documents here
β”‚   β”œβ”€β”€ 01_Stevens_Admissions_Guide.pdf
β”‚   β”œβ”€β”€ 02_Stevens_Course_Catalogue.pdf
β”‚   β”œβ”€β”€ 03_Stevens_Tuition_Fees.pdf
β”‚   β”œβ”€β”€ 04_Stevens_Academic_Calendar.pdf
β”‚   └── 05_Stevens_International_FAQ.pdf
β”‚
β”œβ”€β”€ chroma_db/                # Auto-created by ingest.py (git-ignored)
β”œβ”€β”€ .env                      # Your API keys (git-ignored)
β”œβ”€β”€ .gitignore
β”œβ”€β”€ requirements.txt
└── README.md

🧠 Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         INGEST PIPELINE                         β”‚
β”‚                                                                 β”‚
β”‚  PDF files  β†’  PyPDF page extract  β†’  Sentence-aware chunker   β”‚
β”‚            β†’  all-MiniLM-L6-v2 embeddings  β†’  ChromaDB upsert  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                          QUERY PIPELINE                         β”‚
β”‚                                                                 β”‚
β”‚  User question  β†’  Embed query  β†’  ChromaDB top-8 retrieval    β”‚
β”‚               β†’  Build context (with chunk + page labels)       β”‚
β”‚               β†’  Conflict-aware prompt  β†’  Groq LLaMA 3.3 70B  β”‚
β”‚               β†’  Parse βœ“ / ⚠ markers  β†’  Render styled cards   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Component Technology Why
Vector store ChromaDB (persistent) Local, fast, no infra needed
Embeddings all-MiniLM-L6-v2 384-d semantic vectors, free
LLM LLaMA 3.3 70B via Groq Strong multi-source reasoning, free tier
Chunking Sentence-boundary aware Dates & deadlines never split mid-sentence
UI Streamlit + custom CSS Two-column, dark academic theme

πŸš€ Getting Started

1 β€” Clone the repo

git clone https://github.com/yourusername/University-Rag-Assistant.git
cd University-Rag-Assistant

2 β€” Create and activate a virtual environment

python -m venv .venv

# Windows
.venv\Scripts\activate

# macOS / Linux
source .venv/bin/activate

3 β€” Install dependencies

pip install -r requirements.txt

4 β€” Add your API key

Create a .env file in the project root:

GROQ_API_KEY=your_groq_api_key_here

Get a free key at console.groq.com.

5 β€” Add your PDFs

Drop your university PDF documents into the data/ folder.

6 β€” Ingest documents

python src/ingest.py

This extracts text, chunks it sentence-by-sentence, generates semantic embeddings, and stores everything in ChromaDB. Re-run any time you add or update documents β€” upsert makes it safe to re-run.

7 β€” Launch the app

streamlit run app.py

Open http://localhost:8501 in your browser.


πŸ’‘ Example Questions to Try

Question Expected behaviour
What is the priority deadline for graduate master's programs? πŸ”΄ Conflict detected β€” two different dates found across document sections
What is the minimum TOEFL score for international students? 🟒 Clean answer β€” 79 iBT, cited from International FAQ
How much is undergraduate tuition for 2025–2026? 🟒 Clean answer β€” $62,428, cited from Tuition & Fees doc
When does the fall 2025 semester start? 🟒 Clean answer β€” August 28, 2025, from Academic Calendar
Can F-1 students work off-campus in their first year? 🟒 Clean answer with policy details from International FAQ

⚠️ Conflict Detection in Action

When two sources disagree, the UI renders two distinct cards:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ βœ“  PREFERRED ANSWER (most recent source)            β”‚
β”‚                                                     β”‚
β”‚  Based on Source 1 (Admissions Guide, Aug 2024):    β”‚
β”‚  The priority deadline is February 1, 2025.         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ⚠  CONFLICT DETECTED                               β”‚
β”‚                                                     β”‚
β”‚  Source 3 (Admissions Guide, page 2, chunk 2)       β”‚
β”‚  states: January 15, 2025.                          β”‚
β”‚                                                     β”‚
β”‚  Please confirm with the Office of Admissions.      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The LLM follows a strict 5-rule resolution protocol β€” it never silently picks a value when sources disagree.


πŸ›  Troubleshooting

OpenBLAS memory error on Windows

$env:OPENBLAS_NUM_THREADS = "1"
$env:OMP_NUM_THREADS = "1"
python src/ingest.py

src module not found Make sure src/__init__.py exists and you're running commands from the project root, not inside src/.

Conflict not detected Ensure build_context() in rag.py is NOT deduplicating chunks by page β€” two chunks from the same page can still conflict and must appear as separate sources.

Re-ingesting after doc changes Just re-run python src/ingest.py β€” upsert() handles duplicates safely.


πŸ“¦ Requirements

streamlit
chromadb
pypdf
sentence-transformers
groq
python-dotenv

Install all at once:

pip install streamlit chromadb pypdf sentence-transformers groq python-dotenv

πŸ—Ί Roadmap

  • Multi-university support (swap document sets via sidebar)
  • Source confidence scores displayed per chunk
  • PDF viewer panel β€” click a source chip to see the original page
  • Chat history β€” multi-turn conversation with memory
  • Export answers as PDF report
  • Admin panel β€” drag-and-drop document upload + re-ingest trigger

🀝 Contributing

Pull requests are welcome. For major changes please open an issue first to discuss what you'd like to change.

  1. Fork the repo
  2. Create your branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“„ License

Distributed under the MIT License. See LICENSE for more information.


⬆ Back to top

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages