An AI-powered document intelligence system for Stevens Institute of Technology
Ask questions about admissions, tuition, deadlines, courses, and international student policies β
and get cited, conflict-aware answers grounded strictly in official university documents.
This project turns five official Stevens Institute PDF documents into a queryable knowledge base. You ask a question in natural language; the system retrieves the most relevant chunks, sends them to an LLM, and returns a sourced, structured answer.
The standout feature is conflict detection β if two documents disagree on the same fact (e.g. two different application deadlines), the app doesn't silently pick one. It:
- π’ Highlights the preferred answer from the more recently published source
- π΄ Flags the conflict with both values and their sources side-by-side
- Recommends the student verify directly with the relevant office
University-Rag-Assistant/
β
βββ app.py # Streamlit UI β two-column layout, conflict cards
β
βββ src/
β βββ __init__.py
β βββ rag.py # Retrieval + LLM answering + conflict-aware prompt
β βββ ingest.py # PDF loading, sentence-aware chunking, ChromaDB upsert
β
βββ data/ # Drop your PDF documents here
β βββ 01_Stevens_Admissions_Guide.pdf
β βββ 02_Stevens_Course_Catalogue.pdf
β βββ 03_Stevens_Tuition_Fees.pdf
β βββ 04_Stevens_Academic_Calendar.pdf
β βββ 05_Stevens_International_FAQ.pdf
β
βββ chroma_db/ # Auto-created by ingest.py (git-ignored)
βββ .env # Your API keys (git-ignored)
βββ .gitignore
βββ requirements.txt
βββ README.md
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INGEST PIPELINE β
β β
β PDF files β PyPDF page extract β Sentence-aware chunker β
β β all-MiniLM-L6-v2 embeddings β ChromaDB upsert β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β QUERY PIPELINE β
β β
β User question β Embed query β ChromaDB top-8 retrieval β
β β Build context (with chunk + page labels) β
β β Conflict-aware prompt β Groq LLaMA 3.3 70B β
β β Parse β / β markers β Render styled cards β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Component | Technology | Why |
|---|---|---|
| Vector store | ChromaDB (persistent) | Local, fast, no infra needed |
| Embeddings | all-MiniLM-L6-v2 |
384-d semantic vectors, free |
| LLM | LLaMA 3.3 70B via Groq | Strong multi-source reasoning, free tier |
| Chunking | Sentence-boundary aware | Dates & deadlines never split mid-sentence |
| UI | Streamlit + custom CSS | Two-column, dark academic theme |
git clone https://github.com/yourusername/University-Rag-Assistant.git
cd University-Rag-Assistantpython -m venv .venv
# Windows
.venv\Scripts\activate
# macOS / Linux
source .venv/bin/activatepip install -r requirements.txtCreate a .env file in the project root:
GROQ_API_KEY=your_groq_api_key_hereGet a free key at console.groq.com.
Drop your university PDF documents into the data/ folder.
python src/ingest.pyThis extracts text, chunks it sentence-by-sentence, generates semantic embeddings, and stores everything in ChromaDB. Re-run any time you add or update documents β upsert makes it safe to re-run.
streamlit run app.pyOpen http://localhost:8501 in your browser.
| Question | Expected behaviour |
|---|---|
What is the priority deadline for graduate master's programs? |
π΄ Conflict detected β two different dates found across document sections |
What is the minimum TOEFL score for international students? |
π’ Clean answer β 79 iBT, cited from International FAQ |
How much is undergraduate tuition for 2025β2026? |
π’ Clean answer β $62,428, cited from Tuition & Fees doc |
When does the fall 2025 semester start? |
π’ Clean answer β August 28, 2025, from Academic Calendar |
Can F-1 students work off-campus in their first year? |
π’ Clean answer with policy details from International FAQ |
When two sources disagree, the UI renders two distinct cards:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β PREFERRED ANSWER (most recent source) β
β β
β Based on Source 1 (Admissions Guide, Aug 2024): β
β The priority deadline is February 1, 2025. β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β CONFLICT DETECTED β
β β
β Source 3 (Admissions Guide, page 2, chunk 2) β
β states: January 15, 2025. β
β β
β Please confirm with the Office of Admissions. β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The LLM follows a strict 5-rule resolution protocol β it never silently picks a value when sources disagree.
OpenBLAS memory error on Windows
$env:OPENBLAS_NUM_THREADS = "1"
$env:OMP_NUM_THREADS = "1"
python src/ingest.pysrc module not found
Make sure src/__init__.py exists and you're running commands from the project root, not inside src/.
Conflict not detected
Ensure build_context() in rag.py is NOT deduplicating chunks by page β two chunks from the same page can still conflict and must appear as separate sources.
Re-ingesting after doc changes
Just re-run python src/ingest.py β upsert() handles duplicates safely.
streamlit
chromadb
pypdf
sentence-transformers
groq
python-dotenvInstall all at once:
pip install streamlit chromadb pypdf sentence-transformers groq python-dotenv- Multi-university support (swap document sets via sidebar)
- Source confidence scores displayed per chunk
- PDF viewer panel β click a source chip to see the original page
- Chat history β multi-turn conversation with memory
- Export answers as PDF report
- Admin panel β drag-and-drop document upload + re-ingest trigger
Pull requests are welcome. For major changes please open an issue first to discuss what you'd like to change.
- Fork the repo
- Create your branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Distributed under the MIT License. See LICENSE for more information.