Skip to content

Resh-97/GradeLens

Repository files navigation

GradeLens — AI Grading Assistant

A prototype chat assistant for professors reviewing AI-generated thesis grades. Built with FastAPI + LangGraph.

The assistant is explanation-first: it answers questions about stored grading artifacts inside a single thesis conversation. It does not re-grade, finalise grades, or make pass/fail recommendations. Every response includes a grounding status, version ID, and citations back to the grading data.

Demo: here

Requirements

  • Python 3.11+
  • UV (package manager — see install instructions below)

No API key is required. The app runs fully in deterministic mode without one. Set ANTHROPIC_API_KEY or OPENAI_API_KEY to enable LLM-generated answers.


Setup

Install UV

# macOS / Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows (PowerShell)
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Install dependencies

git clone <repo-url>
cd GradeLens

uv sync               # creates .venv and installs all production dependencies
uv sync --extra dev   # also installs pytest

Configure environment

cp .env.example .env
# Optional: set ANTHROPIC_API_KEY or OPENAI_API_KEY in .env

Running the app

Both the backend and frontend must be running for the full UI experience.

Backend (FastAPI)

uv run uvicorn app.main:app --reload

API available at http://localhost:8000
Interactive docs at http://localhost:8000/docs

Frontend (Streamlit)

uv run streamlit run frontend/app.py

Opens at http://localhost:8501
Three-panel layout: sidebar (thesis selector, stage scores), centre (chat), right (citations/evidence).

Offline evals — no server or API key needed

uv run python -m app.evals.eval_runner

Runs 10 representative professor questions through the full graph and checks 42 properties across all intent types. Exits with code 1 on any failure. Run this after changes to router.py, nodes.py, or llm_client.py.

Tests

uv run pytest tests/ -v

Project structure

GradeLens/
├── app/
│   ├── agent/
│   │   ├── graph.py          # LangGraph StateGraph definition
│   │   ├── nodes.py          # All graph node functions
│   │   ├── router.py         # Deterministic intent classifier (keyword/regex)
│   │   ├── llm_client.py     # AnthropicClient / OpenAIClient / build_llm_client()
│   │   └── state.py          # AgentState TypedDict + initial_state()
│   ├── api/
│   │   └── routes.py         # FastAPI route handlers
│   ├── data/                 # JSON fixtures (loaded at startup — no DB writes)
│   │   ├── theses.json
│   │   └── grades/
│   │       ├── thesis_001.json   # 2 grade versions + evidence + override event
│   │       └── thesis_002.json
│   ├── evals/
│   │   ├── eval_runner.py    # Offline eval script (10 cases, 42 checks)
│   │   └── deterministic_client.py  # Offline answer generator (evals/tests only)
│   ├── models/
│   │   └── schemas.py        # Pydantic models (GradeVersion, Citation, RoutingDecision, …)
│   ├── storage/
│   │   ├── data_store.py     # In-memory DataStore loaded from JSON fixtures
│   │   └── database.py       # SQLAlchemy async SQLite for conversations + audit log
│   └── main.py               # FastAPI app factory + startup
├── frontend/
│   └── app.py                # Streamlit UI
├── tests/
│   └── test_routing_model.py # Unit tests for intent classification + fetch nodes
├── CLAUDE.md                 # Guidance for Claude Code
├── pyproject.toml
└── .env.example

Architecture

POST /theses/{id}/chat
        │
        ▼
  LangGraph StateGraph
        │
  ┌──────────────────────────────┐
  │ resolve_version              │  loads Thesis + active GradeVersion
  │ classify_intent              │  → grade_explanation | version_comparison |
  │                              │    rubric_lookup | override_request | unsupported
  │ fetch_explain_context        │  focal stage(s) + evidence  (grade_explanation)
  │ fetch_rubric_context         │  focal stage(s) + evidence  (rubric_lookup)
  │ fetch_comparison_context     │  version diff + override metadata
  │ call_override_service        │  stub override handler — bypasses generation
  │ generate_answer              │  LLM synthesizer (deterministic offline fallback)
  │ validate_grounding           │  version ref · score consistency · citations · isolation
  │ safe_fallback                │  if blocked: replace with raw structured data
  │ finalize                     │  promotes to final_answer
  │ persist                      │  writes to SQLite (messages + audit log)
  └──────────────────────────────┘
        │
        ▼
  ChatResponse: answer · grounding_status · citations · warnings · intent

Key design decisions

Decision Rationale
LangGraph StateGraph Routing logic is visible in graph structure rather than buried in if/else chains
Deterministic keyword router No LLM needed for intent classification; unsupported sub-asks are tracked explicitly on the RoutingDecision
Grounding validator as a dedicated node Cannot be skipped; safe_fallback is mandatory on blocked — the assistant never hallucinates
DeterministicClient in app/evals/ Offline answer generation used for evals and tests; never imported in production code
In-memory DataStore + SQLite split Immutable grading fixtures stay fast and simple; only mutable conversation history hits the DB

Intent types

Intent Trigger Route
grade_explanation Stage aliases ("methodology"), "why", "explain", "scored", evidence keywords fetch_explain_context
version_comparison "compare", "what changed", "before/after the override", version refs fetch_comparison_context
rubric_lookup "rubric", "grading criteria", "how is it graded" fetch_rubric_context
override_request "want to change the score", "increase/decrease", "override stage" call_override_service
unsupported Pass/fail recommendations, personal grading opinions generate_answer (refusal)

override_request is evaluated before unsupported so a professor asking to change a score is never treated as an opinion query.


Sample data

One thesis: "Deep Learning Approaches to Climate Prediction"

Two grade versions:

  • gv_001 (superseded) — original pipeline output, Stage 3 scored 9/20
  • gv_002 (active) — after Prof. Chen override, Stage 3 scored 14/20

Stage 3 (Methodology) is the intentional weak point, with linked evidence for realistic retrieval demos.


Example API calls

# Health check
curl http://localhost:8000/health

# Active grade for a thesis
curl http://localhost:8000/theses/thesis_001/active-grade | python -m json.tool

# List grade versions
curl http://localhost:8000/theses/thesis_001/grade-versions | python -m json.tool

# Chat — score explanation
curl -X POST http://localhost:8000/theses/thesis_001/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "Why was methodology scored low?", "professor_id": "prof_chen"}'

# Chat — evidence lookup
curl -X POST http://localhost:8000/theses/thesis_001/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "Show me the evidence supporting the methodology assessment.", "professor_id": "prof_chen"}'

# Chat — version comparison
curl -X POST http://localhost:8000/theses/thesis_001/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "What changed between version 1 and version 2?", "professor_id": "prof_chen"}'

# Chat — rubric lookup
curl -X POST http://localhost:8000/theses/thesis_001/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "What criteria are used to grade the methodology stage?", "professor_id": "prof_chen"}'

# Chat — override request
curl -X POST http://localhost:8000/theses/thesis_001/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "I want to change the methodology score.", "professor_id": "prof_chen"}'

# Chat — unsupported (out-of-scope)
curl -X POST http://localhost:8000/theses/thesis_001/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "What final grade would you personally give this thesis?", "professor_id": "prof_chen"}'

# Resume a conversation
curl -X POST http://localhost:8000/theses/thesis_001/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "What about stage 4?", "professor_id": "prof_chen", "conversation_id": "conv_001"}'

# Read conversation history
curl http://localhost:8000/conversations/conv_001 | python -m json.tool

# Compare versions directly
curl -X POST http://localhost:8000/theses/thesis_001/compare-versions \
  -H "Content-Type: application/json" \
  -d '{"old_version_id": "gv_001", "new_version_id": "gv_002"}'

Chat response fields

Field Description
answer The grounded explanation
grade_version_id_used Always the active version unless comparison
grounding_status grounded · blocked · allowed_override
citations List of {source_type, reference_id, label, excerpt} — source types: stage, evidence, version
warnings Grounding issues found (empty if clean)
intent Classified question type
retrieved_context_summary Short description of what was fetched

Known limitations

  • No real PDF parsing — thesis text is hardcoded fixture data
  • No authentication — professor_id is a free string in requests
  • The persist node runs DB writes in a background thread (workaround for LangGraph sync nodes in async FastAPI — sufficient for prototype)
  • LangGraph's built-in checkpointer is not used — conversation history is written by a custom persist node directly to SQLite via SQLAlchemy
  • call_override_service is a stub — it reads existing override history but does not write new overrides

Potential next steps

  1. Add vector embeddings for semantic evidence retrieval (e.g. sentence-transformers)
  2. Wire LangGraph's built-in checkpointer to the SQLite session factory
  3. Add professor authentication (JWT or institution SSO)
  4. Implement a real PDF parsing pipeline for thesis ingestion
  5. Implement the override service write path (currently stub-only)
  6. Add a "helpful / not grounded" feedback button (writes to audit log)
  7. Expose the LangGraph graph visualisation at /graph for debugging

About

An evidence-grounded grading assistant that helps professors interrogate AI-generated thesis evaluations, inspect supporting passages, and compare rationale across grading versions.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages