Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 30 additions & 12 deletions .env.example
Original file line number Diff line number Diff line change
@@ -1,16 +1,34 @@
# OpenAI API Key (Required for embeddings and LLM)
OPENAI_API_KEY=sk-your-api-key-here
# =====================================================================
# Multimodal Document Chat — environment configuration
# Copy to `.env` (it is gitignored) before running docker-compose.
# =====================================================================

# OpenAI Models (Optional - defaults provided)
OPENAI_MODEL=gpt-4o-mini
OPENAI_EMBEDDING_MODEL=text-embedding-3-small
# --- LLM (Gemini via OpenAI-compatible API) --------------------
LLM_PROVIDER=gemini
LLM_API_KEY=replace-with-your-gemini-api-key
LLM_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai
LLM_MODEL=gemini-2.5-flash

# Database (Configured in docker-compose.yml)
DATABASE_URL=postgresql://docuser:docpass@localhost:5432/docdb
# Legacy aliases — docker-compose interpolates `${LLM_API_KEY}` into the
# OPENAI_API_KEY env var so the upstream `openai` SDK picks it up; if you
# run the backend outside compose, leave OPENAI_API_KEY blank — the code's
# `resolved_api_key` helper falls back to LLM_API_KEY automatically.

# Redis (Configured in docker-compose.yml)
REDIS_URL=redis://localhost:6379/0
# --- Embeddings (local, free, deterministic) -------------------------
EMBEDDING_MODEL=BAAI/bge-small-en-v1.5
EMBEDDING_DIMENSION=384

# Upload Settings
UPLOAD_DIR=./uploads
MAX_FILE_SIZE=50 # MB
# --- Retrieval tuning ------------------------------------------------
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
TOP_K_RESULTS=5

# --- Generation tuning ------------------------------------------------
LLM_TEMPERATURE=0.1
LLM_MAX_TOKENS=1024

# --- Infrastructure (docker-compose defaults) ------------------------
DATABASE_URL=postgresql://docuser:docpass@postgres:5432/docdb
REDIS_URL=redis://redis:6379/0
UPLOAD_DIR=/app/uploads
MAX_FILE_SIZE=52428800
288 changes: 288 additions & 0 deletions DESIGN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,288 @@
# DESIGN

This document is the "why" companion to the code. The README explains
how to run it; this file explains the architectural decisions, where
the trade-offs are, and how I would scale it.

---

## 1. System overview

```
┌──────────────┐ PDF ┌──────────────────────────┐
│ Next.js UI │ ──────────► │ FastAPI / upload │
│ (port 3000) │ │ - persists file │
└──────┬───────┘ │ - schedules processor │
│ └─────────────┬────────────┘
│ chat │ BackgroundTask
▼ ▼
┌──────────────┐ ┌───────────────────────────┐
│ /api/chat │ │ DocumentProcessor │
│ │ │ Docling → text+img+table │
│ ChatEngine │◄──retrieves──┤ PyMuPDF fallback │
│ │ │ Hierarchical chunking │
└──────┬───────┘ └────────────┬──────────────┘
│ embeds │ writes
▼ ▼
┌──────────────┐ ┌───────────────────────────┐
│ Gemini 2.5 │ │ Postgres + pgvector │
│ Flash │ │ chunks(embedding 384d) │
│ (OpenAI- │ │ images, tables │
│ compatible) │ │ │
└──────────────┘ └───────────────────────────┘
```

* **Embeddings stay local** (`sentence-transformers`,
`BAAI/bge-small-en-v1.5`, 384-dim). Local embeddings are
deterministic, free, and not subject to provider drift, so the index
survives upgrades cleanly.
* **LLM is Gemini 2.5 Flash** via its OpenAI-compatible endpoint. The
upstream `openai` SDK works unchanged — we just point `base_url`
at `https://generativelanguage.googleapis.com/v1beta/openai`.
Swapping providers is a config change.
* **PDF parsing is Docling-first with a PyMuPDF fallback**. Docling
gives us a structured layout tree (sections, figures, tables) that
makes precise multimodal linkage possible; PyMuPDF is a robust
fallback for environments where Docling's model weights can't be
downloaded so the demo never bricks on a single dependency.

---

## 2. Chunking strategy

**Default size: 1000 chars, overlap 200 chars.**

Rationale:

1. **Model fit.** `bge-small-en-v1.5` has a 512-token context. 1000
characters of English averages ~250 tokens — well within the model's
sweet spot, so the embedding isn't truncated for any normal chunk.
2. **Recall vs. precision.** Smaller chunks (~200 chars) over-index on
short n-grams and miss "this paragraph is about X". Larger chunks
(>1500 chars) dilute the embedding so it averages two topics and
stops ranking either highly. 1000 chars is the empirical Goldilocks
point on dense academic text like the Attention paper.
3. **Overlap of 200 chars (~20%)** preserves continuity across cuts so
a sentence that straddles a chunk boundary is still searchable from
both sides. Going higher would inflate the index size without much
recall benefit.

**Boundary-aware sliding window, not naive char split.** The processor
(`_sliding_paragraphs` in `document_processor.py`) walks paragraph by
paragraph, only emitting a chunk when adding the next paragraph would
exceed `CHUNK_SIZE`. The overlap tail starts at a word boundary, not
mid-token. Single paragraphs longer than `CHUNK_SIZE` get hard-cut as a
last resort.

**Why not semantic chunking?** I considered it. Semantic chunking
(embed every sentence, cluster, then group) gives better recall on
narrative text but is significantly slower at ingest, harder to
debug (a chunk's bounds become opaque), and the gains shrink on
heavily structured documents like research papers where paragraph
boundaries already align with semantic shifts. I left a hook in the
`_chunk_pages` method so a semantic variant can drop in later.

**Each chunk carries metadata:**

```jsonc
{
"char_start": 1023,
"char_end": 2031,
"related_image_ids": [12, 13], // same-page images + explicit Figure N refs
"related_table_ids": [4] // same-page tables + explicit Table N refs
}
```

Page number, section heading path, and char offsets all travel with
the chunk. That's what lets the chat engine cite precisely.

---

## 3. Multimodal linking

Linking text to figures and tables is the hardest part of multimodal
RAG. Two strategies, used together:

### 3a. Ingest-time **explicit** linking
At parse time, each chunk gets:

* every image/table on the **same page** (proximity heuristic), plus
* every image/table whose caption matches an in-text reference
(`"see Figure 1"` → resolves to the picture whose caption starts
"Figure 1: …").

This is high-precision: if the author wrote "see Figure 1", the model
gets Figure 1, even when Figure 1 is on a different page.

### 3b. Query-time **page co-location**
When the retriever scores a chunk on page 5, we union in every
image/table on page 5 into the prompt context. This is high-recall:
even if the chunk doesn't itself mention the figure, a question about
the visualised concept will surface the visual.

### Why both?
* Explicit alone fails when the parser misses a caption (Docling is
good but not perfect).
* Co-location alone fails when the relevant figure is one page over
but referenced in this paragraph.
* Together, recall is robust and precision stays high because
explicit hits get surfaced first in the prompt (see
`VectorStore.get_related_media`).

### Per-image embeddings
I deliberately did **not** embed images directly with a vision model
(e.g. CLIP). For this task the question is always text, and CLIP-style
image embeddings don't beat caption-anchored retrieval on documents
with high-quality captions. Adding a vision encoder doubles infra
cost and complicates re-indexing on prompt changes. If we later want
to answer questions about *visual* content ("what colour is the
encoder block?"), I'd add a parallel `image_embeddings` table and
a hybrid search — but as an additive change, not a rewrite.

---

## 4. Evaluation pipeline

Designed but lightly executed in this 4-hour scope. The shape:

### Metrics

| Axis | Metric | Tooling |
|------|--------|---------|
| Retrieval | recall@5, MRR | offline script over labelled set |
| Answer relevance | embedding cosine(question, answer) | sentence-transformers |
| Faithfulness | LLM-as-judge: claims vs. retrieved context | RAGAS / custom GPT-4-class judge |
| Citation correctness | bracketed `[chunk N]` markers actually contain the claim | string overlap heuristic |
| Latency | p50/p95 end-to-end | uvicorn middleware + log aggregation |

### Test set

`backend/app/services/evals/ragas_eval.py` ships three labelled cases
tied to the Attention paper — one each for text retrieval (self-
attention explanation), image retrieval (Figure 1 architecture
diagram), and table retrieval (BLEU score table). Each case lists
expected pages, expected modalities, and expected keywords. The
`Evaluator.run_case` method computes keyword recall, page hit-rate,
modality hit-rate, and an answer-relevance proxy in one pass.

To scale this, the right play is:

1. **Golden set** — 30-50 hand-labelled questions across 5-10
representative documents, with expected evidence (page + snippet).
2. **CI gate** — block PRs that regress recall@5 by more than 5pp on
the golden set.
3. **Online eval** — sample 1% of production conversations weekly,
run them through an LLM-as-judge for faithfulness, alert on drift.
4. **Failure taxonomy** — every regression is tagged
(retrieval miss, citation hallucination, refusal, formatting) so
the team knows where to invest.

### What I would NOT do
Skip pure metric chasing on shallow benchmarks like BEIR — they
don't reflect document-grounded QA. A 50-case bespoke golden set
beats a 10k-case generic one for this product.

---

## 5. Prompt versioning strategy

### Today (in this repo)

Prompts live as `.txt` files under
`backend/app/services/prompts/templates/`, named
`<purpose>.<role>.<version>.txt`:

```
qa.system.v1.txt
qa.user.v1.txt
```

A `PromptRegistry` (`prompts/registry.py`) loads them lazily, caches
them, and renders via `str.format_map` with safe fallback for missing
keys. Callers ask for a name+version pair:

```python
registry.render("qa", "system", "v1")
```

`ChatEngine` records the prompt version it used into each chat
response (`prompt_version` field). That's the audit trail.

### Why this shape, not hardcoded strings?

* `git blame` on a prompt file tells you who changed the system prompt
and why. Inline strings hide that.
* Versions are **additive**, never destructive. `qa.system.v2.txt`
ships alongside v1; callers switch when ready. No "monkey-patch the
prompt in prod" risk.
* Tests pin a specific version so refactors don't silently regress.

### Scaling to a real product

The interface (`PromptRegistry.render(name, role, version, **vars)`)
is intentionally identical to what hosted prompt registries look like
(LangSmith Hub, PromptLayer, internal Postgres-backed registries).
Migration is a one-method swap behind the same API.

Add later, in roughly this order:

1. **A/B routing.** `registry.render(...)` becomes
`registry.render(..., user_bucket=...)` so we can ramp v2 to 10% of
users and compare metrics.
2. **Hot reload.** File watcher invalidates the cache on save.
3. **Hosted registry.** Move the txt files to a Postgres-backed table
with `prompt(name, role, version, body, created_by, created_at)`.
Same API, persistent history, role-based edit.
4. **Evaluator coupling.** Each prompt version gets an automatic
eval run on the golden set; "promote v2 to default" requires it
to beat v1 on at least one metric without regressing the others.

### What I avoided

* **Stringly-typed system prompts inline in `chat_engine.py`.** They
metastasise — small edits creep into business logic, and version
history is the file history of the *engine*, not the prompt.
* **LangChain's prompt templating.** Adds a heavy dependency for a
100-line module. We can always adopt LangChain later if we need
its ecosystem; we lose nothing today by deferring.

---

## 6. Operational concerns

* **Background processing**: ingest runs via FastAPI's
`BackgroundTasks` with a freshly opened DB session so it survives
past the request lifetime. Production-grade would switch to Celery
or RQ on the Redis we already provision.
* **pgvector index**: I deliberately did not add an HNSW or IVFFlat
index — at small document counts the cold cost of building the
index outweighs the scan cost, and pgvector with `vector_cosine_ops`
defaults are fine. The right point to add `CREATE INDEX ... USING
hnsw` is when the chunk table crosses ~100k rows.
* **Streaming responses**: not implemented in this scope. The chat
endpoint returns a single JSON. Adding SSE/streaming is a 30-line
change in the API + a tweak in the React component; design-wise,
`LLMClient.chat()` becomes a generator.
* **Auth, rate limits, multi-tenancy**: out of scope. A real
deployment would gate uploads by user, partition documents by
org_id (added as a column on `documents`), and rate-limit chat
endpoints with a Redis-backed token bucket.

---

## 7. What I would do next for improvement, if needed

In priority order:

1. **Reranker.** Add `bge-reranker-base` between retrieval and prompt
construction. A cross-encoder reranker over top-20 → top-5 buys
noticeable accuracy on multi-aspect questions for very little
latency cost.
2. **Streaming responses.** SSE the model output to the UI so the
first token shows up in <500ms instead of waiting for the full
completion.
3. **Citation rendering.** Parse `[chunk N]` markers in the answer
text and turn them into clickable affordances that scroll the
source pane to that snippet.
4. **Eval CI.** Wire `evals/ragas_eval.py` into a GitHub Action that
runs on every PR and posts the deltas as a comment.
Loading