InterOpera-Apps · edwaldo12 · May 15, 2026
diff --git a/.env.example b/.env.example
@@ -1,16 +1,34 @@
-# OpenAI API Key (Required for embeddings and LLM)
-OPENAI_API_KEY=sk-your-api-key-here
+# =====================================================================
+# Multimodal Document Chat — environment configuration
+# Copy to `.env` (it is gitignored) before running docker-compose.
+# =====================================================================
 
-# OpenAI Models (Optional - defaults provided)
-OPENAI_MODEL=gpt-4o-mini
-OPENAI_EMBEDDING_MODEL=text-embedding-3-small
+# --- LLM (Gemini via OpenAI-compatible API) --------------------
+LLM_PROVIDER=gemini
+LLM_API_KEY=replace-with-your-gemini-api-key
+LLM_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai
+LLM_MODEL=gemini-2.5-flash
 
-# Database (Configured in docker-compose.yml)
-DATABASE_URL=postgresql://docuser:docpass@localhost:5432/docdb
+# Legacy aliases — docker-compose interpolates `${LLM_API_KEY}` into the
+# OPENAI_API_KEY env var so the upstream `openai` SDK picks it up; if you
+# run the backend outside compose, leave OPENAI_API_KEY blank — the code's
+# `resolved_api_key` helper falls back to LLM_API_KEY automatically.
 
-# Redis (Configured in docker-compose.yml)
-REDIS_URL=redis://localhost:6379/0
+# --- Embeddings (local, free, deterministic) -------------------------
+EMBEDDING_MODEL=BAAI/bge-small-en-v1.5
+EMBEDDING_DIMENSION=384
 
-# Upload Settings
-UPLOAD_DIR=./uploads
-MAX_FILE_SIZE=50  # MB
+# --- Retrieval tuning ------------------------------------------------
+CHUNK_SIZE=1000
+CHUNK_OVERLAP=200
+TOP_K_RESULTS=5
+
+# --- Generation tuning ------------------------------------------------
+LLM_TEMPERATURE=0.1
+LLM_MAX_TOKENS=1024
+
+# --- Infrastructure (docker-compose defaults) ------------------------
+DATABASE_URL=postgresql://docuser:docpass@postgres:5432/docdb
+REDIS_URL=redis://redis:6379/0
+UPLOAD_DIR=/app/uploads
+MAX_FILE_SIZE=52428800
diff --git a/DESIGN.md b/DESIGN.md
@@ -0,0 +1,288 @@
+# DESIGN
+
+This document is the "why" companion to the code. The README explains
+how to run it; this file explains the architectural decisions, where
+the trade-offs are, and how I would scale it.
+
+---
+
+## 1. System overview
+
+```
+   ┌──────────────┐    PDF      ┌──────────────────────────┐
+   │  Next.js UI  │ ──────────► │   FastAPI / upload       │
+   │  (port 3000) │             │   - persists file        │
+   └──────┬───────┘             │   - schedules processor  │
+          │                     └─────────────┬────────────┘
+          │ chat                               │ BackgroundTask
+          ▼                                   ▼
+   ┌──────────────┐               ┌───────────────────────────┐
+   │  /api/chat   │               │  DocumentProcessor        │
+   │              │               │  Docling → text+img+table │
+   │  ChatEngine  │◄──retrieves──┤  PyMuPDF fallback         │
+   │              │               │  Hierarchical chunking    │
+   └──────┬───────┘               └────────────┬──────────────┘
+          │ embeds                              │ writes
+          ▼                                     ▼
+   ┌──────────────┐               ┌───────────────────────────┐
+   │ Gemini 2.5   │               │  Postgres + pgvector      │
+   │ Flash        │               │  chunks(embedding 384d)   │
+   │ (OpenAI-     │               │  images, tables           │
+   │  compatible) │               │                           │
+   └──────────────┘               └───────────────────────────┘
+```
+
+* **Embeddings stay local** (`sentence-transformers`,
+  `BAAI/bge-small-en-v1.5`, 384-dim). Local embeddings are
+  deterministic, free, and not subject to provider drift, so the index
+  survives upgrades cleanly.
+* **LLM is Gemini 2.5 Flash** via its OpenAI-compatible endpoint. The
+  upstream `openai` SDK works unchanged — we just point `base_url`
+  at `https://generativelanguage.googleapis.com/v1beta/openai`.
+  Swapping providers is a config change.
+* **PDF parsing is Docling-first with a PyMuPDF fallback**. Docling
+  gives us a structured layout tree (sections, figures, tables) that
+  makes precise multimodal linkage possible; PyMuPDF is a robust
+  fallback for environments where Docling's model weights can't be
+  downloaded so the demo never bricks on a single dependency.
+
+---
+
+## 2. Chunking strategy
+
+**Default size: 1000 chars, overlap 200 chars.**
+
+Rationale:
+
+1. **Model fit.** `bge-small-en-v1.5` has a 512-token context. 1000
+   characters of English averages ~250 tokens — well within the model's
+   sweet spot, so the embedding isn't truncated for any normal chunk.
+2. **Recall vs. precision.** Smaller chunks (~200 chars) over-index on
+   short n-grams and miss "this paragraph is about X". Larger chunks
+   (>1500 chars) dilute the embedding so it averages two topics and
+   stops ranking either highly. 1000 chars is the empirical Goldilocks
+   point on dense academic text like the Attention paper.
+3. **Overlap of 200 chars (~20%)** preserves continuity across cuts so
+   a sentence that straddles a chunk boundary is still searchable from
+   both sides. Going higher would inflate the index size without much
+   recall benefit.
+
+**Boundary-aware sliding window, not naive char split.** The processor
+(`_sliding_paragraphs` in `document_processor.py`) walks paragraph by
+paragraph, only emitting a chunk when adding the next paragraph would
+exceed `CHUNK_SIZE`. The overlap tail starts at a word boundary, not
+mid-token. Single paragraphs longer than `CHUNK_SIZE` get hard-cut as a
+last resort.
+
+**Why not semantic chunking?** I considered it. Semantic chunking
+(embed every sentence, cluster, then group) gives better recall on
+narrative text but is significantly slower at ingest, harder to
+debug (a chunk's bounds become opaque), and the gains shrink on
+heavily structured documents like research papers where paragraph
+boundaries already align with semantic shifts. I left a hook in the
+`_chunk_pages` method so a semantic variant can drop in later.
+
+**Each chunk carries metadata:**
+
+```jsonc
+{
+  "char_start": 1023,
+  "char_end": 2031,
+  "related_image_ids": [12, 13],   // same-page images + explicit Figure N refs
+  "related_table_ids": [4]          // same-page tables + explicit Table N refs
+}
+```
+
+Page number, section heading path, and char offsets all travel with
+the chunk. That's what lets the chat engine cite precisely.
+
+---
+
+## 3. Multimodal linking
+
+Linking text to figures and tables is the hardest part of multimodal
+RAG. Two strategies, used together:
+
+### 3a. Ingest-time **explicit** linking
+At parse time, each chunk gets:
+
+* every image/table on the **same page** (proximity heuristic), plus
+* every image/table whose caption matches an in-text reference
+  (`"see Figure 1"` → resolves to the picture whose caption starts
+  "Figure 1: …").
+
+This is high-precision: if the author wrote "see Figure 1", the model
+gets Figure 1, even when Figure 1 is on a different page.
+
+### 3b. Query-time **page co-location**
+When the retriever scores a chunk on page 5, we union in every
+image/table on page 5 into the prompt context. This is high-recall:
+even if the chunk doesn't itself mention the figure, a question about
+the visualised concept will surface the visual.
+
+### Why both?
+* Explicit alone fails when the parser misses a caption (Docling is
+  good but not perfect).
+* Co-location alone fails when the relevant figure is one page over
+  but referenced in this paragraph.
+* Together, recall is robust and precision stays high because
+  explicit hits get surfaced first in the prompt (see
+  `VectorStore.get_related_media`).
+
+### Per-image embeddings
+I deliberately did **not** embed images directly with a vision model
+(e.g. CLIP). For this task the question is always text, and CLIP-style
+image embeddings don't beat caption-anchored retrieval on documents
+with high-quality captions. Adding a vision encoder doubles infra
+cost and complicates re-indexing on prompt changes. If we later want
+to answer questions about *visual* content ("what colour is the
+encoder block?"), I'd add a parallel `image_embeddings` table and
+a hybrid search — but as an additive change, not a rewrite.
+
+---
+
+## 4. Evaluation pipeline
+
+Designed but lightly executed in this 4-hour scope. The shape:
+
+### Metrics
+
+| Axis | Metric | Tooling |
+|------|--------|---------|
+| Retrieval | recall@5, MRR | offline script over labelled set |
+| Answer relevance | embedding cosine(question, answer) | sentence-transformers |
+| Faithfulness | LLM-as-judge: claims vs. retrieved context | RAGAS / custom GPT-4-class judge |
+| Citation correctness | bracketed `[chunk N]` markers actually contain the claim | string overlap heuristic |
+| Latency | p50/p95 end-to-end | uvicorn middleware + log aggregation |
+
+### Test set
+
+`backend/app/services/evals/ragas_eval.py` ships three labelled cases
+tied to the Attention paper — one each for text retrieval (self-
+attention explanation), image retrieval (Figure 1 architecture
+diagram), and table retrieval (BLEU score table). Each case lists
+expected pages, expected modalities, and expected keywords. The
+`Evaluator.run_case` method computes keyword recall, page hit-rate,
+modality hit-rate, and an answer-relevance proxy in one pass.
+
+To scale this, the right play is:
+
+1. **Golden set** — 30-50 hand-labelled questions across 5-10
+   representative documents, with expected evidence (page + snippet).
+2. **CI gate** — block PRs that regress recall@5 by more than 5pp on
+   the golden set.
+3. **Online eval** — sample 1% of production conversations weekly,
+   run them through an LLM-as-judge for faithfulness, alert on drift.
+4. **Failure taxonomy** — every regression is tagged
+   (retrieval miss, citation hallucination, refusal, formatting) so
+   the team knows where to invest.
+
+### What I would NOT do
+Skip pure metric chasing on shallow benchmarks like BEIR — they
+don't reflect document-grounded QA. A 50-case bespoke golden set
+beats a 10k-case generic one for this product.
+
+---
+
+## 5. Prompt versioning strategy
+
+### Today (in this repo)
+
+Prompts live as `.txt` files under
+`backend/app/services/prompts/templates/`, named
+`<purpose>.<role>.<version>.txt`:
+
+```
+qa.system.v1.txt
+qa.user.v1.txt
+```
+
+A `PromptRegistry` (`prompts/registry.py`) loads them lazily, caches
+them, and renders via `str.format_map` with safe fallback for missing
+keys. Callers ask for a name+version pair:
+
+```python
+registry.render("qa", "system", "v1")
+```
+
+`ChatEngine` records the prompt version it used into each chat
+response (`prompt_version` field). That's the audit trail.
+
+### Why this shape, not hardcoded strings?
+
+* `git blame` on a prompt file tells you who changed the system prompt
+  and why. Inline strings hide that.
+* Versions are **additive**, never destructive. `qa.system.v2.txt`
+  ships alongside v1; callers switch when ready. No "monkey-patch the
+  prompt in prod" risk.
+* Tests pin a specific version so refactors don't silently regress.
+
+### Scaling to a real product
+
+The interface (`PromptRegistry.render(name, role, version, **vars)`)
+is intentionally identical to what hosted prompt registries look like
+(LangSmith Hub, PromptLayer, internal Postgres-backed registries).
+Migration is a one-method swap behind the same API.
+
+Add later, in roughly this order:
+
+1. **A/B routing.** `registry.render(...)` becomes
+   `registry.render(..., user_bucket=...)` so we can ramp v2 to 10% of
+   users and compare metrics.
+2. **Hot reload.** File watcher invalidates the cache on save.
+3. **Hosted registry.** Move the txt files to a Postgres-backed table
+   with `prompt(name, role, version, body, created_by, created_at)`.
+   Same API, persistent history, role-based edit.
+4. **Evaluator coupling.** Each prompt version gets an automatic
+   eval run on the golden set; "promote v2 to default" requires it
+   to beat v1 on at least one metric without regressing the others.
+
+### What I avoided
+
+* **Stringly-typed system prompts inline in `chat_engine.py`.** They
+  metastasise — small edits creep into business logic, and version
+  history is the file history of the *engine*, not the prompt.
+* **LangChain's prompt templating.** Adds a heavy dependency for a
+  100-line module. We can always adopt LangChain later if we need
+  its ecosystem; we lose nothing today by deferring.
+
+---
+
+## 6. Operational concerns
+
+* **Background processing**: ingest runs via FastAPI's
+  `BackgroundTasks` with a freshly opened DB session so it survives
+  past the request lifetime. Production-grade would switch to Celery
+  or RQ on the Redis we already provision.
+* **pgvector index**: I deliberately did not add an HNSW or IVFFlat
+  index — at small document counts the cold cost of building the
+  index outweighs the scan cost, and pgvector with `vector_cosine_ops`
+  defaults are fine. The right point to add `CREATE INDEX ... USING
+  hnsw` is when the chunk table crosses ~100k rows.
+* **Streaming responses**: not implemented in this scope. The chat
+  endpoint returns a single JSON. Adding SSE/streaming is a 30-line
+  change in the API + a tweak in the React component; design-wise,
+  `LLMClient.chat()` becomes a generator.
+* **Auth, rate limits, multi-tenancy**: out of scope. A real
+  deployment would gate uploads by user, partition documents by
+  org_id (added as a column on `documents`), and rate-limit chat
+  endpoints with a Redis-backed token bucket.
+
+---
+
+## 7. What I would do next for improvement, if needed
+
+In priority order:
+
+1. **Reranker.** Add `bge-reranker-base` between retrieval and prompt
+   construction. A cross-encoder reranker over top-20 → top-5 buys
+   noticeable accuracy on multi-aspect questions for very little
+   latency cost.
+2. **Streaming responses.** SSE the model output to the UI so the
+   first token shows up in <500ms instead of waiting for the full
+   completion.
+3. **Citation rendering.** Parse `[chunk N]` markers in the answer
+   text and turn them into clickable affordances that scroll the
+   source pane to that snippet.
+4. **Eval CI.** Wire `evals/ragas_eval.py` into a GitHub Action that
+   runs on every PR and posts the deltas as a comment.