Example RAG

A Retrieval-Augmented Generation (RAG) system for documents using LanceDB as the vector database. Designed to mirror AWS serverless architecture locally using Docker.

Architecture

┌─────────────────┐
│    Web App      │
│   (Next.js)     │
└────────┬────────┘
         │ WebSocket
         ▼
┌─────────────────┐         SSE (streaming)        ┌─────────────────┐
│    Gateway      │◄──────────────────────────────►│      LLM        │
│   (Node.js)     │                                │    (Python)     │
└────────┬────────┘                                └─────────────────┘
         │ AWS SDK (Lambda Invoke)
         ▼
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Ingestion     │     │      Chat       │     │    Embedding    │
│   (Lambda)      │     │    (Lambda)     │     │    (Python)     │
└────────┬────────┘     └────────┬────────┘     └────────┬────────┘
         │                       │                       │
         │                       │                       │
         ▼                       ▼                       │
┌─────────────────────────────────────────┐              │
│              Shared Module              │◄─────────────┘
│  (connection, operations, embedding)    │
└────────────────────┬────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────┐
│            LanceDB (embedded)           │
│         stored in MinIO (S3)            │
└─────────────────────────────────────────┘

Gateway

The gateway handles WebSocket connections from the web app and invokes Lambda functions using the AWS SDK. This mirrors AWS API Gateway WebSocket APIs:

Local: Gateway uses AWS SDK pointing at Lambda RIE endpoints
Production: Gateway uses AWS SDK pointing at real Lambda functions

The web app communicates entirely over WebSocket - no direct HTTP calls to Lambda services.

Data Flow

Ingestion (adding documents):

Web App
    │
    │  WebSocket: { action: "ingest", text: "...", entry_date: "...", topics: [...] }
    │
    ▼
Gateway
    │
    │  AWS SDK Lambda Invoke
    │
    ▼
Ingestion Service (Lambda)
    │
    ├──► Embedding Service (get vector for text)
    │         │
    │         ▼
    │    Returns 384-dimensional vector
    │
    ▼
LanceDB ──► S3/MinIO (internal storage)
    │
    ▼
Gateway
    │
    │  WebSocket: { action: "ingest", statusCode: 200, data: { id, message } }
    │
    ▼
Web App

Query (searching documents):

Web App
    │
    │  WebSocket: { action: "query", query: "...", limit: 5 }
    │
    ▼
Gateway
    │
    │  AWS SDK Lambda Invoke
    │
    ▼
Chat Service (Lambda)
    │
    ├──► Embedding Service (get vector for query)
    │         │
    │         ▼
    │    Returns 384-dimensional vector
    │
    ▼
LanceDB (vector similarity search)
    │
    ▼
Gateway
    │
    │  WebSocket: { action: "query", statusCode: 200, data: { results: [...] } }
    │
    ▼
Web App

LanceDB uses S3 (MinIO) as its storage backend internally - the services make one call to LanceDB, which handles S3 storage under the hood.

Chat (streaming flow):

Web App
    │
    │  WebSocket: { action: "chat", content: "...", stream: true }
    │
    ▼
Gateway
    │
    │  AWS SDK Lambda Invoke
    │
    ▼
Chat Service (Lambda)
    │
    ├──► Embedding Service (get vector for user message)
    │         │
    │         ▼
    │    Returns 1024-dimensional vector
    │
    ├──► LanceDB (vector similarity search for RAG context)
    │
    ▼
Gateway (receives RAG context)
    │
    │  HTTP POST with stream=true
    │
    ▼
LLM Service
    │
    │  SSE: data: {"choices":[{"delta":{"content":"token"}}]}
    │  ... (tokens stream)
    │  SSE: data: [DONE]
    │
    ▼
Gateway (proxies SSE to WebSocket)
    │
    │  WebSocket: { action: "chat_stream_start", rag_context: [...] }
    │  WebSocket: { action: "chat_stream_token", token: "Hello" }
    │  WebSocket: { action: "chat_stream_token", token: " there" }
    │  ... (tokens stream)
    │  WebSocket: { action: "chat_stream_end", content: "Hello there...", role: "assistant" }
    │
    ▼
Web App (displays tokens as they arrive for "typing" effect)

Detailed Chat Flow:

1. Web App → Gateway: "what's the price of gold?"

2. Gateway → Chat Service (action: "rag"):
   - Chat Service internally:
     - Calls Embedding Service → gets vector [0.12, -0.45, ...]
     - Searches LanceDB with that vector → finds matching entries
     - Saves user message to MongoDB
   - Returns to Gateway:
     - system_prompt: "You are a journal assistant... Relevant entries: [2026-01-27] gold is $5,220..."
     - rag_context: [{entry_date, text_snippet, score}]

3. Gateway → LLM:
   - Sends: system_prompt + user message
   - Receives: streaming tokens

4. Gateway → Web App:
   - Forwards each token via WebSocket

5. Gateway → Chat Service (action: "save_message"):
   - Saves assistant's complete response to MongoDB

Key point: The embedding vector stays inside the Chat Service. The Gateway only receives:

The system prompt (text with RAG context baked in)
The rag_context metadata (for showing "Sources" in UI)

The Gateway never sees the actual vector—it just passes text to the LLM. The LLM has no knowledge of the Chat Service, embeddings, or databases. It simply receives a system prompt (which happens to contain retrieved documents) and a user message, then generates a response.

Services

Service	Description
web	Next.js frontend for chat and document upload
gateway	WebSocket server that invokes Lambda functions via AWS SDK
ingestion	TypeScript Lambda for adding documents to the database
chat	TypeScript Lambda for semantic search over documents
embedding	Python service using sentence-transformers for vector generation
llm	Python service using Qwen2.5-3B-Instruct for chat completions (SSE streaming)
minio	S3-compatible object storage (mimics AWS S3 locally)
minio-setup	One-time container that creates the `lancedb` bucket

Shared Module

Common code lives in /shared and is imported via TypeScript path aliases (@shared/*). This avoids npm package dependencies while allowing code reuse.

Contents:

config.ts - Environment configuration with defaults
types.ts - TypeScript types for entries, events, and API bodies (includes aws-lambda dependency)
chat-types.ts - Chat-related types (browser-safe, no Node.js dependencies)
db/connection.ts - LanceDB connection management with caching
db/operations.ts - LanceDB operations (addEntry, searchSimilar)
db/mongo-connection.ts - MongoDB connection management
db/mongo-operations.ts - MongoDB CRUD operations for chat sessions and messages
services/embedding.ts - Embedding service client

Web App Chat Feature

The chat page (/chat) provides a real-time chat interface that communicates with the gateway over WebSocket.

Components (web/components/Chat/):

Chat.tsx - Main container with header, error banner, message list, and input
MessageList.tsx - Displays messages with user/assistant styling, RAG context sources, typing indicator
MessageInput.tsx - Text input with Enter-to-send support
ConnectionStatus.tsx - Shows WebSocket connection state (connecting/connected/disconnected/error)

Hooks (web/components/Chat/hooks/):

useWebSocket.ts - WebSocket connection management with auto-reconnect
useChat.ts - Chat state management combining WebSocket with message handling

Types:

Shared types imported from @shared/chat-types (ChatMessage, ChatSession, RagContext, WebSocketChatMessage)
UI-specific types in web/components/Chat/types.ts (ConnectionStatus)

Database Schema

LanceDB (Vector Database)

Single table document_entries with the following columns:

Column	Type	Description
`id`	string	Unique identifier (UUID)
`entry_id`	string \| null	Groups chunks from the same entry
`entry_date`	string	Date of the document
`chunk_index`	number	Position when entry is split into chunks
`text`	string	The document text
`vector`	number[]	384-dimensional embedding vector
`topics`	string[]	Array of mood tags
`word_count`	number	Word count of the text

MongoDB (Chat History)

Two collections for storing chat sessions and messages.

`sessions` Collection

Field	Type	Description
`_id`	ObjectId	MongoDB auto-generated
`session_id`	string	UUID, unique index
`user_id`	string \| null	null for anonymous, user UUID when auth added
`title`	string \| null	Optional session title (auto-generated from first message)
`created_at`	Date	Session creation timestamp
`updated_at`	Date	Last message timestamp
`metadata`	object	Extensible metadata (e.g., client_info)

Indexes: session_id (unique), user_id (sparse), updated_at (descending)

`messages` Collection

Field	Type	Description
`_id`	ObjectId	MongoDB auto-generated
`message_id`	string	UUID, unique index
`session_id`	string	Foreign key to sessions
`role`	"user" \| "assistant"	Message sender
`content`	string	Message text
`created_at`	Date	Message timestamp
`rag_context`	array \| null	Retrieved documents used for response (null for user messages)

Each rag_context item contains: entry_id, entry_date, text_snippet (first 200 chars), score (similarity score)

Indexes: message_id (unique), session_id + created_at (compound)

Design Decisions & Tradeoffs

Naming Conventions (snake_case vs camelCase)

Decision: Different conventions for different layers.

Layer	Convention	Example
WebSocket messages (client ↔ gateway)	snake_case	`session_id`
Internal TypeScript / Lambda payloads	camelCase	`sessionId`
MongoDB documents	snake_case	`session_id`

Why:

Database fields use snake_case (MongoDB convention, matches JSON APIs)
WebSocket messages use snake_case to align with database schema and external API conventions
Internal TypeScript code uses camelCase (JavaScript/TypeScript convention)
Gateway converts between conventions at the boundary

Tradeoff:

Requires conversion at gateway layer
Could have used camelCase everywhere, but snake_case for wire formats is a common REST/JSON API convention and matches MongoDB's typical style

Single Table vs Multiple Tables

Decision: Single table with all entries.

Why:

Simpler queries and maintenance
Scales well as knowledge base grows
No need for cross-table joins
entry_id and chunk_index handle multi-chunk entries without separate tables

LanceDB Embedded vs Server Mode

Decision: Embedded mode with S3 storage.

Why:

LanceDB is designed like SQLite - embedded, not a server
Multiple services connect directly to the same S3 bucket
Mirrors AWS serverless pattern (Lambda + S3)
No separate database server to manage
LanceDB handles concurrency internally

MinIO for Local Development

Decision: Use MinIO to simulate AWS S3 locally.

Why:

Same S3 connection logic works in dev and production
Just change the endpoint URL for AWS deployment
Web console (port 9001) for debugging/browsing data

TypeScript over Rust for Lambda Services

Decision: TypeScript with Node.js 24 Lambda runtime.

Why:

Faster development iteration
LanceDB has a mature TypeScript SDK
Better alignment with team skills
Rust would provide better performance but slower development

Shared Module via Docker COPY (not npm)

Decision: Copy shared code in Docker build, use tsconfig paths.

Why:

No npm link or package.json dependencies
Works with multi-stage Docker builds
TypeScript path aliases (@shared/*) provide clean imports
Simpler than publishing to private npm registry

No Barrel Files

Decision: Direct imports instead of index.ts re-exports.

Why:

Follows project conventions (CLAUDE.md)
Avoids circular dependency issues
Clearer import paths show exact source

Topics as Array

Decision: Store topics as string[] instead of single value.

Why:

One document can describe multiple events/feelings
More flexible for filtering and analysis
Supports "happy and anxious" type entries

Connection Caching for Lambda

Decision: Module-level connection state with lazy initialization.

Why:

Lambda cold starts are expensive
Connection persists across invocations
initConnection() on module load, awaited in handler
resetConnection() for testing

Separate Sessions and Messages Collections

Decision: Two MongoDB collections instead of embedding messages in sessions.

Why:

Allows efficient session listing without loading all messages
Messages can grow large; separate collection avoids document size limits
Easier to query and paginate messages independently
Better index performance for message retrieval

UUID-Based IDs for Chat

Decision: Use session_id and message_id (UUIDs) alongside MongoDB's _id.

Why:

Matches existing pattern in LanceDB entries
Allows client-generated IDs for optimistic updates
Portable identifiers that don't depend on MongoDB ObjectId format
Easier to reference across services

RAG Context on Messages

Decision: Store retrieved documents in rag_context field on assistant messages.

Why:

Records which documents were used to generate each response
Useful for debugging retrieval quality
Enables "show sources" UI feature
Stores snippet + score, not full text (keeps documents small)

Separate Chat Types File (chat-types.ts)

Decision: Split chat-related types into a separate file from types.ts.

Why:

types.ts imports aws-lambda types which aren't available in browser environments
The web app needs to import chat types (ChatMessage, RagContext, etc.) without pulling in Node.js dependencies
Backend services import from @shared/types (with aws-lambda) or @shared/chat-types as needed
Web app imports only from @shared/chat-types

Tradeoff:

Two type files instead of one
Could have used conditional exports or a build step to tree-shake, but separate files are simpler and more explicit

Direct Imports Over Re-exports

Decision: Services import directly from source files rather than through barrel files or re-export layers.

Why:

Initially considered re-exporting chat types from types.ts for convenience
Removed because it adds unnecessary indirection
Direct imports (@shared/chat-types) are clearer about where types come from
Follows project convention of no barrel files (no index.ts re-exports)

UI-Specific Types Stay in Components

Decision: Keep ConnectionStatus type in web/components/Chat/types.ts, not in shared.

Why:

ConnectionStatus ('connecting' | 'connected' | 'disconnected' | 'error') is purely a UI concern
Backend services don't need to know about WebSocket connection states
Only types used across multiple services belong in shared
Keeps shared module focused on domain types (messages, sessions, RAG context)

Docker COPY for Shared Module in Web App

Decision: Copy shared module into Docker container at /shared/, use tsconfig paths for resolution.

Why:

Web app's Dockerfile builds from project root context to access ../shared
tsconfig.json maps @shared/* to ../shared/src/*
Relative path ../shared from /app resolves to /shared in container
Volume mount ./shared:/shared in docker-compose for dev hot-reload
No webpack aliases needed - tsconfig paths handle both dev and build

Tradeoff:

Initially tried webpack alias in next.config.ts, but unnecessary complexity
Build context must be project root (.) not just ./web

Sparse Index for user_id

Decision: Use sparse index on user_id field (null for anonymous users).

Why:

Efficient queries when user auth is added later
Sparse index excludes null values, saving space
Anonymous sessions still work without placeholder values
Future-proofs schema for multi-user support

MongoDB Schema Validation

Decision: Use JSON Schema validation on collections.

Why:

Catches malformed documents at insert time
Documents required fields and types
Acts as lightweight contract between services
Validation errors are explicit, not silent data corruption

JavaScript Init Script over Shell

Decision: Use init-mongo.js instead of init-mongo.sh for MongoDB initialization.

Why:

MongoDB Docker entrypoint runs .js files directly with mongosh
No shell heredoc syntax or $ escaping needed
Cleaner, more readable initialization code
Runs automatically on first startup when data directory is empty

Dependency Injection for Testing

Decision: Functions accept optional dependency parameters.

Why:

initConnection(connectDep = connect) allows mock injection
getTable(createIfMissing, connectDep) for testability
No need for complex mocking libraries

WebSocket Gateway vs Direct Lambda Calls

Decision: Web app connects via WebSocket to a gateway, which invokes Lambda functions.

Why:

Lambda functions cannot hold persistent WebSocket connections
AWS API Gateway WebSocket APIs work by: (1) holding the WebSocket connection, (2) invoking Lambda per message via the internal Lambda Invoke API, (3) Lambda pushes responses via the API Gateway Management API
Our gateway mirrors this pattern locally
Client communicates entirely over WebSocket - cleaner than REST for a chat application

Tradeoff:

Additional service to maintain (gateway)
Could have used direct REST calls from web app to Lambda RIE, but that doesn't match production architecture

AWS SDK for Lambda Invocation

Decision: Gateway uses @aws-sdk/client-lambda to invoke Lambda functions, not raw HTTP.

Why:

In production AWS, API Gateway uses the Lambda Invoke API (internal AWS mechanism), not HTTP
The Lambda RIE's HTTP endpoint (/2015-03-31/functions/function/invocations) is just an emulation for local testing
Using AWS SDK keeps gateway code identical between local and production - only the endpoint URL changes
Same code works locally (pointing at RIE) and in production (pointing at real Lambda)

Local:

new LambdaClient({ endpoint: "http://ingestion:8080" })

Production:

new LambdaClient({ region: "us-east-1" }) // Uses real Lambda

LLM Service: SSE over HTTP (not WebSocket)

Decision: LLM service exposes HTTP/SSE endpoints, not WebSocket. Gateway proxies SSE to WebSocket.

Architecture:

┌──────────┐     WS      ┌─────────┐     SSE      ┌─────────────┐
│ Frontend │◄───────────►│ Gateway │◄────────────►│ LLM Service │
└──────────┘             └─────────┘              └─────────────┘

Why not give LLM direct WebSocket access?

Separation of concerns - The LLM service should be a pure inference engine: messages in, tokens out. It shouldn't know about connection management, WebSocket protocols, or client sessions.
Testability - HTTP/SSE is easy to test with curl, httpx, pytest. WebSocket testing requires more setup and stateful connections.
Flexibility - The LLM can be called from:
- The gateway (for chat)
- CLI tools for debugging
- Batch jobs for bulk processing
- Other backend services
- HTTP-only clients
If it were WebSocket-coupled, all those would become harder.
GPU resources are precious - The LLM service is GPU-bound. Don't burden it with connection management. Let it focus on inference.
Proxy overhead is negligible - For localhost, proxying SSE→WebSocket adds microseconds. LLM inference takes seconds. The overhead is unnoticeable.

How SSE→WebSocket proxy works:

Client sends { action: "chat", content: "...", stream: true } over WebSocket
Gateway fetches RAG context from Chat service
Gateway calls LLM service with stream: true, receives SSE response

Gateway consumes SSE events and forwards each token to WebSocket:

LLM SSE:  data: {"choices":[{"delta":{"content":"Hello"}}]}
                                 ↓
Gateway:  Parse JSON, extract token
                                 ↓
WebSocket: { action: "chat_stream_token", token: "Hello" }

On stream end, Gateway sends { action: "chat_stream_end", content: "..." }

Stream events sent to client:

chat_stream_start - Stream beginning, includes RAG context
chat_stream_token - Individual token
chat_stream_end - Stream complete, includes full content
chat_stream_error - Error occurred, includes partial content

Tradeoff:

Extra hop adds complexity to gateway
Could have simplified by giving LLM direct WebSocket access
But the architectural benefits (testability, flexibility, separation of concerns) outweigh the complexity cost

Lambda RIE vs Custom HTTP Wrapper

Decision: Use AWS Lambda base image with Runtime Interface Emulator (RIE), not custom HTTP wrappers.

Why:

Initially considered creating main.ts files that wrap Lambda handlers in Express/HTTP servers for local development
This approach diverges from production behavior and adds unnecessary code
Lambda base images include RIE which provides the HTTP layer automatically
Lambda services have a single handler function, no HTTP routes - the RIE handles HTTP-to-event translation
Keeps Lambda code focused on business logic, not HTTP concerns

Tradeoff:

No hot reloading for Lambda services (must rebuild container on code changes)
Web app and gateway have hot reloading via volume mounts; Lambda services do not

Action-Based Routing in Lambda

Decision: Single Lambda handler with action field for routing, not multiple HTTP endpoints.

Why:

Lambda functions receive events, not HTTP requests
In production, API Gateway maps routes to Lambda invocations with event payloads
Single handler with switch statement on action field mirrors this pattern
Keeps Lambda code portable - same handler works with API Gateway, direct invocation, or RIE

Example:

export const handler = async (event: IngestEvent) => {
  switch (event.action) {
    case "ingest": return await ingest(event.body);
    case "health": return { statusCode: 200, body: "ok" };
  }
};

Hot Reloading Strategy

Decision: Different hot reloading approaches for different service types.

Service	Hot Reload	Method
Web (Next.js)	Yes	Volume mounts + `next dev`
Gateway (Node.js)	Yes	Volume mounts + `tsx watch`
Lambda services	No	Rebuild container on changes

Why:

Web and gateway are stateless Node.js servers - easy to reload
Lambda services use AWS base images with RIE - designed for container-based deployment, not file watching
Could add hot reloading to Lambda with custom setup, but diverges from production pattern
For rapid Lambda iteration, run tests locally with npm test instead of full container rebuild

Testing

Jest with ESM support. Run tests:

# Shared module
cd shared && npm test

# Ingestion service
cd ingestion && npm test

# Chat service
cd chat && npm test

# Gateway service
cd gateway && npm test

# Web app
cd web && npm test

Test patterns used:

jest.unstable_mockModule() for ESM module mocking
Import jest from @jest/globals (required for ESM)
Dynamic imports after mocking: const { fn } = await import("./module")
jest.fn<any>() for typed mocks
Tests live in tests/ subdirectories alongside source code

Development

Prerequisites

Docker and Docker Compose
Node.js 20+

Setup

# Install dependencies
cd shared && npm install
cd ../ingestion && npm install
cd ../chat && npm install

# Start services
docker-compose up

Data Persistence

Data is persisted to local directories via Docker bind mounts:

Directory	Contents
`./minio-data/`	MinIO/S3 data (LanceDB vector database)
`./db/data/`	MongoDB data (chat history)

These directories are created automatically and excluded from git. Data survives docker compose down and container rebuilds.

MongoDB Initialization: The db/init-mongo.js script runs automatically on first startup (when ./db/data/ is empty), creating the sessions and messages collections with schema validation and indexes.

To reset all data:

rm -rf minio-data/ db/data/
docker compose down -v
docker compose up -d

Environment Variables

Variable	Default	Description
`LANCEDB_URI`	`s3://lancedb/documents`	LanceDB storage location
`S3_ENDPOINT`	`http://localhost:9000`	S3/MinIO endpoint
`AWS_ACCESS_KEY_ID`	`minioadmin`	S3 access key
`AWS_SECRET_ACCESS_KEY`	`minioadmin`	S3 secret key
`AWS_REGION`	`us-east-1`	AWS region
`EMBEDDING_SERVICE_URL`	`http://localhost:8001`	Embedding service URL
`MONGO_URI`	`mongodb://root:example@mongo:27017`	MongoDB connection string
`MONGO_DB_NAME`	`example_rag`	MongoDB database name
`INGESTION_FUNCTION_NAME`	`function`	Lambda function name for ingestion (use actual name in prod)
`CHAT_FUNCTION_NAME`	`function`	Lambda function name for chat (use actual name in prod)
`INGESTION_LAMBDA_ENDPOINT`	`http://localhost:8002`	Ingestion Lambda endpoint (omit in prod for real AWS)
`CHAT_LAMBDA_ENDPOINT`	`http://localhost:8003`	Chat Lambda endpoint (omit in prod for real AWS)
`LLM_SERVICE_URL`	`http://localhost:8004`	LLM service URL for chat completions
`NEXT_PUBLIC_WS_URL`	`ws://localhost:8080/ws`	WebSocket gateway URL (web app)

API

Ingestion Service

Ingest Entry:

{
  "action": "ingest",
  "entry_date": "2024-01-15",
  "text": "Today was a good day...",
  "topics": ["happy", "calm"],
  "entry_id": "optional-id",
  "chunk_index": 0
}

Health Check:

{
  "action": "health"
}

Chat Service

Search:

{
  "action": "query",
  "query": "How was my day?",
  "limit": 5
}

RAG Context:

{
  "action": "rag",
  "body": {
    "message": "What did I do yesterday?",
    "sessionId": "session-uuid"
  }
}

Save Message:

{
  "action": "save_message",
  "body": {
    "sessionId": "session-uuid",
    "content": "Assistant response...",
    "ragContext": [...]
  }
}

Health Check:

{
  "action": "health"
}

RAG Configuration

How Vector Search Works

When you store text in a RAG system, you don't store the raw words. An embedding model (like sentence-transformers) converts text into a list of numbers called a vector.

"the price of gold is $5,220.50" → [0.12, -0.45, 0.78, ..., 0.33]  (384 numbers)
"what's going on with gold?"     → [0.15, -0.41, 0.72, ..., 0.29]  (384 numbers)

These numbers encode the semantic meaning of the text—not the exact words, but the concepts. Texts with similar meanings produce similar vectors.

Cosine Distance

To find relevant results, we measure how "close" two vectors are. Cosine similarity measures the angle between two vectors:

similarity = 1.0  → identical meaning (vectors point same direction)
similarity = 0.0  → unrelated (vectors perpendicular)
similarity = -1.0 → opposite meaning

Cosine distance is 1 - similarity:

distance = 0.0 → identical
distance = 1.0 → unrelated
distance = 2.0 → opposite

Visual intuition (imagine vectors as arrows in space):

                    "price of gold today"
                   ↗
                 /
               /  ← small angle = low distance = similar
             /
"gold market" →

                              "my breakfast" →  ← large angle = high distance = unrelated

Why Similar Phrases Can Have High Distance

Even when two texts are about the same topic, the embedding model encodes more than just keywords:

Text	Concepts encoded
"what's going on with gold in the market?"	question, market trends, general inquiry
"the price of gold is $5,220.50 at 4:50pm"	statement, specific price, specific time

The vectors are related but not close. Distance might be 0.9-1.0 even though both mention gold.

Similarity Threshold

The RAG search uses a similarity score threshold to filter out irrelevant results. This prevents the system from returning unrelated documents when the user's query doesn't match any content.

Configuration (shared/src/db/operations.ts called by chat/src/services/chat.service.ts):

searchSimilar(table, queryVector, limit, maxDistance = 1.2)

Parameter	Default	Description
`maxDistance`	1.2	Maximum cosine distance allowed (lower = more similar)

How it works:

LanceDB returns results with a _distance field (cosine distance)
Distance ranges from 0 (identical) to 2 (opposite vectors)
Results with distance > maxDistance are filtered out
If no results pass the threshold, RAG context is empty

Tuning guidance:

Threshold	Behavior
0.5	Strict - only highly relevant results
0.8	Moderate - may miss semantically related but differently phrased content
1.2	Balanced (default) - catches related content with different phrasing
1.5+	Very permissive - rarely filters anything

Example:

User asks "hey" → no documents about greetings → high distance scores → filtered out
User asks "how was my trip to Paris?" → document about Paris trip → low distance → included

Why this matters: Without a threshold, vector search always returns the top N results regardless of relevance. A query like "hey" would return whatever entries happen to be least dissimilar, even if they're completely unrelated (e.g., an entry about gold). The threshold ensures only genuinely relevant context is passed to the LLM.

System Prompt

The system prompt is the instruction set that tells the LLM who it is, how to behave, and what context it has available. It's the primary mechanism for customizing LLM behavior without retraining the model.

Location: chat/src/services/chat.service.ts → buildSystemPrompt()

Structure:

function buildSystemPrompt(ragContext: RagContext[]): string {
  if (ragContext.length === 0) {
    return `You are a helpful assistant for a personal document knowledge base.
The user is asking a question, but no relevant documents were found.
Respond helpfully and suggest they might want to add more documents or rephrase their question.`;
  }

  const contextEntries = ragContext
    .map((ctx) => `[${ctx.entry_date}] ${ctx.text_snippet}`)
    .join("\n\n");

  return `You are a helpful assistant for a personal document knowledge base.
Use the following documents to answer the user's question.
Be conversational and reference specific details from the entries when relevant.
If the entries don't contain enough information to answer, say so honestly.

Relevant documents:
${contextEntries}`;
}

How it works:

The system prompt is sent to the LLM as the first message in the conversation, before the user's message:

Messages sent to LLM:
┌─────────────────────────────────────────────────────────────┐
│ role: "system"                                              │
│ content: "You are a helpful assistant for a personal        │
│          document knowledge base. Use the following journal     │
│          entries to answer the user's question...           │
│                                                             │
│          Relevant documents:                          │
│          [2026-01-27] the price of gold is $5,220.50..."    │
├─────────────────────────────────────────────────────────────┤
│ role: "user"                                                │
│ content: "what's the current price of gold?"                │
└─────────────────────────────────────────────────────────────┘

Why it matters:

Aspect	Effect
Identity	"You are a helpful assistant for a personal document knowledge base" tells the LLM its role and domain
Behavior	"Be conversational and reference specific details" shapes response style
Boundaries	"If the entries don't contain enough information, say so honestly" prevents hallucination
Context injection	RAG results are embedded directly in the prompt, giving the LLM access to user's data

Without a system prompt, the LLM would be a generic assistant with no knowledge of:

Its purpose (documents)
The user's data (documents)
How to respond (conversational, honest about limitations)

Customization examples:

Use Case	System Prompt Modification
More formal tone	"Respond in a professional, formal tone"
Therapy-style	"You are a supportive listener. Ask reflective questions about the user's feelings"
Data analysis	"Analyze patterns across documents. Look for trends in mood, topics, and frequency"
Strict factual	"Only answer questions that can be directly answered from the documents. Never speculate"

The RAG + System Prompt pattern:

This is the core of how RAG applications work:

1. User asks a question
2. System searches vector database for relevant content
3. Relevant content is injected into the system prompt
4. LLM receives: system prompt (with context) + user message
5. LLM generates response grounded in the provided context

The LLM doesn't have direct database access—it only sees what's included in the prompt. This is both a limitation (context window size) and a feature (you control exactly what the LLM knows).

Architectural Patterns

CQRS: Command Query Responsibility Segregation

Core idea: Separate the code that reads data from the code that writes data.

Traditional approach (current implementation):

┌─────────────────────────────────────┐
│           Chat Service             │
│                                     │
│  • searchSimilar() ← READ           │
│  • getHistory()    ← READ           │
│  • saveMessage()   ← WRITE          │
│  • createSession() ← WRITE          │
└─────────────────────────────────────┘

One service does everything. Simple, but responsibilities are mixed.

CQRS approach:

┌─────────────────────────────────────┐      ┌─────────────────────────────────┐
│           Chat Service             │      │         Command Service         │
│           (READ side)               │      │          (WRITE side)           │
│                                     │      │                                 │
│  • searchSimilar()                  │      │  • saveMessage()                │
│  • getHistory()                     │      │  • createSession()              │
│  • getSession()                     │      │  • updateSession()              │
└─────────────────────────────────────┘      └─────────────────────────────────┘

Terminology:

Term	Meaning	Example
Query	Request that returns data, doesn't change state	"Get chat history"
Command	Request that changes state, may not return data	"Save this message"

Why separate them?

Different optimization needs:

Reads (Queries)	Writes (Commands)
Need to be FAST	Need to be RELIABLE
Can use caching	Need validation
Can use read replicas	Need consistency
Can be eventually consistent	Often need transactions

Different scaling patterns:

Typical app: 90% reads, 10% writes

Without CQRS:
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Service │ │ Service │ │ Service │   ← Scale everything together
└─────────┘ └─────────┘ └─────────┘

With CQRS:
┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐
│ Query │ │ Query │ │ Query │ │ Query │ │ Query │   ← Scale reads heavily
└───────┘ └───────┘ └───────┘ └───────┘ └───────┘
              ┌─────────┐
              │ Command │   ← Fewer write instances needed
              └─────────┘

Different data models:

Write model (normalized):           Read model (denormalized):
┌──────────┐  ┌──────────┐         ┌─────────────────────────────┐
│ sessions │  │ messages │         │     chat_history_view       │
│──────────│  │──────────│         │─────────────────────────────│
│ id       │  │ id       │   →     │ session_id                  │
│ title    │  │ session_id│        │ session_title               │
│ created  │  │ content   │        │ messages[] (embedded)       │
└──────────┘  │ role      │        │ last_message_preview        │
              └──────────┘         └─────────────────────────────┘

Optimized for integrity           Optimized for fast reads

Applied to this app:

Current flow (mixed read/write):

Gateway → Chat Service (rag)         ← READ (search)
                                      ← WRITE (save user message) ❌ mixed
Gateway → LLM (stream)
Gateway → Chat Service (save_message) ← WRITE

CQRS flow (separated):

Gateway → Command Service (save user message)  ← WRITE
Gateway → Chat Service (rag)                  ← READ only
Gateway → LLM (stream)
Gateway → Command Service (save assistant)     ← WRITE

When CQRS is overkill:

Small apps with low traffic
Simple CRUD with no complex queries
Team is small and doesn't need separation
Read/write patterns are similar

When CQRS shines:

High-scale systems (millions of reads)
Complex domains with different read/write needs
Event-sourced systems
Microservices where teams own different concerns

Related patterns:

Pattern	Description
CQRS	Separate read/write code paths
Event Sourcing	Store events, not state. Rebuild state from events.
CQRS + Event Sourcing	Commands emit events, queries read from projections

Current decision: This app uses a mixed approach (Chat Service handles both reads and writes) because the scale doesn't justify the added complexity. CQRS would be considered if scaling requirements change.

Known Limitations

File Upload Types

Current Limitation: Only .txt and .md files are supported for upload.

Why: File uploads are processed client-side in the browser using the JavaScript File.text() API, which only works with plain text files. The extracted text is then sent to the ingestion service as JSON. No files are stored separately—only the text content is embedded and stored in LanceDB.

To add support for .pdf and .docx files:

Format	Client-side Option	Server-side Option
`.pdf`	`pdfjs-dist` (Mozilla's PDF.js)	`pdf-parse` (Node.js), `PyMuPDF` (Python)
`.docx`	`mammoth.js`	`mammoth` (Node.js), `python-docx` (Python)

Implementation approaches:

Client-side parsing (current architecture):
- Add parsing libraries to the web app
- Extract text in the browser before sending to ingestion
- Pros: No changes to backend, files never leave the browser
- Cons: Larger frontend bundle, limited by browser capabilities
Server-side parsing:
- Send raw file bytes to ingestion service (would require multipart/form-data)
- Parse files in the ingestion Lambda or a dedicated parsing service
- Pros: More powerful parsing, can handle complex documents
- Cons: Requires backend changes, files must be transmitted/stored temporarily

Files to modify:

web/components/DocumentUpload/DropZone.tsx - Update accept attribute to include new file types
web/components/DocumentUpload/api.ts - Add parsing logic in submitFileContent() or create new parsing functions
If server-side: ingestion/src/services/ingestion.service.ts - Add file parsing before embedding

Development Log

Session: Chat Action Implementation (2026-01-27)

This session implemented the Gateway and Chat service chat action, connecting the existing web app chat UI to the backend.

Changes Made

1. Gateway Chat Action (gateway/src/handlers/messageHandler.ts)

Added ChatMessage interface with content and session_id fields
Added "chat" case to the message handler switch statement
Created handleChat() function that forwards messages to the chat service

2. Chat Service Chat Handler (chat/src/services/chat.service.ts)

New file implementing chat logic with RAG search
Performs vector similarity search on user message
Returns RAG context with placeholder response (LLM integration pending)
Added ChatBody type to shared types

3. Lambda Client Fix (gateway/src/services/lambdaClient.ts)

Changed functionName from service names ("ingestion", "chat") to "function"
Problem: AWS Lambda RIE expects /2015-03-31/functions/function/invocations path, but the SDK was generating /2015-03-31/functions/chat/invocations based on the function name
Symptom: 404 errors with "Unexpected non-whitespace character after JSON" (HTML error page being parsed as JSON)

4. WebSocket URL Configuration (docker-compose.yml, web/.env.local)

Changed NEXT_PUBLIC_WS_URL from ws://localhost:8080/ws to ws://192.168.2.17:8080/ws
Problem: When accessing the web app from a different machine, localhost resolves to the client machine, not the Docker host
Context: Environment variables in Next.js are baked in at build time for client-side code (NEXT_PUBLIC_* prefix), so changes require container rebuild
Tradeoff: Hardcoding IP works for local network access but isn't portable. Production would use a proper hostname.

5. Field Name Alignment (gateway/src/handlers/messageHandler.ts)

Changed gateway's ChatMessage interface from { message, sessionId } to { content, session_id }
Problem: Web client sends content and session_id (matching @shared/chat-types), but gateway expected different field names
Lesson: When adding new message types, check existing type definitions in shared module first

6. React Strict Mode WebSocket Fix (web/components/Chat/hooks/useWebSocket.ts)

Added isDisconnectingRef to track intentional disconnects
Skip error/close handlers when disconnecting intentionally
Problem: React Strict Mode in development mounts, unmounts, then remounts components. The first WebSocket connection would be closed during cleanup, triggering error state before the second connection established.
Symptom: "WebSocket connection failed" error flash on page load, even though connection ultimately succeeds
Tradeoff: Added complexity to the hook, but only ~5 lines. Alternative was disabling Strict Mode, which would hide other potential issues.

7. Chat Input UX Improvements (web/components/Chat/Chat.tsx, web/components/Chat/MessageInput.tsx)

Removed isLoading from disabled condition - input stays enabled while waiting for response
Added inputRef to maintain focus after sending message
Problem: Input was disabled during loading, preventing users from typing their next message. Focus was lost after submit.
Tradeoff: Users can now queue messages while waiting, which the backend doesn't currently handle (messages process sequentially). Acceptable for MVP.

Architecture Decisions

Why route chat through Chat service (not a new Chat service)?

Chat needs RAG search, which Query already has
Keeps service count minimal for MVP
Can extract to dedicated service later if complexity grows

Why placeholder LLM response?

LLM service (in llm/ directory) isn't implemented yet
Chat flow works end-to-end with mock response
Unblocks frontend development and integration testing

Files Modified

gateway/src/handlers/messageHandler.ts  # Chat action routing
gateway/src/services/lambdaClient.ts    # Lambda function name fix
chat/src/handler.ts                     # Chat case in switch
chat/src/services/chat.service.ts       # New file - chat logic
shared/src/types.ts                     # ChatBody type
docker-compose.yml                      # WebSocket URL
web/.env.local                          # WebSocket URL (local override)
web/components/Chat/hooks/useWebSocket.ts  # Strict Mode fix
web/components/Chat/Chat.tsx            # Input disabled state
web/components/Chat/MessageInput.tsx    # Focus retention

Session: MongoDB Integration & Chat Persistence (2026-01-27)

This session implemented MongoDB integration for chat history persistence and fixed a WebSocket connection error on page load.

Changes Made

1. MongoDB Connection Module (shared/src/db/mongo-connection.ts)

New file implementing MongoDB connection management
Singleton pattern matching existing LanceDB connection approach
Functions: initMongoConnection(), getMongoDb(), closeMongoConnection(), resetMongoConnection()
Dependency injection support for testing

2. MongoDB Operations Module (shared/src/db/mongo-operations.ts)

New file implementing session/message CRUD operations
Session operations: createSession(), getSession(), updateSessionTitle(), updateSessionTimestamp(), listSessions(), deleteSession(), getOrCreateSession()
Message operations: createMessage(), getMessage(), getSessionMessages(), deleteMessage()
All operations take Db as first parameter for testability

3. Config Updates (shared/src/config.ts)

Added mongoUri and mongoDbName to AppConfig interface
Default values: mongodb://root:example@mongo:27017 and example_rag
Matches docker-compose MongoDB configuration

4. Chat Service MongoDB Integration (chat/src/services/chat.service.ts)

Integrated MongoDB operations into chat handler
User messages saved before RAG search
Assistant messages saved after response generation (with RAG context)
Session created automatically if not exists

5. Connection Error Fix (web/components/Chat/hooks/useChat.ts)

Added onOpen handler that clears error state when WebSocket connects
Problem: "Connection error" banner appeared briefly on page load due to transient WebSocket errors during initial connection or React Strict Mode's double-mount
Symptom: Error message flashed then disappeared as connection established
Fix: Clear any existing error when onopen fires

Trade-offs

Singleton MongoDB Connection

Decision: Module-level connection state, same as LanceDB
Pro: Simple, connection reused across Lambda invocations (warm starts)
Con: Connection shared across all requests; no per-request isolation
Why acceptable: Matches existing pattern, MongoDB driver handles connection pooling internally

Synchronous Message Saving

Decision: Save messages inline during chat request (not async/fire-and-forget)
Pro: Data consistency guaranteed, errors surface immediately
Con: Adds latency to chat response (~5-10ms per insert)
Why acceptable: Latency is negligible compared to embedding + RAG search + future LLM call

Clear All Errors on Connect

Decision: onOpen clears any existing error, not just connection errors
Pro: Simple, handles all transient error cases
Con: Could hide a real error that occurred right before successful connection
Why acceptable: If there's a persistent issue, error will resurface on next operation; better UX than showing scary errors on initial load

UUIDs for Session/Message IDs

Decision: Use crypto.randomUUID() for IDs, not MongoDB ObjectId
Pro: Client can generate IDs for optimistic updates, portable across databases
Con: Slightly larger than ObjectId (36 chars vs 24)
Why acceptable: Matches existing LanceDB ID pattern, enables future client-side ID generation

User Message ID Not Returned

Current: Chat response only includes assistant message ID, not user message ID
Why: User message is saved to MongoDB for history, but return value isn't used
Future option: Could return user_message_id in response for:
- Client confirmation that message was saved
- Message linking (in_reply_to field)
- Server timestamp sync
- Request tracing/debugging
To implement: Capture return value from createMessage() and add to response

Files Modified

shared/src/config.ts                    # Added MongoDB config
shared/src/db/mongo-connection.ts       # New - connection management
shared/src/db/mongo-operations.ts       # New - CRUD operations
shared/src/db/mongo-operations.test.ts  # New - unit tests
shared/package.json                     # Added mongodb dependency
chat/src/services/chat.service.ts       # Integrated MongoDB saves
web/components/Chat/hooks/useChat.ts    # Fixed connection error

Testing

MongoDB operations verified with:

Unit tests (npm test in shared directory)
Integration test against live MongoDB container
End-to-end test via chat UI - messages now persist in sessions and messages collections

To verify data persistence:

docker exec example-rag-mongodb mongosh --quiet -u root -p example \
  --authenticationDatabase admin example_rag \
  --eval "db.sessions.find().toArray(); db.messages.find().toArray();"

Session: Code Quality & Production Readiness (2026-01-27)

This session addressed production compatibility, code consistency, and test maintenance.

Changes Made

1. Lambda Function Names Configurable (gateway/src/services/lambdaClient.ts)

Added INGESTION_FUNCTION_NAME and QUERY_FUNCTION_NAME environment variables
Default to "function" for local Lambda RIE compatibility
Problem: Hardcoded "function" name works with RIE but not production AWS Lambda
Fix: Environment variables allow setting real function names in production

2. Destructured CSS Module Styles (all web components)

Refactored from styles.className to destructured const { className } = styles
Improves readability and reduces repetition
Dynamic style access (e.g., styles[status]) still uses the styles object

3. Updated Outdated Test (web/components/Chat/tests/Chat.test.tsx)

Test expected input to be disabled during loading (old behavior)
Updated to expect input enabled during loading (current intentional behavior)
Context: Previous session intentionally kept input enabled to allow message queuing

Files Modified

gateway/src/services/lambdaClient.ts           # Configurable function names
web/components/Chat/Chat.tsx                   # Destructured styles
web/components/Chat/MessageInput.tsx           # Destructured styles
web/components/Chat/MessageList.tsx            # Destructured styles
web/components/Chat/ConnectionStatus.tsx       # Destructured styles
web/app/chat/page.tsx                          # Destructured styles
web/components/DocumentUpload/DocumentUpload.tsx    # Destructured styles
web/components/DocumentUpload/FileUploadForm.tsx    # Destructured styles
web/components/DocumentUpload/TextEntryForm.tsx     # Destructured styles
web/components/DocumentUpload/DateInput.tsx         # Destructured styles
web/components/DocumentUpload/FileList.tsx          # Destructured styles
web/components/DocumentUpload/DropZone.tsx          # Destructured styles
web/components/DocumentUpload/StatusMessage.tsx     # Destructured styles
web/components/DocumentUpload/MoodSelector.tsx      # Destructured styles
web/components/Home/Home.tsx                        # Destructured styles
web/components/ThemeToggle/ThemeToggle.tsx          # Destructured styles
web/components/Chat/tests/Chat.test.tsx             # Fixed outdated test

Session: LLM Streaming & Chat Service Enhancement (2026-01-27)

This session implemented true token streaming from the LLM service to the web UI, restructured the chat flow for proper separation of concerns, and added RAG relevance filtering.

Architecture Decision: Why Gateway Calls LLM (Not Chat Service)

Problem: The initial implementation had the Chat service calling the LLM and returning the complete response. This meant no streaming—the entire response appeared at once in the UI.

Why Chat Service Can't Stream:

Chat service runs as a Lambda function
Lambda functions return a single response—they cannot stream data incrementally
Even if the LLM streams tokens to the Chat service, Lambda must buffer the entire response before returning

Why Gateway Can Stream:

Gateway maintains persistent WebSocket connections with clients
Gateway can consume SSE (Server-Sent Events) from the LLM service
Gateway can forward each token to WebSocket clients as it arrives

New Architecture:

1. Client → Gateway: { action: "chat", content: "..." }
2. Gateway → Query (rag): Get RAG context + save user message
3. Gateway → LLM (SSE stream): Stream tokens
4. Gateway → Client: Forward each token via WebSocket
5. Gateway → Query (save_message): Save complete assistant message

Trade-offs:

Gateway now has more responsibility (LLM orchestration)
Chat service is simpler (just RAG + MongoDB operations)
Streaming works end-to-end
Messages are saved to MongoDB after streaming completes (not during)

Changes Made

1. Chat Service Restructured (chat/src/services/chat.service.ts, chat/src/handler.ts)

Removed chat action (did everything including LLM call)
Added rag action: RAG search + save user message, returns context + system prompt
Added save_message action: Save assistant message after streaming completes
Chat service no longer imports or calls LLM

2. Gateway Streaming Implementation (gateway/src/handlers/messageHandler.ts)

handleChat now orchestrates the full flow:
1. Calls Chat service rag action
2. Streams from LLM service using existing llmClient.ts
3. Forwards tokens to WebSocket as chat_stream_token events
4. Calls Chat service save_message to persist
5. Sends chat_stream_end with complete content

3. WebSocket Stream Events

Event	Purpose
`chat_stream_start`	Stream beginning, includes RAG context
`chat_stream_token`	Individual token from LLM
`chat_stream_end`	Stream complete, includes full content
`chat_stream_error`	Error with partial content if any

4. React Streaming Smoothness (web/components/Chat/hooks/useChat.ts)

Problem: Initial streaming implementation was "jumpy"—UI flickered with each token.

Root causes:

flushSync forced synchronous re-render for every token
Scroll-to-bottom triggered on every streamingContent change
React 18's automatic batching wasn't helping because flushSync bypassed it

Solution: requestAnimationFrame batching

// Buffer tokens in a ref
streamingBufferRef.current += token

// Schedule single RAF update (coalesces rapid tokens)
if (rafIdRef.current === null) {
  rafIdRef.current = requestAnimationFrame(() => {
    setStreamingContent(streamingBufferRef.current)
    rafIdRef.current = null
  })
}

Benefits:

Tokens batch naturally at 60fps
No forced synchronous renders
Smooth visual streaming effect

5. Scroll Behavior Fix (web/components/Chat/MessageList.tsx)

Only auto-scroll when user is near bottom (within 100px)
Don't scroll on every streaming token
Scroll once when isLoading changes (stream starts)

6. RAG Similarity Threshold (shared/src/db/operations.ts)

Problem: User says "hey" → system returns document about gold (irrelevant).

Root cause: Vector search always returns top N results, even if they're dissimilar.

Solution: Added maxDistance parameter (default 0.8) to filter results:

.filter((result) => result.score <= maxDistance)

How cosine distance works:

0 = identical vectors
2 = opposite vectors
0.8 threshold filters out results that are only vaguely similar

7. Logging Configuration (llm/src/server.py)

Added logging.basicConfig() to show model loading progress
Logs: "Loading model...", "Model loaded and ready"

Files Modified

chat/src/handler.ts                     # Added rag, save_message actions
chat/src/services/chat.service.ts       # Split into rag() and saveMessage()
shared/src/types.ts                     # Added RagBody, SaveMessageBody types
shared/src/db/operations.ts             # Added maxDistance threshold
gateway/src/handlers/messageHandler.ts  # Full streaming orchestration
gateway/src/services/llmClient.ts       # Debug logging for streaming
web/components/Chat/hooks/useChat.ts    # RAF-based smooth streaming
web/components/Chat/MessageList.tsx     # Smarter scroll behavior
web/components/Chat/Chat.module.scss    # Blinking cursor for streaming
llm/src/server.py                       # Logging configuration
docker-compose.yml                      # Chat service depends on llm, mongo

Debugging Notes

Checking if tokens are streaming:

docker compose logs -f gateway
# Look for: [LLM] Starting stream request...
#           [LLM] Stream done, total tokens: X
#           [chat] Streamed X tokens

If streaming looks instant (no typing effect):

Tokens may arrive faster than 60fps updates
This is actually working correctly—LLM is just fast
The RAF batching ensures smooth rendering regardless of token speed

Project Structure

example-rag/
├── docker-compose.yml
├── web/                      # Next.js frontend
│   ├── app/
│   │   ├── layout.tsx
│   │   ├── page.tsx
│   │   ├── chat/
│   │   │   └── page.tsx      # Chat interface
│   │   └── upload/
│   │       └── page.tsx
│   ├── components/
│   │   ├── Home/
│   │   ├── DocumentUpload/
│   │   ├── Chat/
│   │   │   ├── Chat.tsx
│   │   │   ├── MessageList.tsx
│   │   │   ├── MessageInput.tsx
│   │   │   ├── ConnectionStatus.tsx
│   │   │   ├── Chat.module.scss
│   │   │   ├── types.ts
│   │   │   ├── hooks/
│   │   │   │   ├── useWebSocket.ts
│   │   │   │   └── useChat.ts
│   │   │   └── tests/
│   │   │       ├── Chat.test.tsx
│   │   │       ├── MessageList.test.tsx
│   │   │       ├── MessageInput.test.tsx
│   │   │       ├── ConnectionStatus.test.tsx
│   │   │       ├── useWebSocket.test.ts
│   │   │       └── useChat.test.ts
│   │   └── ThemeToggle/
│   ├── contexts/
│   │   └── ThemeContext.tsx
│   ├── Dockerfile.dev
│   └── jest.config.js
├── gateway/                  # WebSocket gateway
│   ├── src/
│   │   ├── index.ts
│   │   ├── tests/
│   │   │   └── index.test.ts
│   │   ├── handlers/
│   │   │   ├── connectionHandler.ts
│   │   │   ├── messageHandler.ts
│   │   │   └── tests/
│   │   │       ├── connectionHandler.test.ts
│   │   │       └── messageHandler.test.ts
│   │   └── services/
│   │       ├── lambdaClient.ts
│   │       └── tests/
│   │           └── lambdaClient.test.ts
│   ├── Dockerfile.dev
│   ├── jest.config.js
│   └── tsconfig.json
├── shared/
│   ├── src/
│   │   ├── config.ts
│   │   ├── types.ts          # Backend types (aws-lambda dependency)
│   │   ├── chat-types.ts     # Chat types (browser-safe)
│   │   ├── db/
│   │   │   ├── connection.ts       # LanceDB connection
│   │   │   ├── operations.ts       # LanceDB operations
│   │   │   ├── mongo-connection.ts # MongoDB connection
│   │   │   └── mongo-operations.ts # MongoDB CRUD operations
│   │   └── services/
│   │       └── embedding.ts
│   └── jest.config.js
├── ingestion/
│   ├── src/
│   │   ├── handler.ts
│   │   └── services/
│   │       └── ingestion.service.ts
│   ├── Dockerfile
│   ├── Dockerfile.dev
│   └── jest.config.js
├── chat/
│   ├── src/
│   │   ├── handler.ts
│   │   └── services/
│   │       ├── query.service.ts
│   │       └── chat.service.ts
│   ├── Dockerfile
│   ├── Dockerfile.dev
│   └── jest.config.js
├── embedding/
│   └── (Python sentence-transformers service)
└── llm/
    ├── src/
    │   ├── __init__.py
    │   ├── __main__.py         # Entry point with dev auto-restart
    │   ├── config.py           # Environment configuration
    │   ├── dependencies.py     # FastAPI dependency injection
    │   ├── model.py            # LLMService class with streaming
    │   └── server.py           # FastAPI server with SSE streaming
    ├── tests/
    │   ├── conftest.py         # MockLLMService for testing
    │   ├── test_model.py
    │   └── test_server.py
    ├── models/
    │   └── qwen2.5-3b-instruct/  # Model files
    ├── Dockerfile
    └── pyproject.toml

FilesExpand file tree

NOTES.md

Latest commit

History

NOTES.md

File metadata and controls

Example RAG

Architecture

Gateway

Data Flow

Services

Shared Module

Web App Chat Feature

Database Schema

LanceDB (Vector Database)

MongoDB (Chat History)

sessions Collection

messages Collection

Design Decisions & Tradeoffs

Naming Conventions (snake_case vs camelCase)

Single Table vs Multiple Tables

LanceDB Embedded vs Server Mode

MinIO for Local Development

TypeScript over Rust for Lambda Services

Shared Module via Docker COPY (not npm)

No Barrel Files

Topics as Array

Connection Caching for Lambda

Separate Sessions and Messages Collections

UUID-Based IDs for Chat

RAG Context on Messages

Separate Chat Types File (chat-types.ts)

Direct Imports Over Re-exports

UI-Specific Types Stay in Components

Docker COPY for Shared Module in Web App

Sparse Index for user_id

MongoDB Schema Validation

JavaScript Init Script over Shell

Dependency Injection for Testing

WebSocket Gateway vs Direct Lambda Calls

AWS SDK for Lambda Invocation

LLM Service: SSE over HTTP (not WebSocket)

Lambda RIE vs Custom HTTP Wrapper

Action-Based Routing in Lambda

Hot Reloading Strategy

Testing

Development

Prerequisites

Setup

Data Persistence

Environment Variables

API

Ingestion Service

Chat Service

RAG Configuration

How Vector Search Works

Cosine Distance

Why Similar Phrases Can Have High Distance

Similarity Threshold

System Prompt

Architectural Patterns

CQRS: Command Query Responsibility Segregation

Known Limitations

File Upload Types

Development Log

Session: Chat Action Implementation (2026-01-27)

Changes Made

Architecture Decisions

Files Modified

Session: MongoDB Integration & Chat Persistence (2026-01-27)

Changes Made

Trade-offs

Files Modified

Testing

Session: Code Quality & Production Readiness (2026-01-27)

Changes Made

Files Modified

Session: LLM Streaming & Chat Service Enhancement (2026-01-27)

Architecture Decision: Why Gateway Calls LLM (Not Chat Service)

Changes Made

`sessions` Collection

`messages` Collection