A Retrieval-Augmented Generation (RAG) system for documents using LanceDB as the vector database. Designed to mirror AWS serverless architecture locally using Docker.
┌─────────────────┐
│ Web App │
│ (Next.js) │
└────────┬────────┘
│ WebSocket
▼
┌─────────────────┐ SSE (streaming) ┌─────────────────┐
│ Gateway │◄──────────────────────────────►│ LLM │
│ (Node.js) │ │ (Python) │
└────────┬────────┘ └─────────────────┘
│ AWS SDK (Lambda Invoke)
▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Ingestion │ │ Chat │ │ Embedding │
│ (Lambda) │ │ (Lambda) │ │ (Python) │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
│ │ │
▼ ▼ │
┌─────────────────────────────────────────┐ │
│ Shared Module │◄─────────────┘
│ (connection, operations, embedding) │
└────────────────────┬────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ LanceDB (embedded) │
│ stored in MinIO (S3) │
└─────────────────────────────────────────┘
The gateway handles WebSocket connections from the web app and invokes Lambda functions using the AWS SDK. This mirrors AWS API Gateway WebSocket APIs:
- Local: Gateway uses AWS SDK pointing at Lambda RIE endpoints
- Production: Gateway uses AWS SDK pointing at real Lambda functions
The web app communicates entirely over WebSocket - no direct HTTP calls to Lambda services.
Ingestion (adding documents):
Web App
│
│ WebSocket: { action: "ingest", text: "...", entry_date: "...", topics: [...] }
│
▼
Gateway
│
│ AWS SDK Lambda Invoke
│
▼
Ingestion Service (Lambda)
│
├──► Embedding Service (get vector for text)
│ │
│ ▼
│ Returns 384-dimensional vector
│
▼
LanceDB ──► S3/MinIO (internal storage)
│
▼
Gateway
│
│ WebSocket: { action: "ingest", statusCode: 200, data: { id, message } }
│
▼
Web App
Query (searching documents):
Web App
│
│ WebSocket: { action: "query", query: "...", limit: 5 }
│
▼
Gateway
│
│ AWS SDK Lambda Invoke
│
▼
Chat Service (Lambda)
│
├──► Embedding Service (get vector for query)
│ │
│ ▼
│ Returns 384-dimensional vector
│
▼
LanceDB (vector similarity search)
│
▼
Gateway
│
│ WebSocket: { action: "query", statusCode: 200, data: { results: [...] } }
│
▼
Web App
LanceDB uses S3 (MinIO) as its storage backend internally - the services make one call to LanceDB, which handles S3 storage under the hood.
Chat (streaming flow):
Web App
│
│ WebSocket: { action: "chat", content: "...", stream: true }
│
▼
Gateway
│
│ AWS SDK Lambda Invoke
│
▼
Chat Service (Lambda)
│
├──► Embedding Service (get vector for user message)
│ │
│ ▼
│ Returns 1024-dimensional vector
│
├──► LanceDB (vector similarity search for RAG context)
│
▼
Gateway (receives RAG context)
│
│ HTTP POST with stream=true
│
▼
LLM Service
│
│ SSE: data: {"choices":[{"delta":{"content":"token"}}]}
│ ... (tokens stream)
│ SSE: data: [DONE]
│
▼
Gateway (proxies SSE to WebSocket)
│
│ WebSocket: { action: "chat_stream_start", rag_context: [...] }
│ WebSocket: { action: "chat_stream_token", token: "Hello" }
│ WebSocket: { action: "chat_stream_token", token: " there" }
│ ... (tokens stream)
│ WebSocket: { action: "chat_stream_end", content: "Hello there...", role: "assistant" }
│
▼
Web App (displays tokens as they arrive for "typing" effect)
Detailed Chat Flow:
1. Web App → Gateway: "what's the price of gold?"
2. Gateway → Chat Service (action: "rag"):
- Chat Service internally:
- Calls Embedding Service → gets vector [0.12, -0.45, ...]
- Searches LanceDB with that vector → finds matching entries
- Saves user message to MongoDB
- Returns to Gateway:
- system_prompt: "You are a journal assistant... Relevant entries: [2026-01-27] gold is $5,220..."
- rag_context: [{entry_date, text_snippet, score}]
3. Gateway → LLM:
- Sends: system_prompt + user message
- Receives: streaming tokens
4. Gateway → Web App:
- Forwards each token via WebSocket
5. Gateway → Chat Service (action: "save_message"):
- Saves assistant's complete response to MongoDB
Key point: The embedding vector stays inside the Chat Service. The Gateway only receives:
- The system prompt (text with RAG context baked in)
- The rag_context metadata (for showing "Sources" in UI)
The Gateway never sees the actual vector—it just passes text to the LLM. The LLM has no knowledge of the Chat Service, embeddings, or databases. It simply receives a system prompt (which happens to contain retrieved documents) and a user message, then generates a response.
| Service | Description |
|---|---|
| web | Next.js frontend for chat and document upload |
| gateway | WebSocket server that invokes Lambda functions via AWS SDK |
| ingestion | TypeScript Lambda for adding documents to the database |
| chat | TypeScript Lambda for semantic search over documents |
| embedding | Python service using sentence-transformers for vector generation |
| llm | Python service using Qwen2.5-3B-Instruct for chat completions (SSE streaming) |
| minio | S3-compatible object storage (mimics AWS S3 locally) |
| minio-setup | One-time container that creates the lancedb bucket |
Common code lives in /shared and is imported via TypeScript path aliases (@shared/*). This avoids npm package dependencies while allowing code reuse.
Contents:
config.ts- Environment configuration with defaultstypes.ts- TypeScript types for entries, events, and API bodies (includesaws-lambdadependency)chat-types.ts- Chat-related types (browser-safe, no Node.js dependencies)db/connection.ts- LanceDB connection management with cachingdb/operations.ts- LanceDB operations (addEntry, searchSimilar)db/mongo-connection.ts- MongoDB connection managementdb/mongo-operations.ts- MongoDB CRUD operations for chat sessions and messagesservices/embedding.ts- Embedding service client
The chat page (/chat) provides a real-time chat interface that communicates with the gateway over WebSocket.
Components (web/components/Chat/):
Chat.tsx- Main container with header, error banner, message list, and inputMessageList.tsx- Displays messages with user/assistant styling, RAG context sources, typing indicatorMessageInput.tsx- Text input with Enter-to-send supportConnectionStatus.tsx- Shows WebSocket connection state (connecting/connected/disconnected/error)
Hooks (web/components/Chat/hooks/):
useWebSocket.ts- WebSocket connection management with auto-reconnectuseChat.ts- Chat state management combining WebSocket with message handling
Types:
- Shared types imported from
@shared/chat-types(ChatMessage, ChatSession, RagContext, WebSocketChatMessage) - UI-specific types in
web/components/Chat/types.ts(ConnectionStatus)
Single table document_entries with the following columns:
| Column | Type | Description |
|---|---|---|
id |
string | Unique identifier (UUID) |
entry_id |
string | null | Groups chunks from the same entry |
entry_date |
string | Date of the document |
chunk_index |
number | Position when entry is split into chunks |
text |
string | The document text |
vector |
number[] | 384-dimensional embedding vector |
topics |
string[] | Array of mood tags |
word_count |
number | Word count of the text |
Two collections for storing chat sessions and messages.
| Field | Type | Description |
|---|---|---|
_id |
ObjectId | MongoDB auto-generated |
session_id |
string | UUID, unique index |
user_id |
string | null | null for anonymous, user UUID when auth added |
title |
string | null | Optional session title (auto-generated from first message) |
created_at |
Date | Session creation timestamp |
updated_at |
Date | Last message timestamp |
metadata |
object | Extensible metadata (e.g., client_info) |
Indexes: session_id (unique), user_id (sparse), updated_at (descending)
| Field | Type | Description |
|---|---|---|
_id |
ObjectId | MongoDB auto-generated |
message_id |
string | UUID, unique index |
session_id |
string | Foreign key to sessions |
role |
"user" | "assistant" | Message sender |
content |
string | Message text |
created_at |
Date | Message timestamp |
rag_context |
array | null | Retrieved documents used for response (null for user messages) |
Each rag_context item contains: entry_id, entry_date, text_snippet (first 200 chars), score (similarity score)
Indexes: message_id (unique), session_id + created_at (compound)
Decision: Different conventions for different layers.
| Layer | Convention | Example |
|---|---|---|
| WebSocket messages (client ↔ gateway) | snake_case | session_id |
| Internal TypeScript / Lambda payloads | camelCase | sessionId |
| MongoDB documents | snake_case | session_id |
Why:
- Database fields use snake_case (MongoDB convention, matches JSON APIs)
- WebSocket messages use snake_case to align with database schema and external API conventions
- Internal TypeScript code uses camelCase (JavaScript/TypeScript convention)
- Gateway converts between conventions at the boundary
Tradeoff:
- Requires conversion at gateway layer
- Could have used camelCase everywhere, but snake_case for wire formats is a common REST/JSON API convention and matches MongoDB's typical style
Decision: Single table with all entries.
Why:
- Simpler queries and maintenance
- Scales well as knowledge base grows
- No need for cross-table joins
entry_idandchunk_indexhandle multi-chunk entries without separate tables
Decision: Embedded mode with S3 storage.
Why:
- LanceDB is designed like SQLite - embedded, not a server
- Multiple services connect directly to the same S3 bucket
- Mirrors AWS serverless pattern (Lambda + S3)
- No separate database server to manage
- LanceDB handles concurrency internally
Decision: Use MinIO to simulate AWS S3 locally.
Why:
- Same S3 connection logic works in dev and production
- Just change the endpoint URL for AWS deployment
- Web console (port 9001) for debugging/browsing data
Decision: TypeScript with Node.js 24 Lambda runtime.
Why:
- Faster development iteration
- LanceDB has a mature TypeScript SDK
- Better alignment with team skills
- Rust would provide better performance but slower development
Decision: Copy shared code in Docker build, use tsconfig paths.
Why:
- No npm link or package.json dependencies
- Works with multi-stage Docker builds
- TypeScript path aliases (
@shared/*) provide clean imports - Simpler than publishing to private npm registry
Decision: Direct imports instead of index.ts re-exports.
Why:
- Follows project conventions (CLAUDE.md)
- Avoids circular dependency issues
- Clearer import paths show exact source
Decision: Store topics as string[] instead of single value.
Why:
- One document can describe multiple events/feelings
- More flexible for filtering and analysis
- Supports "happy and anxious" type entries
Decision: Module-level connection state with lazy initialization.
Why:
- Lambda cold starts are expensive
- Connection persists across invocations
initConnection()on module load, awaited in handlerresetConnection()for testing
Decision: Two MongoDB collections instead of embedding messages in sessions.
Why:
- Allows efficient session listing without loading all messages
- Messages can grow large; separate collection avoids document size limits
- Easier to query and paginate messages independently
- Better index performance for message retrieval
Decision: Use session_id and message_id (UUIDs) alongside MongoDB's _id.
Why:
- Matches existing pattern in LanceDB entries
- Allows client-generated IDs for optimistic updates
- Portable identifiers that don't depend on MongoDB ObjectId format
- Easier to reference across services
Decision: Store retrieved documents in rag_context field on assistant messages.
Why:
- Records which documents were used to generate each response
- Useful for debugging retrieval quality
- Enables "show sources" UI feature
- Stores snippet + score, not full text (keeps documents small)
Decision: Split chat-related types into a separate file from types.ts.
Why:
types.tsimportsaws-lambdatypes which aren't available in browser environments- The web app needs to import chat types (ChatMessage, RagContext, etc.) without pulling in Node.js dependencies
- Backend services import from
@shared/types(with aws-lambda) or@shared/chat-typesas needed - Web app imports only from
@shared/chat-types
Tradeoff:
- Two type files instead of one
- Could have used conditional exports or a build step to tree-shake, but separate files are simpler and more explicit
Decision: Services import directly from source files rather than through barrel files or re-export layers.
Why:
- Initially considered re-exporting chat types from
types.tsfor convenience - Removed because it adds unnecessary indirection
- Direct imports (
@shared/chat-types) are clearer about where types come from - Follows project convention of no barrel files (no
index.tsre-exports)
Decision: Keep ConnectionStatus type in web/components/Chat/types.ts, not in shared.
Why:
ConnectionStatus('connecting' | 'connected' | 'disconnected' | 'error') is purely a UI concern- Backend services don't need to know about WebSocket connection states
- Only types used across multiple services belong in shared
- Keeps shared module focused on domain types (messages, sessions, RAG context)
Decision: Copy shared module into Docker container at /shared/, use tsconfig paths for resolution.
Why:
- Web app's Dockerfile builds from project root context to access
../shared tsconfig.jsonmaps@shared/*to../shared/src/*- Relative path
../sharedfrom/appresolves to/sharedin container - Volume mount
./shared:/sharedin docker-compose for dev hot-reload - No webpack aliases needed - tsconfig paths handle both dev and build
Tradeoff:
- Initially tried webpack alias in
next.config.ts, but unnecessary complexity - Build context must be project root (
.) not just./web
Decision: Use sparse index on user_id field (null for anonymous users).
Why:
- Efficient queries when user auth is added later
- Sparse index excludes null values, saving space
- Anonymous sessions still work without placeholder values
- Future-proofs schema for multi-user support
Decision: Use JSON Schema validation on collections.
Why:
- Catches malformed documents at insert time
- Documents required fields and types
- Acts as lightweight contract between services
- Validation errors are explicit, not silent data corruption
Decision: Use init-mongo.js instead of init-mongo.sh for MongoDB initialization.
Why:
- MongoDB Docker entrypoint runs
.jsfiles directly with mongosh - No shell heredoc syntax or
$escaping needed - Cleaner, more readable initialization code
- Runs automatically on first startup when data directory is empty
Decision: Functions accept optional dependency parameters.
Why:
initConnection(connectDep = connect)allows mock injectiongetTable(createIfMissing, connectDep)for testability- No need for complex mocking libraries
Decision: Web app connects via WebSocket to a gateway, which invokes Lambda functions.
Why:
- Lambda functions cannot hold persistent WebSocket connections
- AWS API Gateway WebSocket APIs work by: (1) holding the WebSocket connection, (2) invoking Lambda per message via the internal Lambda Invoke API, (3) Lambda pushes responses via the API Gateway Management API
- Our gateway mirrors this pattern locally
- Client communicates entirely over WebSocket - cleaner than REST for a chat application
Tradeoff:
- Additional service to maintain (gateway)
- Could have used direct REST calls from web app to Lambda RIE, but that doesn't match production architecture
Decision: Gateway uses @aws-sdk/client-lambda to invoke Lambda functions, not raw HTTP.
Why:
- In production AWS, API Gateway uses the Lambda Invoke API (internal AWS mechanism), not HTTP
- The Lambda RIE's HTTP endpoint (
/2015-03-31/functions/function/invocations) is just an emulation for local testing - Using AWS SDK keeps gateway code identical between local and production - only the endpoint URL changes
- Same code works locally (pointing at RIE) and in production (pointing at real Lambda)
Local:
new LambdaClient({ endpoint: "http://ingestion:8080" })Production:
new LambdaClient({ region: "us-east-1" }) // Uses real LambdaDecision: LLM service exposes HTTP/SSE endpoints, not WebSocket. Gateway proxies SSE to WebSocket.
Architecture:
┌──────────┐ WS ┌─────────┐ SSE ┌─────────────┐
│ Frontend │◄───────────►│ Gateway │◄────────────►│ LLM Service │
└──────────┘ └─────────┘ └─────────────┘
Why not give LLM direct WebSocket access?
-
Separation of concerns - The LLM service should be a pure inference engine: messages in, tokens out. It shouldn't know about connection management, WebSocket protocols, or client sessions.
-
Testability - HTTP/SSE is easy to test with curl, httpx, pytest. WebSocket testing requires more setup and stateful connections.
-
Flexibility - The LLM can be called from:
- The gateway (for chat)
- CLI tools for debugging
- Batch jobs for bulk processing
- Other backend services
- HTTP-only clients
If it were WebSocket-coupled, all those would become harder.
-
GPU resources are precious - The LLM service is GPU-bound. Don't burden it with connection management. Let it focus on inference.
-
Proxy overhead is negligible - For localhost, proxying SSE→WebSocket adds microseconds. LLM inference takes seconds. The overhead is unnoticeable.
How SSE→WebSocket proxy works:
- Client sends
{ action: "chat", content: "...", stream: true }over WebSocket - Gateway fetches RAG context from Chat service
- Gateway calls LLM service with
stream: true, receives SSE response - Gateway consumes SSE events and forwards each token to WebSocket:
LLM SSE: data: {"choices":[{"delta":{"content":"Hello"}}]} ↓ Gateway: Parse JSON, extract token ↓ WebSocket: { action: "chat_stream_token", token: "Hello" } - On stream end, Gateway sends
{ action: "chat_stream_end", content: "..." }
Stream events sent to client:
chat_stream_start- Stream beginning, includes RAG contextchat_stream_token- Individual tokenchat_stream_end- Stream complete, includes full contentchat_stream_error- Error occurred, includes partial content
Tradeoff:
- Extra hop adds complexity to gateway
- Could have simplified by giving LLM direct WebSocket access
- But the architectural benefits (testability, flexibility, separation of concerns) outweigh the complexity cost
Decision: Use AWS Lambda base image with Runtime Interface Emulator (RIE), not custom HTTP wrappers.
Why:
- Initially considered creating
main.tsfiles that wrap Lambda handlers in Express/HTTP servers for local development - This approach diverges from production behavior and adds unnecessary code
- Lambda base images include RIE which provides the HTTP layer automatically
- Lambda services have a single handler function, no HTTP routes - the RIE handles HTTP-to-event translation
- Keeps Lambda code focused on business logic, not HTTP concerns
Tradeoff:
- No hot reloading for Lambda services (must rebuild container on code changes)
- Web app and gateway have hot reloading via volume mounts; Lambda services do not
Decision: Single Lambda handler with action field for routing, not multiple HTTP endpoints.
Why:
- Lambda functions receive events, not HTTP requests
- In production, API Gateway maps routes to Lambda invocations with event payloads
- Single handler with switch statement on
actionfield mirrors this pattern - Keeps Lambda code portable - same handler works with API Gateway, direct invocation, or RIE
Example:
export const handler = async (event: IngestEvent) => {
switch (event.action) {
case "ingest": return await ingest(event.body);
case "health": return { statusCode: 200, body: "ok" };
}
};Decision: Different hot reloading approaches for different service types.
| Service | Hot Reload | Method |
|---|---|---|
| Web (Next.js) | Yes | Volume mounts + next dev |
| Gateway (Node.js) | Yes | Volume mounts + tsx watch |
| Lambda services | No | Rebuild container on changes |
Why:
- Web and gateway are stateless Node.js servers - easy to reload
- Lambda services use AWS base images with RIE - designed for container-based deployment, not file watching
- Could add hot reloading to Lambda with custom setup, but diverges from production pattern
- For rapid Lambda iteration, run tests locally with
npm testinstead of full container rebuild
Jest with ESM support. Run tests:
# Shared module
cd shared && npm test
# Ingestion service
cd ingestion && npm test
# Chat service
cd chat && npm test
# Gateway service
cd gateway && npm test
# Web app
cd web && npm testTest patterns used:
jest.unstable_mockModule()for ESM module mocking- Import
jestfrom@jest/globals(required for ESM) - Dynamic imports after mocking:
const { fn } = await import("./module") jest.fn<any>()for typed mocks- Tests live in
tests/subdirectories alongside source code
- Docker and Docker Compose
- Node.js 20+
# Install dependencies
cd shared && npm install
cd ../ingestion && npm install
cd ../chat && npm install
# Start services
docker-compose upData is persisted to local directories via Docker bind mounts:
| Directory | Contents |
|---|---|
./minio-data/ |
MinIO/S3 data (LanceDB vector database) |
./db/data/ |
MongoDB data (chat history) |
These directories are created automatically and excluded from git. Data survives docker compose down and container rebuilds.
MongoDB Initialization: The db/init-mongo.js script runs automatically on first startup (when ./db/data/ is empty), creating the sessions and messages collections with schema validation and indexes.
To reset all data:
rm -rf minio-data/ db/data/
docker compose down -v
docker compose up -d| Variable | Default | Description |
|---|---|---|
LANCEDB_URI |
s3://lancedb/documents |
LanceDB storage location |
S3_ENDPOINT |
http://localhost:9000 |
S3/MinIO endpoint |
AWS_ACCESS_KEY_ID |
minioadmin |
S3 access key |
AWS_SECRET_ACCESS_KEY |
minioadmin |
S3 secret key |
AWS_REGION |
us-east-1 |
AWS region |
EMBEDDING_SERVICE_URL |
http://localhost:8001 |
Embedding service URL |
MONGO_URI |
mongodb://root:example@mongo:27017 |
MongoDB connection string |
MONGO_DB_NAME |
example_rag |
MongoDB database name |
INGESTION_FUNCTION_NAME |
function |
Lambda function name for ingestion (use actual name in prod) |
CHAT_FUNCTION_NAME |
function |
Lambda function name for chat (use actual name in prod) |
INGESTION_LAMBDA_ENDPOINT |
http://localhost:8002 |
Ingestion Lambda endpoint (omit in prod for real AWS) |
CHAT_LAMBDA_ENDPOINT |
http://localhost:8003 |
Chat Lambda endpoint (omit in prod for real AWS) |
LLM_SERVICE_URL |
http://localhost:8004 |
LLM service URL for chat completions |
NEXT_PUBLIC_WS_URL |
ws://localhost:8080/ws |
WebSocket gateway URL (web app) |
Ingest Entry:
{
"action": "ingest",
"entry_date": "2024-01-15",
"text": "Today was a good day...",
"topics": ["happy", "calm"],
"entry_id": "optional-id",
"chunk_index": 0
}Health Check:
{
"action": "health"
}Search:
{
"action": "query",
"query": "How was my day?",
"limit": 5
}RAG Context:
{
"action": "rag",
"body": {
"message": "What did I do yesterday?",
"sessionId": "session-uuid"
}
}Save Message:
{
"action": "save_message",
"body": {
"sessionId": "session-uuid",
"content": "Assistant response...",
"ragContext": [...]
}
}Health Check:
{
"action": "health"
}When you store text in a RAG system, you don't store the raw words. An embedding model (like sentence-transformers) converts text into a list of numbers called a vector.
"the price of gold is $5,220.50" → [0.12, -0.45, 0.78, ..., 0.33] (384 numbers)
"what's going on with gold?" → [0.15, -0.41, 0.72, ..., 0.29] (384 numbers)
These numbers encode the semantic meaning of the text—not the exact words, but the concepts. Texts with similar meanings produce similar vectors.
To find relevant results, we measure how "close" two vectors are. Cosine similarity measures the angle between two vectors:
similarity = 1.0 → identical meaning (vectors point same direction)
similarity = 0.0 → unrelated (vectors perpendicular)
similarity = -1.0 → opposite meaning
Cosine distance is 1 - similarity:
distance = 0.0 → identical
distance = 1.0 → unrelated
distance = 2.0 → opposite
Visual intuition (imagine vectors as arrows in space):
"price of gold today"
↗
/
/ ← small angle = low distance = similar
/
"gold market" →
"my breakfast" → ← large angle = high distance = unrelated
Even when two texts are about the same topic, the embedding model encodes more than just keywords:
| Text | Concepts encoded |
|---|---|
| "what's going on with gold in the market?" | question, market trends, general inquiry |
| "the price of gold is $5,220.50 at 4:50pm" | statement, specific price, specific time |
The vectors are related but not close. Distance might be 0.9-1.0 even though both mention gold.
The RAG search uses a similarity score threshold to filter out irrelevant results. This prevents the system from returning unrelated documents when the user's query doesn't match any content.
Configuration (shared/src/db/operations.ts called by chat/src/services/chat.service.ts):
searchSimilar(table, queryVector, limit, maxDistance = 1.2)| Parameter | Default | Description |
|---|---|---|
maxDistance |
1.2 | Maximum cosine distance allowed (lower = more similar) |
How it works:
- LanceDB returns results with a
_distancefield (cosine distance) - Distance ranges from 0 (identical) to 2 (opposite vectors)
- Results with distance >
maxDistanceare filtered out - If no results pass the threshold, RAG context is empty
Tuning guidance:
| Threshold | Behavior |
|---|---|
| 0.5 | Strict - only highly relevant results |
| 0.8 | Moderate - may miss semantically related but differently phrased content |
| 1.2 | Balanced (default) - catches related content with different phrasing |
| 1.5+ | Very permissive - rarely filters anything |
Example:
- User asks "hey" → no documents about greetings → high distance scores → filtered out
- User asks "how was my trip to Paris?" → document about Paris trip → low distance → included
Why this matters: Without a threshold, vector search always returns the top N results regardless of relevance. A query like "hey" would return whatever entries happen to be least dissimilar, even if they're completely unrelated (e.g., an entry about gold). The threshold ensures only genuinely relevant context is passed to the LLM.
The system prompt is the instruction set that tells the LLM who it is, how to behave, and what context it has available. It's the primary mechanism for customizing LLM behavior without retraining the model.
Location: chat/src/services/chat.service.ts → buildSystemPrompt()
Structure:
function buildSystemPrompt(ragContext: RagContext[]): string {
if (ragContext.length === 0) {
return `You are a helpful assistant for a personal document knowledge base.
The user is asking a question, but no relevant documents were found.
Respond helpfully and suggest they might want to add more documents or rephrase their question.`;
}
const contextEntries = ragContext
.map((ctx) => `[${ctx.entry_date}] ${ctx.text_snippet}`)
.join("\n\n");
return `You are a helpful assistant for a personal document knowledge base.
Use the following documents to answer the user's question.
Be conversational and reference specific details from the entries when relevant.
If the entries don't contain enough information to answer, say so honestly.
Relevant documents:
${contextEntries}`;
}How it works:
The system prompt is sent to the LLM as the first message in the conversation, before the user's message:
Messages sent to LLM:
┌─────────────────────────────────────────────────────────────┐
│ role: "system" │
│ content: "You are a helpful assistant for a personal │
│ document knowledge base. Use the following journal │
│ entries to answer the user's question... │
│ │
│ Relevant documents: │
│ [2026-01-27] the price of gold is $5,220.50..." │
├─────────────────────────────────────────────────────────────┤
│ role: "user" │
│ content: "what's the current price of gold?" │
└─────────────────────────────────────────────────────────────┘
Why it matters:
| Aspect | Effect |
|---|---|
| Identity | "You are a helpful assistant for a personal document knowledge base" tells the LLM its role and domain |
| Behavior | "Be conversational and reference specific details" shapes response style |
| Boundaries | "If the entries don't contain enough information, say so honestly" prevents hallucination |
| Context injection | RAG results are embedded directly in the prompt, giving the LLM access to user's data |
Without a system prompt, the LLM would be a generic assistant with no knowledge of:
- Its purpose (documents)
- The user's data (documents)
- How to respond (conversational, honest about limitations)
Customization examples:
| Use Case | System Prompt Modification |
|---|---|
| More formal tone | "Respond in a professional, formal tone" |
| Therapy-style | "You are a supportive listener. Ask reflective questions about the user's feelings" |
| Data analysis | "Analyze patterns across documents. Look for trends in mood, topics, and frequency" |
| Strict factual | "Only answer questions that can be directly answered from the documents. Never speculate" |
The RAG + System Prompt pattern:
This is the core of how RAG applications work:
1. User asks a question
2. System searches vector database for relevant content
3. Relevant content is injected into the system prompt
4. LLM receives: system prompt (with context) + user message
5. LLM generates response grounded in the provided context
The LLM doesn't have direct database access—it only sees what's included in the prompt. This is both a limitation (context window size) and a feature (you control exactly what the LLM knows).
Core idea: Separate the code that reads data from the code that writes data.
Traditional approach (current implementation):
┌─────────────────────────────────────┐
│ Chat Service │
│ │
│ • searchSimilar() ← READ │
│ • getHistory() ← READ │
│ • saveMessage() ← WRITE │
│ • createSession() ← WRITE │
└─────────────────────────────────────┘
One service does everything. Simple, but responsibilities are mixed.
CQRS approach:
┌─────────────────────────────────────┐ ┌─────────────────────────────────┐
│ Chat Service │ │ Command Service │
│ (READ side) │ │ (WRITE side) │
│ │ │ │
│ • searchSimilar() │ │ • saveMessage() │
│ • getHistory() │ │ • createSession() │
│ • getSession() │ │ • updateSession() │
└─────────────────────────────────────┘ └─────────────────────────────────┘
Terminology:
| Term | Meaning | Example |
|---|---|---|
| Query | Request that returns data, doesn't change state | "Get chat history" |
| Command | Request that changes state, may not return data | "Save this message" |
Why separate them?
- Different optimization needs:
| Reads (Queries) | Writes (Commands) |
|---|---|
| Need to be FAST | Need to be RELIABLE |
| Can use caching | Need validation |
| Can use read replicas | Need consistency |
| Can be eventually consistent | Often need transactions |
- Different scaling patterns:
Typical app: 90% reads, 10% writes
Without CQRS:
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Service │ │ Service │ │ Service │ ← Scale everything together
└─────────┘ └─────────┘ └─────────┘
With CQRS:
┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐
│ Query │ │ Query │ │ Query │ │ Query │ │ Query │ ← Scale reads heavily
└───────┘ └───────┘ └───────┘ └───────┘ └───────┘
┌─────────┐
│ Command │ ← Fewer write instances needed
└─────────┘
- Different data models:
Write model (normalized): Read model (denormalized):
┌──────────┐ ┌──────────┐ ┌─────────────────────────────┐
│ sessions │ │ messages │ │ chat_history_view │
│──────────│ │──────────│ │─────────────────────────────│
│ id │ │ id │ → │ session_id │
│ title │ │ session_id│ │ session_title │
│ created │ │ content │ │ messages[] (embedded) │
└──────────┘ │ role │ │ last_message_preview │
└──────────┘ └─────────────────────────────┘
Optimized for integrity Optimized for fast reads
Applied to this app:
Current flow (mixed read/write):
Gateway → Chat Service (rag) ← READ (search)
← WRITE (save user message) ❌ mixed
Gateway → LLM (stream)
Gateway → Chat Service (save_message) ← WRITE
CQRS flow (separated):
Gateway → Command Service (save user message) ← WRITE
Gateway → Chat Service (rag) ← READ only
Gateway → LLM (stream)
Gateway → Command Service (save assistant) ← WRITE
When CQRS is overkill:
- Small apps with low traffic
- Simple CRUD with no complex queries
- Team is small and doesn't need separation
- Read/write patterns are similar
When CQRS shines:
- High-scale systems (millions of reads)
- Complex domains with different read/write needs
- Event-sourced systems
- Microservices where teams own different concerns
Related patterns:
| Pattern | Description |
|---|---|
| CQRS | Separate read/write code paths |
| Event Sourcing | Store events, not state. Rebuild state from events. |
| CQRS + Event Sourcing | Commands emit events, queries read from projections |
Current decision: This app uses a mixed approach (Chat Service handles both reads and writes) because the scale doesn't justify the added complexity. CQRS would be considered if scaling requirements change.
Current Limitation: Only .txt and .md files are supported for upload.
Why: File uploads are processed client-side in the browser using the JavaScript File.text() API, which only works with plain text files. The extracted text is then sent to the ingestion service as JSON. No files are stored separately—only the text content is embedded and stored in LanceDB.
To add support for .pdf and .docx files:
| Format | Client-side Option | Server-side Option |
|---|---|---|
.pdf |
pdfjs-dist (Mozilla's PDF.js) |
pdf-parse (Node.js), PyMuPDF (Python) |
.docx |
mammoth.js |
mammoth (Node.js), python-docx (Python) |
Implementation approaches:
-
Client-side parsing (current architecture):
- Add parsing libraries to the web app
- Extract text in the browser before sending to ingestion
- Pros: No changes to backend, files never leave the browser
- Cons: Larger frontend bundle, limited by browser capabilities
-
Server-side parsing:
- Send raw file bytes to ingestion service (would require multipart/form-data)
- Parse files in the ingestion Lambda or a dedicated parsing service
- Pros: More powerful parsing, can handle complex documents
- Cons: Requires backend changes, files must be transmitted/stored temporarily
Files to modify:
web/components/DocumentUpload/DropZone.tsx- Updateacceptattribute to include new file typesweb/components/DocumentUpload/api.ts- Add parsing logic insubmitFileContent()or create new parsing functions- If server-side:
ingestion/src/services/ingestion.service.ts- Add file parsing before embedding
This session implemented the Gateway and Chat service chat action, connecting the existing web app chat UI to the backend.
1. Gateway Chat Action (gateway/src/handlers/messageHandler.ts)
- Added
ChatMessageinterface withcontentandsession_idfields - Added
"chat"case to the message handler switch statement - Created
handleChat()function that forwards messages to the chat service
2. Chat Service Chat Handler (chat/src/services/chat.service.ts)
- New file implementing chat logic with RAG search
- Performs vector similarity search on user message
- Returns RAG context with placeholder response (LLM integration pending)
- Added
ChatBodytype to shared types
3. Lambda Client Fix (gateway/src/services/lambdaClient.ts)
- Changed
functionNamefrom service names ("ingestion","chat") to"function" - Problem: AWS Lambda RIE expects
/2015-03-31/functions/function/invocationspath, but the SDK was generating/2015-03-31/functions/chat/invocationsbased on the function name - Symptom: 404 errors with "Unexpected non-whitespace character after JSON" (HTML error page being parsed as JSON)
4. WebSocket URL Configuration (docker-compose.yml, web/.env.local)
- Changed
NEXT_PUBLIC_WS_URLfromws://localhost:8080/wstows://192.168.2.17:8080/ws - Problem: When accessing the web app from a different machine,
localhostresolves to the client machine, not the Docker host - Context: Environment variables in Next.js are baked in at build time for client-side code (
NEXT_PUBLIC_*prefix), so changes require container rebuild - Tradeoff: Hardcoding IP works for local network access but isn't portable. Production would use a proper hostname.
5. Field Name Alignment (gateway/src/handlers/messageHandler.ts)
- Changed gateway's
ChatMessageinterface from{ message, sessionId }to{ content, session_id } - Problem: Web client sends
contentandsession_id(matching@shared/chat-types), but gateway expected different field names - Lesson: When adding new message types, check existing type definitions in shared module first
6. React Strict Mode WebSocket Fix (web/components/Chat/hooks/useWebSocket.ts)
- Added
isDisconnectingRefto track intentional disconnects - Skip error/close handlers when disconnecting intentionally
- Problem: React Strict Mode in development mounts, unmounts, then remounts components. The first WebSocket connection would be closed during cleanup, triggering error state before the second connection established.
- Symptom: "WebSocket connection failed" error flash on page load, even though connection ultimately succeeds
- Tradeoff: Added complexity to the hook, but only ~5 lines. Alternative was disabling Strict Mode, which would hide other potential issues.
7. Chat Input UX Improvements (web/components/Chat/Chat.tsx, web/components/Chat/MessageInput.tsx)
- Removed
isLoadingfrom disabled condition - input stays enabled while waiting for response - Added
inputRefto maintain focus after sending message - Problem: Input was disabled during loading, preventing users from typing their next message. Focus was lost after submit.
- Tradeoff: Users can now queue messages while waiting, which the backend doesn't currently handle (messages process sequentially). Acceptable for MVP.
Why route chat through Chat service (not a new Chat service)?
- Chat needs RAG search, which Query already has
- Keeps service count minimal for MVP
- Can extract to dedicated service later if complexity grows
Why placeholder LLM response?
- LLM service (in
llm/directory) isn't implemented yet - Chat flow works end-to-end with mock response
- Unblocks frontend development and integration testing
gateway/src/handlers/messageHandler.ts # Chat action routing
gateway/src/services/lambdaClient.ts # Lambda function name fix
chat/src/handler.ts # Chat case in switch
chat/src/services/chat.service.ts # New file - chat logic
shared/src/types.ts # ChatBody type
docker-compose.yml # WebSocket URL
web/.env.local # WebSocket URL (local override)
web/components/Chat/hooks/useWebSocket.ts # Strict Mode fix
web/components/Chat/Chat.tsx # Input disabled state
web/components/Chat/MessageInput.tsx # Focus retention
This session implemented MongoDB integration for chat history persistence and fixed a WebSocket connection error on page load.
1. MongoDB Connection Module (shared/src/db/mongo-connection.ts)
- New file implementing MongoDB connection management
- Singleton pattern matching existing LanceDB connection approach
- Functions:
initMongoConnection(),getMongoDb(),closeMongoConnection(),resetMongoConnection() - Dependency injection support for testing
2. MongoDB Operations Module (shared/src/db/mongo-operations.ts)
- New file implementing session/message CRUD operations
- Session operations:
createSession(),getSession(),updateSessionTitle(),updateSessionTimestamp(),listSessions(),deleteSession(),getOrCreateSession() - Message operations:
createMessage(),getMessage(),getSessionMessages(),deleteMessage() - All operations take
Dbas first parameter for testability
3. Config Updates (shared/src/config.ts)
- Added
mongoUriandmongoDbNametoAppConfiginterface - Default values:
mongodb://root:example@mongo:27017andexample_rag - Matches docker-compose MongoDB configuration
4. Chat Service MongoDB Integration (chat/src/services/chat.service.ts)
- Integrated MongoDB operations into chat handler
- User messages saved before RAG search
- Assistant messages saved after response generation (with RAG context)
- Session created automatically if not exists
5. Connection Error Fix (web/components/Chat/hooks/useChat.ts)
- Added
onOpenhandler that clears error state when WebSocket connects - Problem: "Connection error" banner appeared briefly on page load due to transient WebSocket errors during initial connection or React Strict Mode's double-mount
- Symptom: Error message flashed then disappeared as connection established
- Fix: Clear any existing error when
onopenfires
Singleton MongoDB Connection
- Decision: Module-level connection state, same as LanceDB
- Pro: Simple, connection reused across Lambda invocations (warm starts)
- Con: Connection shared across all requests; no per-request isolation
- Why acceptable: Matches existing pattern, MongoDB driver handles connection pooling internally
Synchronous Message Saving
- Decision: Save messages inline during chat request (not async/fire-and-forget)
- Pro: Data consistency guaranteed, errors surface immediately
- Con: Adds latency to chat response (~5-10ms per insert)
- Why acceptable: Latency is negligible compared to embedding + RAG search + future LLM call
Clear All Errors on Connect
- Decision:
onOpenclears any existing error, not just connection errors - Pro: Simple, handles all transient error cases
- Con: Could hide a real error that occurred right before successful connection
- Why acceptable: If there's a persistent issue, error will resurface on next operation; better UX than showing scary errors on initial load
UUIDs for Session/Message IDs
- Decision: Use
crypto.randomUUID()for IDs, not MongoDB ObjectId - Pro: Client can generate IDs for optimistic updates, portable across databases
- Con: Slightly larger than ObjectId (36 chars vs 24)
- Why acceptable: Matches existing LanceDB ID pattern, enables future client-side ID generation
User Message ID Not Returned
- Current: Chat response only includes assistant message ID, not user message ID
- Why: User message is saved to MongoDB for history, but return value isn't used
- Future option: Could return
user_message_idin response for:- Client confirmation that message was saved
- Message linking (
in_reply_tofield) - Server timestamp sync
- Request tracing/debugging
- To implement: Capture return value from
createMessage()and add to response
shared/src/config.ts # Added MongoDB config
shared/src/db/mongo-connection.ts # New - connection management
shared/src/db/mongo-operations.ts # New - CRUD operations
shared/src/db/mongo-operations.test.ts # New - unit tests
shared/package.json # Added mongodb dependency
chat/src/services/chat.service.ts # Integrated MongoDB saves
web/components/Chat/hooks/useChat.ts # Fixed connection error
MongoDB operations verified with:
- Unit tests (
npm testin shared directory) - Integration test against live MongoDB container
- End-to-end test via chat UI - messages now persist in
sessionsandmessagescollections
To verify data persistence:
docker exec example-rag-mongodb mongosh --quiet -u root -p example \
--authenticationDatabase admin example_rag \
--eval "db.sessions.find().toArray(); db.messages.find().toArray();"This session addressed production compatibility, code consistency, and test maintenance.
1. Lambda Function Names Configurable (gateway/src/services/lambdaClient.ts)
- Added
INGESTION_FUNCTION_NAMEandQUERY_FUNCTION_NAMEenvironment variables - Default to
"function"for local Lambda RIE compatibility - Problem: Hardcoded
"function"name works with RIE but not production AWS Lambda - Fix: Environment variables allow setting real function names in production
2. Destructured CSS Module Styles (all web components)
- Refactored from
styles.classNameto destructuredconst { className } = styles - Improves readability and reduces repetition
- Dynamic style access (e.g.,
styles[status]) still uses thestylesobject
3. Updated Outdated Test (web/components/Chat/tests/Chat.test.tsx)
- Test expected input to be disabled during loading (old behavior)
- Updated to expect input enabled during loading (current intentional behavior)
- Context: Previous session intentionally kept input enabled to allow message queuing
gateway/src/services/lambdaClient.ts # Configurable function names
web/components/Chat/Chat.tsx # Destructured styles
web/components/Chat/MessageInput.tsx # Destructured styles
web/components/Chat/MessageList.tsx # Destructured styles
web/components/Chat/ConnectionStatus.tsx # Destructured styles
web/app/chat/page.tsx # Destructured styles
web/components/DocumentUpload/DocumentUpload.tsx # Destructured styles
web/components/DocumentUpload/FileUploadForm.tsx # Destructured styles
web/components/DocumentUpload/TextEntryForm.tsx # Destructured styles
web/components/DocumentUpload/DateInput.tsx # Destructured styles
web/components/DocumentUpload/FileList.tsx # Destructured styles
web/components/DocumentUpload/DropZone.tsx # Destructured styles
web/components/DocumentUpload/StatusMessage.tsx # Destructured styles
web/components/DocumentUpload/MoodSelector.tsx # Destructured styles
web/components/Home/Home.tsx # Destructured styles
web/components/ThemeToggle/ThemeToggle.tsx # Destructured styles
web/components/Chat/tests/Chat.test.tsx # Fixed outdated test
This session implemented true token streaming from the LLM service to the web UI, restructured the chat flow for proper separation of concerns, and added RAG relevance filtering.
Problem: The initial implementation had the Chat service calling the LLM and returning the complete response. This meant no streaming—the entire response appeared at once in the UI.
Why Chat Service Can't Stream:
- Chat service runs as a Lambda function
- Lambda functions return a single response—they cannot stream data incrementally
- Even if the LLM streams tokens to the Chat service, Lambda must buffer the entire response before returning
Why Gateway Can Stream:
- Gateway maintains persistent WebSocket connections with clients
- Gateway can consume SSE (Server-Sent Events) from the LLM service
- Gateway can forward each token to WebSocket clients as it arrives
New Architecture:
1. Client → Gateway: { action: "chat", content: "..." }
2. Gateway → Query (rag): Get RAG context + save user message
3. Gateway → LLM (SSE stream): Stream tokens
4. Gateway → Client: Forward each token via WebSocket
5. Gateway → Query (save_message): Save complete assistant message
Trade-offs:
- Gateway now has more responsibility (LLM orchestration)
- Chat service is simpler (just RAG + MongoDB operations)
- Streaming works end-to-end
- Messages are saved to MongoDB after streaming completes (not during)
1. Chat Service Restructured (chat/src/services/chat.service.ts, chat/src/handler.ts)
- Removed
chataction (did everything including LLM call) - Added
ragaction: RAG search + save user message, returns context + system prompt - Added
save_messageaction: Save assistant message after streaming completes - Chat service no longer imports or calls LLM
2. Gateway Streaming Implementation (gateway/src/handlers/messageHandler.ts)
handleChatnow orchestrates the full flow:- Calls Chat service
ragaction - Streams from LLM service using existing
llmClient.ts - Forwards tokens to WebSocket as
chat_stream_tokenevents - Calls Chat service
save_messageto persist - Sends
chat_stream_endwith complete content
- Calls Chat service
3. WebSocket Stream Events
| Event | Purpose |
|---|---|
chat_stream_start |
Stream beginning, includes RAG context |
chat_stream_token |
Individual token from LLM |
chat_stream_end |
Stream complete, includes full content |
chat_stream_error |
Error with partial content if any |
4. React Streaming Smoothness (web/components/Chat/hooks/useChat.ts)
Problem: Initial streaming implementation was "jumpy"—UI flickered with each token.
Root causes:
flushSyncforced synchronous re-render for every token- Scroll-to-bottom triggered on every
streamingContentchange - React 18's automatic batching wasn't helping because
flushSyncbypassed it
Solution: requestAnimationFrame batching
// Buffer tokens in a ref
streamingBufferRef.current += token
// Schedule single RAF update (coalesces rapid tokens)
if (rafIdRef.current === null) {
rafIdRef.current = requestAnimationFrame(() => {
setStreamingContent(streamingBufferRef.current)
rafIdRef.current = null
})
}Benefits:
- Tokens batch naturally at 60fps
- No forced synchronous renders
- Smooth visual streaming effect
5. Scroll Behavior Fix (web/components/Chat/MessageList.tsx)
- Only auto-scroll when user is near bottom (within 100px)
- Don't scroll on every streaming token
- Scroll once when
isLoadingchanges (stream starts)
6. RAG Similarity Threshold (shared/src/db/operations.ts)
Problem: User says "hey" → system returns document about gold (irrelevant).
Root cause: Vector search always returns top N results, even if they're dissimilar.
Solution: Added maxDistance parameter (default 0.8) to filter results:
.filter((result) => result.score <= maxDistance)How cosine distance works:
- 0 = identical vectors
- 2 = opposite vectors
- 0.8 threshold filters out results that are only vaguely similar
7. Logging Configuration (llm/src/server.py)
- Added
logging.basicConfig()to show model loading progress - Logs: "Loading model...", "Model loaded and ready"
chat/src/handler.ts # Added rag, save_message actions
chat/src/services/chat.service.ts # Split into rag() and saveMessage()
shared/src/types.ts # Added RagBody, SaveMessageBody types
shared/src/db/operations.ts # Added maxDistance threshold
gateway/src/handlers/messageHandler.ts # Full streaming orchestration
gateway/src/services/llmClient.ts # Debug logging for streaming
web/components/Chat/hooks/useChat.ts # RAF-based smooth streaming
web/components/Chat/MessageList.tsx # Smarter scroll behavior
web/components/Chat/Chat.module.scss # Blinking cursor for streaming
llm/src/server.py # Logging configuration
docker-compose.yml # Chat service depends on llm, mongo
Checking if tokens are streaming:
docker compose logs -f gateway
# Look for: [LLM] Starting stream request...
# [LLM] Stream done, total tokens: X
# [chat] Streamed X tokensIf streaming looks instant (no typing effect):
- Tokens may arrive faster than 60fps updates
- This is actually working correctly—LLM is just fast
- The RAF batching ensures smooth rendering regardless of token speed
example-rag/
├── docker-compose.yml
├── web/ # Next.js frontend
│ ├── app/
│ │ ├── layout.tsx
│ │ ├── page.tsx
│ │ ├── chat/
│ │ │ └── page.tsx # Chat interface
│ │ └── upload/
│ │ └── page.tsx
│ ├── components/
│ │ ├── Home/
│ │ ├── DocumentUpload/
│ │ ├── Chat/
│ │ │ ├── Chat.tsx
│ │ │ ├── MessageList.tsx
│ │ │ ├── MessageInput.tsx
│ │ │ ├── ConnectionStatus.tsx
│ │ │ ├── Chat.module.scss
│ │ │ ├── types.ts
│ │ │ ├── hooks/
│ │ │ │ ├── useWebSocket.ts
│ │ │ │ └── useChat.ts
│ │ │ └── tests/
│ │ │ ├── Chat.test.tsx
│ │ │ ├── MessageList.test.tsx
│ │ │ ├── MessageInput.test.tsx
│ │ │ ├── ConnectionStatus.test.tsx
│ │ │ ├── useWebSocket.test.ts
│ │ │ └── useChat.test.ts
│ │ └── ThemeToggle/
│ ├── contexts/
│ │ └── ThemeContext.tsx
│ ├── Dockerfile.dev
│ └── jest.config.js
├── gateway/ # WebSocket gateway
│ ├── src/
│ │ ├── index.ts
│ │ ├── tests/
│ │ │ └── index.test.ts
│ │ ├── handlers/
│ │ │ ├── connectionHandler.ts
│ │ │ ├── messageHandler.ts
│ │ │ └── tests/
│ │ │ ├── connectionHandler.test.ts
│ │ │ └── messageHandler.test.ts
│ │ └── services/
│ │ ├── lambdaClient.ts
│ │ └── tests/
│ │ └── lambdaClient.test.ts
│ ├── Dockerfile.dev
│ ├── jest.config.js
│ └── tsconfig.json
├── shared/
│ ├── src/
│ │ ├── config.ts
│ │ ├── types.ts # Backend types (aws-lambda dependency)
│ │ ├── chat-types.ts # Chat types (browser-safe)
│ │ ├── db/
│ │ │ ├── connection.ts # LanceDB connection
│ │ │ ├── operations.ts # LanceDB operations
│ │ │ ├── mongo-connection.ts # MongoDB connection
│ │ │ └── mongo-operations.ts # MongoDB CRUD operations
│ │ └── services/
│ │ └── embedding.ts
│ └── jest.config.js
├── ingestion/
│ ├── src/
│ │ ├── handler.ts
│ │ └── services/
│ │ └── ingestion.service.ts
│ ├── Dockerfile
│ ├── Dockerfile.dev
│ └── jest.config.js
├── chat/
│ ├── src/
│ │ ├── handler.ts
│ │ └── services/
│ │ ├── query.service.ts
│ │ └── chat.service.ts
│ ├── Dockerfile
│ ├── Dockerfile.dev
│ └── jest.config.js
├── embedding/
│ └── (Python sentence-transformers service)
└── llm/
├── src/
│ ├── __init__.py
│ ├── __main__.py # Entry point with dev auto-restart
│ ├── config.py # Environment configuration
│ ├── dependencies.py # FastAPI dependency injection
│ ├── model.py # LLMService class with streaming
│ └── server.py # FastAPI server with SSE streaming
├── tests/
│ ├── conftest.py # MockLLMService for testing
│ ├── test_model.py
│ └── test_server.py
├── models/
│ └── qwen2.5-3b-instruct/ # Model files
├── Dockerfile
└── pyproject.toml
