A RAG pipeline for ingesting newsletters and querying them with natural language. Supports production-style features: pdfplumber preprocessing, token/semantic chunking, hybrid retrieval (vector + BM25 + RRF + Cross-Encoder rerank), semantic cache (Redis), LLM answer generation (Anthropic Claude), and monitoring.
uv syncOptional extras (see pyproject.toml):
uv sync --extra anthropic— LLM answer generation (Anthropic Claude)uv sync --extra hybrid— Cross-Encoder rerank + SemanticChunker (sentence-transformers, scikit-learn)uv sync --extra cache— Semantic cache (Redis)uv sync --extra production— anthropic + hybrid + cache
Copy .env.example to .env and set ANTHROPIC_API_KEY for generation. Embeddings use FastEmbed by default.
1. Add your data to the data/ folder:
- Supported formats:
.pdf,.txt,.md,.eml(emails) - Example: Copy your newsletter emails or PDFs into
data/
2. Convert .eml → PDF (only if you have .eml files):
uv run python -m src.eml_to_pdfThis converts each .eml file to a .pdf in the same folder.
3. Build the index (preprocess, chunk, embed, store):
uv run python -m src.build_indexEnv options: CHUNKER=semantic (needs --extra hybrid), CHUNK_SIZE, CHUNK_OVERLAP, INDEX_BATCH_SIZE.
4. Query your newsletters:
# Retrieval only (returns relevant chunks)
uv run python -m src.query "What did the newsletter say about X?"
# With LLM answer generation (requires ANTHROPIC_API_KEY in .env)
USE_GENERATOR=1 uv run python -m src.query "What did the newsletter say about X?"
# Interactive mode
uv run python -m src.queryEnv options: USE_HYBRID=1 (vector + BM25 + RRF + rerank), USE_CACHE=1 (Redis), USE_GENERATOR=1 (LLM answer with citations).
A Streamlit-based web interface for interacting with your newsletter archive without using the command line.
uv run streamlit run src/app.pyOpens at http://localhost:8501 by default.
┌─────────────────────────────────────────────────┐
│ Newsletter RAG │
├──────────────┬──────────────────────────────────┤
│ SIDEBAR │ MAIN AREA │
│ │ │
│ [Upload .eml]│ 🔍 [Search bar...............] │
│ │ ☑ Generate answer with LLM │
│ [Convert PDF]│ [Search] │
│ │ │
│ [Rebuild │ Answer: │
│ Index] │ ───────────────────────────── │
│ │ <LLM generated answer> │
│ ─────────────│ │
│ Data Status: │ Sources: │
│ • PDFs: 5 │ - source1.pdf │
│ • EMLs: 3 │ - source2.pdf │
│ │ │
│ │ Retrieved Chunks: │
│ │ [1] source.pdf - chunk text... │
│ │ [2] source.pdf - chunk text... │
└──────────────┴──────────────────────────────────┘
| Feature | Description |
|---|---|
| Upload .eml | Drag-and-drop or browse to upload email files (multiple files supported) |
| Convert to PDF | Converts uploaded .eml files to PDFs in data/ for indexing |
| Rebuild Index | Re-indexes all documents in data/ (run after adding new files) |
| Data Status | Shows count of PDFs and EML files in the data directory |
| Search | Enter natural language queries to search your newsletters |
| LLM Toggle | Enable/disable LLM-generated answers (requires ANTHROPIC_API_KEY) |
| Results | Displays answer, sources, and expandable retrieved chunks |
- Upload: Use the sidebar file uploader to add
.emlnewsletter files - Convert: Click "Convert to PDF" to process uploaded emails
- Index: Click "Rebuild Index" to add new documents to the search index
- Search: Enter a question in the main area and click "Search"
- Review: View the generated answer and expand chunks for details
The web UI respects the same environment variables as the CLI:
ANTHROPIC_API_KEY— Required for LLM answer generationUSE_HYBRID— Enable hybrid retrieval (vector + BM25)USE_CACHE— Enable Redis semantic cacheCHUNK_SIZE,CHUNK_OVERLAP— Chunking parameters for indexing