newsletter-rag

A RAG pipeline for ingesting newsletters and querying them with natural language. Supports production-style features: pdfplumber preprocessing, token/semantic chunking, hybrid retrieval (vector + BM25 + RRF + Cross-Encoder rerank), semantic cache (Redis), LLM answer generation (Anthropic Claude), and monitoring.

Setup

uv sync

Optional extras (see pyproject.toml):

uv sync --extra anthropic — LLM answer generation (Anthropic Claude)
uv sync --extra hybrid — Cross-Encoder rerank + SemanticChunker (sentence-transformers, scikit-learn)
uv sync --extra cache — Semantic cache (Redis)
uv sync --extra production — anthropic + hybrid + cache

Copy .env.example to .env and set ANTHROPIC_API_KEY for generation. Embeddings use FastEmbed by default.

Usage

1. Add your data to the data/ folder:

Supported formats: .pdf, .txt, .md, .eml (emails)
Example: Copy your newsletter emails or PDFs into data/

2. Convert .eml → PDF (only if you have .eml files):

uv run python -m src.eml_to_pdf

This converts each .eml file to a .pdf in the same folder.

3. Build the index (preprocess, chunk, embed, store):

uv run python -m src.build_index

Env options: CHUNKER=semantic (needs --extra hybrid), CHUNK_SIZE, CHUNK_OVERLAP, INDEX_BATCH_SIZE.

4. Query your newsletters:

# Retrieval only (returns relevant chunks)
uv run python -m src.query "What did the newsletter say about X?"

# With LLM answer generation (requires ANTHROPIC_API_KEY in .env)
USE_GENERATOR=1 uv run python -m src.query "What did the newsletter say about X?"

# Interactive mode
uv run python -m src.query

Env options: USE_HYBRID=1 (vector + BM25 + RRF + rerank), USE_CACHE=1 (Redis), USE_GENERATOR=1 (LLM answer with citations).

Web UI

A Streamlit-based web interface for interacting with your newsletter archive without using the command line.

Running the Web UI

uv run streamlit run src/app.py

Opens at http://localhost:8501 by default.

Interface Layout

┌─────────────────────────────────────────────────┐
│  Newsletter RAG                                 │
├──────────────┬──────────────────────────────────┤
│   SIDEBAR    │         MAIN AREA               │
│              │                                  │
│ [Upload .eml]│  🔍 [Search bar...............]  │
│              │  ☑ Generate answer with LLM     │
│ [Convert PDF]│  [Search]                       │
│              │                                  │
│ [Rebuild     │  Answer:                        │
│  Index]      │  ─────────────────────────────  │
│              │  <LLM generated answer>         │
│ ─────────────│                                  │
│ Data Status: │  Sources:                        │
│ • PDFs: 5    │  - source1.pdf                  │
│ • EMLs: 3    │  - source2.pdf                  │
│              │                                  │
│              │  Retrieved Chunks:              │
│              │  [1] source.pdf - chunk text... │
│              │  [2] source.pdf - chunk text... │
└──────────────┴──────────────────────────────────┘

Features

Feature	Description
Upload .eml	Drag-and-drop or browse to upload email files (multiple files supported)
Convert to PDF	Converts uploaded `.eml` files to PDFs in `data/` for indexing
Rebuild Index	Re-indexes all documents in `data/` (run after adding new files)
Data Status	Shows count of PDFs and EML files in the data directory
Search	Enter natural language queries to search your newsletters
LLM Toggle	Enable/disable LLM-generated answers (requires `ANTHROPIC_API_KEY`)
Results	Displays answer, sources, and expandable retrieved chunks

Workflow

Upload: Use the sidebar file uploader to add .eml newsletter files
Convert: Click "Convert to PDF" to process uploaded emails
Index: Click "Rebuild Index" to add new documents to the search index
Search: Enter a question in the main area and click "Search"
Review: View the generated answer and expand chunks for details

Configuration

The web UI respects the same environment variables as the CLI:

ANTHROPIC_API_KEY — Required for LLM answer generation
USE_HYBRID — Enable hybrid retrieval (vector + BM25)
USE_CACHE — Enable Redis semantic cache
CHUNK_SIZE, CHUNK_OVERLAP — Chunking parameters for indexing

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
.env.example		.env.example
.gitignore		.gitignore
OVERVIEW.md		OVERVIEW.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

newsletter-rag

Setup

Usage

Web UI

Running the Web UI

Interface Layout

Features

Workflow

Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

newsletter-rag

Setup

Usage

Web UI

Running the Web UI

Interface Layout

Features

Workflow

Configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages