Skip to content

annthespy/newsletter-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

newsletter-rag

A RAG pipeline for ingesting newsletters and querying them with natural language. Supports production-style features: pdfplumber preprocessing, token/semantic chunking, hybrid retrieval (vector + BM25 + RRF + Cross-Encoder rerank), semantic cache (Redis), LLM answer generation (Anthropic Claude), and monitoring.

Setup

uv sync

Optional extras (see pyproject.toml):

  • uv sync --extra anthropic — LLM answer generation (Anthropic Claude)
  • uv sync --extra hybrid — Cross-Encoder rerank + SemanticChunker (sentence-transformers, scikit-learn)
  • uv sync --extra cache — Semantic cache (Redis)
  • uv sync --extra production — anthropic + hybrid + cache

Copy .env.example to .env and set ANTHROPIC_API_KEY for generation. Embeddings use FastEmbed by default.

Usage

1. Add your data to the data/ folder:

  • Supported formats: .pdf, .txt, .md, .eml (emails)
  • Example: Copy your newsletter emails or PDFs into data/

2. Convert .eml → PDF (only if you have .eml files):

uv run python -m src.eml_to_pdf

This converts each .eml file to a .pdf in the same folder.

3. Build the index (preprocess, chunk, embed, store):

uv run python -m src.build_index

Env options: CHUNKER=semantic (needs --extra hybrid), CHUNK_SIZE, CHUNK_OVERLAP, INDEX_BATCH_SIZE.

4. Query your newsletters:

# Retrieval only (returns relevant chunks)
uv run python -m src.query "What did the newsletter say about X?"

# With LLM answer generation (requires ANTHROPIC_API_KEY in .env)
USE_GENERATOR=1 uv run python -m src.query "What did the newsletter say about X?"

# Interactive mode
uv run python -m src.query

Env options: USE_HYBRID=1 (vector + BM25 + RRF + rerank), USE_CACHE=1 (Redis), USE_GENERATOR=1 (LLM answer with citations).

Web UI

A Streamlit-based web interface for interacting with your newsletter archive without using the command line.

Running the Web UI

uv run streamlit run src/app.py

Opens at http://localhost:8501 by default.

Interface Layout

┌─────────────────────────────────────────────────┐
│  Newsletter RAG                                 │
├──────────────┬──────────────────────────────────┤
│   SIDEBAR    │         MAIN AREA               │
│              │                                  │
│ [Upload .eml]│  🔍 [Search bar...............]  │
│              │  ☑ Generate answer with LLM     │
│ [Convert PDF]│  [Search]                       │
│              │                                  │
│ [Rebuild     │  Answer:                        │
│  Index]      │  ─────────────────────────────  │
│              │  <LLM generated answer>         │
│ ─────────────│                                  │
│ Data Status: │  Sources:                        │
│ • PDFs: 5    │  - source1.pdf                  │
│ • EMLs: 3    │  - source2.pdf                  │
│              │                                  │
│              │  Retrieved Chunks:              │
│              │  [1] source.pdf - chunk text... │
│              │  [2] source.pdf - chunk text... │
└──────────────┴──────────────────────────────────┘

Features

Feature Description
Upload .eml Drag-and-drop or browse to upload email files (multiple files supported)
Convert to PDF Converts uploaded .eml files to PDFs in data/ for indexing
Rebuild Index Re-indexes all documents in data/ (run after adding new files)
Data Status Shows count of PDFs and EML files in the data directory
Search Enter natural language queries to search your newsletters
LLM Toggle Enable/disable LLM-generated answers (requires ANTHROPIC_API_KEY)
Results Displays answer, sources, and expandable retrieved chunks

Workflow

  1. Upload: Use the sidebar file uploader to add .eml newsletter files
  2. Convert: Click "Convert to PDF" to process uploaded emails
  3. Index: Click "Rebuild Index" to add new documents to the search index
  4. Search: Enter a question in the main area and click "Search"
  5. Review: View the generated answer and expand chunks for details

Configuration

The web UI respects the same environment variables as the CLI:

  • ANTHROPIC_API_KEY — Required for LLM answer generation
  • USE_HYBRID — Enable hybrid retrieval (vector + BM25)
  • USE_CACHE — Enable Redis semantic cache
  • CHUNK_SIZE, CHUNK_OVERLAP — Chunking parameters for indexing

About

A RAG pipeline for ingesting newsletters and querying them with natural language

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages