DocSage

Intelligent Document Processing with AI-Powered Analytics

Local, zero-cost alternative to AWS Textract + Bedrock

DocSage is a local, zero-cost platform for AI-powered document intelligence that sits on top of an Intelligent Document Processing (IDP) pipeline. DocSage features an intelligent AI agent that processes documents and answers questions using natural language.

It ingests PDF documents (e.g., invoices, bank statements, forms), extracts structured data, and lets you ask natural language questions like:

"What did I spend on rent in the last 3 months?"
"Which vendors are above $5,000 this quarter?"
"Show me anomalies in monthly spend and the supporting documents."

The implementation is designed to mirror how this would run on AWS (Textract, Bedrock, S3, RDS, OpenSearch), but uses 100% free, local tools instead.

Why this project exists

I built this to practice end-to-end architecture for an intelligent document processing system similar to what you'd run on AWS (Textract + Bedrock + RDS + OpenSearch), but using 100% local, free tools. My learning focus was:

Designing a tool-using LLM agent wired into SQL, metrics, and RAG
Building an IDP pipeline (OCR, classification, field extraction) for financial docs
Structuring a FastAPI + Streamlit system that's easy to "lift-and-shift" to AWS

That makes the learning goal explicit instead of implicit.

Architecture Overview

Conceptual Flow

Ingestion & IDP
- PDFs are stored in data/raw_docs/.
- A Python-based IDP pipeline:
  - Uses OCR (Tesseract) and PDF parsing (pdfplumber) to extract text.
  - Classifies document types (invoice, statement, etc.).
  - Extracts key fields (dates, amounts, vendors, categories).
- Structured outputs are saved as JSON/CSV and loaded into a relational database.
Storage & Analytics
- All structured data is stored in PostgreSQL (via Docker Compose) by default.
- SQLite is available as an optional alternative for development.
- Derived metrics (e.g., monthly totals, category breakdowns, vendor stats) are computed and exposed as reusable "metrics functions".
RAG + Vector Search
- Document chunks and summaries are embedded with a free sentence-transformers model.
- Embeddings are stored in a local FAISS index (no external vector DB).
- This enables the agent to retrieve supporting documents for its answers.
AI Agent (LLM + Tools)
- DocSage features an intelligent AI agent that powers the system.
- A local LLM (via Ollama) provides reasoning and natural language generation.
- The agent is wired with tools (using LangChain/LlamaIndex-style patterns):
  - sql_tool: run parameterized SQL queries on the transactional DB.
  - metrics_tool: call pre-defined Python functions for KPIs.
  - rag_tool: search FAISS for relevant document snippets.
- The agent decides which tools to call based on the user's query, aggregates the results, and explains the insight in plain language, referencing underlying data and documents.
API & UI
- Backend: FastAPI application exposing:
  - POST /chat/insights – main endpoint for DocSage's AI agent.
  - GET /health – health check endpoint.
  - GET /docs – interactive API documentation.
- Frontend: Streamlit app with 8 comprehensive pages:
  - 📊 Analytics Dashboard – Time-series analytics, spending trends, vendor analysis, and forecasting.
  - 💬 Chat – Natural language interface to interact with DocSage.
  - 📄 Documents – Document management with visual overlays, interactive corrections, and real-time upload.
  - ⚠️ Anomalies – Automated anomaly detection (duplicates, unusual amounts, missing fields).
  - 🔍 Document Comparison – Side-by-side document comparison and price change tracking.
  - 📈 Insights Report – AI-generated natural language insights and recommendations.
  - 🔗 Receipt Matching – Automatic receipt-to-invoice matching with fuzzy matching.
  - 📤 Export – Export data to Excel and Markdown formats.

Stack (Local Analogues of AWS Services)

This project intentionally mirrors an AWS-native design:

AWS Service (Target)	Local / Free Equivalent
S3 (document storage)	`data/raw_docs/` on local disk
Textract (OCR)	Tesseract + `pytesseract`
Comprehend / Bedrock NLU	Local LLM + `sentence-transformers`
RDS / Aurora	PostgreSQL (Docker) - SQLite optional
OpenSearch / Kendra	FAISS vector index
Bedrock LLM (agents)	Ollama + LangChain/LlamaIndex
Lambda / Step Functions	Python services + scripts
QuickSight	Streamlit charts + notebooks

This makes it easy to lift and shift the architecture to AWS later by replacing the local components with managed services.

Features

Core IDP & Document Processing

✅ End-to-end IDP pipeline:
- OCR + text extraction from PDFs and images (Tesseract + pdfplumber).
- Document classification (invoices, receipts, statements).
- Field extraction into structured tables with confidence scores.
- Real-time document upload with drag-and-drop support.

AI-Powered Analytics

✅ RAG-enabled AI agent:
- DocSage's agent combines SQL analytics with document retrieval.
- Answers questions in natural language and cites source docs.
- Intelligent tool-using agent that chooses between SQL, metrics, and RAG.
✅ Time-series analytics:
- Monthly spending trends with interactive charts.
- Daily spending visualization (last 30 days).
- Vendor trends over time.
- Spending forecast using linear regression (3-month prediction).
✅ Smart expense categorization:
- LLM-based automatic categorization into 12+ categories.
- Categories: Office Supplies, Software, Travel, Meals, Services, etc.

Document Intelligence

✅ Visual document overlay:
- Highlight extracted fields on document images.
- Color-coded fields with confidence scores.
- Annotated document viewer.
✅ Interactive document correction:
- Edit extracted data directly in the UI.
- Track corrections with confidence scores.
- Real-time updates after corrections.
✅ Document comparison:
- Side-by-side comparison of documents.
- Similar document finder.
- Price change detection for recurring vendors.
- Price trend charts.

Anomaly Detection & Quality

✅ Automated anomaly detection:
- Duplicate transaction detection.
- Unusual amount flags (>2 standard deviations).
- Missing field detection.
- Date anomaly identification.
- Severity levels (High, Medium, Low).

Business Intelligence

✅ Natural language insights generator:
- AI-generated reports using Ollama LLM.
- Spending pattern analysis.
- Cost optimization recommendations.
- Markdown format with downloadable reports.
✅ Receipt-to-invoice matching:
- Automatic matching with fuzzy matching.
- Vendor name similarity scoring.
- Amount and date tolerance matching.
- Confidence scores for matches.

Export & Reporting

✅ Export functionality:
- Excel export with multiple sheets (Transactions, Vendors, Categories, Anomalies, Documents).
- Summary reports in Markdown format.
- Downloadable files with timestamps.

📖 See FEATURES.md for detailed documentation of all features.

Prerequisites

Python 3.9+
PostgreSQL (via Docker Compose) - Primary database
- SQLite is available as an optional alternative (set USE_SQLITE=True in .env)
Docker (required for PostgreSQL via Docker Compose)
- Install from: https://www.docker.com/get-started
Ollama installed and running locally
- Install from: https://ollama.com/
- Pull a model: ollama pull llama3 (or mistral, codellama, etc.)
Tesseract OCR (for image OCR)
- macOS: brew install tesseract
- Linux: sudo apt-get install tesseract-ocr
- Windows: Download from GitHub

Installation

Clone the repository:
```
git clone <repository-url>
cd docsage
```

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Set up environment variables: Create a .env file (optional, defaults are provided):

# PostgreSQL (default database)
POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgres
POSTGRES_DB=docsage
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
USE_SQLITE=False  # Set to True to use SQLite instead (not recommended for production)

# Ollama LLM
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llama3

Start PostgreSQL (via Docker Compose):
```
docker-compose up -d postgres
```
Note: PostgreSQL is the default and recommended database. To use SQLite instead (not recommended for production), set USE_SQLITE=True in your .env file and skip this step.
Create the database (if it doesn't exist):
```
docker-compose exec postgres psql -U postgres -c "CREATE DATABASE docsage;"
```
Or if you prefer to create it manually, connect to PostgreSQL and run:
```
CREATE DATABASE docsage;
```
Verify Ollama is running:
```
curl http://localhost:11434/api/tags
```

Usage

1. Initialize Database

Important: Make sure Docker is running and PostgreSQL is started before running these commands.

First, run the database migration (if upgrading from an older version):

# Activate virtual environment first
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Run migration
python scripts/migrate_database.py

Then create tables and optionally seed with sample data:

# Create tables only
python scripts/seed_db.py 0

# Create tables + seed with 50 sample transactions
python scripts/seed_db.py

2. Download Sample Documents (Optional)

You can download a free invoice/receipt dataset from Hugging Face:

# Install dataset library
pip install datasets pillow

# Download sample (first 20 images for testing)
python3 scripts/download_huggingface_dataset.py --split train --max-images 20

# Or download full training set (2,040 images)
python3 scripts/download_huggingface_dataset.py --split train

See DATASET_GUIDE.md for detailed instructions.

3. Ingest Documents

Place PDF/image files in data/raw_docs/ and run:

python3 scripts/ingest_docs.py

This will:

Extract text from PDFs/images (using OCR for images)
Classify document types
Extract structured fields
Create transaction records

4. Build Vector Embeddings

After ingesting documents, build the FAISS index:

python3 scripts/build_embeddings.py

5. Start the API Server

python3 -m app.main
# Or: uvicorn app.main:app --reload

The API will be available at http://localhost:8000

6. Start the Streamlit Frontend

In a new terminal:

streamlit run frontend/streamlit_app.py

Navigate to http://localhost:8501 in your browser.

API Usage

Chat Endpoint

curl -X POST "http://localhost:8000/chat/insights" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What did I spend on rent in the last 3 months?",
    "use_rag": true,
    "use_sql": true
  }'

Response:

{
  "answer": "Based on the transaction data...",
  "sources": [...],
  "sql_query": "SELECT SUM(amount) FROM transactions WHERE..."
}

Project Structure

app/
  main.py           # FastAPI entrypoint
  config.py         # App settings
  db.py             # Database connection (SQLAlchemy)
  models.py         # ORM models (Documents, Transactions, etc.)
  schemas.py        # Pydantic schemas for API requests/responses

  services/
    idp_pipeline.py          # OCR + extraction pipeline
    rag.py                   # Embedding + FAISS vector search helpers
    sql_tools.py             # Safe SQL wrappers used by the agent
    insights.py              # Metrics / KPI computation functions
    anomaly_detection.py     # Anomaly detection and alerting
    categorization.py          # LLM-based expense categorization
    document_comparison.py   # Document comparison and price tracking
    document_visualization.py # Visual document overlay with annotations
    export_service.py        # Excel and Markdown export functionality
    insights_generator.py    # AI-generated natural language insights
    receipt_matching.py      # Receipt-to-invoice matching

  agents/
    insight_agent.py# DocSageAgent class - Core AI agent orchestration logic
    tools.py        # Tool definitions exposed to the LLM

  vectorstore/
    faiss_store.py  # FAISS index management

data/
  raw_docs/         # Input PDFs
  processed/        # Extracted JSON/CSV
  embeddings/       # FAISS indexes, metadata

frontend/
  streamlit_app.py  # Comprehensive UI with 8 pages: Analytics, Chat, Documents, Anomalies, Comparison, Insights, Receipt Matching, Export

scripts/
  ingest_docs.py                  # CLI: load PDFs into the system
  build_embeddings.py             # Build FAISS vector index
  seed_db.py                      # Initialize database and seed sample data
  migrate_database.py             # Database migration script (adds new tables/columns)
  download_huggingface_dataset.py # Download invoice/receipt datasets
  add_documents_from_folder.py    # Batch document ingestion
  diagnose_and_fix_transactions.py # Diagnostic and repair utilities

notebooks/
  exploratory_idp.ipynb
  analytics_demo.ipynb

Configuration

Key configuration options in app/config.py:

Database (PostgreSQL is default):
- POSTGRES_USER, POSTGRES_PASSWORD, POSTGRES_DB, POSTGRES_HOST, POSTGRES_PORT: PostgreSQL connection settings
- USE_SQLITE: Set to True to use SQLite instead (default: False - PostgreSQL is recommended)
LLM:
OLLAMA_MODEL: LLM model to use (default: llama3)
- OLLAMA_BASE_URL: Ollama API endpoint (default: http://localhost:11434)
Vector Store:
EMBEDDING_MODEL: Embedding model (default: all-MiniLM-L6-v2)
FAISS_INDEX_PATH: Path to FAISS index file
API:
- API_HOST: API host (default: 0.0.0.0)
- API_PORT: API port (default: 8000)

Troubleshooting

Ollama Connection Error

Ensure Ollama is running: ollama serve
Check the model is available: ollama list
Pull the model if needed: ollama pull llama3

Tesseract OCR Not Found

Install Tesseract (see Prerequisites)
On macOS, ensure it's in PATH: which tesseract

Database Connection Error

Ensure PostgreSQL is running: docker-compose ps
Check connection settings in .env or app/config.py

No Documents Found

Place PDF/image files in data/raw_docs/
Run python scripts/ingest_docs.py
Supported formats: PDF, PNG, JPG, JPEG

Database Migration Issues

If you see errors about missing columns or tables, run: python3 scripts/migrate_database.py
This adds new tables (document_corrections) and columns (confidence_score, is_corrected)

Development

Adding New Document Types

Update classify_document() in app/services/idp_pipeline.py
Add extraction function (e.g., extract_form_fields())
Update parse_document() to handle the new type

Adding New Metrics

Add function to app/services/insights.py
Update create_metrics_tool() in app/agents/tools.py

Deployment

DocSage can be deployed for free using Railway or Render with free LLM APIs (Groq or Hugging Face).

Quick Deploy to Railway (Free)

Get a free Groq API key: console.groq.com
Deploy to Railway: Connect your GitHub repo at railway.app
Set environment variables:
- LLM_PROVIDER=groq
- GROQ_API_KEY=your_key
- Database credentials (Railway provides these automatically)

See DEPLOYMENT.md for detailed deployment instructions including:

Railway deployment (recommended)
Render deployment
Docker Compose production setup
Free LLM API setup (Groq, Hugging Face)
Environment variable configuration

Future Enhancements

What I learned

How to design tools and guardrails so an LLM can safely query a SQL DB
How to combine RAG + analytics (FAISS + metrics functions + SQL) for grounded insights
How to mirror a managed-cloud architecture with local components first

Next steps

Add GitHub Actions to run linting on each push
Swap local components for AWS services (Textract, Bedrock, RDS) in a branch

License

MIT License

Contributing

Contributions welcome! Please open an issue or submit a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
app		app
frontend		frontend
notebooks		notebooks
scripts		scripts
web		web
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
DATASET_GUIDE.md		DATASET_GUIDE.md
DEPLOYMENT.md		DEPLOYMENT.md
Dockerfile.api		Dockerfile.api
Dockerfile.frontend		Dockerfile.frontend
FEATURES.md		FEATURES.md
README.md		README.md
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
image.png		image.png
railway.json		railway.json
railway.toml		railway.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

DocSage

Why this project exists

Architecture Overview

Conceptual Flow

Stack (Local Analogues of AWS Services)

Features

Core IDP & Document Processing

AI-Powered Analytics

Document Intelligence

Anomaly Detection & Quality

Business Intelligence

Export & Reporting

Prerequisites

Installation

Usage

1. Initialize Database

2. Download Sample Documents (Optional)

3. Ingest Documents

4. Build Vector Embeddings

5. Start the API Server

6. Start the Streamlit Frontend

API Usage

Chat Endpoint

Project Structure

Configuration

Troubleshooting

Ollama Connection Error

Tesseract OCR Not Found

Database Connection Error

No Documents Found

Database Migration Issues

Development

Adding New Document Types

Adding New Metrics

Deployment

Quick Deploy to Railway (Free)

Future Enhancements

What I learned

Next steps

License

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages