Intelligent Document Processing with AI-Powered Analytics
Local, zero-cost alternative to AWS Textract + Bedrock
DocSage is a local, zero-cost platform for AI-powered document intelligence that sits on top of an Intelligent Document Processing (IDP) pipeline. DocSage features an intelligent AI agent that processes documents and answers questions using natural language.
It ingests PDF documents (e.g., invoices, bank statements, forms), extracts structured data, and lets you ask natural language questions like:
- "What did I spend on rent in the last 3 months?"
- "Which vendors are above $5,000 this quarter?"
- "Show me anomalies in monthly spend and the supporting documents."
The implementation is designed to mirror how this would run on AWS (Textract, Bedrock, S3, RDS, OpenSearch), but uses 100% free, local tools instead.
I built this to practice end-to-end architecture for an intelligent document processing system similar to what you'd run on AWS (Textract + Bedrock + RDS + OpenSearch), but using 100% local, free tools. My learning focus was:
- Designing a tool-using LLM agent wired into SQL, metrics, and RAG
- Building an IDP pipeline (OCR, classification, field extraction) for financial docs
- Structuring a FastAPI + Streamlit system that's easy to "lift-and-shift" to AWS
That makes the learning goal explicit instead of implicit.
-
Ingestion & IDP
- PDFs are stored in
data/raw_docs/. - A Python-based IDP pipeline:
- Uses OCR (Tesseract) and PDF parsing (
pdfplumber) to extract text. - Classifies document types (invoice, statement, etc.).
- Extracts key fields (dates, amounts, vendors, categories).
- Uses OCR (Tesseract) and PDF parsing (
- Structured outputs are saved as JSON/CSV and loaded into a relational database.
- PDFs are stored in
-
Storage & Analytics
- All structured data is stored in PostgreSQL (via Docker Compose) by default.
- SQLite is available as an optional alternative for development.
- Derived metrics (e.g., monthly totals, category breakdowns, vendor stats) are computed and exposed as reusable "metrics functions".
-
RAG + Vector Search
- Document chunks and summaries are embedded with a free
sentence-transformersmodel. - Embeddings are stored in a local FAISS index (no external vector DB).
- This enables the agent to retrieve supporting documents for its answers.
- Document chunks and summaries are embedded with a free
-
AI Agent (LLM + Tools)
- DocSage features an intelligent AI agent that powers the system.
- A local LLM (via Ollama) provides reasoning and natural language generation.
- The agent is wired with tools (using LangChain/LlamaIndex-style patterns):
sql_tool: run parameterized SQL queries on the transactional DB.metrics_tool: call pre-defined Python functions for KPIs.rag_tool: search FAISS for relevant document snippets.
- The agent decides which tools to call based on the user's query, aggregates the results, and explains the insight in plain language, referencing underlying data and documents.
-
API & UI
- Backend: FastAPI application exposing:
POST /chat/insights– main endpoint for DocSage's AI agent.GET /health– health check endpoint.GET /docs– interactive API documentation.
- Frontend: Streamlit app with 8 comprehensive pages:
- 📊 Analytics Dashboard – Time-series analytics, spending trends, vendor analysis, and forecasting.
- 💬 Chat – Natural language interface to interact with DocSage.
- 📄 Documents – Document management with visual overlays, interactive corrections, and real-time upload.
⚠️ Anomalies – Automated anomaly detection (duplicates, unusual amounts, missing fields).- 🔍 Document Comparison – Side-by-side document comparison and price change tracking.
- 📈 Insights Report – AI-generated natural language insights and recommendations.
- 🔗 Receipt Matching – Automatic receipt-to-invoice matching with fuzzy matching.
- 📤 Export – Export data to Excel and Markdown formats.
- Backend: FastAPI application exposing:
This project intentionally mirrors an AWS-native design:
| AWS Service (Target) | Local / Free Equivalent |
|---|---|
| S3 (document storage) | data/raw_docs/ on local disk |
| Textract (OCR) | Tesseract + pytesseract |
| Comprehend / Bedrock NLU | Local LLM + sentence-transformers |
| RDS / Aurora | PostgreSQL (Docker) - SQLite optional |
| OpenSearch / Kendra | FAISS vector index |
| Bedrock LLM (agents) | Ollama + LangChain/LlamaIndex |
| Lambda / Step Functions | Python services + scripts |
| QuickSight | Streamlit charts + notebooks |
This makes it easy to lift and shift the architecture to AWS later by replacing the local components with managed services.
- ✅ End-to-end IDP pipeline:
- OCR + text extraction from PDFs and images (Tesseract + pdfplumber).
- Document classification (invoices, receipts, statements).
- Field extraction into structured tables with confidence scores.
- Real-time document upload with drag-and-drop support.
- ✅ RAG-enabled AI agent:
- DocSage's agent combines SQL analytics with document retrieval.
- Answers questions in natural language and cites source docs.
- Intelligent tool-using agent that chooses between SQL, metrics, and RAG.
- ✅ Time-series analytics:
- Monthly spending trends with interactive charts.
- Daily spending visualization (last 30 days).
- Vendor trends over time.
- Spending forecast using linear regression (3-month prediction).
- ✅ Smart expense categorization:
- LLM-based automatic categorization into 12+ categories.
- Categories: Office Supplies, Software, Travel, Meals, Services, etc.
- ✅ Visual document overlay:
- Highlight extracted fields on document images.
- Color-coded fields with confidence scores.
- Annotated document viewer.
- ✅ Interactive document correction:
- Edit extracted data directly in the UI.
- Track corrections with confidence scores.
- Real-time updates after corrections.
- ✅ Document comparison:
- Side-by-side comparison of documents.
- Similar document finder.
- Price change detection for recurring vendors.
- Price trend charts.
- ✅ Automated anomaly detection:
- Duplicate transaction detection.
- Unusual amount flags (>2 standard deviations).
- Missing field detection.
- Date anomaly identification.
- Severity levels (High, Medium, Low).
- ✅ Natural language insights generator:
- AI-generated reports using Ollama LLM.
- Spending pattern analysis.
- Cost optimization recommendations.
- Markdown format with downloadable reports.
- ✅ Receipt-to-invoice matching:
- Automatic matching with fuzzy matching.
- Vendor name similarity scoring.
- Amount and date tolerance matching.
- Confidence scores for matches.
- ✅ Export functionality:
- Excel export with multiple sheets (Transactions, Vendors, Categories, Anomalies, Documents).
- Summary reports in Markdown format.
- Downloadable files with timestamps.
📖 See FEATURES.md for detailed documentation of all features.
- Python 3.9+
- PostgreSQL (via Docker Compose) - Primary database
- SQLite is available as an optional alternative (set
USE_SQLITE=Truein.env)
- SQLite is available as an optional alternative (set
- Docker (required for PostgreSQL via Docker Compose)
- Install from: https://www.docker.com/get-started
- Ollama installed and running locally
- Install from: https://ollama.com/
- Pull a model:
ollama pull llama3(ormistral,codellama, etc.)
- Tesseract OCR (for image OCR)
- macOS:
brew install tesseract - Linux:
sudo apt-get install tesseract-ocr - Windows: Download from GitHub
- macOS:
-
Clone the repository:
git clone <repository-url> cd docsage
-
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables: Create a
.envfile (optional, defaults are provided):# PostgreSQL (default database) POSTGRES_USER=postgres POSTGRES_PASSWORD=postgres POSTGRES_DB=docsage POSTGRES_HOST=localhost POSTGRES_PORT=5432 USE_SQLITE=False # Set to True to use SQLite instead (not recommended for production) # Ollama LLM OLLAMA_BASE_URL=http://localhost:11434 OLLAMA_MODEL=llama3
-
Start PostgreSQL (via Docker Compose):
docker-compose up -d postgres
Note: PostgreSQL is the default and recommended database. To use SQLite instead (not recommended for production), set
USE_SQLITE=Truein your.envfile and skip this step. -
Create the database (if it doesn't exist):
docker-compose exec postgres psql -U postgres -c "CREATE DATABASE docsage;"
Or if you prefer to create it manually, connect to PostgreSQL and run:
CREATE DATABASE docsage;
-
Verify Ollama is running:
curl http://localhost:11434/api/tags
Important: Make sure Docker is running and PostgreSQL is started before running these commands.
First, run the database migration (if upgrading from an older version):
# Activate virtual environment first
source venv/bin/activate # On Windows: venv\Scripts\activate
# Run migration
python scripts/migrate_database.pyThen create tables and optionally seed with sample data:
# Create tables only
python scripts/seed_db.py 0
# Create tables + seed with 50 sample transactions
python scripts/seed_db.pyYou can download a free invoice/receipt dataset from Hugging Face:
# Install dataset library
pip install datasets pillow
# Download sample (first 20 images for testing)
python3 scripts/download_huggingface_dataset.py --split train --max-images 20
# Or download full training set (2,040 images)
python3 scripts/download_huggingface_dataset.py --split trainSee DATASET_GUIDE.md for detailed instructions.
Place PDF/image files in data/raw_docs/ and run:
python3 scripts/ingest_docs.pyThis will:
- Extract text from PDFs/images (using OCR for images)
- Classify document types
- Extract structured fields
- Create transaction records
After ingesting documents, build the FAISS index:
python3 scripts/build_embeddings.pypython3 -m app.main
# Or: uvicorn app.main:app --reloadThe API will be available at http://localhost:8000
In a new terminal:
streamlit run frontend/streamlit_app.pyNavigate to http://localhost:8501 in your browser.
curl -X POST "http://localhost:8000/chat/insights" \
-H "Content-Type: application/json" \
-d '{
"query": "What did I spend on rent in the last 3 months?",
"use_rag": true,
"use_sql": true
}'Response:
{
"answer": "Based on the transaction data...",
"sources": [...],
"sql_query": "SELECT SUM(amount) FROM transactions WHERE..."
}app/
main.py # FastAPI entrypoint
config.py # App settings
db.py # Database connection (SQLAlchemy)
models.py # ORM models (Documents, Transactions, etc.)
schemas.py # Pydantic schemas for API requests/responses
services/
idp_pipeline.py # OCR + extraction pipeline
rag.py # Embedding + FAISS vector search helpers
sql_tools.py # Safe SQL wrappers used by the agent
insights.py # Metrics / KPI computation functions
anomaly_detection.py # Anomaly detection and alerting
categorization.py # LLM-based expense categorization
document_comparison.py # Document comparison and price tracking
document_visualization.py # Visual document overlay with annotations
export_service.py # Excel and Markdown export functionality
insights_generator.py # AI-generated natural language insights
receipt_matching.py # Receipt-to-invoice matching
agents/
insight_agent.py# DocSageAgent class - Core AI agent orchestration logic
tools.py # Tool definitions exposed to the LLM
vectorstore/
faiss_store.py # FAISS index management
data/
raw_docs/ # Input PDFs
processed/ # Extracted JSON/CSV
embeddings/ # FAISS indexes, metadata
frontend/
streamlit_app.py # Comprehensive UI with 8 pages: Analytics, Chat, Documents, Anomalies, Comparison, Insights, Receipt Matching, Export
scripts/
ingest_docs.py # CLI: load PDFs into the system
build_embeddings.py # Build FAISS vector index
seed_db.py # Initialize database and seed sample data
migrate_database.py # Database migration script (adds new tables/columns)
download_huggingface_dataset.py # Download invoice/receipt datasets
add_documents_from_folder.py # Batch document ingestion
diagnose_and_fix_transactions.py # Diagnostic and repair utilities
notebooks/
exploratory_idp.ipynb
analytics_demo.ipynb
Key configuration options in app/config.py:
- Database (PostgreSQL is default):
POSTGRES_USER,POSTGRES_PASSWORD,POSTGRES_DB,POSTGRES_HOST,POSTGRES_PORT: PostgreSQL connection settingsUSE_SQLITE: Set toTrueto use SQLite instead (default:False- PostgreSQL is recommended)
- LLM:
OLLAMA_MODEL: LLM model to use (default:llama3)OLLAMA_BASE_URL: Ollama API endpoint (default:http://localhost:11434)
- Vector Store:
EMBEDDING_MODEL: Embedding model (default:all-MiniLM-L6-v2)FAISS_INDEX_PATH: Path to FAISS index file- API:
API_HOST: API host (default:0.0.0.0)API_PORT: API port (default:8000)
- Ensure Ollama is running:
ollama serve - Check the model is available:
ollama list - Pull the model if needed:
ollama pull llama3
- Install Tesseract (see Prerequisites)
- On macOS, ensure it's in PATH:
which tesseract
- Ensure PostgreSQL is running:
docker-compose ps - Check connection settings in
.envorapp/config.py
- Place PDF/image files in
data/raw_docs/ - Run
python scripts/ingest_docs.py - Supported formats: PDF, PNG, JPG, JPEG
- If you see errors about missing columns or tables, run:
python3 scripts/migrate_database.py - This adds new tables (
document_corrections) and columns (confidence_score,is_corrected)
- Update
classify_document()inapp/services/idp_pipeline.py - Add extraction function (e.g.,
extract_form_fields()) - Update
parse_document()to handle the new type
- Add function to
app/services/insights.py - Update
create_metrics_tool()inapp/agents/tools.py
DocSage can be deployed for free using Railway or Render with free LLM APIs (Groq or Hugging Face).
- Get a free Groq API key: console.groq.com
- Deploy to Railway: Connect your GitHub repo at railway.app
- Set environment variables:
LLM_PROVIDER=groqGROQ_API_KEY=your_key- Database credentials (Railway provides these automatically)
See DEPLOYMENT.md for detailed deployment instructions including:
- Railway deployment (recommended)
- Render deployment
- Docker Compose production setup
- Free LLM API setup (Groq, Hugging Face)
- Environment variable configuration
- Advanced text chunking strategies
- Multi-turn conversation support
- PDF report generation (using reportlab)
- Real-time document processing webhooks
- AWS deployment guide
- Email integration for automatic document processing
- Multi-language support
- Advanced ML models for better extraction accuracy
- Budget tracking and alerts
- Approval workflows
- How to design tools and guardrails so an LLM can safely query a SQL DB
- How to combine RAG + analytics (FAISS + metrics functions + SQL) for grounded insights
- How to mirror a managed-cloud architecture with local components first
- Add GitHub Actions to run linting on each push
- Swap local components for AWS services (Textract, Bedrock, RDS) in a branch
MIT License
Contributions welcome! Please open an issue or submit a pull request.