This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
# Install with dev dependencies
uv pip install -e ".[dev]"
# Run all tests
pytest
# Run a single test file
pytest tests/test_facade.py
# Run a single test by name
pytest tests/test_config.py::test_parse_config_defaults -v
# Build the package
uv build
# CLI usage
docproc --file input.pdf -o output.md
docproc --file input.pdf -o output.md --config docproc.yaml
docproc init-config --env .env
docproc completions [bash|zsh]docproc is a document extraction library + CLI. It reads PDF/DOCX/PPTX/XLSX files and outputs markdown via a three-stage pipeline:
-
Load —
docproc/doc/loaders/contains format-specific loaders (PyMuPDF for PDF, python-docx, python-pptx, openpyxl). All loaders implement the same base interface inloaders/base.pyand are selected byloaders/factory.py. -
Extract/Vision —
docproc/extractors/vision_llm.pysends PDF page images to a vision-capable LLM (enabled wheningest.use_vision: true). Falls back to native text on any connection/provider error. -
Refine —
docproc/refiners/llm_refine.pypasses extracted text through an LLM to clean markdown and format LaTeX (enabled wheningest.use_llm_refine: true).
Entry points:
docproc/pipeline.py—extract_document_to_text()is the core function used by both the CLI and library. It orchestrates vision extraction → text fallback → LLM refine.docproc/facade.py—Docprocclass wraps the pipeline with instance-scoped config. Factory classmethods:with_openai(),from_config_path(),from_env().docproc/bin/cli.py— CLI entry point registered asdocprocinpyproject.toml.
Config system (docproc/config/):
schema.pydefinesdocprocConfig(dataclass) with sub-configs:DatabaseConfig,AIProviderConfig,IngestConfig,RAGConfig,AIConfig.loader.pyhas two functions:parse_config()(pure, returns a new config, does NOT update global state) andload_config()(sets the process-wide singleton used byget_config()). Useparse_config()in tests and library code;load_config()in CLI.- Config is resolved from (in order): explicit path →
DOCPROC_CONFIGenv →./docproc.yaml→./docproc.yml→~/.config/docproc/docproc.yml.
Provider system (docproc/providers/):
factory.py—get_provider()creates and caches provider instances. Bypasses cache when aconfigargument is passed (use this in tests).- Supported providers:
openai,azure,anthropic,ollama,litellm. - All providers implement
ModelProviderbase class inbase.py.
Demo (demo/) is a separate full-stack application (Go API + React UI + PostgreSQL/PgVector + RabbitMQ) that invokes the docproc CLI as a subprocess when documents are uploaded. It is not part of the Python library.
Key env vars: DOCPROC_CONFIG, DOCPROC_PRIMARY_AI, OPENAI_API_KEY, AZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_DEPLOYMENT, ANTHROPIC_API_KEY, OLLAMA_BASE_URL, DATABASE_URL, AI_DISABLED.
For local dev without a config file, copy .env.example → .env and run docproc init-config to generate ~/.config/docproc/docproc.yml.