Skip to content

felipechang/pdf-to-blog

Repository files navigation

pdf-to-blog

Turn PDFs into a Hugo static blog with optional companion audio from Podcast Generator. Watch for PDFs deposited under hugo_site/static/pdfs/, OCR with Tesseract, write Markdown posts (with links to the PDF and optional podcast audio), then run hugo.

Pipeline

  1. Ingest — Drop PDFs into hugo_site/static/pdfs/ (watched continuously). Each file is served at /pdfs/<filename>.
  2. OCR — Tesseract reads each page (via Poppler-rendered images).
  3. Article — OCR text becomes the post body (title from the filename). Front matter includes pdf: for the original file. You can edit the Markdown afterward.
  4. Podcast — If PODCAST_ENABLE is true and the API is reachable, POST /podcast/generate stores hugo_site/static/podcasts/<slug>.wav and front matter links it.
  5. Hugohugo --gc --minify writes hugo_site/public/.

python -m pdf_to_blog process /path/to/file.pdf copies the PDF into hugo_site/static/pdfs/ if it is not already there, then builds the post the same way.

flowchart LR
  pdfs[static/pdfs]
  ocr[Tesseract]
  post[Hugo content/posts]
  pod[Podcast API]
  site[hugo public]

  pdfs --> ocr --> post
  ocr --> pod
  pod --> post
  post --> site
Loading

Layout

Path Role
hugo_site/static/pdfs/ Incoming PDFs (watched); published at /pdfs/*
hugo_site/ Hugo site: content/posts/, static/pdfs/, static/podcasts/, layouts/
hugo_site/public/ Build output (gitignored)
pdf_to_blog/podcast_client.py HTTP client for the podcast FastAPI

Prerequisites

  • Python 3.11+ for local runs.
  • Tesseract + Poppler (pdftoppm) on the host if not using Docker.
  • Hugo (Extended not required for this theme) on the host for local hugo-build / preview.
  • Ollama server reachable at OLLAMA_API_URL. Required if OLLAMA_ENABLE is true.
  • Podcast Generator service (Ollama + Chatterbox, etc.) reachable at PODCAST_API_URL. Required if PODCAST_ENABLE is true. Check the felipechang/podcast-generator repo for details.

Install (local)

pip install -e .

Optional: create a .env in the project root; python-dotenv loads it (see variables below).

Configuration

Variable Description
PDF_BLOG_ROOT Workspace root (default: current working directory)
INGEST_DIR Folder to watch (default: <hugo_site>/static/pdfs)
HUGO_SITE_DIR Hugo site path (default: <root>/hugo_site)
STATE_PATH JSON file tracking processed file hashes
PODCAST_API_URL Podcast Generator base URL (default http://127.0.0.1:8000)
PODCAST_TIMEOUT_S HTTP timeout for generate (default 3600)
PODCAST_POLL_INTERVAL_S Polling interval for podcast status (default 30)
PODCAST_ASSISTANT_PROMPT Optional assistant_prompt for podcast configuration
PODCAST_ENABLE true/false — skip TTS calls when false
OLLAMA_API_URL Ollama API base URL (default http://127.0.0.1:11434)
OLLAMA_MODEL Model used for rewriting/summarizing (default glm-4.7-flash:latest)
OLLAMA_TITLE_PROMPT Prompt for generating a title (default: Summarize...)
OLLAMA_BODY_PROMPT Prompt for transforming OCR text to body (default: Transform...)
OLLAMA_TIMEOUT_S HTTP timeout for Ollama calls (default 600)
OLLAMA_ENABLE true/false — skip Ollama rewriting when false
TESSERACT_CMD Explicit path to tesseract if not on PATH
TESSERACT_LANG Tesseract language codes (default eng)

The podcast server uses its own env (OLLAMA_*, SPEAKER_*_VOICE, TTS_*, etc.) as documented in the podcast-generator project.

CLI

# Watch hugo_site/static/pdfs/ for PDFs (defaults apply when run from project root)
python -m pdf_to_blog watch

# One-shot PDF
python -m pdf_to_blog process path/to/file.pdf

# Hugo build only
python -m pdf_to_blog hugo-build

Docker (Tesseract + pipeline)

Builds an image with Tesseract, Poppler, Hugo, and the Python package. Use * klakegg/hugo* for live preview (docker compose profile preview) or a one-off build (see below).

docker compose up --build pipeline
  • Mounts ./hugo_site at /work/hugo_site. PDFs are watched under hugo_site/static/pdfs/ (same mount).
  • The pipeline service sets PDF_BLOG_ROOT, INGEST_DIR, HUGO_SITE_DIR, and STATE_PATH to /work/... paths that match this layout so a host-only .env cannot write outside the bind mount (which would 404 /pdfs/* or /podcasts/* in preview).
  • Persists processing state in a named volume (pdf_blog_state) at /work/data/state.json.
  • Default PODCAST_API_URL is http://host.docker.internal:8000 so a podcast service on the host is reachable (Linux uses extra_hosts: host-gateway).

Hugo live preview (optional profile) uses the minimal klakegg/hugo image ( ext-alpine = Hugo Extended on Alpine); the site is mounted at /src as in their docs.

docker compose --profile preview up hugo

Open http://localhost:1313/. Override the tag with Compose image or pin klakegg/hugo:<version>-ext-alpine in docker-compose.yml if you need a specific Hugo version.

The pipeline image includes hugo, so each ingested PDF runs hugo --gc --minify into hugo_site/public/. For a one-off build without the pipeline container, use klakegg, for example:

docker run --rm -v "%cd%/hugo_site:/src" klakegg/hugo:ext-alpine --gc --minify

(On PowerShell, use "${PWD}/hugo_site:/src" or an absolute path instead of %cd%.)

Podcast client (Python)

PodcastClient in pdf_to_blog/podcast_client.py wraps:

Method API
health() GET /health
generate(content, assistant_prompt="") POST /podcast/generatestr (task_id)
get_task_status(task_id) GET /podcast/task/{task_id}/status
download_task_result(task_id) GET /podcast/task/{task_id}bytes

Example:

from pdf_to_blog.podcast_client import PodcastClient

c = PodcastClient(base_url="http://127.0.0.1:8000", timeout_s=600.0)
print(c.health())
task_id = c.generate("Same content as above.")
print(f"Task started: {task_id}")

# Poll status (simplified)
import time

while True:
    status = c.get_task_status(task_id)
    if status["status"] == "completed":
        wav = c.download_task_result(task_id)
        Path("episode.wav").write_bytes(wav)
        break
    elif status["status"] == "failed":
        print(f"Failed: {status['error']}")
        break
    time.sleep(5)

Matches the curl-style usage of the Podcast Generator service (Content-Type: application/json body with content and optional assistant_prompt).

Development notes

  • Re-processing the same file is skipped while the SHA-256 matches the last successful run (see STATE_PATH).
  • Add LLM rewriting between OCR and the post body if you want curated articles instead of raw OCR text.
  • For non-English PDFs, install extra Tesseract language packs in the image or on the host and extend extract_text_pdf to pass lang.

License

Specify your license here once chosen.

About

Fully local AI blog post generation with optional podcast style comentary

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors