Turn PDFs into a Hugo static blog with optional companion audio from
Podcast Generator. Watch for PDFs deposited under
hugo_site/static/pdfs/, OCR with Tesseract, write Markdown posts (with links to the PDF and
optional podcast audio), then run hugo.
- Ingest — Drop PDFs into
hugo_site/static/pdfs/(watched continuously). Each file is served at/pdfs/<filename>. - OCR — Tesseract reads each page (via Poppler-rendered images).
- Article — OCR text becomes the post body (title from the filename). Front matter includes
pdf:for the original file. You can edit the Markdown afterward. - Podcast — If
PODCAST_ENABLEis true and the API is reachable,POST /podcast/generatestoreshugo_site/static/podcasts/<slug>.wavand front matter links it. - Hugo —
hugo --gc --minifywriteshugo_site/public/.
python -m pdf_to_blog process /path/to/file.pdf copies the PDF into hugo_site/static/pdfs/ if it is not already
there, then builds the post the same way.
flowchart LR
pdfs[static/pdfs]
ocr[Tesseract]
post[Hugo content/posts]
pod[Podcast API]
site[hugo public]
pdfs --> ocr --> post
ocr --> pod
pod --> post
post --> site
| Path | Role |
|---|---|
hugo_site/static/pdfs/ |
Incoming PDFs (watched); published at /pdfs/* |
hugo_site/ |
Hugo site: content/posts/, static/pdfs/, static/podcasts/, layouts/ |
hugo_site/public/ |
Build output (gitignored) |
pdf_to_blog/podcast_client.py |
HTTP client for the podcast FastAPI |
- Python 3.11+ for local runs.
- Tesseract + Poppler (
pdftoppm) on the host if not using Docker. - Hugo (Extended not required for this theme) on the host for local
hugo-build/ preview. - Ollama server reachable at
OLLAMA_API_URL. Required ifOLLAMA_ENABLEis true. - Podcast Generator service (Ollama + Chatterbox, etc.) reachable at
PODCAST_API_URL. Required ifPODCAST_ENABLEis true. Check the felipechang/podcast-generator repo for details.
pip install -e .Optional: create a .env in the project root; python-dotenv loads it (see variables below).
| Variable | Description |
|---|---|
PDF_BLOG_ROOT |
Workspace root (default: current working directory) |
INGEST_DIR |
Folder to watch (default: <hugo_site>/static/pdfs) |
HUGO_SITE_DIR |
Hugo site path (default: <root>/hugo_site) |
STATE_PATH |
JSON file tracking processed file hashes |
PODCAST_API_URL |
Podcast Generator base URL (default http://127.0.0.1:8000) |
PODCAST_TIMEOUT_S |
HTTP timeout for generate (default 3600) |
PODCAST_POLL_INTERVAL_S |
Polling interval for podcast status (default 30) |
PODCAST_ASSISTANT_PROMPT |
Optional assistant_prompt for podcast configuration |
PODCAST_ENABLE |
true/false — skip TTS calls when false |
OLLAMA_API_URL |
Ollama API base URL (default http://127.0.0.1:11434) |
OLLAMA_MODEL |
Model used for rewriting/summarizing (default glm-4.7-flash:latest) |
OLLAMA_TITLE_PROMPT |
Prompt for generating a title (default: Summarize...) |
OLLAMA_BODY_PROMPT |
Prompt for transforming OCR text to body (default: Transform...) |
OLLAMA_TIMEOUT_S |
HTTP timeout for Ollama calls (default 600) |
OLLAMA_ENABLE |
true/false — skip Ollama rewriting when false |
TESSERACT_CMD |
Explicit path to tesseract if not on PATH |
TESSERACT_LANG |
Tesseract language codes (default eng) |
The podcast server uses its own env (OLLAMA_*, SPEAKER_*_VOICE, TTS_*, etc.) as documented in the
podcast-generator project.
# Watch hugo_site/static/pdfs/ for PDFs (defaults apply when run from project root)
python -m pdf_to_blog watch
# One-shot PDF
python -m pdf_to_blog process path/to/file.pdf
# Hugo build only
python -m pdf_to_blog hugo-buildBuilds an image with Tesseract, Poppler, Hugo, and the Python package. Use *
klakegg/hugo* for live preview (docker compose profile preview) or a
one-off
build (see below).
docker compose up --build pipeline- Mounts
./hugo_siteat/work/hugo_site. PDFs are watched underhugo_site/static/pdfs/(same mount). - The
pipelineservice setsPDF_BLOG_ROOT,INGEST_DIR,HUGO_SITE_DIR, andSTATE_PATHto/work/...paths that match this layout so a host-only.envcannot write outside the bind mount (which would 404/pdfs/*or/podcasts/*in preview). - Persists processing state in a named volume (
pdf_blog_state) at/work/data/state.json. - Default
PODCAST_API_URLishttp://host.docker.internal:8000so a podcast service on the host is reachable (Linux usesextra_hosts: host-gateway).
Hugo live preview (optional profile) uses the minimal klakegg/hugo image (
ext-alpine = Hugo Extended on Alpine); the site is mounted at /src as in their docs.
docker compose --profile preview up hugoOpen http://localhost:1313/. Override the tag with
Compose image or pin klakegg/hugo:<version>-ext-alpine in
docker-compose.yml if you need a specific Hugo version.
The pipeline image includes hugo, so each ingested PDF runs hugo --gc --minify into hugo_site/public/. For a
one-off build without the pipeline container, use klakegg, for example:
docker run --rm -v "%cd%/hugo_site:/src" klakegg/hugo:ext-alpine --gc --minify(On PowerShell, use "${PWD}/hugo_site:/src" or an absolute path instead of %cd%.)
PodcastClient in pdf_to_blog/podcast_client.py wraps:
| Method | API |
|---|---|
health() |
GET /health |
generate(content, assistant_prompt="") |
POST /podcast/generate → str (task_id) |
get_task_status(task_id) |
GET /podcast/task/{task_id}/status |
download_task_result(task_id) |
GET /podcast/task/{task_id} → bytes |
Example:
from pdf_to_blog.podcast_client import PodcastClient
c = PodcastClient(base_url="http://127.0.0.1:8000", timeout_s=600.0)
print(c.health())
task_id = c.generate("Same content as above.")
print(f"Task started: {task_id}")
# Poll status (simplified)
import time
while True:
status = c.get_task_status(task_id)
if status["status"] == "completed":
wav = c.download_task_result(task_id)
Path("episode.wav").write_bytes(wav)
break
elif status["status"] == "failed":
print(f"Failed: {status['error']}")
break
time.sleep(5)Matches the curl-style usage of the Podcast Generator service (Content-Type: application/json body with content and
optional assistant_prompt).
- Re-processing the same file is skipped while the SHA-256 matches the last successful run (see
STATE_PATH). - Add LLM rewriting between OCR and the post body if you want curated articles instead of raw OCR text.
- For non-English PDFs, install extra Tesseract language packs in the image or on the host and extend
extract_text_pdfto passlang.
Specify your license here once chosen.