pdf-to-blog

Turn PDFs into a Hugo static blog with optional companion audio from Podcast Generator. Watch for PDFs deposited under hugo_site/static/pdfs/, OCR with Tesseract, write Markdown posts (with links to the PDF and optional podcast audio), then run hugo.

Pipeline

Ingest — Drop PDFs into hugo_site/static/pdfs/ (watched continuously). Each file is served at /pdfs/<filename>.
OCR — Tesseract reads each page (via Poppler-rendered images).
Article — OCR text becomes the post body (title from the filename). Front matter includes pdf: for the original file. You can edit the Markdown afterward.
Podcast — If PODCAST_ENABLE is true and the API is reachable, POST /podcast/generate stores hugo_site/static/podcasts/<slug>.wav and front matter links it.
Hugo — hugo --gc --minify writes hugo_site/public/.

python -m pdf_to_blog process /path/to/file.pdf copies the PDF into hugo_site/static/pdfs/ if it is not already there, then builds the post the same way.

flowchart LR
  pdfs[static/pdfs]
  ocr[Tesseract]
  post[Hugo content/posts]
  pod[Podcast API]
  site[hugo public]

  pdfs --> ocr --> post
  ocr --> pod
  pod --> post
  post --> site

Layout

Path	Role
`hugo_site/static/pdfs/`	Incoming PDFs (watched); published at `/pdfs/*`
`hugo_site/`	Hugo site: `content/posts/`, `static/pdfs/`, `static/podcasts/`, `layouts/`
`hugo_site/public/`	Build output (gitignored)
`pdf_to_blog/podcast_client.py`	HTTP client for the podcast FastAPI

Prerequisites

Python 3.11+ for local runs.
Tesseract + Poppler (pdftoppm) on the host if not using Docker.
Hugo (Extended not required for this theme) on the host for local hugo-build / preview.
Ollama server reachable at OLLAMA_API_URL. Required if OLLAMA_ENABLE is true.
Podcast Generator service (Ollama + Chatterbox, etc.) reachable at PODCAST_API_URL. Required if PODCAST_ENABLE is true. Check the felipechang/podcast-generator repo for details.

Install (local)

pip install -e .

Optional: create a .env in the project root; python-dotenv loads it (see variables below).

Configuration

Variable	Description
`PDF_BLOG_ROOT`	Workspace root (default: current working directory)
`INGEST_DIR`	Folder to watch (default: `<hugo_site>/static/pdfs`)
`HUGO_SITE_DIR`	Hugo site path (default: `<root>/hugo_site`)
`STATE_PATH`	JSON file tracking processed file hashes
`PODCAST_API_URL`	Podcast Generator base URL (default `http://127.0.0.1:8000`)
`PODCAST_TIMEOUT_S`	HTTP timeout for generate (default `3600`)
`PODCAST_POLL_INTERVAL_S`	Polling interval for podcast status (default `30`)
`PODCAST_ASSISTANT_PROMPT`	Optional `assistant_prompt` for podcast configuration
`PODCAST_ENABLE`	`true`/`false` — skip TTS calls when `false`
`OLLAMA_API_URL`	Ollama API base URL (default `http://127.0.0.1:11434`)
`OLLAMA_MODEL`	Model used for rewriting/summarizing (default `glm-4.7-flash:latest`)
`OLLAMA_TITLE_PROMPT`	Prompt for generating a title (default: Summarize...)
`OLLAMA_BODY_PROMPT`	Prompt for transforming OCR text to body (default: Transform...)
`OLLAMA_TIMEOUT_S`	HTTP timeout for Ollama calls (default `600`)
`OLLAMA_ENABLE`	`true`/`false` — skip Ollama rewriting when `false`
`TESSERACT_CMD`	Explicit path to `tesseract` if not on `PATH`
`TESSERACT_LANG`	Tesseract language codes (default `eng`)

The podcast server uses its own env (OLLAMA_*, SPEAKER_*_VOICE, TTS_*, etc.) as documented in the podcast-generator project.

CLI

# Watch hugo_site/static/pdfs/ for PDFs (defaults apply when run from project root)
python -m pdf_to_blog watch

# One-shot PDF
python -m pdf_to_blog process path/to/file.pdf

# Hugo build only
python -m pdf_to_blog hugo-build

Docker (Tesseract + pipeline)

Builds an image with Tesseract, Poppler, Hugo, and the Python package. Use * klakegg/hugo* for live preview (docker compose profile preview) or a one-off build (see below).

docker compose up --build pipeline

Mounts ./hugo_site at /work/hugo_site. PDFs are watched under hugo_site/static/pdfs/ (same mount).
The pipeline service sets PDF_BLOG_ROOT, INGEST_DIR, HUGO_SITE_DIR, and STATE_PATH to /work/... paths that match this layout so a host-only .env cannot write outside the bind mount (which would 404 /pdfs/* or /podcasts/* in preview).
Persists processing state in a named volume (pdf_blog_state) at /work/data/state.json.
Default PODCAST_API_URL is http://host.docker.internal:8000 so a podcast service on the host is reachable (Linux uses extra_hosts: host-gateway).

Hugo live preview (optional profile) uses the minimal klakegg/hugo image ( ext-alpine = Hugo Extended on Alpine); the site is mounted at /src as in their docs.

docker compose --profile preview up hugo

Open http://localhost:1313/. Override the tag with Compose image or pin klakegg/hugo:<version>-ext-alpine in docker-compose.yml if you need a specific Hugo version.

The pipeline image includes hugo, so each ingested PDF runs hugo --gc --minify into hugo_site/public/. For a one-off build without the pipeline container, use klakegg, for example:

docker run --rm -v "%cd%/hugo_site:/src" klakegg/hugo:ext-alpine --gc --minify

(On PowerShell, use "${PWD}/hugo_site:/src" or an absolute path instead of %cd%.)

Podcast client (Python)

PodcastClient in pdf_to_blog/podcast_client.py wraps:

Method	API
`health()`	`GET /health`
`generate(content, assistant_prompt="")`	`POST /podcast/generate` → `str` (task_id)
`get_task_status(task_id)`	`GET /podcast/task/{task_id}/status`
`download_task_result(task_id)`	`GET /podcast/task/{task_id}` → `bytes`

Example:

from pdf_to_blog.podcast_client import PodcastClient

c = PodcastClient(base_url="http://127.0.0.1:8000", timeout_s=600.0)
print(c.health())
task_id = c.generate("Same content as above.")
print(f"Task started: {task_id}")

# Poll status (simplified)
import time

while True:
    status = c.get_task_status(task_id)
    if status["status"] == "completed":
        wav = c.download_task_result(task_id)
        Path("episode.wav").write_bytes(wav)
        break
    elif status["status"] == "failed":
        print(f"Failed: {status['error']}")
        break
    time.sleep(5)

Matches the curl-style usage of the Podcast Generator service (Content-Type: application/json body with content and optional assistant_prompt).

Development notes

Re-processing the same file is skipped while the SHA-256 matches the last successful run (see STATE_PATH).
Add LLM rewriting between OCR and the post body if you want curated articles instead of raw OCR text.
For non-English PDFs, install extra Tesseract language packs in the image or on the host and extend extract_text_pdf to pass lang.

License

Specify your license here once chosen.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
hugo_site		hugo_site
pdf_to_blog		pdf_to_blog
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdf-to-blog

Pipeline

Layout

Prerequisites

Install (local)

Configuration

CLI

Docker (Tesseract + pipeline)

Podcast client (Python)

Development notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pdf-to-blog

Pipeline

Layout

Prerequisites

Install (local)

Configuration

CLI

Docker (Tesseract + pipeline)

Podcast client (Python)

Development notes

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages