Skip to content

feat(io): implement Apple Pages IWA import with heading structure#46

Merged
taylorarndt merged 3 commits into
mainfrom
feat/pages-import-iwa
May 31, 2026
Merged

feat(io): implement Apple Pages IWA import with heading structure#46
taylorarndt merged 3 commits into
mainfrom
feat/pages-import-iwa

Conversation

@payown
Copy link
Copy Markdown

@payown payown commented May 31, 2026

Summary

  • Rewrites Route A in quill/io/pages.py to use keynote-parser's real API (IWAFile + zip_file_reader) — the previous implementation called a non-existent load_presentation() function and would always fall through to the error message
  • Patches ID_NAME_MAP at runtime to skip Pages-specific protobuf message types (e.g. type 10000 = WP.DocumentArchive) that keynote-parser doesn't ship definitions for — previously these crashed the entire parse before any text was reached
  • Fixes text field handling: MessageToDict returns it as a list (repeated protobuf string), not a plain string
  • Follows the super.parent reference chain to resolve paragraph style names — Pages stores base styles with super.name directly, but variation styles only carry super.parent pointing back to the base
  • Fixes the markitdown version pin in pyproject.toml (>=0.3.0>=0.1.0); the latest release is 0.1.6 and the old constraint blocked installation entirely

Result

Opening a .pages file now produces structured Markdown with real heading levels (H2 Articles → H3 Sections → H4 Subsections), extracted via pure Python with no LibreOffice required. Tested against a real multi-section ACBO constitution document.

Test plan

  • python3 -m pytest tests/unit/io/test_structured.py -q — all 17 tests pass
  • Open a .pages file via python3 -m quill yourfile.pages — document loads with heading outline intact
  • Confirm Route B (LibreOffice + MarkItDown) fallback still works when keynote-parser is absent

Closes #38

🤖 Generated with Claude Code

Route A now correctly uses keynote-parser's real API (IWAFile +
zip_file_reader) instead of the non-existent load_presentation().

Key fixes:
- Patch ID_NAME_MAP at runtime to skip Pages-specific message types
  (e.g. type 10000 = WP.DocumentArchive) that keynote-parser doesn't
  know — these were crashing the entire parse before any text was read
- Join the repeated-string `text` field (list) from MessageToDict output
- Follow super.parent reference chain to resolve variation style names
  (Pages stores base styles with super.name; variations only carry
  super.parent pointing to the base)

Tested against a real .pages document — full text + H2/H3/H4 heading
hierarchy extracted correctly via pure Python, no LibreOffice needed.

Also fixes markitdown version pin (>=0.3.0 → >=0.1.0); latest release
is 0.1.6 and the old constraint blocked installation entirely.

Closes #38

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 31, 2026 04:21
@payown
Copy link
Copy Markdown
Author

payown commented May 31, 2026

@tayarndt would you mind taking a look at this when you get a chance? This implements the Pages import from issue #38 — tested against a real .pages document and getting full text + heading structure (H2/H3/H4) via pure Python. Would love your feedback before we merge.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a pure-Python Apple Pages (.pages) importer that parses iWork Archive (.iwa) content via keynote-parser to extract text and derive Markdown heading structure, with a LibreOffice + MarkItDown fallback path. It also adjusts the optional dependency pin so the Pages import extra can be installed.

Changes:

  • Reworks .pages Route A import to use keynote-parser’s IWAFile + zip_file_reader, including a fallback-safe ID_NAME_MAP wrapper for unknown Pages protobuf types.
  • Adds paragraph-style → Markdown heading prefixing by resolving paragraph style names (including following super.parent chains).
  • Fixes the markitdown optional dependency constraint to a valid released version range.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
quill/io/pages.py Implements IWA parsing path, patches unknown proto types handling, and applies paragraph style-based heading prefixes; keeps LibreOffice fallback.
pyproject.toml Updates pages optional dependency pin for markitdown[all] to allow installation.

Comment thread quill/io/pages.py
Comment on lines +145 to +148
storages = [s for s in all_storage if s.get("inDocument") or s.get("in_document")]
if not storages:
storages = [s for s in all_storage if s.get("text", "").strip()]

Comment thread quill/io/pages.py
Comment on lines 3 to 5
import subprocess
import tempfile
from pathlib import Path
Comment thread quill/io/pages.py
Comment on lines +73 to +89
def _patched_id_name_map():
"""Return a wrapper around keynote-parser's ID_NAME_MAP that never raises KeyError."""
import keynote_parser.codec as _codec

class _FallbackMap(dict):
def __missing__(self, type_id: int):
class _Factory:
_tid = type_id

@classmethod
def FromString(cls, data: bytes) -> "_UnknownIWAMessage":
return _UnknownIWAMessage(cls._tid, data)

return _Factory

return _FallbackMap(_codec.ID_NAME_MAP)

Comment thread quill/io/pages.py
return _FallbackMap(_codec.ID_NAME_MAP)


def _parse_iwa_bundle(path: Path, zip_file_reader) -> dict[str, list[dict]]:
@taylorarndt taylorarndt merged commit 1e5da13 into main May 31, 2026
1 of 3 checks passed
@taylorarndt taylorarndt deleted the feat/pages-import-iwa branch May 31, 2026 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Import Apple Pages (.pages) documents with headings/structure (IWA parse or LibreOffice fallback)

4 participants