|
| 1 | +# Plan: Pandoc Lua Constructor Type Coercion |
| 2 | + |
| 3 | +## Status: Complete |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## Overview |
| 8 | + |
| 9 | +Pandoc's Lua API performs automatic type coercion ("fuzzy peeking") when |
| 10 | +constructors receive arguments. q2's constructors are strict — they only |
| 11 | +accept tables of the exact userdata type. This causes real-world extensions |
| 12 | +(e.g., lipsum) to fail because they rely on coercion behaviors like |
| 13 | +`pandoc.Para("text")` or `pandoc.Para(pandoc.Str("x"))`. |
| 14 | + |
| 15 | +This plan brings q2's coercion in line with real Pandoc's `pandoc-lua-marshal` |
| 16 | +package, specifically the `peekInlinesFuzzy`, `peekInlineFuzzy`, |
| 17 | +`peekBlocksFuzzy`, and `peekBlockFuzzy` functions. |
| 18 | + |
| 19 | +## Codebase Context |
| 20 | + |
| 21 | +### Where coercion happens in real Pandoc |
| 22 | + |
| 23 | +Source: `pandoc-lua-marshal` Haskell package |
| 24 | +(`~/src/pandoc-lua-marshal/src/Text/Pandoc/Lua/Marshal/`). |
| 25 | + |
| 26 | +**`peekInlinesFuzzy`** (Inline.hs:138-147) — dispatches on Lua type: |
| 27 | +1. `TypeString` → word-split via `B.text` into `Str`/`Space`/`SoftBreak` list |
| 28 | +2. `TypeTable` → try `__toinline` metamethod (→ singleton), else `peekList peekInlineFuzzy` |
| 29 | +3. `TypeUserdata` → singleton via `peekInlineFuzzy` |
| 30 | +4. Otherwise → error |
| 31 | + |
| 32 | +**`peekInlineFuzzy`** (Inline.hs:127-134) — dispatches on Lua type: |
| 33 | +1. `TypeString` → `Str(text)` (NO word splitting) |
| 34 | +2. `TypeTable` → try `__toinline` metamethod, else `peekInline` |
| 35 | +3. `TypeUserdata` → `peekInline` or `__toinline` metamethod |
| 36 | +4. Otherwise → error |
| 37 | + |
| 38 | +**`peekBlocksFuzzy`** (Block.hs:145-153) — tries in order: |
| 39 | +1. `__toblock` metamethod → singleton list |
| 40 | +2. `peekList peekBlockFuzzy` (each element via `peekBlockFuzzy`) |
| 41 | +3. Single `peekBlockFuzzy` → singleton list |
| 42 | +4. Otherwise → error |
| 43 | + |
| 44 | +**`peekBlockFuzzy`** (Block.hs:133-141) — tries in order: |
| 45 | +1. `peekBlock` (exact Block userdata) |
| 46 | +2. `__toblock` metamethod |
| 47 | +3. `Plain <$!> peekInlinesFuzzy` (any inlines-like value → wrap in Plain) |
| 48 | +4. Otherwise → error |
| 49 | + |
| 50 | +**`B.text`** (pandoc-types Builder.hs:334-350) — word-splitting: |
| 51 | +- Groups consecutive characters by space/non-space category |
| 52 | +- Space chars: ` `, `\r`, `\n`, `\t` |
| 53 | +- Non-space runs → `Str` |
| 54 | +- Space-only runs → `Space`, unless the run contains `\n` or `\r` → `SoftBreak` |
| 55 | +- Multiple consecutive spaces collapse to a single `Space`/`SoftBreak` |
| 56 | +- Empty string → empty list |
| 57 | + |
| 58 | +### Where coercion happens in q2 |
| 59 | + |
| 60 | +- `crates/pampa/src/lua/types.rs` — `lua_table_to_inlines()` (line ~1343) |
| 61 | + and `lua_table_to_blocks()` (line ~1367). Both only accept `Value::Table` |
| 62 | + containing the exact userdata type. |
| 63 | +- `crates/pampa/src/lua/constructors.rs` — all constructors call one of |
| 64 | + these two functions. The `pandoc.Inlines()` and `pandoc.Blocks()` |
| 65 | + constructors have their own coercion logic that is partially correct. |
| 66 | + |
| 67 | +### Which constructors use which peek functions in real Pandoc |
| 68 | + |
| 69 | +Every constructor in Pandoc that takes inlines or blocks uses the fuzzy |
| 70 | +variants — no exceptions. Full mapping from `pandoc-lua-marshal`: |
| 71 | + |
| 72 | +| Constructor | Parameter | Pandoc peek function | |
| 73 | +|---|---|---| |
| 74 | +| Para, Plain | content | `peekInlinesFuzzy` | |
| 75 | +| Header | content | `peekInlinesFuzzy` | |
| 76 | +| Emph, Strong, Underline, Strikeout, Superscript, Subscript, SmallCaps | content | `peekInlinesFuzzy` | |
| 77 | +| Quoted | content | `peekInlinesFuzzy` | |
| 78 | +| Link, Image | content | `peekInlinesFuzzy` | |
| 79 | +| Span | content | `peekInlinesFuzzy` | |
| 80 | +| Cite | content | `peekInlinesFuzzy` | |
| 81 | +| Note, BlockQuote, Div | content | `peekBlocksFuzzy` | |
| 82 | +| Figure | content | `peekBlocksFuzzy` | |
| 83 | +| BulletList, OrderedList | items | `peekItemsFuzzy` = `peekList peekBlocksFuzzy \|\| singleton peekBlocksFuzzy` | |
| 84 | +| DefinitionList | items | `peekList peekDefinitionItem` where term = `peekInlinesFuzzy`, defs = `peekList peekBlocksFuzzy \|\| singleton peekBlocksFuzzy` | |
| 85 | +| LineBlock | lines | `peekList peekInlinesFuzzy` | |
| 86 | +| Caption | long | `peekBlocksFuzzy` | |
| 87 | +| Caption | short | `peekInlinesFuzzy` | |
| 88 | +| Caption (fuzzy peek) | fallback | tries Caption, then table, then `peekBlocksFuzzy` | |
| 89 | +| Citation | prefix | `peekInlinesFuzzy` | |
| 90 | +| Citation | suffix | `peekInlinesFuzzy` | |
| 91 | +| pandoc.Inlines | content | `peekInlinesFuzzy` (delegates entirely) | |
| 92 | +| pandoc.Blocks | content | `peekBlocksFuzzy` (delegates entirely) | |
| 93 | + |
| 94 | +## Current q2 behavior vs expected |
| 95 | + |
| 96 | +### Inlines constructors (Para, Emph, Strong, etc.) |
| 97 | + |
| 98 | +| Input | Real Pandoc | q2 now | Gap | |
| 99 | +|---|---|---|---| |
| 100 | +| `{pandoc.Str("x"), pandoc.Space()}` | works | works | — | |
| 101 | +| `pandoc.Str("x")` (single userdata) | `{Str("x")}` | **error** | fix | |
| 102 | +| `"hello"` (string) | `{Str("hello")}` | **error** | fix | |
| 103 | +| `"hello world"` (multi-word string) | `{Str("hello"), Space, Str("world")}` | **error** | fix | |
| 104 | +| `{"hello", pandoc.Space(), "world"}` (mixed) | `{Str("hello"), Space, Str("world")}` | **error** | fix | |
| 105 | + |
| 106 | +### Blocks constructors (Div, BlockQuote, Figure, Note) |
| 107 | + |
| 108 | +| Input | Real Pandoc | q2 now | Gap | |
| 109 | +|---|---|---|---| |
| 110 | +| `{pandoc.Para(...)}` (table of blocks) | works | works | — | |
| 111 | +| `pandoc.Para(...)` (single userdata) | `{Para(...)}` | **error** | fix | |
| 112 | +| `"text"` (string) | `{Plain({Str("text")})}` | **error** | fix | |
| 113 | +| `{pandoc.Str("x")}` (inlines-like) | `{Plain({Str("x")})}` | **error** | fix | |
| 114 | +| `{pandoc.Str("x"), pandoc.Str("y")}` (multiple inlines) | `{Plain({Str("x")}), Plain({Str("y")})}` | **error** | fix | |
| 115 | + |
| 116 | +Note: A table of inlines passed to a blocks constructor produces **one Plain |
| 117 | +block per element**, NOT one Plain wrapping all inlines. This is because |
| 118 | +`peekBlockFuzzy` is applied per-element, and each inline individually becomes |
| 119 | +`Plain([that_inline])`. |
| 120 | + |
| 121 | +### `pandoc.Inlines()` constructor |
| 122 | + |
| 123 | +| Input | Real Pandoc | q2 now | Gap | |
| 124 | +|---|---|---|---| |
| 125 | +| `"hello world"` | `{Str("hello"), Space, Str("world")}` | `{Str("hello world")}` | fix | |
| 126 | +| `{"hello", pandoc.Str("!")}` (mixed) | `{Str("hello"), Str("!")}` | `{Str("hello"), Str("!")}` | — | |
| 127 | +| Single Inline userdata | wraps in list | wraps in list | — | |
| 128 | +| `nil` | empty list | empty list | — | |
| 129 | + |
| 130 | +### `pandoc.Blocks()` constructor |
| 131 | + |
| 132 | +| Input | Real Pandoc | q2 now | Gap | |
| 133 | +|---|---|---|---| |
| 134 | +| Single Block userdata | wraps in list | wraps in list | — | |
| 135 | +| `nil` | empty list | empty list | — | |
| 136 | +| String | `{Plain(word-split inlines)}` | **error** | fix | |
| 137 | +| Inlines-like | `{Plain(inlines)}` | **error** | fix | |
| 138 | + |
| 139 | +### Helper constructors |
| 140 | + |
| 141 | +| Constructor | Parameter | Real Pandoc | q2 now | Gap | |
| 142 | +|---|---|---|---|---| |
| 143 | +| BulletList | each item | `peekBlocksFuzzy` (string → `[Plain(word-split)]`) | strict blocks only | fix | |
| 144 | +| BulletList | items arg | list of items OR single item → singleton | list only | fix | |
| 145 | +| OrderedList | each item | same as BulletList | strict blocks only | fix | |
| 146 | +| DefinitionList | term | `peekInlinesFuzzy` (string → word-split) | strict inlines only | fix | |
| 147 | +| DefinitionList | definitions | `peekList peekBlocksFuzzy` or single → singleton | strict blocks only | fix | |
| 148 | +| LineBlock | each line | `peekInlinesFuzzy` (string → word-split) | strict inlines only | fix | |
| 149 | +| Caption | long | `peekBlocksFuzzy` | strict blocks only | fix | |
| 150 | +| Caption | short | `peekInlinesFuzzy` | strict inlines only | fix | |
| 151 | +| Citation | prefix | `peekInlinesFuzzy` | strict inlines only | fix | |
| 152 | +| Citation | suffix | `peekInlinesFuzzy` | strict inlines only | fix | |
| 153 | + |
| 154 | +--- |
| 155 | + |
| 156 | +## Work Items |
| 157 | + |
| 158 | +### Phase 1: Core coercion functions (types.rs) |
| 159 | + |
| 160 | +- [x] **1.1** Add `split_string_to_inlines(s: &str) -> Vec<Inline>` utility |
| 161 | + that splits a string on whitespace, producing `Str`/`Space`/`SoftBreak` |
| 162 | + elements matching Pandoc's `B.text` behavior: |
| 163 | + - Group consecutive characters by space vs non-space |
| 164 | + - Space characters: ` `, `\r`, `\n`, `\t` |
| 165 | + - Non-space runs → `Str(text)` |
| 166 | + - Space-only runs → `SoftBreak` if run contains `\n` or `\r`, else `Space` |
| 167 | + - Multiple consecutive spaces collapse into a single `Space`/`SoftBreak` |
| 168 | + - Empty string → empty vec |
| 169 | + |
| 170 | +- [x] **1.2** Rewrite `lua_table_to_inlines()` as `peek_inlines_fuzzy()`: |
| 171 | + Accept (in this priority order, matching Pandoc's type dispatch): |
| 172 | + 1. `Value::String` — word-split via `split_string_to_inlines()` |
| 173 | + 2. `Value::Table` — iterate sequence values, each via `peek_inline_fuzzy()` |
| 174 | + 3. `Value::UserData` containing `LuaInline` — wrap in singleton vec |
| 175 | + 4. Otherwise → error |
| 176 | + |
| 177 | + Add helper `peek_inline_fuzzy(val: Value) -> Result<Inline>`: |
| 178 | + 1. `Value::String` — wrap in single `Str` (NO word splitting) |
| 179 | + 2. `Value::UserData` containing `LuaInline` — extract |
| 180 | + 3. Otherwise → error |
| 181 | + |
| 182 | +- [x] **1.3** Rewrite `lua_table_to_blocks()` as `peek_blocks_fuzzy()`: |
| 183 | + Accept (in this priority order, matching Pandoc): |
| 184 | + 1. `Value::Table` — iterate sequence values, each via `peek_block_fuzzy()` |
| 185 | + 2. `Value::UserData` containing `LuaBlock` — wrap in singleton vec |
| 186 | + 3. Any value that `peek_inlines_fuzzy()` accepts — wrap in |
| 187 | + `Plain(inlines)` as singleton vec |
| 188 | + 4. Otherwise → error |
| 189 | + |
| 190 | + Add helper `peek_block_fuzzy(val: Value) -> Result<Block>`: |
| 191 | + 1. `Value::UserData` containing `LuaBlock` — extract |
| 192 | + 2. Any value that `peek_inlines_fuzzy()` accepts — wrap in `Plain(inlines)` |
| 193 | + 3. Otherwise → error |
| 194 | + |
| 195 | +- [x] **1.4** Write tests for each coercion path: |
| 196 | + - `split_string_to_inlines`: empty, single word, multi-word, newlines, |
| 197 | + tabs, multiple consecutive spaces, mixed space/newline runs, |
| 198 | + leading/trailing whitespace |
| 199 | + - `peek_inlines_fuzzy`: table of inlines, table with mixed strings, |
| 200 | + single inline, single string, multi-word string |
| 201 | + - `peek_blocks_fuzzy`: table of blocks, single block, string→Plain, |
| 202 | + inlines-like→Plain, table of inlines→multiple Plains |
| 203 | + |
| 204 | +### Phase 2: Update all constructors (constructors.rs) |
| 205 | + |
| 206 | +- [x] **2.1** Replace all calls to `lua_table_to_inlines()` with |
| 207 | + `peek_inlines_fuzzy()` in constructors: Para, Plain, Header, Emph, |
| 208 | + Strong, Underline, Strikeout, Superscript, Subscript, SmallCaps, |
| 209 | + Quoted, Link, Image, Span, Cite. |
| 210 | + |
| 211 | +- [x] **2.2** Replace all calls to `lua_table_to_blocks()` with |
| 212 | + `peek_blocks_fuzzy()` in constructors: Note, BlockQuote, Div, Figure. |
| 213 | + |
| 214 | +- [x] **2.3** Update helper parsing functions that call the old functions: |
| 215 | + - `parse_list_items()` → use `peek_blocks_fuzzy()` for each item, |
| 216 | + AND accept a single blocks-like value (not just a table of items), |
| 217 | + matching Pandoc's `peekItemsFuzzy`. |
| 218 | + - `parse_definition_list_items()` → use `peek_inlines_fuzzy()` for terms, |
| 219 | + `peek_blocks_fuzzy()` for definitions (already via parse_list_items). |
| 220 | + - `parse_line_block_content()` → use `peek_inlines_fuzzy()` for each line. |
| 221 | + - `parse_caption()` → use `peek_inlines_fuzzy()` for short, |
| 222 | + `peek_blocks_fuzzy()` for long. |
| 223 | + - `parse_single_citation()` → use `peek_inlines_fuzzy()` for prefix |
| 224 | + and suffix. |
| 225 | + |
| 226 | +- [x] **2.4** Update `pandoc.Inlines()` constructor: delegate entirely to |
| 227 | + `peek_inlines_fuzzy()` for the content argument, then wrap results as |
| 228 | + LuaInline userdata in a Lua table with the Inlines metatable. This |
| 229 | + replaces the current inline coercion logic and adds word-splitting |
| 230 | + for top-level strings. |
| 231 | + |
| 232 | +- [x] **2.5** Update `pandoc.Blocks()` constructor: delegate to |
| 233 | + `peek_blocks_fuzzy()` for the content argument, then wrap results as |
| 234 | + LuaBlock userdata in a Lua table with the Blocks metatable. This adds |
| 235 | + support for strings and inlines-like values (wrapped in Plain). |
| 236 | + |
| 237 | +### Phase 3: Constructor-level tests |
| 238 | + |
| 239 | +- [x] **3.1** Add tests for inlines constructors with coerced input types: |
| 240 | + - `pandoc.Para("hello world")` → `Para([Str("hello"), Space, Str("world")])` |
| 241 | + - `pandoc.Para(pandoc.Str("x"))` → `Para([Str("x")])` |
| 242 | + - `pandoc.Emph("text")` → `Emph([Str("text")])` |
| 243 | + - `pandoc.Header(1, "title")` → `Header(1, [Str("title")])` |
| 244 | + |
| 245 | +- [x] **3.2** Add tests for blocks constructors with coerced input types: |
| 246 | + - `pandoc.Div(pandoc.Para(...))` → `Div([Para(...)])` |
| 247 | + - `pandoc.Div("text")` → `Div([Plain([Str("text")])])` |
| 248 | + - `pandoc.BlockQuote("text")` → `BlockQuote([Plain([Str("text")])])` |
| 249 | + |
| 250 | +- [x] **3.3** Add tests for Inlines/Blocks constructors: |
| 251 | + - `pandoc.Inlines("hello world")` → word-split |
| 252 | + - `pandoc.Blocks("text")` → `[Plain([Str("text")])]` |
| 253 | + - `pandoc.Blocks(pandoc.Str("x"))` → `[Plain([Str("x")])]` |
| 254 | + |
| 255 | +- [x] **3.4** Add tests for helper constructors: |
| 256 | + - `pandoc.BulletList({"text", "more"})` — each string becomes blocks |
| 257 | + - `pandoc.BulletList(pandoc.Para(...))` — single item wrapping |
| 258 | + - `pandoc.LineBlock({"line one", "line two"})` — string lines |
| 259 | + - `pandoc.Citation("id", mode, "prefix")` — string prefix/suffix |
| 260 | + - Caption with string long/short |
| 261 | + |
| 262 | +- [x] **3.5** Add a test reproducing the lipsum pattern: |
| 263 | + ```lua |
| 264 | + local json = quarto.json.decode('["Lorem ipsum dolor sit amet"]') |
| 265 | + return pandoc.Para(json[1]) |
| 266 | + ``` |
| 267 | + Verify it produces `Para([Str("Lorem"), Space, Str("ipsum"), ...])`. |
| 268 | + |
| 269 | +### Phase 4: Verify |
| 270 | + |
| 271 | +- [x] **4.1** Run `cargo nextest run -p pampa` — all constructor and |
| 272 | + shortcode tests pass |
| 273 | +- [x] **4.2** Run `cargo nextest run --workspace` — no regressions |
| 274 | +- [x] **4.3** Verify the lipsum smoke test still works (it uses the |
| 275 | + `pandoc.Para({pandoc.Str(...)})` explicit form, which must keep working) |
| 276 | + |
| 277 | +## Design Notes |
| 278 | + |
| 279 | +### Why word-splitting matters |
| 280 | + |
| 281 | +Real Pandoc's `peekInlinesFuzzy` doesn't just wrap a string in `Str` — it |
| 282 | +splits on whitespace. This is because `pandoc.Para("hello world")` should |
| 283 | +produce the same AST as Pandoc would from parsing markdown `hello world`: |
| 284 | +multiple `Str` nodes separated by `Space`. |
| 285 | + |
| 286 | +This distinction matters for rendering: a single `Str("hello world")` with |
| 287 | +an embedded space may render differently than `Str("hello") Space Str("world")` |
| 288 | +in some output formats. |
| 289 | + |
| 290 | +### `peekInlineFuzzy` vs `peekInlinesFuzzy` string handling |
| 291 | + |
| 292 | +These behave differently for strings: |
| 293 | +- `peekInlinesFuzzy("hello world")` → `{Str("hello"), Space, Str("world")}` |
| 294 | + (word split — used when a string is the ENTIRE content argument) |
| 295 | +- `peekInlineFuzzy("hello world")` → `Str("hello world")` |
| 296 | + (no split — used when a string is ONE ELEMENT in a table) |
| 297 | + |
| 298 | +This is because in `{"hello", pandoc.Space(), "world"}`, each string |
| 299 | +element is treated as a single `Str` node. Word-splitting only applies |
| 300 | +at the top level. |
| 301 | + |
| 302 | +### Per-element block coercion |
| 303 | + |
| 304 | +When a table of inlines is passed to a blocks constructor, each element |
| 305 | +is independently coerced via `peek_block_fuzzy`. This means |
| 306 | +`pandoc.Div({pandoc.Str("x"), pandoc.Str("y")})` produces |
| 307 | +`Div([Plain([Str("x")]), Plain([Str("y")])])` — two separate Plain blocks, |
| 308 | +NOT one Plain containing both inlines. |
| 309 | + |
| 310 | +### Metamethods (`__toinline`, `__toblock`) — deferred |
| 311 | + |
| 312 | +Real Pandoc supports `__toinline` and `__toblock` metamethods for custom |
| 313 | +type coercion. We don't implement these yet and they're not needed for |
| 314 | +any current extension. This can be added later when needed. |
| 315 | + |
| 316 | +### Migration: old function names |
| 317 | + |
| 318 | +After renaming `lua_table_to_inlines` → `peek_inlines_fuzzy` (and blocks), |
| 319 | +grep for any remaining callers. The rename makes the behavior change |
| 320 | +visible and matches Pandoc's terminology. |
| 321 | + |
| 322 | +## Pandoc Source References |
| 323 | + |
| 324 | +| File | Content | |
| 325 | +|---|---| |
| 326 | +| `~/src/pandoc-lua-marshal/src/Text/Pandoc/Lua/Marshal/Inline.hs` | `peekInlineFuzzy` (L127), `peekInlinesFuzzy` (L138), `mkInlines` (L444), all inline constructors | |
| 327 | +| `~/src/pandoc-lua-marshal/src/Text/Pandoc/Lua/Marshal/Block.hs` | `peekBlockFuzzy` (L133), `peekBlocksFuzzy` (L145), `mkBlocks` (L477), `peekItemsFuzzy` (L469), all block constructors | |
| 328 | +| `~/src/pandoc-lua-marshal/src/Text/Pandoc/Lua/Marshal/Content.hs` | `peekDefinitionItem` (L73) | |
| 329 | +| `~/src/pandoc-lua-marshal/src/Text/Pandoc/Lua/Marshal/Caption.hs` | `peekCaptionFuzzy` (L74), `mkCaption` (L83) | |
| 330 | +| `~/src/pandoc-lua-marshal/src/Text/Pandoc/Lua/Marshal/Citation.hs` | `mkCitation` (L83) — prefix/suffix use `peekInlinesFuzzy` | |
| 331 | +| `~/src/pandoc-types/src/Text/Pandoc/Builder.hs` | `B.text` (L334) — word-splitting algorithm | |
| 332 | + |
| 333 | +## Files Touched |
| 334 | + |
| 335 | +| File | Change | |
| 336 | +|---|---| |
| 337 | +| `crates/pampa/src/lua/types.rs` | Rewrite `lua_table_to_inlines/blocks` as fuzzy variants, add `split_string_to_inlines` | |
| 338 | +| `crates/pampa/src/lua/constructors.rs` | Update all constructor calls, all helper functions, update `Inlines`/`Blocks` constructors | |
0 commit comments