Skip to content

Commit 72d5d7a

Browse files
Implement Pandoc-compatible fuzzy type coercion for Lua constructors
Pandoc's Lua API automatically coerces arguments passed to constructors (e.g. pandoc.Para("text") or pandoc.Div(pandoc.Para(...))). Our constructors were strict, only accepting tables of exact userdata types, causing real-world extensions like lipsum to fail. Add peek_inlines_fuzzy/peek_blocks_fuzzy matching pandoc-lua-marshal's peekInlinesFuzzy/peekBlocksFuzzy behavior: - Strings word-split into Str/Space/SoftBreak (matching B.text) - Single userdata wrapped in singleton lists - Mixed tables of strings and userdata accepted - Inlines-like values in block context wrapped in Plain Update all constructors (Para, Emph, Div, etc.), helper functions (parse_list_items, parse_caption, parse_single_citation, etc.), and pandoc.Inlines()/pandoc.Blocks() to use fuzzy coercion uniformly.
1 parent 7e77f71 commit 72d5d7a

5 files changed

Lines changed: 1207 additions & 149 deletions

File tree

Lines changed: 338 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,338 @@
1+
# Plan: Pandoc Lua Constructor Type Coercion
2+
3+
## Status: Complete
4+
5+
---
6+
7+
## Overview
8+
9+
Pandoc's Lua API performs automatic type coercion ("fuzzy peeking") when
10+
constructors receive arguments. q2's constructors are strict — they only
11+
accept tables of the exact userdata type. This causes real-world extensions
12+
(e.g., lipsum) to fail because they rely on coercion behaviors like
13+
`pandoc.Para("text")` or `pandoc.Para(pandoc.Str("x"))`.
14+
15+
This plan brings q2's coercion in line with real Pandoc's `pandoc-lua-marshal`
16+
package, specifically the `peekInlinesFuzzy`, `peekInlineFuzzy`,
17+
`peekBlocksFuzzy`, and `peekBlockFuzzy` functions.
18+
19+
## Codebase Context
20+
21+
### Where coercion happens in real Pandoc
22+
23+
Source: `pandoc-lua-marshal` Haskell package
24+
(`~/src/pandoc-lua-marshal/src/Text/Pandoc/Lua/Marshal/`).
25+
26+
**`peekInlinesFuzzy`** (Inline.hs:138-147) — dispatches on Lua type:
27+
1. `TypeString` → word-split via `B.text` into `Str`/`Space`/`SoftBreak` list
28+
2. `TypeTable` → try `__toinline` metamethod (→ singleton), else `peekList peekInlineFuzzy`
29+
3. `TypeUserdata` → singleton via `peekInlineFuzzy`
30+
4. Otherwise → error
31+
32+
**`peekInlineFuzzy`** (Inline.hs:127-134) — dispatches on Lua type:
33+
1. `TypeString``Str(text)` (NO word splitting)
34+
2. `TypeTable` → try `__toinline` metamethod, else `peekInline`
35+
3. `TypeUserdata``peekInline` or `__toinline` metamethod
36+
4. Otherwise → error
37+
38+
**`peekBlocksFuzzy`** (Block.hs:145-153) — tries in order:
39+
1. `__toblock` metamethod → singleton list
40+
2. `peekList peekBlockFuzzy` (each element via `peekBlockFuzzy`)
41+
3. Single `peekBlockFuzzy` → singleton list
42+
4. Otherwise → error
43+
44+
**`peekBlockFuzzy`** (Block.hs:133-141) — tries in order:
45+
1. `peekBlock` (exact Block userdata)
46+
2. `__toblock` metamethod
47+
3. `Plain <$!> peekInlinesFuzzy` (any inlines-like value → wrap in Plain)
48+
4. Otherwise → error
49+
50+
**`B.text`** (pandoc-types Builder.hs:334-350) — word-splitting:
51+
- Groups consecutive characters by space/non-space category
52+
- Space chars: ` `, `\r`, `\n`, `\t`
53+
- Non-space runs → `Str`
54+
- Space-only runs → `Space`, unless the run contains `\n` or `\r``SoftBreak`
55+
- Multiple consecutive spaces collapse to a single `Space`/`SoftBreak`
56+
- Empty string → empty list
57+
58+
### Where coercion happens in q2
59+
60+
- `crates/pampa/src/lua/types.rs``lua_table_to_inlines()` (line ~1343)
61+
and `lua_table_to_blocks()` (line ~1367). Both only accept `Value::Table`
62+
containing the exact userdata type.
63+
- `crates/pampa/src/lua/constructors.rs` — all constructors call one of
64+
these two functions. The `pandoc.Inlines()` and `pandoc.Blocks()`
65+
constructors have their own coercion logic that is partially correct.
66+
67+
### Which constructors use which peek functions in real Pandoc
68+
69+
Every constructor in Pandoc that takes inlines or blocks uses the fuzzy
70+
variants — no exceptions. Full mapping from `pandoc-lua-marshal`:
71+
72+
| Constructor | Parameter | Pandoc peek function |
73+
|---|---|---|
74+
| Para, Plain | content | `peekInlinesFuzzy` |
75+
| Header | content | `peekInlinesFuzzy` |
76+
| Emph, Strong, Underline, Strikeout, Superscript, Subscript, SmallCaps | content | `peekInlinesFuzzy` |
77+
| Quoted | content | `peekInlinesFuzzy` |
78+
| Link, Image | content | `peekInlinesFuzzy` |
79+
| Span | content | `peekInlinesFuzzy` |
80+
| Cite | content | `peekInlinesFuzzy` |
81+
| Note, BlockQuote, Div | content | `peekBlocksFuzzy` |
82+
| Figure | content | `peekBlocksFuzzy` |
83+
| BulletList, OrderedList | items | `peekItemsFuzzy` = `peekList peekBlocksFuzzy \|\| singleton peekBlocksFuzzy` |
84+
| DefinitionList | items | `peekList peekDefinitionItem` where term = `peekInlinesFuzzy`, defs = `peekList peekBlocksFuzzy \|\| singleton peekBlocksFuzzy` |
85+
| LineBlock | lines | `peekList peekInlinesFuzzy` |
86+
| Caption | long | `peekBlocksFuzzy` |
87+
| Caption | short | `peekInlinesFuzzy` |
88+
| Caption (fuzzy peek) | fallback | tries Caption, then table, then `peekBlocksFuzzy` |
89+
| Citation | prefix | `peekInlinesFuzzy` |
90+
| Citation | suffix | `peekInlinesFuzzy` |
91+
| pandoc.Inlines | content | `peekInlinesFuzzy` (delegates entirely) |
92+
| pandoc.Blocks | content | `peekBlocksFuzzy` (delegates entirely) |
93+
94+
## Current q2 behavior vs expected
95+
96+
### Inlines constructors (Para, Emph, Strong, etc.)
97+
98+
| Input | Real Pandoc | q2 now | Gap |
99+
|---|---|---|---|
100+
| `{pandoc.Str("x"), pandoc.Space()}` | works | works ||
101+
| `pandoc.Str("x")` (single userdata) | `{Str("x")}` | **error** | fix |
102+
| `"hello"` (string) | `{Str("hello")}` | **error** | fix |
103+
| `"hello world"` (multi-word string) | `{Str("hello"), Space, Str("world")}` | **error** | fix |
104+
| `{"hello", pandoc.Space(), "world"}` (mixed) | `{Str("hello"), Space, Str("world")}` | **error** | fix |
105+
106+
### Blocks constructors (Div, BlockQuote, Figure, Note)
107+
108+
| Input | Real Pandoc | q2 now | Gap |
109+
|---|---|---|---|
110+
| `{pandoc.Para(...)}` (table of blocks) | works | works ||
111+
| `pandoc.Para(...)` (single userdata) | `{Para(...)}` | **error** | fix |
112+
| `"text"` (string) | `{Plain({Str("text")})}` | **error** | fix |
113+
| `{pandoc.Str("x")}` (inlines-like) | `{Plain({Str("x")})}` | **error** | fix |
114+
| `{pandoc.Str("x"), pandoc.Str("y")}` (multiple inlines) | `{Plain({Str("x")}), Plain({Str("y")})}` | **error** | fix |
115+
116+
Note: A table of inlines passed to a blocks constructor produces **one Plain
117+
block per element**, NOT one Plain wrapping all inlines. This is because
118+
`peekBlockFuzzy` is applied per-element, and each inline individually becomes
119+
`Plain([that_inline])`.
120+
121+
### `pandoc.Inlines()` constructor
122+
123+
| Input | Real Pandoc | q2 now | Gap |
124+
|---|---|---|---|
125+
| `"hello world"` | `{Str("hello"), Space, Str("world")}` | `{Str("hello world")}` | fix |
126+
| `{"hello", pandoc.Str("!")}` (mixed) | `{Str("hello"), Str("!")}` | `{Str("hello"), Str("!")}` ||
127+
| Single Inline userdata | wraps in list | wraps in list ||
128+
| `nil` | empty list | empty list ||
129+
130+
### `pandoc.Blocks()` constructor
131+
132+
| Input | Real Pandoc | q2 now | Gap |
133+
|---|---|---|---|
134+
| Single Block userdata | wraps in list | wraps in list ||
135+
| `nil` | empty list | empty list ||
136+
| String | `{Plain(word-split inlines)}` | **error** | fix |
137+
| Inlines-like | `{Plain(inlines)}` | **error** | fix |
138+
139+
### Helper constructors
140+
141+
| Constructor | Parameter | Real Pandoc | q2 now | Gap |
142+
|---|---|---|---|---|
143+
| BulletList | each item | `peekBlocksFuzzy` (string → `[Plain(word-split)]`) | strict blocks only | fix |
144+
| BulletList | items arg | list of items OR single item → singleton | list only | fix |
145+
| OrderedList | each item | same as BulletList | strict blocks only | fix |
146+
| DefinitionList | term | `peekInlinesFuzzy` (string → word-split) | strict inlines only | fix |
147+
| DefinitionList | definitions | `peekList peekBlocksFuzzy` or single → singleton | strict blocks only | fix |
148+
| LineBlock | each line | `peekInlinesFuzzy` (string → word-split) | strict inlines only | fix |
149+
| Caption | long | `peekBlocksFuzzy` | strict blocks only | fix |
150+
| Caption | short | `peekInlinesFuzzy` | strict inlines only | fix |
151+
| Citation | prefix | `peekInlinesFuzzy` | strict inlines only | fix |
152+
| Citation | suffix | `peekInlinesFuzzy` | strict inlines only | fix |
153+
154+
---
155+
156+
## Work Items
157+
158+
### Phase 1: Core coercion functions (types.rs)
159+
160+
- [x] **1.1** Add `split_string_to_inlines(s: &str) -> Vec<Inline>` utility
161+
that splits a string on whitespace, producing `Str`/`Space`/`SoftBreak`
162+
elements matching Pandoc's `B.text` behavior:
163+
- Group consecutive characters by space vs non-space
164+
- Space characters: ` `, `\r`, `\n`, `\t`
165+
- Non-space runs → `Str(text)`
166+
- Space-only runs → `SoftBreak` if run contains `\n` or `\r`, else `Space`
167+
- Multiple consecutive spaces collapse into a single `Space`/`SoftBreak`
168+
- Empty string → empty vec
169+
170+
- [x] **1.2** Rewrite `lua_table_to_inlines()` as `peek_inlines_fuzzy()`:
171+
Accept (in this priority order, matching Pandoc's type dispatch):
172+
1. `Value::String` — word-split via `split_string_to_inlines()`
173+
2. `Value::Table` — iterate sequence values, each via `peek_inline_fuzzy()`
174+
3. `Value::UserData` containing `LuaInline` — wrap in singleton vec
175+
4. Otherwise → error
176+
177+
Add helper `peek_inline_fuzzy(val: Value) -> Result<Inline>`:
178+
1. `Value::String` — wrap in single `Str` (NO word splitting)
179+
2. `Value::UserData` containing `LuaInline` — extract
180+
3. Otherwise → error
181+
182+
- [x] **1.3** Rewrite `lua_table_to_blocks()` as `peek_blocks_fuzzy()`:
183+
Accept (in this priority order, matching Pandoc):
184+
1. `Value::Table` — iterate sequence values, each via `peek_block_fuzzy()`
185+
2. `Value::UserData` containing `LuaBlock` — wrap in singleton vec
186+
3. Any value that `peek_inlines_fuzzy()` accepts — wrap in
187+
`Plain(inlines)` as singleton vec
188+
4. Otherwise → error
189+
190+
Add helper `peek_block_fuzzy(val: Value) -> Result<Block>`:
191+
1. `Value::UserData` containing `LuaBlock` — extract
192+
2. Any value that `peek_inlines_fuzzy()` accepts — wrap in `Plain(inlines)`
193+
3. Otherwise → error
194+
195+
- [x] **1.4** Write tests for each coercion path:
196+
- `split_string_to_inlines`: empty, single word, multi-word, newlines,
197+
tabs, multiple consecutive spaces, mixed space/newline runs,
198+
leading/trailing whitespace
199+
- `peek_inlines_fuzzy`: table of inlines, table with mixed strings,
200+
single inline, single string, multi-word string
201+
- `peek_blocks_fuzzy`: table of blocks, single block, string→Plain,
202+
inlines-like→Plain, table of inlines→multiple Plains
203+
204+
### Phase 2: Update all constructors (constructors.rs)
205+
206+
- [x] **2.1** Replace all calls to `lua_table_to_inlines()` with
207+
`peek_inlines_fuzzy()` in constructors: Para, Plain, Header, Emph,
208+
Strong, Underline, Strikeout, Superscript, Subscript, SmallCaps,
209+
Quoted, Link, Image, Span, Cite.
210+
211+
- [x] **2.2** Replace all calls to `lua_table_to_blocks()` with
212+
`peek_blocks_fuzzy()` in constructors: Note, BlockQuote, Div, Figure.
213+
214+
- [x] **2.3** Update helper parsing functions that call the old functions:
215+
- `parse_list_items()` → use `peek_blocks_fuzzy()` for each item,
216+
AND accept a single blocks-like value (not just a table of items),
217+
matching Pandoc's `peekItemsFuzzy`.
218+
- `parse_definition_list_items()` → use `peek_inlines_fuzzy()` for terms,
219+
`peek_blocks_fuzzy()` for definitions (already via parse_list_items).
220+
- `parse_line_block_content()` → use `peek_inlines_fuzzy()` for each line.
221+
- `parse_caption()` → use `peek_inlines_fuzzy()` for short,
222+
`peek_blocks_fuzzy()` for long.
223+
- `parse_single_citation()` → use `peek_inlines_fuzzy()` for prefix
224+
and suffix.
225+
226+
- [x] **2.4** Update `pandoc.Inlines()` constructor: delegate entirely to
227+
`peek_inlines_fuzzy()` for the content argument, then wrap results as
228+
LuaInline userdata in a Lua table with the Inlines metatable. This
229+
replaces the current inline coercion logic and adds word-splitting
230+
for top-level strings.
231+
232+
- [x] **2.5** Update `pandoc.Blocks()` constructor: delegate to
233+
`peek_blocks_fuzzy()` for the content argument, then wrap results as
234+
LuaBlock userdata in a Lua table with the Blocks metatable. This adds
235+
support for strings and inlines-like values (wrapped in Plain).
236+
237+
### Phase 3: Constructor-level tests
238+
239+
- [x] **3.1** Add tests for inlines constructors with coerced input types:
240+
- `pandoc.Para("hello world")``Para([Str("hello"), Space, Str("world")])`
241+
- `pandoc.Para(pandoc.Str("x"))``Para([Str("x")])`
242+
- `pandoc.Emph("text")``Emph([Str("text")])`
243+
- `pandoc.Header(1, "title")``Header(1, [Str("title")])`
244+
245+
- [x] **3.2** Add tests for blocks constructors with coerced input types:
246+
- `pandoc.Div(pandoc.Para(...))``Div([Para(...)])`
247+
- `pandoc.Div("text")``Div([Plain([Str("text")])])`
248+
- `pandoc.BlockQuote("text")``BlockQuote([Plain([Str("text")])])`
249+
250+
- [x] **3.3** Add tests for Inlines/Blocks constructors:
251+
- `pandoc.Inlines("hello world")` → word-split
252+
- `pandoc.Blocks("text")``[Plain([Str("text")])]`
253+
- `pandoc.Blocks(pandoc.Str("x"))``[Plain([Str("x")])]`
254+
255+
- [x] **3.4** Add tests for helper constructors:
256+
- `pandoc.BulletList({"text", "more"})` — each string becomes blocks
257+
- `pandoc.BulletList(pandoc.Para(...))` — single item wrapping
258+
- `pandoc.LineBlock({"line one", "line two"})` — string lines
259+
- `pandoc.Citation("id", mode, "prefix")` — string prefix/suffix
260+
- Caption with string long/short
261+
262+
- [x] **3.5** Add a test reproducing the lipsum pattern:
263+
```lua
264+
local json = quarto.json.decode('["Lorem ipsum dolor sit amet"]')
265+
return pandoc.Para(json[1])
266+
```
267+
Verify it produces `Para([Str("Lorem"), Space, Str("ipsum"), ...])`.
268+
269+
### Phase 4: Verify
270+
271+
- [x] **4.1** Run `cargo nextest run -p pampa` — all constructor and
272+
shortcode tests pass
273+
- [x] **4.2** Run `cargo nextest run --workspace` — no regressions
274+
- [x] **4.3** Verify the lipsum smoke test still works (it uses the
275+
`pandoc.Para({pandoc.Str(...)})` explicit form, which must keep working)
276+
277+
## Design Notes
278+
279+
### Why word-splitting matters
280+
281+
Real Pandoc's `peekInlinesFuzzy` doesn't just wrap a string in `Str` — it
282+
splits on whitespace. This is because `pandoc.Para("hello world")` should
283+
produce the same AST as Pandoc would from parsing markdown `hello world`:
284+
multiple `Str` nodes separated by `Space`.
285+
286+
This distinction matters for rendering: a single `Str("hello world")` with
287+
an embedded space may render differently than `Str("hello") Space Str("world")`
288+
in some output formats.
289+
290+
### `peekInlineFuzzy` vs `peekInlinesFuzzy` string handling
291+
292+
These behave differently for strings:
293+
- `peekInlinesFuzzy("hello world")``{Str("hello"), Space, Str("world")}`
294+
(word split — used when a string is the ENTIRE content argument)
295+
- `peekInlineFuzzy("hello world")``Str("hello world")`
296+
(no split — used when a string is ONE ELEMENT in a table)
297+
298+
This is because in `{"hello", pandoc.Space(), "world"}`, each string
299+
element is treated as a single `Str` node. Word-splitting only applies
300+
at the top level.
301+
302+
### Per-element block coercion
303+
304+
When a table of inlines is passed to a blocks constructor, each element
305+
is independently coerced via `peek_block_fuzzy`. This means
306+
`pandoc.Div({pandoc.Str("x"), pandoc.Str("y")})` produces
307+
`Div([Plain([Str("x")]), Plain([Str("y")])])` — two separate Plain blocks,
308+
NOT one Plain containing both inlines.
309+
310+
### Metamethods (`__toinline`, `__toblock`) — deferred
311+
312+
Real Pandoc supports `__toinline` and `__toblock` metamethods for custom
313+
type coercion. We don't implement these yet and they're not needed for
314+
any current extension. This can be added later when needed.
315+
316+
### Migration: old function names
317+
318+
After renaming `lua_table_to_inlines``peek_inlines_fuzzy` (and blocks),
319+
grep for any remaining callers. The rename makes the behavior change
320+
visible and matches Pandoc's terminology.
321+
322+
## Pandoc Source References
323+
324+
| File | Content |
325+
|---|---|
326+
| `~/src/pandoc-lua-marshal/src/Text/Pandoc/Lua/Marshal/Inline.hs` | `peekInlineFuzzy` (L127), `peekInlinesFuzzy` (L138), `mkInlines` (L444), all inline constructors |
327+
| `~/src/pandoc-lua-marshal/src/Text/Pandoc/Lua/Marshal/Block.hs` | `peekBlockFuzzy` (L133), `peekBlocksFuzzy` (L145), `mkBlocks` (L477), `peekItemsFuzzy` (L469), all block constructors |
328+
| `~/src/pandoc-lua-marshal/src/Text/Pandoc/Lua/Marshal/Content.hs` | `peekDefinitionItem` (L73) |
329+
| `~/src/pandoc-lua-marshal/src/Text/Pandoc/Lua/Marshal/Caption.hs` | `peekCaptionFuzzy` (L74), `mkCaption` (L83) |
330+
| `~/src/pandoc-lua-marshal/src/Text/Pandoc/Lua/Marshal/Citation.hs` | `mkCitation` (L83) — prefix/suffix use `peekInlinesFuzzy` |
331+
| `~/src/pandoc-types/src/Text/Pandoc/Builder.hs` | `B.text` (L334) — word-splitting algorithm |
332+
333+
## Files Touched
334+
335+
| File | Change |
336+
|---|---|
337+
| `crates/pampa/src/lua/types.rs` | Rewrite `lua_table_to_inlines/blocks` as fuzzy variants, add `split_string_to_inlines` |
338+
| `crates/pampa/src/lua/constructors.rs` | Update all constructor calls, all helper functions, update `Inlines`/`Blocks` constructors |

0 commit comments

Comments
 (0)