Skip to content

Phase 3: Make script rewriters fragment-safe for full streaming #584

@aram356

Description

@aram356

Parent: #563

Context

lol_html fragments text nodes across input chunk boundaries when processing HTML incrementally. Script rewriters (NextJsNextDataRewriter, GoogleTagManagerIntegration) currently expect complete text content — if a domain string like "googletagmanager.com" is split across chunks, the rewrite silently fails.

Phase 1 works around this with a dual-mode HtmlRewriterAdapter: streaming mode when no script rewriters are registered, buffered mode when they are. This means streaming only benefits configs without GTM/NextJS script rewriters.

Phase 3 makes the rewriters themselves fragment-safe, enabling streaming for ALL configurations.

Approach

Each script rewriter accumulates text fragments internally via is_last_in_text_node, then operates on the complete text. Key considerations:

  • Intermediate fragments must return Replace("") (not Keep) to suppress output, since the full accumulated text is emitted on the final fragment
  • When the rewriter returns Keep on the full text but fragments were suppressed, must emit Replace(full_text) to restore the content
  • When text is NOT fragmented (single fragment), return Keep as before — no unnecessary replacement
  • Multiple rewriters on the same selector (e.g., NextJsNextDataRewriter on script#__NEXT_DATA__ + NextJsRscPlaceholderRewriter on script) each accumulate independently — last text.replace() wins, same as current behavior

Tasks

  • Add Mutex<String> accumulation to NextJsNextDataRewriter
  • Add Mutex<String> accumulation to GoogleTagManagerIntegration
  • Remove new_buffered() from HtmlRewriterAdapter — always stream
  • Remove has_script_rewriters gate from create_html_processor
  • Add small-chunk-size regression tests:
    • __NEXT_DATA__ rewrite with text split across chunk boundaries
    • GTM inline script rewrite with domain split across chunk boundaries
  • Full verification

Acceptance Criteria

  • All script rewriters produce correct output regardless of chunk boundaries
  • HtmlRewriterAdapter always streams (no buffered mode)
  • Streaming benefits all configurations, not just those without script rewriters
  • All existing tests pass

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions