Skip to content

fix: allow startxref offset to include leading whitespace#797

Closed
vitormattos wants to merge 4 commits into
smalot:masterfrom
vitormattos:fix/unable-find-xref-pass-c
Closed

fix: allow startxref offset to include leading whitespace#797
vitormattos wants to merge 4 commits into
smalot:masterfrom
vitormattos:fix/unable-find-xref-pass-c

Conversation

@vitormattos
Copy link
Copy Markdown

@vitormattos vitormattos commented Apr 24, 2026

Summary

  • Accept startxref offsets that point to leading whitespace before the xref keyword.
  • Use the whitespace-adjusted offset for classic xref decoding and Unix line-ending detection.
  • Keep trailer key parsing strict for dictionary entries.

Problem

  • Some PDFs set startxref to the whitespace immediately before xref.
  • The parser required an exact xref position and incorrectly fell back to xref stream decoding.
  • This can lead to Invalid object reference failures for files that are otherwise parseable.

Fix

  • Skip PDF whitespace before testing xref position.
  • Decode classic xref from the adjusted offset.
  • Apply the same adjustment in the Unix line-ending check path.
  • Keep strict trailer key matching to avoid accidental value matches.

Regression coverage

  • Added fixtures:
    • samples/bugs/PullRequest797-vera.pdf
    • samples/bugs/PullRequest797-pdf.js.pdf
  • Added tests:
    • DocumentIssueFocusTest::testParseFileWhenStartxrefPointsToLeadingWhitespaceInVeraPdfFixture
    • DocumentIssueFocusTest::testParseFileWithCompressedXrefObjectFromPdfJsCorpus

Validation

  • make run-phpunit ARGS="tests/PHPUnit/Integration/DocumentIssueFocusTest.php"
  • Result: OK

PDF Sources

Some PDFs set startxref to the whitespace immediately before the
xref keyword instead of the first letter of xref.

The parser required an exact match and incorrectly switched to xref
stream decoding, which then failed with Invalid object reference.

Changes:
- Skip PDF whitespace before checking startxref position
- Use adjusted offset when decoding classic xref
- Apply same whitespace tolerance for Unix line-ending detection
- Tighten trailer key regexes to match /Size /Root /Encrypt /Info /Prev
- Add regression fixture and integration test

Regression fixture:
- samples/bugs/PullRequestXrefWhitespaceStart.pdf

Test:
- DocumentIssueFocusTest::testParseFileWhenStartxrefPointsToLeadingWhitespace

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
@vitormattos
Copy link
Copy Markdown
Author

Added one more regression sample from the pdf.js corpus: (copied as ). It exercises the same startxref/xref recovery path and passes with the current fix.

@vitormattos
Copy link
Copy Markdown
Author

Added one more regression sample from the pdf.js corpus: pdfkit_compressed.pdf (copied as samples/bugs/PullRequest797.pdf). It exercises the same startxref/xref recovery path and passes with the current fix.

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
@vitormattos
Copy link
Copy Markdown
Author

Renamed the two regression fixtures to make the source corpus explicit: PullRequest797-vera.pdf and PullRequest797-pdf.js.pdf.

@vitormattos
Copy link
Copy Markdown
Author

This PR has been restacked into a per-file consolidation flow for RawDataParser changes.\n\nSuperseded-by chain:\n- base (upstream): #796\n- stacked continuation (fork): https://github.com/vitormattos/pdfparser/pull/26\n\nThe stacked branch keeps equivalent PR797 fix intent while reducing cross-PR conflicts in shared test files. Closing this standalone PR to avoid duplicate merge paths.

@vitormattos vitormattos deleted the fix/unable-find-xref-pass-c branch April 27, 2026 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant