fix: allow startxref offset to include leading whitespace by vitormattos · Pull Request #797 · smalot/pdfparser

vitormattos · 2026-04-24T03:01:48Z

Summary

Accept startxref offsets that point to leading whitespace before the xref keyword.
Use the whitespace-adjusted offset for classic xref decoding and Unix line-ending detection.
Keep trailer key parsing strict for dictionary entries.

Problem

Some PDFs set startxref to the whitespace immediately before xref.
The parser required an exact xref position and incorrectly fell back to xref stream decoding.
This can lead to Invalid object reference failures for files that are otherwise parseable.

Fix

Skip PDF whitespace before testing xref position.
Decode classic xref from the adjusted offset.
Apply the same adjustment in the Unix line-ending check path.
Keep strict trailer key matching to avoid accidental value matches.

Regression coverage

Added fixtures:
- samples/bugs/PullRequest797-vera.pdf
- samples/bugs/PullRequest797-pdf.js.pdf
Added tests:
- DocumentIssueFocusTest::testParseFileWhenStartxrefPointsToLeadingWhitespaceInVeraPdfFixture
- DocumentIssueFocusTest::testParseFileWithCompressedXrefObjectFromPdfJsCorpus

Validation

make run-phpunit ARGS="tests/PHPUnit/Integration/DocumentIssueFocusTest.php"
Result: OK

PDF Sources

veraPDF corpus
- Fixture origin: PDF_A-2b/6.6 Metadata/6.6.2 Metadata streams/6.6.2.3 Schemas/6.6.2.3.2 Extension schemas/veraPDF test suite 6-6-2-3-2-t01-pass-c.pdf
- Source URL: https://github.com/veraPDF/veraPDF-corpus/blob/staging/PDF_A-2b/6.6%20Metadata/6.6.2%20Metadata%20streams/6.6.2.3%20Schemas/6.6.2.3.2%20Extension%20schemas/veraPDF%20test%20suite%206-6-2-3-2-t01-pass-c.pdf
- Local copy used in tests: samples/bugs/PullRequest797-vera.pdf
pdf.js corpus
- Fixture origin: test/pdfs/pdfkit_compressed.pdf
- Source URL: https://github.com/mozilla/pdf.js/blob/master/test/pdfs/pdfkit_compressed.pdf
- Local copy used in tests: samples/bugs/PullRequest797-pdf.js.pdf

Some PDFs set startxref to the whitespace immediately before the xref keyword instead of the first letter of xref. The parser required an exact match and incorrectly switched to xref stream decoding, which then failed with Invalid object reference. Changes: - Skip PDF whitespace before checking startxref position - Use adjusted offset when decoding classic xref - Apply same whitespace tolerance for Unix line-ending detection - Tighten trailer key regexes to match /Size /Root /Encrypt /Info /Prev - Add regression fixture and integration test Regression fixture: - samples/bugs/PullRequestXrefWhitespaceStart.pdf Test: - DocumentIssueFocusTest::testParseFileWhenStartxrefPointsToLeadingWhitespace Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

vitormattos · 2026-04-24T14:59:21Z

Added one more regression sample from the pdf.js corpus: (copied as ). It exercises the same startxref/xref recovery path and passes with the current fix.

vitormattos · 2026-04-24T14:59:37Z

Added one more regression sample from the pdf.js corpus: pdfkit_compressed.pdf (copied as samples/bugs/PullRequest797.pdf). It exercises the same startxref/xref recovery path and passes with the current fix.

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

vitormattos · 2026-04-24T15:05:51Z

Renamed the two regression fixtures to make the source corpus explicit: PullRequest797-vera.pdf and PullRequest797-pdf.js.pdf.

vitormattos · 2026-04-25T21:36:26Z

This PR has been restacked into a per-file consolidation flow for RawDataParser changes.\n\nSuperseded-by chain:\n- base (upstream): #796\n- stacked continuation (fork): https://github.com/vitormattos/pdfparser/pull/26\n\nThe stacked branch keeps equivalent PR797 fix intent while reducing cross-PR conflicts in shared test files. Closing this standalone PR to avoid duplicate merge paths.

vitormattos mentioned this pull request Apr 24, 2026

fix: aggregate startxref whitespace tolerance vitormattos/pdfparser#3

Merged

vitormattos added 2 commits April 24, 2026 00:05

test: use assertCount for page count assertion

583526e

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

test: add pdf.js compressed xref regression

a3d9df0

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

test: clarify pull request fixture provenance

d4c26e9

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

vitormattos mentioned this pull request Apr 25, 2026

sync: include missing PR806 follow-up commit in integration vitormattos/pdfparser#24

Closed

vitormattos closed this Apr 25, 2026

vitormattos deleted the fix/unable-find-xref-pass-c branch April 27, 2026 17:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: allow startxref offset to include leading whitespace#797

fix: allow startxref offset to include leading whitespace#797
vitormattos wants to merge 4 commits into
smalot:masterfrom
vitormattos:fix/unable-find-xref-pass-c

vitormattos commented Apr 24, 2026 •

edited

Loading

Uh oh!

vitormattos commented Apr 24, 2026

Uh oh!

vitormattos commented Apr 24, 2026

Uh oh!

vitormattos commented Apr 24, 2026

Uh oh!

vitormattos commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vitormattos commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vitormattos commented Apr 24, 2026

Uh oh!

vitormattos commented Apr 24, 2026

Uh oh!

vitormattos commented Apr 24, 2026

Uh oh!

vitormattos commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vitormattos commented Apr 24, 2026 •

edited

Loading