fix: validate input is a real PDF before encryption (closes #24)#53
fix: validate input is a real PDF before encryption (closes #24)#53ambicuity wants to merge 1 commit intoSUPAIDEAS:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds early validation to ensure inputs are real PDFs before attempting encryption, improving user-facing errors for invalid/non-PDF content (per #24) and extending test coverage to match the new behavior.
Changes:
- Added a PDF “magic bytes” validation step and normalized
PdfReaderparse failures into a friendlyValueError. - Updated unit tests to cover magic-byte rejection and
PdfReaderparse-failure handling. - Added integration tests to ensure the CLI rejects fake PDF content (with and without
.pdfextension).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
passifypdf/encryptpdf.py |
Adds magic-byte validation and targeted PdfReadError handling to produce consistent invalid-PDF errors. |
tests/unittests/test_encryptpdf.py |
Updates mocks for new validation flow; adds unit tests for invalid inputs. |
tests/integrationtests/test_encryptpdf_integration.py |
Adds CLI subprocess tests that assert friendly error output and non-creation of outputs for fake inputs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| invalid_pdf_error = ValueError(f"Input file '{input_pdf}' is not a valid PDF file.") | ||
| with input_path.open("rb") as source_file: | ||
| if source_file.read(5) != b"%PDF-": |
There was a problem hiding this comment.
The magic-byte check is overly strict: some valid PDFs can include leading whitespace/bytes before the "%PDF-" header (the header is typically expected near the start, not necessarily at byte 0). Reading and comparing only the first 5 bytes may incorrectly reject such files. Consider scanning the first ~1KB for the "%PDF-" marker (or skipping leading whitespace) before failing, and keep the error message the same.
| if source_file.read(5) != b"%PDF-": | |
| # Read the first chunk of the file and look for the PDF header anywhere within it, | |
| # since some valid PDFs may contain leading bytes before the "%PDF-" marker. | |
| header_chunk = source_file.read(1024) | |
| if b"%PDF-" not in header_chunk: |
Summary
This PR fixes issue #24 by validating that input files are real PDFs before encryption.
Reproduction (before)
.pdfextension containing plain text.passifypdfagainst it.Stream has ended unexpectedly).Root cause
The implementation relied on
PdfReaderinitialization for file-format validation and wrapped errors generically, which produced unclear user-facing output for invalid/non-PDF content.Fix
%PDF-before reading withPdfReader.PdfReadErrorhandling to normalize parser failures into one friendly error.Input file '<path>' is not a valid PDF file.Behavior change
Tests
Unit
Path.open("rb")magic-byte flow.PdfReadErrorfallback rejection test.Integration
.pdfcontent.Local verification
uv sync --all-extrasuv run pytest tests/unittests/ -v --cov=passifypdf --cov-report=xml --cov-report=termuv run pytest tests/integrationtests/ -v.pdfand extensionless files both return exit code1with the friendly invalid-PDF error.