Skip to content

Commit ff32f7b

Browse files
committed
Add VideoSanitizer and AudioSanitizer for metadata cleanup and embedded payload detection
1 parent b8db377 commit ff32f7b

7 files changed

Lines changed: 794 additions & 57 deletions

File tree

README.md

Lines changed: 176 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -2,26 +2,29 @@
22

33
[![MIT Licensed](https://img.shields.io/badge/License-MIT-brightgreen.svg?style=flat-square)](LICENSE)
44
[![Check code style](https://github.com/SytxLabs/FileSanitizer/actions/workflows/code-style.yml/badge.svg?style=flat-square)](https://github.com/SytxLabs/FileSanitizer/actions/workflows/code-style.yml)
5-
[![Tests](https://github.com/SytxLabs/FileSanitizer/actions/workflows/tests.yml/badge.svg?style=flat-square)](https://github.com/SytxLabs/FileSanitizer/actions/workflows/code-style.yml)
5+
[![Tests](https://github.com/SytxLabs/FileSanitizer/actions/workflows/tests.yml/badge.svg?style=flat-square)](https://github.com/SytxLabs/FileSanitizer/actions/workflows/tests.yml)
66
[![Latest Version on Packagist](https://poser.pugx.org/sytxlabs/filesanitizer/v/stable?format=flat-square)](https://packagist.org/packages/sytxlabs/filesanitizer)
77
[![Total Downloads](https://poser.pugx.org/sytxlabs/filesanitizer/downloads?format=flat-square)](https://packagist.org/packages/sytxlabs/filesanitizer)
88

9-
10-
Pure PHP file sanitizer and scanner for uploaded files. It strips metadata where practical, rewrites selected file types into safer forms, and blocks files that look malicious.
9+
Pure PHP file sanitizer and scanner for uploaded files. It strips metadata where practical, rewrites selected file types into safer forms, and detects suspicious or malicious content such as XSS-style payloads, risky embedded markup, active PDF content, and dangerous archive entries.
1110

1211
## Features
1312

14-
- Re-encodes JPEG, PNG, GIF, and WebP images to strip EXIF and ancillary metadata
15-
- Scans HTML, SVG, PDF, text, Office OOXML, ZIP, and nested ZIP content for risky payloads
16-
- Recursively scans ZIP archives with depth, entry-count, and expanded-size guards
17-
- Detects common XSS-style payloads such as `<script>`, inline handlers, `javascript:` URLs, hostile CSS, and dangerous PDF actions
18-
- Applies strict allowlist-based cleanup for HTML and strict removal rules for SVG
13+
- Re-encodes supported image formats to remove metadata and ancillary chunks
14+
- Sanitizes HTML and SVG using strict policy-based cleanup
15+
- Scans PDFs for active content and applies best-effort cleanup
16+
- Scans OOXML documents for risky content such as macros, ActiveX, and external relationships
17+
- Recursively scans ZIP archives, including nested archives, with configurable safety limits
18+
- Scans audio files for suspicious embedded payloads and removes metadata where practical
19+
- Scans video files for suspicious embedded payloads and applies best-effort metadata cleanup
20+
- Supports sanitize-always mode for best-effort cleaning even when risky content is detected
21+
- Pure PHP implementation with no shell access, SSH, or external binaries required
1922

20-
## Install
23+
## Installation
2124

2225
```bash
2326
composer require sytxlabs/filesanitizer
24-
```
27+
````
2528

2629
For development and tests:
2730

@@ -30,9 +33,7 @@ composer install
3033
composer test
3134
```
3235

33-
PHPUnit is added as a dev dependency
34-
35-
## Usage
36+
## Quick start
3637

3738
```php
3839
<?php
@@ -42,53 +43,194 @@ require __DIR__ . '/vendor/autoload.php';
4243
use SytxLabs\FileSanitizer\FileSanitizer;
4344
4445
$sanitizer = new FileSanitizer();
46+
4547
$result = $sanitizer->process(__DIR__ . '/upload.svg');
4648
4749
if (!$result['scan']->safe) {
4850
foreach ($result['scan']->issues as $issue) {
4951
echo $issue->code . ': ' . $issue->message . PHP_EOL;
5052
}
53+
5154
exit(1);
5255
}
5356
5457
echo 'Sanitized file written to: ' . $result['sanitize']->outputPath . PHP_EOL;
5558
```
5659
57-
## Archive scanning notes
60+
## sanitizeAlways mode
5861
59-
The recursive archive scanner uses PHP's `ZipArchive` extension and does not extract archives to a shell. PHP documents `ZipArchive` for reading archive entries and notes extraction and open behavior through the zip extension API.
62+
When `sanitizeAlways` is enabled, FileSanitizer will attempt best-effort sanitization even if risky content is detected during scanning.
63+
64+
This is useful when you want to:
65+
66+
* always strip metadata where possible
67+
* always rewrite supported files where possible
68+
* keep findings for review without immediately rejecting the upload
69+
70+
```php
71+
<?php
72+
73+
use SytxLabs\FileSanitizer\FileSanitizer;
74+
75+
$sanitizer = new FileSanitizer();
76+
77+
$result = $sanitizer->process(__DIR__ . '/upload.pdf', null, true);
78+
```
79+
80+
Best-effort sanitization does not guarantee a full structural rebuild for complex formats such as PDF, audio, or video containers.
81+
82+
## Supported file types
83+
84+
FileSanitizer currently supports scanning and/or sanitizing the following file types.
85+
86+
* Images
87+
* JPEG
88+
* PNG
89+
* GIF
90+
* WebP
91+
* Documents and markup
92+
* HTML
93+
* SVG
94+
* PDF
95+
* TXT and text-like files
96+
* DOCX
97+
* XLSX
98+
* PPTX
99+
* Archives
100+
* ZIP
101+
* Nested ZIP archives
102+
* Audio
103+
* MP3
104+
* WAV
105+
* OGG
106+
* FLAC
107+
* M4A
108+
* AAC
109+
* Video
110+
* MP4
111+
* MOV
112+
* WebM
113+
* MKV
114+
* AVI
115+
116+
## How it works
117+
118+
FileSanitizer combines format-aware scanning with best-effort sanitization.
119+
120+
### Scanning
121+
122+
The scanner looks for suspicious patterns and risky structures such as:
123+
124+
* inline JavaScript-style payloads
125+
* dangerous HTML or SVG constructs
126+
* active PDF actions
127+
* suspicious archive paths and nested archive abuse
128+
* risky embedded strings in audio and video containers
129+
* macros, ActiveX, and external relationships in OOXML files
130+
131+
### Sanitizing
132+
133+
Supported sanitizers attempt to reduce risk by:
134+
135+
* re-encoding images
136+
* removing unsafe HTML and SVG elements and attributes
137+
* stripping metadata where practical
138+
* rewriting selected file formats into safer forms
139+
* applying best-effort cleanup to complex containers
140+
141+
## Archive scanning
142+
143+
ZIP scanning is recursive and designed to detect suspicious content without using shell extraction.
60144
61145
Current guards:
62146
63-
- Maximum nesting depth: 3
64-
- Maximum scanned entries per archive: 1000
65-
- Maximum expanded bytes scanned: 25 MB
66-
- Flags suspicious paths such as `../evil.txt` or absolute-path entries
147+
* maximum nesting depth: 3
148+
* maximum scanned entries per archive: 1000
149+
* maximum expanded bytes scanned: 25 MB
150+
* suspicious path detection for entries such as `../evil.txt` or absolute paths
67151
68152
## HTML and SVG policy
69153
70-
HTML cleanup uses PHP's DOM support to parse and rewrite content, removing disallowed tags and risky attributes instead of relying on `strip_tags()`, which PHP user notes caution is not enough for safe attribute handling. PHP's DOM APIs support HTML parsing and tree editing.
154+
HTML and SVG sanitization is policy-based and removes risky constructs instead of relying on simple tag stripping.
71155
72156
Highlights:
73157
74-
- Removes `script`, `iframe`, `object`, `embed`, `form`, and other non-allowlisted elements
75-
- Removes all `on*` event handlers
76-
- Removes `javascript:`, `vbscript:`, `file:`, and unsafe `data:` URLs
77-
- Drops hostile CSS such as `expression()`, `@import`, `url()`, `behavior:`, and `-moz-binding`
78-
- Removes SVG active content elements such as `script`, `foreignObject`, animation elements, external media, `image`, and `use`
158+
* removes `script`, `iframe`, `object`, `embed`, `form`, and other disallowed elements
159+
* removes all `on*` event handlers
160+
* removes `javascript:`, `vbscript:`, `file:`, and unsafe `data:` URLs
161+
* removes hostile CSS such as `expression()`, `@import`, `url()`, `behavior:`, and `-moz-binding`
162+
* removes SVG active content such as `script`, `foreignObject`, animation elements, external media, `image`, and `use`
163+
164+
## Audio support
165+
166+
FileSanitizer includes best-effort support for common audio formats.
79167
80-
## Tests
168+
### What it does
169+
170+
* Detects suspicious embedded payloads such as:
171+
172+
* `<script`
173+
* `javascript:`
174+
* inline event handler patterns like `onclick=`
175+
* `<iframe`
176+
* `data:text/html`
177+
* embedded PHP tags
178+
179+
* Removes metadata where practical:
180+
181+
* MP3: ID3v1 and ID3v2 tags
182+
* WAV: selected metadata chunks such as `LIST`, `INFO`, and `ID3`
183+
* OGG, FLAC, M4A, and AAC: conservative best-effort textual payload cleanup
184+
185+
### Notes
186+
187+
Audio sanitization is best-effort and does not transcode or fully rebuild complex media containers. No shell tools, SSH access, or external binaries are required.
188+
189+
## Video support
190+
191+
FileSanitizer includes best-effort support for common video containers.
192+
193+
### What it does
194+
195+
* Detects suspicious embedded payloads such as:
196+
197+
* `<script`
198+
* `javascript:`
199+
* inline event handler patterns like `onload=`
200+
* `<iframe`
201+
* `data:text/html`
202+
* embedded PHP tags
203+
204+
* Applies conservative container cleanup where practical:
205+
206+
* MP4 and MOV: attempts to remove selected metadata atoms such as `udta`, `meta`, and `ilst`
207+
* AVI: removes selected metadata chunks such as `INFO`, `JUNK`, and `IDIT`
208+
* WebM and MKV: applies conservative best-effort textual payload cleanup
209+
210+
### Notes
211+
212+
Video sanitization is best-effort and does not transcode or fully rebuild media containers. Without external tools such as FFmpeg, full structural video rewriting is intentionally out of scope.
213+
214+
## Test coverage
81215
82216
Included PHPUnit coverage exercises:
83217
84-
- nested ZIP detection
85-
- path traversal detection inside ZIPs
86-
- HTML sanitization rules
87-
- SVG sanitization rules
88-
- PDF action detection
218+
* nested ZIP detection
219+
* path traversal detection inside ZIPs
220+
* HTML sanitization rules
221+
* SVG sanitization rules
222+
* PDF action detection
223+
* audio metadata stripping
224+
* video file scanning for embedded payloads and metadata stripping
89225
90226
## Limitations
91227
92-
- PDF cleanup is still best-effort rather than a full structural rewrite
93-
- OOXML files are scanned for risky content and external references but not fully rewritten yet
94-
- This package is a sanitizer and heuristic scanner, not a substitute for sandboxing or AV scanning
228+
FileSanitizer is a pure PHP package focused on safe, practical, best-effort sanitization.
229+
230+
### Important limitations
231+
232+
* PDF sanitization is a best-effort and not a full PDF rebuild
233+
* OOXML files are scanned for risky content but are not fully rewritten
234+
* audio sanitization removes metadata where practical but does not transcode files
235+
* video sanitization is best-effort and does not perform full re-encoding or container rebuilding
236+
* complex media formats may still require deeper inspection in high-security environments

examples/web.php

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818

1919
$sanitizer = new FileSanitizer();
2020
$outputPath = $file . '.sanitized.' . pathinfo($file, PATHINFO_EXTENSION);
21-
$result = $sanitizer->process($file, $outputPath);
21+
$result = $sanitizer->process($file, $outputPath, true);
2222

2323
echo '<pre>';
2424
print_r($result);

src/FileSanitizer.php

Lines changed: 28 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -9,11 +9,13 @@
99
use SytxLabs\FileSanitizer\Dto\SanitizeReport;
1010
use SytxLabs\FileSanitizer\Dto\ScanReport;
1111
use SytxLabs\FileSanitizer\Enums\IssueSeverity;
12+
use SytxLabs\FileSanitizer\Sanitizer\AudioSanitizer;
1213
use SytxLabs\FileSanitizer\Sanitizer\HtmlSanitizer;
1314
use SytxLabs\FileSanitizer\Sanitizer\ImageSanitizer;
1415
use SytxLabs\FileSanitizer\Sanitizer\PdfSanitizer;
1516
use SytxLabs\FileSanitizer\Sanitizer\SvgSanitizer;
1617
use SytxLabs\FileSanitizer\Sanitizer\TextLikeSanitizer;
18+
use SytxLabs\FileSanitizer\Sanitizer\VideoSanitizer;
1719
use SytxLabs\FileSanitizer\Scanner\PatternScanner;
1820
use SytxLabs\FileSanitizer\Support\MimeDetector;
1921

@@ -24,13 +26,13 @@ final class FileSanitizer
2426

2527
public function __construct(private readonly ?MimeDetector $mimeDetector = null, private readonly ?ScannerInterface $scanner = null, ?array $sanitizers = null)
2628
{
27-
$this->sanitizers = $sanitizers ?? [new SvgSanitizer(), new HtmlSanitizer(), new ImageSanitizer(), new PdfSanitizer(), new TextLikeSanitizer()];
29+
$this->sanitizers = $sanitizers ?? [new SvgSanitizer(), new HtmlSanitizer(), new ImageSanitizer(), new PdfSanitizer(), new TextLikeSanitizer(), new AudioSanitizer(), new VideoSanitizer()];
2830
}
2931

3032
/**
3133
* @return array{mimeType:string, scan:ScanReport, sanitize:SanitizeReport}
3234
*/
33-
public function process(string $inputPath, bool|string|null $outputPath = null, bool $sanitizeAlways = true): array
35+
public function process(string $inputPath, bool|string|null $outputPath = null, bool $sanitizeAlways = false): array
3436
{
3537
if (is_bool($outputPath)) {
3638
$sanitizeAlways = $outputPath;
@@ -44,36 +46,40 @@ public function process(string $inputPath, bool|string|null $outputPath = null,
4446
$mimeType = ($this->mimeDetector ?? new MimeDetector())->detect($inputPath);
4547
$scan = ($this->scanner ?? new PatternScanner())->scan($inputPath, $mimeType);
4648
$outputPath ??= $this->defaultOutputPath($inputPath);
47-
48-
foreach ($this->sanitizers as $sanitizer) {
49-
if (!$sanitizer->supports($mimeType, $inputPath)) {
50-
continue;
51-
}
52-
if (!$scan->safe && !$sanitizeAlways) {
53-
return ['mimeType' => $mimeType, 'scan' => $scan, 'sanitize' => new SanitizeReport($outputPath, false, $scan->issues, ['skipped' => true])];
54-
}
55-
$sanitize = $sanitizer->sanitize($inputPath, $outputPath, $sanitizeAlways);
56-
if (!$scan->safe) {
57-
$sanitize = new SanitizeReport($sanitize->outputPath, $sanitize->metadataRemoved, [...$scan->issues, ...$sanitize->issues], [...$sanitize->context, 'sanitized_despite_scan_issues' => true]);
49+
$sanitizer = $this->resolveSanitizer($mimeType, $inputPath);
50+
if ($sanitizer === null) {
51+
if (!copy($inputPath, $outputPath)) {
52+
throw new RuntimeException('Could not copy unsupported file to output path.');
5853
}
59-
return ['mimeType' => $mimeType, 'scan' => $scan, 'sanitize' => $sanitize];
54+
$issues = $scan->issues;
55+
$issues[] = new Issue('no_sanitizer', 'No specialized sanitizer exists for this file type; original file was copied after scanning.', IssueSeverity::Warning);
56+
return ['mimeType' => $mimeType, 'scan' => $scan, 'sanitize' => new SanitizeReport($outputPath, false, $issues, ['copied_original' => true])];
6057
}
61-
62-
if (!copy($inputPath, $outputPath)) {
63-
throw new RuntimeException('Could not copy unsupported file to output path.');
58+
if (!$scan->safe && !$sanitizeAlways) {
59+
return ['mimeType' => $mimeType, 'scan' => $scan, 'sanitize' => new SanitizeReport($outputPath, false, $scan->issues, ['skipped' => true])];
6460
}
65-
66-
$issues = $scan->issues;
67-
$issues[] = new Issue('no_sanitizer', 'No specialized sanitizer exists for this file type; original file was copied after scanning.', IssueSeverity::Warning);
68-
69-
return ['mimeType' => $mimeType, 'scan' => $scan, 'sanitize' => new SanitizeReport($outputPath, false, $issues, ['copied_original' => true])];
61+
$sanitize = $sanitizer->sanitize($inputPath, $outputPath, $sanitizeAlways);
62+
if (!$scan->safe) {
63+
$sanitize = new SanitizeReport($sanitize->outputPath, $sanitize->metadataRemoved, [...$scan->issues, ...$sanitize->issues], [...$sanitize->context, 'sanitized_despite_scan_issues' => true]);
64+
}
65+
return ['mimeType' => $mimeType, 'scan' => $scan, 'sanitize' => $sanitize];
7066
}
7167

7268
public function sanitizeAlways(string $inputPath, ?string $outputPath = null): array
7369
{
7470
return $this->process($inputPath, $outputPath, true);
7571
}
7672

73+
private function resolveSanitizer(string $mimeType, string $path): ?SanitizerInterface
74+
{
75+
foreach ($this->sanitizers as $sanitizer) {
76+
if ($sanitizer->supports($mimeType, $path)) {
77+
return $sanitizer;
78+
}
79+
}
80+
return null;
81+
}
82+
7783
private function defaultOutputPath(string $inputPath): string
7884
{
7985
$extension = pathinfo($inputPath, PATHINFO_EXTENSION);

0 commit comments

Comments
 (0)