You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[](https://packagist.org/packages/sytxlabs/filesanitizer)
Pure PHP file sanitizer and scanner for uploaded files. It strips metadata where practical, rewrites selected file types into safer forms, and blocks files that look malicious.
9
+
Pure PHP file sanitizer and scanner for uploaded files. It strips metadata where practical, rewrites selected file types into safer forms, and detects suspicious or malicious content such as XSS-style payloads, risky embedded markup, active PDF content, and dangerous archive entries.
11
10
12
11
## Features
13
12
14
-
- Re-encodes JPEG, PNG, GIF, and WebP images to strip EXIF and ancillary metadata
15
-
- Scans HTML, SVG, PDF, text, Office OOXML, ZIP, and nested ZIP content for risky payloads
16
-
- Recursively scans ZIP archives with depth, entry-count, and expanded-size guards
17
-
- Detects common XSS-style payloads such as `<script>`, inline handlers, `javascript:` URLs, hostile CSS, and dangerous PDF actions
18
-
- Applies strict allowlist-based cleanup for HTML and strict removal rules for SVG
13
+
- Re-encodes supported image formats to remove metadata and ancillary chunks
14
+
- Sanitizes HTML and SVG using strict policy-based cleanup
15
+
- Scans PDFs for active content and applies best-effort cleanup
16
+
- Scans OOXML documents for risky content such as macros, ActiveX, and external relationships
17
+
- Recursively scans ZIP archives, including nested archives, with configurable safety limits
18
+
- Scans audio files for suspicious embedded payloads and removes metadata where practical
19
+
- Scans video files for suspicious embedded payloads and applies best-effort metadata cleanup
20
+
- Supports sanitize-always mode for best-effort cleaning even when risky content is detected
21
+
- Pure PHP implementation with no shell access, SSH, or external binaries required
echo'Sanitized file written to: '.$result['sanitize']->outputPath . PHP_EOL;
55
58
```
56
59
57
-
## Archive scanning notes
60
+
## sanitizeAlways mode
58
61
59
-
The recursive archive scanner uses PHP's `ZipArchive` extension and does not extract archives to a shell. PHP documents `ZipArchive` for reading archive entries and notes extraction and open behavior through the zip extension API.
62
+
When `sanitizeAlways` is enabled, FileSanitizer will attempt best-effort sanitization even if risky content is detected during scanning.
63
+
64
+
This is useful when you want to:
65
+
66
+
* always strip metadata where possible
67
+
* always rewrite supported files where possible
68
+
* keep findings for review without immediately rejecting the upload
Best-effort sanitization does not guarantee a full structural rebuild for complex formats such as PDF, audio, or video containers.
81
+
82
+
## Supported file types
83
+
84
+
FileSanitizer currently supports scanning and/or sanitizing the following file types.
85
+
86
+
* Images
87
+
* JPEG
88
+
* PNG
89
+
* GIF
90
+
* WebP
91
+
* Documents and markup
92
+
* HTML
93
+
* SVG
94
+
* PDF
95
+
* TXT and text-like files
96
+
* DOCX
97
+
* XLSX
98
+
* PPTX
99
+
* Archives
100
+
* ZIP
101
+
* Nested ZIP archives
102
+
* Audio
103
+
* MP3
104
+
* WAV
105
+
* OGG
106
+
* FLAC
107
+
* M4A
108
+
* AAC
109
+
* Video
110
+
* MP4
111
+
* MOV
112
+
* WebM
113
+
* MKV
114
+
* AVI
115
+
116
+
## How it works
117
+
118
+
FileSanitizer combines format-aware scanning with best-effort sanitization.
119
+
120
+
### Scanning
121
+
122
+
The scanner looks for suspicious patterns and risky structures such as:
123
+
124
+
* inline JavaScript-style payloads
125
+
* dangerous HTML or SVG constructs
126
+
* active PDF actions
127
+
* suspicious archive paths and nested archive abuse
128
+
* risky embedded strings in audio and video containers
129
+
* macros, ActiveX, and external relationships in OOXML files
130
+
131
+
### Sanitizing
132
+
133
+
Supported sanitizers attempt to reduce risk by:
134
+
135
+
* re-encoding images
136
+
* removing unsafe HTML and SVG elements and attributes
137
+
* stripping metadata where practical
138
+
* rewriting selected file formats into safer forms
139
+
* applying best-effort cleanup to complex containers
140
+
141
+
## Archive scanning
142
+
143
+
ZIP scanning is recursive and designed to detect suspicious content without using shell extraction.
60
144
61
145
Current guards:
62
146
63
-
- Maximum nesting depth: 3
64
-
- Maximum scanned entries per archive: 1000
65
-
- Maximum expanded bytes scanned: 25 MB
66
-
- Flags suspicious paths such as `../evil.txt` or absolute-path entries
147
+
* maximum nesting depth: 3
148
+
* maximum scanned entries per archive: 1000
149
+
* maximum expanded bytes scanned: 25 MB
150
+
*suspicious path detection for entries such as `../evil.txt` or absolute paths
67
151
68
152
## HTML and SVG policy
69
153
70
-
HTML cleanup uses PHP's DOM support to parse and rewrite content, removing disallowed tags and risky attributes instead of relying on `strip_tags()`, which PHP user notes caution is not enough for safe attribute handling. PHP's DOM APIs support HTML parsing and tree editing.
154
+
HTML and SVG sanitization is policy-based and removes risky constructs instead of relying on simple tag stripping.
71
155
72
156
Highlights:
73
157
74
-
- Removes `script`, `iframe`, `object`, `embed`, `form`, and other non-allowlisted elements
75
-
- Removes all `on*` event handlers
76
-
- Removes `javascript:`, `vbscript:`, `file:`, and unsafe `data:` URLs
77
-
- Drops hostile CSS such as `expression()`, `@import`, `url()`, `behavior:`, and `-moz-binding`
78
-
- Removes SVG active content elements such as `script`, `foreignObject`, animation elements, external media, `image`, and `use`
158
+
* removes `script`, `iframe`, `object`, `embed`, `form`, and other disallowed elements
159
+
* removes all `on*` event handlers
160
+
* removes `javascript:`, `vbscript:`, `file:`, and unsafe `data:` URLs
161
+
* removes hostile CSS such as `expression()`, `@import`, `url()`, `behavior:`, and `-moz-binding`
162
+
* removes SVG active content such as `script`, `foreignObject`, animation elements, external media, `image`, and `use`
163
+
164
+
## Audio support
165
+
166
+
FileSanitizer includes best-effort support for common audio formats.
79
167
80
-
## Tests
168
+
### What it does
169
+
170
+
* Detects suspicious embedded payloads such as:
171
+
172
+
*`<script`
173
+
*`javascript:`
174
+
* inline event handler patterns like `onclick=`
175
+
*`<iframe`
176
+
*`data:text/html`
177
+
* embedded PHP tags
178
+
179
+
* Removes metadata where practical:
180
+
181
+
* MP3: ID3v1 and ID3v2 tags
182
+
* WAV: selected metadata chunks such as `LIST`, `INFO`, and `ID3`
Audio sanitization is best-effort and does not transcode or fully rebuild complex media containers. No shell tools, SSH access, or external binaries are required.
188
+
189
+
## Video support
190
+
191
+
FileSanitizer includes best-effort support for common video containers.
192
+
193
+
### What it does
194
+
195
+
* Detects suspicious embedded payloads such as:
196
+
197
+
*`<script`
198
+
*`javascript:`
199
+
* inline event handler patterns like `onload=`
200
+
*`<iframe`
201
+
*`data:text/html`
202
+
* embedded PHP tags
203
+
204
+
* Applies conservative container cleanup where practical:
205
+
206
+
* MP4 and MOV: attempts to remove selected metadata atoms such as `udta`, `meta`, and `ilst`
207
+
* AVI: removes selected metadata chunks such as `INFO`, `JUNK`, and `IDIT`
208
+
* WebM and MKV: applies conservative best-effort textual payload cleanup
209
+
210
+
### Notes
211
+
212
+
Video sanitization is best-effort and does not transcode or fully rebuild media containers. Without external tools such as FFmpeg, full structural video rewriting is intentionally out of scope.
213
+
214
+
## Test coverage
81
215
82
216
Included PHPUnit coverage exercises:
83
217
84
-
- nested ZIP detection
85
-
- path traversal detection inside ZIPs
86
-
- HTML sanitization rules
87
-
- SVG sanitization rules
88
-
- PDF action detection
218
+
* nested ZIP detection
219
+
* path traversal detection inside ZIPs
220
+
* HTML sanitization rules
221
+
* SVG sanitization rules
222
+
* PDF action detection
223
+
* audio metadata stripping
224
+
* video file scanning for embedded payloads and metadata stripping
89
225
90
226
## Limitations
91
227
92
-
- PDF cleanup is still best-effort rather than a full structural rewrite
93
-
- OOXML files are scanned for risky content and external references but not fully rewritten yet
94
-
- This package is a sanitizer and heuristic scanner, not a substitute for sandboxing or AV scanning
228
+
FileSanitizer is a pure PHP package focused on safe, practical, best-effort sanitization.
229
+
230
+
### Important limitations
231
+
232
+
* PDF sanitization is a best-effort and not a full PDF rebuild
233
+
* OOXML files are scanned for risky content but are not fully rewritten
234
+
* audio sanitization removes metadata where practical but does not transcode files
235
+
* video sanitization is best-effort and does not perform full re-encoding or container rebuilding
236
+
* complex media formats may still require deeper inspection in high-security environments
$issues[] = newIssue('no_sanitizer', 'No specialized sanitizer exists for this file type; original file was copied after scanning.', IssueSeverity::Warning);
$issues[] = newIssue('no_sanitizer', 'No specialized sanitizer exists for this file type; original file was copied after scanning.', IssueSeverity::Warning);
0 commit comments