Skip to content

fix: handle DOCX files with inconsistent ZIP filename casing (#1812)#2016

Open
lyydsheep wants to merge 1 commit into
microsoft:mainfrom
lyydsheep:fix/docx-zip-filename-casing-mismatch
Open

fix: handle DOCX files with inconsistent ZIP filename casing (#1812)#2016
lyydsheep wants to merge 1 commit into
microsoft:mainfrom
lyydsheep:fix/docx-zip-filename-casing-mismatch

Conversation

@lyydsheep
Copy link
Copy Markdown

Summary

Fixes #1812 — DOCX files with inconsistent ZIP filename casing between central directory and local file headers cause BadZipFile exceptions.

Some document generators (e.g. certain Microsoft Word versions, legal document systems) produce .docx files where the central directory records one casing (e.g. customXml/item2.xml) but the local file headers record another (e.g. customXML/item2.xml). Python's zipfile module raises BadZipFile when reading such files.

Root Cause

In pre_process_docx(), line 140 calls zip_input.read(name) for each file in the ZIP. Python's zipfile validates that the local file header filename matches the central directory entry at read time — a casing mismatch triggers:

BadZipFile: File name in directory 'customXml/item2.xml' and header b'customXML/item2.xml' differ.

Fix

Add _fix_zip_filename_casing() in pre_process.py that:

  1. Reads the ZIP's central directory (authoritative source) to get correct filenames
  2. Patches any local file header filenames that differ in casing
  3. Only patches when lengths match (safe — only casing differs)
  4. Returns a clean ZIP stream before any further processing

Called at the top of pre_process_docx(), so the fix applies to all downstream consumers (DocxConverter, DocxConverterWithOCR).

Test Plan

  • New test test_docx_zip_filename_casing_mismatch — corrupts a valid .docx by uppercasing local header names, verifies zipfile.BadZipFile is raised without the fix, and confirms MarkItDown produces identical output with the fix
  • Existing docx tests (test_docx_comments, test_docx_equations) still pass
  • Full local test suite passes (10/10)

…ft#1812)

Some document generators (e.g. certain Microsoft Word versions, legal
document systems) produce .docx files where the central directory records
one casing (e.g. 'customXml/item2.xml') but the local file headers record
another (e.g. 'customXML/item2.xml'). Python's zipfile module raises
BadZipFile when reading such files.

Add _fix_zip_filename_casing() to patch local file header filenames to
match the central directory before any ZIP processing occurs.
@lyydsheep
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BadZipFile crash on .docx files with case-mismatched zip entry names

1 participant