Skip to content

fix(ocr): emit valid Markdown newlines for PPTX conversion#2011

Open
MontesanoDev wants to merge 1 commit into
microsoft:mainfrom
MontesanoDev:fix-ocr-pptx-markdown-newlines
Open

fix(ocr): emit valid Markdown newlines for PPTX conversion#2011
MontesanoDev wants to merge 1 commit into
microsoft:mainfrom
MontesanoDev:fix-ocr-pptx-markdown-newlines

Conversation

@MontesanoDev
Copy link
Copy Markdown

Closes #2010

Summary

PptxConverterWithOCR emitted literal \n sequences instead of actual line breaks in Markdown output.

The optional LLM caption path also imported llm_caption from a module that does not exist in the markitdown_ocr package.

Changes

  • Replace literal \n sequences with actual newlines in PPTX Markdown output.
  • Import llm_caption from markitdown.converters._llm_caption.
  • Update PPTX OCR snapshots to expect valid Markdown line breaks.
  • Add regression tests for newline output and the LLM caption import path.

Testing

pytest packages/markitdown-ocr/tests/test_pptx_converter.py -v

Result: 8 passed.

@MontesanoDev
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

markitdown-ocr: PPTX converter emits literal \n sequences in Markdown output

1 participant