FIX:: 1056 hardened xml parser by vdstrizhkova · Pull Request #1695 · microsoft/hve-core

vdstrizhkova · 2026-05-29T10:28:49Z

Pull Request

Description

Extends the XML parser hardening from PR #1053 to the remaining unprotected call

Related Issue(s)

"Closes #1056"

Type of Change

Select all that apply:

Code & Documentation:

Bug fix (non-breaking change fixing an issue)
New feature (non-breaking change adding functionality)
Breaking change (fix or feature causing existing functionality to change)
Documentation update

Infrastructure & Configuration:

AI Artifacts:

Reviewed contribution with prompt-builder agent and addressed all feedback
Copilot instructions (.github/instructions/*.instructions.md)
Copilot prompt (.github/prompts/*.prompt.md)
Copilot agent (.github/agents/*.agent.md)
Copilot skill (.github/skills/*/SKILL.md)

Note for AI Artifact Contributors:

Agents: Research, indexing/referencing other project (using standard VS Code GitHub Copilot/MCP tools), planning, and general implementation agents likely already exist. Review .github/agents/ before creating new ones.

Skills: Must include both bash and PowerShell scripts. See Skills.

Model Versions: Only contributions targeting the latest Anthropic and OpenAI models will be accepted. Older model versions (e.g., GPT-3.5, Claude 3) will be rejected.

See Agents Not Accepted and Model Version Requirements.

Other:

Script/automation (.ps1, .sh, .py)
Other (please describe):

Testing

npm run test:py passes with exit code 0 before and after the change.
New unit test test_timing_template_parsed_with_hardened_parser_blocks_entity_expansion

Checklist

Required Checks

Documentation is updated (if applicable)
Files follow existing naming conventions
Changes are backwards compatible (if applicable)
Tests added for new functionality (if applicable)

Required Automated Checks

The following validation commands must pass before merging:

Markdown linting: npm run lint:md
Spell checking: npm run spell-check
Frontmatter validation: npm run lint:frontmatter
Skill structure validation: npm run validate:skills
Link validation: npm run lint:md-links
PowerShell analysis: npm run lint:ps
Plugin freshness: npm run plugin:generate
Docusaurus tests: npm run docs:test

Security Considerations

This PR does not contain any sensitive or NDA information
Any new dependencies have been reviewed for security issues
Security-related scripts follow the principle of least privilege

Additional Notes

fixed Contributor.md broken link

codecov-commenter · 2026-05-29T10:31:17Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.79%. Comparing base (b30a75d) to head (e5a5cce).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1695      +/-   ##
==========================================
+ Coverage   80.55%   82.79%   +2.23%     
==========================================
  Files         112       65      -47     
  Lines       18679     8567   -10112     
==========================================
- Hits        15047     7093    -7954     
+ Misses       3632     1474    -2158

Flag	Coverage Δ
pester	`84.01% <ø> (-0.02%)`	⬇️
pytest	`52.69% <100.00%> (-25.13%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
.../experimental/tts-voiceover/scripts/embed_audio.py	`44.02% <100.00%> (+0.42%)`	⬆️

... and 48 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

vdstrizhkova · 2026-05-29T11:06:38Z

@microsoft-github-policy-service agree company="Microsoft"

bindsi · 2026-05-29T12:06:49Z

/review

github-actions · 2026-05-29T12:09:56Z

✅ PR Review completed successfully!

github-actions

Review Summary

This PR correctly addresses the XXE hardening goal — adding XMLParser(resolve_entities=False, no_network=True) to _add_narration_timing is a sound defence-in-depth measure, and the broken byzanz link fix in CONTRIBUTING.md is welcome. However, one required change and one suggestion need attention before merge.

Issue Alignment

Branch name and commit message reference issue #1056 (hardened XML parser). The security fix in embed_audio.py and the accompanying test are directly aligned with that goal. ✅

PR Template Compliance

Unable to verify PR description contents due to an integrity policy filter on the PR metadata. If the description is incomplete or missing required template sections, please fill them in.

Coding Standards — ⚠️ Required Change

CONTRIBUTING.md — list marker inconsistency (see inline comment at line 72)

The diff changes the Table of Contents bullets from * to -. The rest of CONTRIBUTING.md continues to use *, creating a mixed-marker file. This violates the repo's markdown instructions (.github/instructions/hve-core/markdown.instructions.md):

Use unordered list markers consistently across a file; for the same level, do not mix *, +, -. Try to always use * for unordered lists.

Please revert the ToC list markers to *.

Code Quality — 💡 Suggestion

test_embed_audio.py — test validates lxml behaviour, not the production fix (see inline comment at line 125)

The new test exercises lxml's parser options directly without calling _add_narration_timing. A regression in the production function (e.g., the parser argument being dropped) would not be caught. The inline comment includes a suggested complementary test that patches etree.fromstring to assert the hardened parser is actually passed.

What's Good ✅

Security fix in embed_audio.py is correct: XMLParser(resolve_entities=False, no_network=True) passed to etree.fromstring is the right approach for defence-in-depth.
byzanz link updated from the defunct git.gnome.org URL to the correct GitLab archive. ✅
Logger call reformatting improves readability for long argument lists. ✅

Action Items

Required: Revert CONTRIBUTING.md ToC list markers from - back to *.
Suggested: Add a test that calls _add_narration_timing and asserts the hardened parser is passed to etree.fromstring.

Note

🔒 Integrity filter blocked 3 items

The following items were blocked because they don't meet the GitHub integrity level.

#1695 issue_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
#1695 pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
#1695 pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

Generated by PR Review for issue #1695 · sonnet46 2.2M

… fix_1056_hardened_xml_parser

…_parser

bindsi · 2026-06-04T12:01:52Z

/review

github-actions · 2026-06-04T12:03:19Z

✅ PR Review completed successfully!

github-actions

Advisory review, this PR is from a maintainer. Findings are informational only.

Overview

This PR applies a hardened lxml.etree.XMLParser(resolve_entities=False, no_network=True) to the _TIMING_TEMPLATE parse call in _add_narration_timing, and adds two new tests to validate the defence. The security intent is correct and the fix itself is minimal and well-targeted.

A codebase-wide grep confirms no other unhardened etree.fromstring / etree.parse call sites exist in .github/skills/ or scripts/ (the xml.sax.saxutils usages in generate_voiceover.py and fuzz_harness.py are for escaping/quoting output, not parsing, so they are not affected).

Issue Alignment

Partially aligned with issue #1056. The issue's acceptance criteria explicitly require:

A documented call site inventory (file path, line number, parser config, trust level)
Evidence that the codebase-wide grep was completed
Follow-up issues for any findings needing larger refactoring

None of these artefacts appear in this PR. Since the grep shows no other vulnerable sites, the remediation appears complete in practice — but the checklist items remain open on the issue. Consider either closing the issue with a note that the audit found only this one site, or updating the issue body with the inventory.

PR Template Compliance

No blocking template gaps identified from the available diff context.

Coding Standards

Three test-convention deviations noted via inline comments:

Test naming — both new methods use a descriptive-prose style instead of the test_given_context_when_action_then_expected format that every pre-existing test in this class follows (python-tests.instructions.md).
AAA comment structure — the pre-existing tests all include # Arrange, # Act, # Assert section comments; the new tests omit them.
Dead mock setup — mock_slide.element.find.return_value = None patches the wrong attribute (.element vs ._element).

Code Quality

_parser leading underscore on a local variable is a minor style nit (see inline comment on embed_audio.py:115).
The first new test validates lxml library behaviour rather than _add_narration_timing directly; its value as coverage for the PR's intent is marginal compared with the second test.

Note

🔒 Integrity filter blocked 2 items

The following items were blocked because they don't meet the GitHub integrity level.

#1695 pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
#1695 issue_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

Generated by PR Review for issue #1695 · sonnet46 3.8M

github-actions · 2026-06-04T12:15:21Z

        assert "old-content" not in xml_str
        assert 'spid="10"' in xml_str

+    def test_timing_template_parsed_with_hardened_parser_blocks_entity_expansion(self):


The test name doesn't follow the existing test_given_context_when_action_then_expected convention used throughout this file (e.g. test_given_slide_xml_when_add_timing_then_timing_element_appended). Also, the test body is missing # Arrange, # Act, and # Assert section comments required by python-tests.instructions.md.

Additionally, this test validates lxml.etree.XMLParser library behavior rather than _add_narration_timing itself — consider whether it adds coverage beyond the second new test, or rename/reframe to make the purpose clear.

Suggested rename:

def test_given_xxe_payload_when_hardened_parser_used_then_entity_not_expanded(self):

bindsi

Thanks for picking this up, @vdstrizhkova — nice job closing out the XML hardening from #1053/#1056. The fix correctly applies the established XMLParser(resolve_entities=False, no_network=True) idiom (matching powerpoint/scripts/extract_content.py), and test_add_narration_timing_uses_hardened_xml_parser is a solid behavioral test that verifies the production code actually passes the hardened parser. Verified locally: all 9 tests pass and ruff check is clean. ✅

Approving. A few NIT recommendations for a future cleanup (all non-blocking):

NIT 1 — Redundant first test. test_timing_template_parsed_with_hardened_parser_blocks_entity_expansion never imports or calls _add_narration_timing/embed_audio; it builds its own parser and asserts lxml's behavior, which is effectively a library test. The second test already proves the production code uses a hardened parser, so this one could be removed (or refactored to assert against the module's actual parser).

NIT 2 — Dead stub line. In test_add_narration_timing_uses_hardened_xml_parser:

mock_slide.element.find.return_value = None

The production code reads slide._element.find(...), not slide.element.find(...), and _element is reassigned to a real etree.Element on the next line — so this stub is a no-op and can be dropped.

NIT 3 — Naming consistency. The new local _parser uses a leading underscore (conventionally "private/unused"), whereas the sibling skill uses parser. Renaming to parser keeps it consistent with extract_content.py.

None of these block merge. Thanks again for the contribution! 🙌

remove redundant XXE library test, drop dead mock stub line

bindsi

Automated batch re-review: no actionable findings. The XML parser hardening remains in place and the prior concern is resolved.

Fixes 1056

df7ed12

vdstrizhkova requested a review from a team as a code owner May 29, 2026 10:28

vdstrizhkova marked this pull request as draft May 29, 2026 10:29

vdstrizhkova changed the title ~~hardened xml parser~~ FIX:: 1056 hardened xml parser May 29, 2026

Fixed the broken link

1ea03ee

vdstrizhkova marked this pull request as ready for review May 29, 2026 11:06

github-actions Bot added the needs-revision label May 29, 2026

github-actions Bot requested changes May 29, 2026

View reviewed changes

Comment thread CONTRIBUTING.md

Comment thread .github/skills/experimental/tts-voiceover/tests/test_embed_audio.py

Varvara Strizhkova and others added 6 commits May 30, 2026 07:08

revert(docs): restore CONTRIBUTING.md to main state

53e40cf

Merge branch 'main' into fix_1056_hardened_xml_parser

7049d84

style(tts-voiceover): apply ruff format to embed_audio.py

e079215

Merge remote-tracking branch 'fork/fix_1056_hardened_xml_parser' into…

9a2943a

… fix_1056_hardened_xml_parser

Merge remote-tracking branch 'origin/main' into fix_1056_hardened_xml…

5361240

…_parser

Fixing test

fb85efe

vdstrizhkova requested a review from WilliamBerryiii June 3, 2026 06:23

Fixing lint

c2e3791

vdstrizhkova self-assigned this Jun 3, 2026

and again

4d94aa6

github-actions Bot reviewed Jun 4, 2026

View reviewed changes

bindsi approved these changes Jun 4, 2026

View reviewed changes

WilliamBerryiii and others added 4 commits June 5, 2026 21:24

Merge branch 'main' into fix_1056_hardened_xml_parser

ebda730

rename _parser to parser in embed_audio,

71e3f0a

remove redundant XXE library test, drop dead mock stub line

rename _parser to parser in embed_audio,

7e86baa

remove redundant XXE library test, drop dead mock stub line

Merge branch 'main' into fix_1056_hardened_xml_parser

e5a5cce

bindsi approved these changes Jun 8, 2026

View reviewed changes

Uh oh!

Conversation

vdstrizhkova commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request

Description

Related Issue(s)

Type of Change

Testing

Checklist

Required Checks

Required Automated Checks

Security Considerations

Additional Notes

Uh oh!

codecov-commenter commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

vdstrizhkova commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bindsi commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Review Summary

Issue Alignment

PR Template Compliance

Coding Standards — ⚠️ Required Change

Code Quality — 💡 Suggestion

What's Good ✅

Action Items

Uh oh!

Uh oh!

Uh oh!

bindsi commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Overview

Issue Alignment

PR Template Compliance

Coding Standards

Code Quality

Uh oh!

github-actions Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bindsi left a comment

Choose a reason for hiding this comment

Uh oh!

bindsi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vdstrizhkova commented May 29, 2026 •

edited

Loading

codecov-commenter commented May 29, 2026 •

edited

Loading

vdstrizhkova commented May 29, 2026 •

edited

Loading

github-actions Bot commented May 29, 2026 •

edited

Loading

github-actions Bot commented Jun 4, 2026 •

edited

Loading