Skip to content

Fix metadata path leaking outside cache when an ancestor dir contains '.py'#762

Open
Kymi808 wants to merge 1 commit into
huggingface:mainfrom
Kymi808:fix/meta-path-splits-on-py-substring
Open

Fix metadata path leaking outside cache when an ancestor dir contains '.py'#762
Kymi808 wants to merge 1 commit into
huggingface:mainfrom
Kymi808:fix/meta-path-splits-on-py-substring

Conversation

@Kymi808
Copy link
Copy Markdown

@Kymi808 Kymi808 commented May 30, 2026

Summary

_copy_script_and_other_resources_in_importable_dir (src/evaluate/loading.py:333) derives the cache metadata JSON sibling of the loaded script:

meta_path = importable_local_file.split(".py")[0] + ".json"

The intent (per the comment two lines below — *"the filename is .py in our case, so better rename to filenam.json instead of filename.py.json") is to swap the trailing .py extension for .json. But str.split(".py") splits on every occurrence of .py and [0] keeps everything before the first.

For any user whose cache path has .py somewhere in an ancestor directory — .pyenv, .pycache, pypy install paths, project directories containing .py — the prefix gets truncated to before that directory.

A pyenv user with an importable file at
/home/u/.pyenv/versions/.../evaluate_modules/metrics/accuracy/<hash>/accuracy.py
ends up writing the metadata to /home/u/.json — outside the cache tree — and every subsequent evaluate.load(...) clobbers it.

>>> "/home/u/.pyenv/.../accuracy.py".split(".py")[0] + ".json"
'/home/u/.json'                                                # current

>>> os.path.splitext("/home/u/.pyenv/.../accuracy.py")[0] + ".json"
'/home/u/.pyenv/.../accuracy.json'                             # fixed

Fix

Use os.path.splitext(...)[0] so only the trailing extension is stripped, matching the comment's stated intent.

Why this isn't a behavior change someone is relying on

grep -rn '"original file path"\|"local file path"\|meta_path' shows nothing in the repo reads the metadata back — it's write-only informational JSON. The only effect of the bug is junk JSON files at unexpected paths.

Test plan

Added test_copy_script_metadata_path_when_ancestor_dir_contains_py in tests/test_load.py. It calls _copy_script_and_other_resources_in_importable_dir with importable_directory_path = <tmp>/.pyenv/evaluate_modules and asserts:

  • the metadata file exists at <tmp>/.pyenv/evaluate_modules/<hash>/accuracy.json (next to the script);
  • nothing was leaked to <tmp>/.json.

The test fails on main with AssertionError: meta file missing at <expected path> and passes with this change. pytest tests/test_load.py::test_copy_script_metadata_path_when_ancestor_dir_contains_py → 1 passed.

`_copy_script_and_other_resources_in_importable_dir` derived the cache
metadata sibling via `importable_local_file.split(".py")[0] + ".json"`. The
intent (per the comment immediately below) is to swap the trailing `.py`
extension for `.json`, but `str.split(".py")` splits on every occurrence and
`[0]` takes everything before the first.

For any cache path with `.py` in an ancestor directory — `.pyenv`,
`.pycache`, pypy paths — the prefix gets truncated to before that directory.
On a pyenv user's machine
`/home/u/.pyenv/.../accuracy/<hash>/accuracy.py` becomes `/home/u/.json`,
so the metadata is written outside the cache tree (and gets clobbered by every
subsequent `evaluate.load()`).

Use `os.path.splitext(...)[0]` to strip only the trailing extension, matching
the comment's intent. Adds a regression test using a tmp dir whose ancestor
contains `.pyenv` and asserts the metadata lands next to the copied script
and not at `<root>/.json`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant