use the IOUC (UNTR extension) to speed up directory walks (EXPLORATION)#2503
use the IOUC (UNTR extension) to speed up directory walks (EXPLORATION)#2503Aaron Moat (AaronMoat) wants to merge 5 commits into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b50a0ee231
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "Codex (@codex) review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "Codex (@codex) address that feedback".
|
Codex (@codex) review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8498834406
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "Codex (@codex) review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "Codex (@codex) address that feedback".
| let _dir_flags_hint = match (opts.emit_untracked, opts.emit_empty_directories) { | ||
| (crate::walk::EmissionMode::CollapseDirectory, false) => { | ||
| DIR_SHOW_OTHER_DIRECTORIES | DIR_HIDE_EMPTY_DIRECTORIES | ||
| } | ||
| (crate::walk::EmissionMode::Matching, _) => 0, |
There was a problem hiding this comment.
Reject UNTR caches when dir_flags do not match walk mode
cache_is_applicable() computes the required UNTR flag combination for the current walk, but the value is discarded and never checked against cache.dir_flags(). This can reuse cache entries produced under a different untracked-files mode (e.g. normal vs all), which Git explicitly treats as incompatible because the cached directory shape differs; using mismatched flags can return incomplete or differently-collapsed untracked results.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
This was intentionally left as a no-op. The cache walk here works differently from the git/C implementation. Rather than serving cached directory entries directly (where the shape of those entries would differ between normal (EmissionMode::CollapseDirectory) and all (EmissionMode::Matching) mode), it always recurses into each subdirectory via is_dir_valid() and re-emits entries according to the current walk options. The cache is purely a "skip readdir" optimization for directories whose stat still matches; the actual output shape is reconstructed fresh each time. See matching_mode_with_tracked_intermediate_dirs_matches_uncached.
It would be a larger change to exactly match git's pattern and I think it's unnecessary.
Happy of course to lean on maintainer advice here!
| else { | ||
| return false; | ||
| }; | ||
| expected.stat().matches(&actual, Default::default()) |
There was a problem hiding this comment.
Validate global exclude files by hash, not stat alone
Validation of $GIT_DIR/info/exclude and core.excludesFile relies on Stat::matches() with default options, which do not compare nanoseconds and never compare file content hash. Same-size edits within the same second can therefore pass validation and keep stale ignore decisions alive in the UNTR fast path. Git’s validation compares exclude OIDs for this reason, so this should verify OidStat::id() against the current file hash (or equivalent content check), not just coarse stat data.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
This seems minor, and it broke CI. I fixed the info/exclude but the global excludes file leaving unchanged for now.
aee92b6 to
f7937e0
Compare
## Summary
Closes 9 compat-mode rows in tests/journey/parity/branch.sh, lands three
reusable foundations consumed by future sprints (commit, push, tag, merge),
and folds in 6 upstream syncs that arrived during the sprint.
## Closed parity rows (all bytes-mode, steward-verified, on-disk state asserted)
- `-v` / `--verbose` — column-aligned name + SHA + subject
- `-vv` — adds `[upstream: ahead N, behind N]` tracking annotation
- `--abbrev=N` / `--no-abbrev` — SHA width override
- `--column=always` — column-major packing
- `-u` / `--set-upstream-to` — writes branch.<n>.{remote,merge}
- `--unset-upstream` — clears them; emits `fatal: branch '<n>' has no upstream information` + exit 128 when absent
- `--track` (create-side) — writes upstream config after ref creation
- `--edit-description` — spawns editor, writes branch.<n>.description
Plus three on-disk-state guard rows (chained gix → gix invocations) so
bytes-mode parity tests can no longer mask missing-side-effect regressions
in branch-config writers.
## Foundations introduced
- gix::Repository::{set_branch_upstream, unset_branch_upstream, set_branch_description}
with $GIT_DIR/config persistence via gix_config::File::write_to_filter
on Source::Local sections (matches gix/src/clone/fetch/util.rs::setup_branch_config).
- Branch::DESCRIPTION key constant + SnapshotMut::clear_subsection_value helper.
- gitoxide-core::shared::columns::write_columns — column-major packer (reusable
for status --column).
- gitoxide-core::shared::editor::edit_file — $GIT_EDITOR/core.editor/$VISUAL/$EDITOR/vi
resolver + shell-spawn (reusable for commit -e, tag --edit, merge --edit).
## Upstream syncs folded in
- merge: pull upstream gix-main into gix-brit (security hardening + fuzz expansion)
- merge: pull GitoxideLabs#2508 — allow checkouts into non-empty directories
- merge: pull GitoxideLabs#2375 — filters and partial cloning: initial support
- merge: pull GitoxideLabs#2503 — IOUC (UNTR extension) dirwalk speedup
- merge: pull GitoxideLabs#2457 — gix-blame: Incremental
- merge: pull GitoxideLabs#2465 — barebones server-side upload-pack
## Adjustments
- steward agent: surface idiomatic Rust as an explicit design priority,
reframing vendor/git as the *what* (flag surface, exit codes, output bytes)
while gitoxide idioms anchor *how*.
- Post-merge cleanup: dropped duplicate clone --filter stub (superseded by GitoxideLabs#2375)
and pr-2457's orphaned log helpers (our log_all already handles paths).
Plan: etc/plan/foundations-sprint-branch-render-and-config.md
Pre-squash backup tag: pre-squash-foundations
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sebastian Thiel (Byron)
left a comment
There was a problem hiding this comment.
It will be hard to review this one, as I would always have to weigh in prompting this myself vs reviewing something created with prompts that may have caused it to go off the rails.
In any case, please turn this into a PR that can be rebased, so it has to lose the merge commits.
There was a problem hiding this comment.
Pull request overview
This PR adds an accelerated directory-walk path in gix-dir by consulting the index’s UNTR (IOUC) extension to serve unchanged directory contents from cache (reducing read_dir syscalls), and fixes a long-standing index stat decoding issue where ctime/mtime were swapped.
Changes:
- Implement UNTR-backed fast path in
gix-dir::walk, including cache validation against directory stats and ignore-file identity. - Fix
gix-indexstat decoding to map on-disk ctime/mtime into the correct fields. - Wire up
core.untrackedCache(includingkeep) intogixconfig parsing and expose an API override to force/disable cache usage; add coverage tests.
Reviewed changes
Copilot reviewed 23 out of 24 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| src/plumbing/progress.rs | Marks core.untrackedCache as actively used/config-relevant. |
| gix/tests/gix/repository/mod.rs | Adds integration tests for core.untrackedCache parsing (keep) and override behavior. |
| gix/src/status/mod.rs | Adjusts status error type to reflect new dirwalk-options error wiring. |
| gix/src/repository/dirwalk.rs | Reads core.untrackedCache into default dirwalk options and refines traversal-root behavior for pathspecs. |
| gix/src/dirwalk/options.rs | Adds untracked_cache option plumbing from gix to gix-dir. |
| gix/src/dirwalk/mod.rs | Introduces UntrackedCache enum and stores it in gix::dirwalk::Options. |
| gix/src/config/tree/sections/core.rs | Adds core.untrackedCache key definition and validation/parsing helper (incl. keep). |
| gix/src/config/mod.rs | Adds config::untracked_cache error module/type alias. |
| gix/Cargo.toml | Ensures gix-dir attributes support is enabled when gix’s dirwalk feature is used. |
| gix-worktree/src/stack/state/ignore.rs | Exposes read-only accessors needed to validate cache applicability vs ignore-stack state. |
| gix-status/tests/status/index_as_worktree_with_renames.rs | Updates expectations to include new Outcome fields via ..Default::default(). |
| gix-index/src/extension/untracked_cache.rs | Refactors UNTR decode to match git’s on-disk layout; adds accessors for decoded fields. |
| gix-index/src/extension/mod.rs | Adds public getters for UNTR metadata (identifier, flags, directories, exclude metadata). |
| gix-index/src/decode/mod.rs | Fixes swapped ctime/mtime decoding in index stat parsing. |
| gix-dir/tests/walk_utils/mod.rs | Enables cache usage in test options where applicable. |
| gix-dir/tests/dir/walk.rs | Adds end-to-end tests comparing cached vs uncached walks and git output; adapts outcomes for new counters. |
| gix-dir/tests/dir_cwd.rs | Updates expected outcomes to include new Outcome fields via ..Default::default(). |
| gix-dir/src/walk/untracked_cache.rs | New module: validates UNTR applicability and per-directory validity; hashes ignore files when needed. |
| gix-dir/src/walk/readdir.rs | Integrates cache-hit/miss accounting and cached traversal emission path into recursion. |
| gix-dir/src/walk/mod.rs | Adds use_untracked_cache option and cache hit/miss counters to walk outcome. |
| gix-dir/src/walk/function.rs | Hooks UNTR validation into walk() entrypoint and passes cache context into recursion. |
| gix-dir/Cargo.toml | Adds unix deps for system identification (uname) and defines an attributes feature. |
| crate-status.md | Marks accelerated UNTR dirwalk support as implemented. |
| Cargo.lock | Locks new dependency graph resulting from added deps/features. |
| pub fn dirwalk_options(&self) -> Result<dirwalk::Options, config::untracked_cache::Error> { | ||
| let use_untracked_cache = config::tree::Core::UNTRACKED_CACHE | ||
| .try_into_untracked_cache(self.config.resolved.boolean(config::tree::Core::UNTRACKED_CACHE))?; | ||
| let fs_caps = self.filesystem_options().map_err(|e| config::key::Error { |
| pub enum Error { | ||
| #[error(transparent)] | ||
| DirwalkOptions(#[from] config::boolean::Error), | ||
| #[error(transparent)] | ||
| ConfigureUntrackedFiles(#[from] config::key::GenericErrorWithValue), | ||
| DirwalkOptions(#[from] config::untracked_cache::Error), | ||
| } |
| let current_path = current.clone(); | ||
|
|
||
| let info = classify::path( | ||
| current, | ||
| current_bstr, | ||
| if prev_len == 0 { 0 } else { prev_len + 1 }, | ||
| None, | ||
| || { | ||
| std::fs::symlink_metadata(¤t_path) | ||
| .ok() | ||
| .map(|ft| ft.file_type().into()) | ||
| }, |
| match (cache.excludes_file(), excludes_file) { | ||
| (Some(expected), Some(path)) if validate_cached_stat(expected, path) => {} |
| let _dir_flags_hint = match (opts.emit_untracked, opts.emit_empty_directories) { | ||
| (crate::walk::EmissionMode::CollapseDirectory, false) => { | ||
| DIR_SHOW_OTHER_DIRECTORIES | DIR_HIDE_EMPTY_DIRECTORIES | ||
| } | ||
| (crate::walk::EmissionMode::Matching, _) => 0, | ||
| (crate::walk::EmissionMode::CollapseDirectory, true) => return Some(false), | ||
| }; |
| assert_eq!( | ||
| out.untracked_cache_hits, | ||
| out.untracked_cache_hits, // at least one hit | ||
| "cache hits should be non-zero" | ||
| ); |
| let tmp = gix_testtools::tempfile::tempdir()?; | ||
| let base = tmp.path().to_path_buf(); | ||
| std::mem::forget(tmp); | ||
| let root = base.join("repo"); |
| fn repo_with_untracked_cache() -> crate::Result<gix::Repository> { | ||
| let tmp = tempfile::tempdir()?; | ||
| let root = tmp.path().join("repo"); | ||
| std::mem::forget(tmp); | ||
| std::fs::create_dir(&root)?; |
9d7b165 to
d822b48
Compare
|
Sebastian Thiel (@Byron) I can start having a look at Copilot's feedback, but if you're unlikely to want to have a look through it's probably a low value activity. More interested if I try and chunk this into stacked PRs? It started big because I didn't really know what changes I needed until I made them, but now that I have something working it shouldn't be too crazy to split up. |
|
Multiple commits would be enough, no stacked PRs please 😅. |
… on-disk layout The `stat()` decoder had ctime and mtime fields swapped. The UNTR extension decoder was also reading fields in the wrong order: the on-disk format stores both stats first, then dir_flags, then the two object hashes, but the old code interleaved them via `decode_oid_stat`. Add read-only accessors to `UntrackedCache`, `Directory`, and `OidStat` so consumers can inspect cached data without accessing public fields directly. Co-authored-by: GPT 5.4 <codex@openai.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Add getters for `overrides`, `globals`, `exclude_file_name_for_directories`, and `source` so callers can inspect the ignore configuration without accessing fields directly. Co-authored-by: GPT 5.4 <codex@openai.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…s to `Outcome` Introduce `Options::use_untracked_cache` (defaults to `true`) to control whether the UNTR index extension is consulted during directory walks. Add `Outcome::untracked_cache_hits` and `untracked_cache_misses` counters so callers can observe how effective the cache is. Replace `#[derive(Default)]` on `Options` with an explicit `impl Default` so the new field's default is visible and auditable. Update all `Outcome` struct literals in tests to use `..Default::default()` for forward compatibility. Co-authored-by: GPT 5.4 <codex@openai.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
When `Options::use_untracked_cache` is set and the index contains a valid UNTR extension, validate its applicability (identifier, global exclude files, per-dir ignore filename) and serve cached untracked entries for directories whose stat hasn't changed, skipping `read_dir` entirely. Cache hits/misses are counted in `Outcome`. On a miss the walk falls back to a real `read_dir` call transparently. Co-authored-by: GPT 5.4 <codex@openai.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…ory walks Wire `core.untrackedCache` into `gix` config parsing and thread it through `DirwalkOptions` down to `gix_dir::walk::Options::use_untracked_cache`. Status operations that invoke a dirwalk now honour the config value so the untracked cache can be enabled, disabled, or left to `keep` semantics as git does. Co-authored-by: GPT 5.4 <codex@openai.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
d822b48 to
8d48d14
Compare
|
I just wanted to make you aware of this comment, if it makes sense to play around with it.
Can we land this separately? Ideally with a test if that's possible. I remember that some strange swapping was going on when decoding, but I always thought it was necessary. Maybe I got it wrong and no test caught it. Let's take a separate look. Can you also tell me more about your workflow? How much time did you spend on this? |
|
I am probably going to go with the recommendation here, which is to redo the implementation myself. Nonetheless, this PR can serve as a spike and a first evaluation of possible performance gains. Those are probably most relevant on Windows. Benchmark
Benchmark resultsDirect Setup:
Counter checks from the same setup:
Conclusion: the PR does eliminate directory reads when the UNTR cache validates, but on this hot-cache local filesystem workload the wall-clock improvement is only about 0-2%. Redo or Review?
Verdict I would not review this as a near-ready implementation. I’d treat it as a spike and recreate the UNTR fast path using Git’s It is partly Git-shaped, but not compatibility-grade. Likely Git Reference? Yes, Git was likely used as a reference, but unevenly. Evidence:
But the implementation does not follow Git closely enough in the critical validation path. Compatibility Problems The biggest gaps against
Recommendation Recreate it around Git’s model before deep review:
Given the correctness gaps and the benchmark showing only small wall-clock wins, I would not spend review time polishing this implementation. Rebuild it against Git’s behavior and tests. |
|
Hey Sebastian Thiel (@Byron) - thanks for poking around. I'm happy to land the ctime/mtime stuff separately. I can work on that soon! My workflow was "works on my machine"-driven development if I'm being totally honest. I'm confident that what I've got here works for my setup, but I am not confident to spec alignment, no bugs, etc. Git source was mainly used after the initial prompted generation, because there was a lot of bugs (e.g. running gix status showed it not reading the untracked cache at all, or incorrectly, etc). Certainly not offended by treating this as a spike :) As for perf
So it's certainly not windows-specific, but it is skewed to being beneficial for repos with huge file counts. (Of course, it only helps when git is used and maintaining the cache, as the PR only reads it) |
|
Thanks so much for sharing your benchmark results. It's super helpful to see the great effect the cache can have with the right input. Let's keep this PR open until it's officially superseded, but put it into DRAFT to spare the reviews in case you keep refreshing it for personal use. |
When
core.untrackedCacheis true,gix-dirnow consults the indexUNTRextension before opening each directory. Directories whose stat and exclude-file OID still match the cache are served from memory, avoidingread_dirsyscalls for unchanged trees.As part of verifying the change, I discovered ctime/mtime decoding was incorrect. The stat decoder assigned
ctime_secs/nsecsto themtimefield and vice versa. The binary format stores ctime first then mtime; correcting the swap fixes IOUC stat comparisons for directories where ctime ≠ mtime.Per CONTRIBUTING.md, disclosing (heavy) AI use in these changes (and have done so in the commit trailers). I went through a number of iterations and have manually tested that things "seem to work on my machine" (installing gix locally and using gix status / playing around with --statistics and adding untracked files and making sure they reflect with minimal directory crawling), but acknowledging I have large gaps in my understanding of this codebase and also git's.
Thanks in advance if you spend the time to review this. I'm very happy to take feedback and fix things up.