Skip to content

Maintenance inception#2146

Open
newren wants to merge 7 commits into
gitgitgadget:masterfrom
newren:maintenance-inception
Open

Maintenance inception#2146
newren wants to merge 7 commits into
gitgitgadget:masterfrom
newren:maintenance-inception

Conversation

@newren

@newren newren commented Jun 11, 2026

Copy link
Copy Markdown

During the time between when a repack starts and when it completes,
large repositories can gain thousands of packfiles and tens of thousands
of loose objects that are not part of that repack. Not only does this
mean that the repository immediately needs maintenance again, but git
operations slow down dramatically during the repack -- sometimes to the
point that the service is effectively considered offline.

This series proposes a simple (optional, off by default) solution:

  • Doing maintenance during maintenance.

or, more specifically:

  • Doing (cheap mere aggregation) maintenance (of new objects and packs)
    during (far more thorough) maintenance (of older objects and packs).

=== Concerns/Objection ===

The description above probably raises several serious concerns; let me
attempt to address them (feedback on whether I've missed any, or
insufficiently addressed some very welcome):

Objection 1: What does "cheap mere aggregation" mean?

It means omitting all of the following:

  • delta compression or even delta searching
  • reachability checks, and anything related to them
    • commit-graph updates
    • bitmap generation
    • cruft pack generation or updates
  • multi-pack index creation or updates

Objection 2: Aggregating packs without delta searches causes bad deltas

Yes, but this series adds a .baddeltas marker file (similar to .keep,
.mtimes, .bitmap, .promisor that already exist) to notify future
pack-objects invocations that two objects within that pack should be
checked for deltas, so the effect does not persist.

(On the flip side, see Trade-off 1)

Objection 3: Repack or its subprocess can fail if packfiles disappear

repack calls pack-objects, which enumerates the existing packs, and
only repacks those (or a subset of those if doing a geometric repack),
and then writes a new pack. The remainder of repack (e.g. multi-pack
index creation, bitmap generation) -- at least up until the
odb_reprepare() call immediately before deleting redundant packs --
only works on this subset of packs. Thus, if we have pack-objects
tell us the packs it is working on and we avoid touching any packs:

  • in an existing multi-pack-index
  • in the list of packs that repack's pack-objects is working on
  • that is written by repack's pack-objects process
  • that are covered by other sidecars (mainly '.keep' and '.mtimes', but
    we throw in '.promisor' and '.bitmap' for good measure)

Then we should be safe; this does require modifying pack-objects to:

  • let us know exactly what packs it is working on
  • create its new pack with a (temporary, see below) .keep file

(We also have pack-objects notify us which loose objects exist at that
time, even though that is not needed for correctness, but does allow
us to generally avoid putting a loose object into two new packs.)

Objection 4: The .keep created by pack-objects might persist

It is removed at the end of the repack process.

Further, if for some reason the repack process is killed or dies, the
.keep file contains a marker making it clear the keep was meant to be
temporary and has the corresponding PID of the repack process, meaning
that if that PID no longer exists, the .keep file can be cleaned up.
See also the next point.

Objection 5: Will this "cheap" maintenance actually be cheap?

There's an unnecessary reachability walk today which makes this kind
of "cheap" maintenance somewhat expensive. Patch 3 removes it and
makes a typical case 500x faster and take 50x less memory.

That's fast enough that I not only run "cheap" maintenance during
"real" maintenance, I also run a "cheap" maintenance step before "real"
maintenance. The reasons for this inserted preparatory step are:

  • it allows other git operations to be faster while waiting for
    the "real" maintenance to complete
  • it reduces the number of packfiles and loose objects that need
    to be tracked for exclusion
  • the preparatory step can look for leftover ".keep" files from a
    previous repack process that died early and remove them

=== Important tradeoffs ===

Trade-off 1: need to repeat some delta checks

Two objects being in the same pack that aren't a delta against each
other usually means a previous process has checked and determined
those two objects are not a good delta. Any such objects placed into
a .baddeltas pack will lose that information and we'll have to
re-check for deltas.

Trade-off 2: mtime delay for expiring objects

When we repack loose objects into a (non-cruft) pack, that effectively
updates their mtime when they could have been ready to expire. If
these objects are unreachable, then the "real" maintenance's
reachability check will determine that and either unpack these objects
or move them to a cruft pack. Either way, they'll be stamped with the
pack's mtime they were removed from. For large repositories where we
essentially continuously run maintenance all day long, this can only
add a delay equal to how long a maintenance run takes and is thus
considered inconsequential.

=== Series overview ===

Patch 1: Add support for a ".baddeltas" marker for packs, in try_delta()
Patch 2: Support --mark-bad-deltas option in pack-objects
Patch 3: Accelerate pack-objects when delta search is off
Patch 4: Add --emit-input-{packs,loose} plumbing options to pack-objects

This patch is critical, because we need to know what packs the
regular repacking will be working with so we avoid them.

Trying to enumerate the packs outside of repack/pack-objects risks a
race with newly pushed packs, and in large repositories which push
multiple times per second, losing that race is quite likely.

Patch 5: New pack-aggregate builtin for cheap object/pack aggregation
Patch 6: Allow repack to spawn pack-aggregate while doing regular repacking

@newren newren force-pushed the maintenance-inception branch from 2e984e4 to d53bd83 Compare June 11, 2026 06:20
newren added 5 commits June 11, 2026 00:45
When considering an object pair for delta compression, try_delta()
short-circuits (skipping a real delta search) when both objects came
from the same on-disk pack and neither is stored there as a delta.
The reasoning is that an earlier pack-objects run already had the
chance to try this pair and chose not to delta either against the
other, so trying again is unlikely to find anything new.

That heuristic is correct only if the source pack was actually built by
something that did a real delta search.  That may not be the case if the
pack was written by `git fast-import` (which only does a basic delta
comparison for objects in the pack it writes), the bulk_checkin
framework when streaming large blobs straight into a pack, or
pack-objects with --window=0.

Add a `.baddeltas` sidecar marker, parallel to `.keep`, `.promisor`,
and `.mtimes`, that lets a producer flag its output pack as one whose
delta layout should not be trusted by this skip.

For context on the underlying delta skip and earlier discussions of a
similar marker, see Jeff King's analysis on the list:

  https://lore.kernel.org/git/20231009202149.GA3281325@coredump.intra.peff.net/

Assisted-by: Claude Opus 4.7
Signed-off-by: Elijah Newren <newren@gmail.com>
The previous patch taught pack-objects to honor a `.baddeltas` sidecar
marker that tells the same-pack delta skip in try_delta() to fall
through to a real delta search.  Add a `--mark-bad-deltas` option that
writes such a marker alongside each output pack.

The flag is incompatible with `--stdout`, since the marker lives next
to the output pack in `$GIT_DIR/objects/pack` and `--stdout` has no
on-disk output to mark.  When pack-objects splits its output across
multiple packs (with `--max-pack-size`), a marker is written for each
resulting pack inside the existing per-pack output loop, between
stage_tmp_packfiles() and rename_tmp_packfile_idx() so the marker is
in place before the `.idx` anchor becomes visible.

Document the option, and extend t5335 to cover both the producer side
(marker is written and disables the same-pack skip for the resulting
pack) and the `--stdout` incompatibility.

Assisted-by: Claude Opus 4.7
Signed-off-by: Elijah Newren <newren@gmail.com>
pack-objects performs a somewhat expensive revision walk (which really
ought to be part of the progress output, but that's an issue for another
time) to:

  * fill in the namehash of objects, to improve the delta selection
    process;

  * with --stdin-packs=follow, pull in objects that live in unknown
    packs to close the set under reachability.

However, if we are not using --stdin-packs=follow and delta searches are
disabled (--window=0 or --depth=0), then this walk is unnecessary.

In a repository with about 35 million objects, running

  git pack-objects --stdin-packs --window=0 --delta-base-offset \
                   --no-write-bitmap-index <out>

over a list of 316 packs took 176s and 8.2 GiB of RSS, almost all
of which was spent in the revision walk.  Skipping the walk drops
the same invocation to 0.32s and 154 MiB of RSS, while producing an
identically sized pack with the same objects.

Assisted-by: Claude Opus 4.7
Signed-off-by: Elijah Newren <newren@gmail.com>
If a process desires to work on packs other than what pack-objects is
working with, there is currently no way to do so.  Callers cannot simply
enumerate the packs before calling pack-objects, because that leaves a
race between their enumeration and pack-objects' enumeration; large
repositories push often enough to make losing that race inevitable.

Address this issue by adding an --emit-input-packs option to
pack-objects to allow it to provide a way to inform the caller which
packs it will be working with.  This will allow us to then operate on
packs other than those ones.

Similarly, add an --emit-input-loose option to allow pack-objects to
provide a way to inform the caller which objects it will be working
with.  This one could be avoided by the external caller just enumerating
objects on its own, because even though doing so has the exact same kind
of race involved, losing the race means the caller adds the loose object
to another pack and deletes it, which doesn't prevent pack-objects from
successfully also packing that same loose object itself.  The only
unfortunate side-effect is that the object will appear in multiple
packs, giving us some duplication.  However, we can cut down on how many
objects get temporarily duplicated using this --emit-input-loose option.

Both options write to <file>.tmp first and then rename, so the file
appears atomically and a reader either sees the previous contents (if
any) or the complete new contents, never a partial write.

Since these options exist only to support the handshake between repack
and the upcoming pack-aggregate builtin, and are not meant for end
users, they are currently undocumented.

Assisted-by: Claude Opus 4.7
Signed-off-by: Elijah Newren <newren@gmail.com>
On busy servers it is possible for the object directory to accumulate
many loose objects and many small push packs while a long-running
maintenance pass is still working.  Until that pass finally finishes,
unrelated git operations end up scanning more files in
`objects/` and `objects/pack/` than is healthy, which shows up as
visible latency on otherwise quick commands.

Introduce a new plumbing builtin, `git pack-aggregate`, intended to
be spawned by `git repack` and run in a loop alongside the normal
repacking process.  Each cycle does two things in order:

  1. Bundle local loose objects (minus any in `--exclude-loose-file`)
     into a new pack with `pack-objects --window=0 --mark-bad-deltas
     --delta-base-offset --no-write-bitmap-index`, and unlink the
     loose copies on success.

  2. Aggregate small local packs (minus any in `--exclude-pack-file`,
     minus packs in the multi-pack-index, minus packs carrying
     `.keep`, `.promisor`, `.mtimes`, or `.bitmap` sidecars) into a
     single new pack the same way, then unlink the source packs.

Both steps copy the existing on-disk representation of each object
through unchanged: no delta search, no bitmap or commit-graph update,
no object recompression.  The output pack lands with a `.baddeltas`
marker so a subsequent thorough repack knows it should reconsider
intra-pack deltas in that pack.  The roll-up pack produced in step 1
is naturally a candidate for step 2 of the same cycle and is normally
folded in.

`git pack-aggregate` takes no locks of its own, much like `git repack`.
Callers that need serialization (typically `git gc`, or a server-side
maintenance driver) must arrange it themselves.

The integration with `git repack` (gated on a new `repack.aggregate`
config and `--aggregate` CLI flag) is left for a follow-on patch.

Assisted-by: Claude Opus 4.7
Signed-off-by: Elijah Newren <newren@gmail.com>
@newren newren force-pushed the maintenance-inception branch from d53bd83 to 9a7521d Compare June 11, 2026 13:54
newren added 2 commits June 11, 2026 07:15
On a busy server hosting a large repository, `git repack` routinely
takes long enough for thousands of packs and tens of thousands of loose
objects to accumulate before it finishes.  Every subsequent git
operation pays the price: each pack lookup walks the longer list, and
each loose-object miss has to consult more sources.

Wire `git repack` to optionally drive `git pack-aggregate` as a
background side process, controlled by `repack.aggregate` config
or the equivalent `--aggregate` / `--no-aggregate` command-line
options.  Default off.

When enabled:

  * Create a small tempdir under `objects/` for two exclude files.
  * Pass `--emit-input-packs=<tmp>/packs --emit-input-loose=<tmp>/loose`
    to the main pack-objects so it records exactly which inputs it
    has committed to.
  * Launch `git pack-aggregate --loop --exclude-pack-file=<tmp>/packs
    --exclude-loose-file=<tmp>/loose` once the emit files exist.
  * Drop a `.keep` marker on every pack written by the main repacking
    work, so the aggregator does not roll it up.

pack-aggregate keeps running through the cruft pack-objects subprocess
and through `write_midx_included_packs()`; it is SIGTERM'd only when it
comes time to delete redundant packs and objects.

Finally, unlink the recorded `.keep` markers after pack-aggregate has
been shut down.

Add a few environment variables to simplify testing:
  * `GIT_TEST_PACK_AGGREGATE_INTERVAL`
  * `GIT_TEST_PACK_AGGREGATE_MIN_PACKS`
  * `GIT_TEST_PACK_AGGREGATE_MIN_LOOSE`

`git repack` itself takes no locks, and neither does the aggregator
we spawn; serialization across maintenance runs is the caller's
responsibility (typically `git gc` or a server-side maintenance
driver).  This matches the prior behavior of `git repack`.

Assisted-by: Claude Opus 4.7
Signed-off-by: Elijah Newren <newren@gmail.com>
A pack-aggregate run rolls many small packs up into one aggregator
output and writes a sibling `.baddeltas` marker, since the new pack
reuses raw delta-chains from its inputs without any cross-input delta
search.  That marker tells future `pack-objects` runs over the pack to
treat its existing deltas as stale and recompute new ones.

The aggregator deliberately does not install a `.keep` marker on its
output: we want the next opportunity to re-delta its contents to roll up
the pack like any other.  But a `git repack --geometric=N` pass walks
the packs sorted ascending by object count and only rolls up packs below
the geometric split; anything already at the heavy end of the
progression sits above the split and is kept as-is.  Aggregated packs
could potentially be large enough to land there, which may leave a
`.baddeltas` pack above the split for many subsequent geometric passes
resulting in the pack not getting repacked and keeping its stale deltas.

Teach `pack_geometry_split()` to do a post-pass demotion of any
`.baddeltas` packs from the kept region into the rollup region.  The
demotion preserves the ascending sort order of the kept region so that
`pack_geometry_preferred_pack()` continues to return the largest local
kept pack.  Demotion is unconditional: a `.baddeltas` marker is an
explicit "these deltas need redoing" tag from a previous repack step,
and the next geometric pass is the cheapest place to honor it.

`.keep`-marked packs are still excluded from the rollup by the existing
`pack_kept_objects` / `remove_redundant_packs` paths, so a user who
really wants to pin a `.baddeltas` pack in place has that escape hatch.

Add two tests to t5336: one that confirms a `.baddeltas` pack above the
natural split is rolled up and its marker disappears from the resulting
pack, and a control test that confirms a non-`.baddeltas` pack above the
split is still preserved.

Assisted-by: Claude Opus 4.7
Signed-off-by: Elijah Newren <newren@gmail.com>
@newren newren force-pushed the maintenance-inception branch from 9a7521d to 24bdb43 Compare June 11, 2026 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant