Skip to content

chore: release v0.6.0#665

Merged
xdotli merged 339 commits into
mainfrom
release/v0.6.0
Jun 13, 2026
Merged

chore: release v0.6.0#665
xdotli merged 339 commits into
mainfrom
release/v0.6.0

Conversation

@xdotli

@xdotli xdotli commented Jun 10, 2026

Copy link
Copy Markdown
Member

Release v0.6.0

The task.md standard release. See CHANGELOG.md for the full list.

Headline

  • task.md task standard + authoring CLI (bench tasks init/check/migrate/export) with a leaderboard-grade acceptance gate
  • bench agent adoption router (create/run/verify) — scaffolds, drives conversion, and runs the parity gate
  • ATIF + ADP trajectory artifacts emitted from every scored rollout
  • OpenReward (ORS) reward-format interop — reward export + ors-episode verifier recognition. The hosted-environment episode runner is in progress (draft feat: OpenReward hosted-environment inbound runner #660) and is NOT in this release.
  • Daytona sandbox auto-reap — ownership-scoped: only benchflow-labeled sandboxes are reaped; foreign sandboxes on a shared key are never touched.
  • Quickstart/CLI docs reconciled with observed behavior; 15 dogfood findings closed

Evidence

  • Full suite: 3,406 passed, 0 failed
  • Layer-2 parity (49 tasks × 3 trials, deepseek-v4-flash): mean delta −2.4% ± 3.7% SE, 95% CI includes zero, 0 conversion defects across all 26 triaged divergent cells (independently recomputed from raw rollouts in audit)
  • Eight-lens adversarial ultraaudit run pre-tag: 2 P0 + several P1 found and fixed — destructive-reaper scoping, the false OpenReward claim corrected, the user-loop regression, citation + test guards. Security + privacy clean.

Checklist

  • CI gate passes (ruff format/check, full pytest, uv.lock)
  • e2e live smoke (Docker) — run before merge
  • CHANGELOG updated · CITATION.cff bumped to 0.6.0
  • Version bumped to 0.6.0
  • Ultraaudit blockers resolved
  • Merge and tag after review (do not merge automatically)

Drafts to land separately: #660 (OpenReward episode runner — account credits), #661 (AgentBeats — @Yiminnn review).


Note

Medium Risk
Removes the status dashboard and large internal planning docs while adding a secrets-dependent integration workflow; release notes claim broad product/CLI changes reviewers should cross-check against the full branch beyond this doc-heavy diff.

Overview
Release packaging for 0.6.0CHANGELOG.md is finalized with the 0.6 feature set (native task.md, bench eval adopt, ATIF/ADP outputs, ORS interop, Daytona auto-reap, CLI/sandbox fixes, and removals such as the legacy CLI tree, experiments/, and unwired OTel). CITATION.cff bumps to 0.6.0 with an updated release date.

Public docs and positioningREADME.md is rewritten around the “universal environment framework” story: RC wheel install, quickstart via bench eval create, task.md authoring, hosted --source-env, loop strategies, and clearer separation of bench agent vs bench eval adopt. AGENTS.md updates Bedrock Opus 4.8 guidance to the LiteLLM proxy patch path. benchmarks/CONVERT.md documents bench eval adopt verify parity inputs and switches adoption runs to Evaluation instead of Job.

Repo cleanup — The entire dashboard/ tree and the Claude dashboard launch config are removed; internal 0.3 plan doc (.claude/dev-docs/0.3-plan.md) is deleted. Labs index (.claude/dev-docs/labs.md) now points at archived docs/labs/ instead of inlining runbooks.

Contributor/agent tooling — Claude skills (benchflow, docs-review, branch-review, launch-prep) are aligned with 0.6 CLI (bench eval metrics/view, bench tasks init, jobs/ layout, RC install). launch-prep tightens the live smoke gate so an all-skipped pytest run cannot false-green e2e.

CI — New .github/workflows/integration-eval.yml: nightly/manual one-task Docker rollout, agent-judge gate, integration pytest slice (excluding duplicate agent rollout), artifact upload; fork-guarded and secrets-backed on the canonical repo.

Reviewed by Cursor Bugbot for commit cb2dca9. Bugbot is set up for automated code reviews on this repo. Configure here.

@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

@mintlify

mintlify Bot commented Jun 10, 2026

Copy link
Copy Markdown

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
benchflow-bff148e7 🔴 Failed Jun 10, 2026, 11:06 PM

💡 Tip: Enable Workflows to automatically generate PRs for you.

xdotli added 2 commits June 11, 2026 02:44
…ide state

The pre-agent workspace snapshot /testbed_verify is seeded world-readable
(chmod -R o+rX) so the root verifier can diff against it, but it was missing
from _DEFAULT_LOCKED. That let the non-root agent read grading-side state in
the snapshot — verifier/judge config, rubrics, expected outputs, and judge
credentials — enabling reward forgery.

Add /testbed_verify to _DEFAULT_LOCKED. Seeding runs before lockdown_paths in
install_agent, so lockdown's chown root:root + chmod 700 closes the read window
before the agent starts. The verifier runs as root and still reads the snapshot
regardless of mode, so oracle/verifier execution is unaffected.

Tests: pin /testbed_verify into the default and resolved locked sets
(mutation-killers), assert the emitted lockdown command chowns root and
chmod 700s it, and add a hermetic sandbox-level test proving the seed->lockdown
sequence leaves no group/other read or traverse bits on the snapshot dir.
Comment thread docs/agent-quickstart.md
xdotli and others added 20 commits June 11, 2026 03:09
…fold edit list

running-any-benchmark.md: Layer 2 conflated two distinct translation flows.
Split them: (a) migrate a task you control -> validate with bench tasks check
(no parity gate); (b) adopt a foreign benchmark -> bench agent create then
prove with bench agent verify. bench agent verify only scores an adopted
benchmarks/<name>/ and errors 'benchmark not adopted ... run bench agent
create first' on a migrated task.md.

agent-quickstart.md:
- STEP 7: add the three placeholder files the task.md scaffold also writes
  (verifier/verifier.md, verifier/rubrics/verifier.md,
  verifier/rubrics/verifier.toml) to the replace-the-placeholders list so a
  literal follower passes 'bench tasks check' first try.
- STEP 1: add a release-candidate install note. 0.6.0 is not yet on PyPI, so
  point testers at the GitHub prerelease wheel for the real 0.6 flow today and
  keep the post-publish 'benchflow==0.6.0' command. Add a one-line trap about
  running 'uv run bench' from inside the project dir.
… full init scaffold

bench agent verify <name> crashed with AttributeError on a top-level JSON
array parity_experiment.json (extract_criterion_comparisons /
extract_reward_samples called .get on a list). Guard a non-Mapping payload
in both parsers so verify returns a documented verdict
(insufficient-evidence) instead of crashing, and never fabricates phantom
zero-delta reward samples from summary records. Widen the related type hints
(load_parity_experiment, build_verify_report) to match reality.

bench tasks init's "Created:" summary under-reported the scaffold (omitted
verifier/test_outputs.py and verifier/rubrics/verifier.toml, both validated
by bench tasks check). Add scaffold_task() returning every file actually
written (derived from disk), keep init_task() as a thin wrapper, and have
the CLI list the real file set.

Tests: top-level-array parity file -> verify verdict not exception (unit +
CLI); init reported set == written set across format/no-pytest/no-oracle.
…e gate

The artifacts section promised `bench agent verify` scores every shipped
parity_experiment.json and that the recorded experiments span per-criterion
verdict agreement. Verified against code, neither holds:

- `build_verify_report`/`extract_criterion_comparisons` require the file to be
  a JSON object; one shipped experiment is a top-level list of runs, which the
  gate cannot score, so `verify` does not work uniformly across the shipped set.
- No shipped experiment records per-criterion data — every `verify` run reports
  0/0 criteria — so "spanning per-criterion verdict agreement" was aspirational.

Rewrite the section to state the object-shape requirement, give a worked
`bench agent verify programbench` example (parity-confirmed, reward delta within
the 0.02 tolerance), note the shipped set is non-uniform (others return
insufficient-evidence; one is a list the gate cannot score), and reword the
Layer 2 lead-in so it no longer implies verify scores every adopted benchmark.
iter2: verifier-home isolation (/testbed_verify locked from agent read).
iter3: bench agent verify no longer crashes on list-shaped parity files
(graceful verdict; guarded both parsers, avoids a false parity-confirmed);
bench tasks init Created-summary now lists the real file set; docs L2
migrate-vs-adopt disambiguation + quickstart placeholder/RC-wheel fixes.
Suite 3,437 green; harvey-lab verify confirmed non-crashing.
…th in authoring example

Final cold-read nits (full flow already passed): the RC install example
pinned an older tag (literal copy fetched rc.2); the authoring example used
a relative hello.txt that could need an oracle iteration to resolve. Both
docs-only; no code/wheel change.
Remove the dashboard/ dev status board (Linear/CI/HF-scores surface, never
shipped in the wheel) and its 10 dashboard-only test modules. Redirect or
reword every reference so nothing dangles:
- pyproject.toml: daytona>=0.184 floor comment now cites sandbox/daytona.py
  (the actual sync-client user) instead of the removed dashboard panel.
- tests/test_paths_symlink_helpers.py: drop the cross-ref to the removed
  symlink-ingestion test.
- experiments/skillsbench-fill/README.md: describe the ledger summary instead
  of the removed Experiments tab path.
- .claude/launch.json: drop the dashboard run config.
- docs/running-any-benchmark.md: neutral wording for downstream consumers.

Collapse labs/ by archiving the two 0.2.x-era research artifacts under
labs/archive/ (git mv, evidence preserved):
- benchjack-sandbox-hardening, reward-hack-matrix
Add labs/archive/README.md and point README, docs/concepts.md,
docs/sandbox-hardening.md, and .claude/dev-docs/labs.md at the archived
location with a clear (historical, 0.2.x-era) note. The security story in
docs/sandbox-hardening.md is unchanged.

Gate: pytest tests/ (excl. integration) green; no dangling dashboard/ paths
or broken labs/ links remain.
…hecks

Pure structural quick wins in sandbox/daytona.py with no behavior change:

- Add _require_sandbox() helper and route the repeated
  'if not self._sandbox: raise' preconditions through it so the type
  checker narrows AsyncSandbox|None at each call site (same falsiness
  check and error message as the inlined guards).
- Define _SDK_RETRY once and apply it to the six idempotent SDK helpers
  that inlined an identical tenacity policy; _create_sandbox and
  _stop_sandbox keep their distinct policies.
- Add _reject_non_main_service() for the three identical single-container
  ValueError guards in the direct strategy.
- Build CreateSandboxFromImageParams once in _DaytonaDirect.start by
  computing the image in an inner if/else (ownership label preserved).
- Drop the no-op 'try/finally: pass' wrapper in _sandbox_exec, keeping
  the explanatory comment.
- Hoist shutil to a single top-level import; drop redundant local
  re/tempfile/shlex/shutil imports.
- Extract _record_agent_timeout() shared by both TimeoutError handlers.
- Add RolloutConfig._primary_role; collapse the three primary_*
  properties to one-liners.
- Initialize _effective_task_path/_task_tmp/_session_traj_count/
  _session_tool_count in __init__ and drop the now-redundant
  getattr guards on the install_agent()/execute() reads.

Guards in _capture_partial_acp_trajectory() and cleanup() are kept:
they intentionally tolerate Rollout.__new__() test stubs that skip
__init__ (documented in-code).
…ult reporter

Pure structural cleanups in the CLI lockdown surface (zero behavior change):

- lockdown: drop the dead _under_path helper (the live copy is inlined in
  the trusted-PATH script string).
- cli: extract _report_eval_result for the duplicated Score/errors summary
  print used by job and eval create.
- cli: collapse the two near-identical EvaluationConfig constructions in
  eval create into one _make_eval_config(source_provenance=...) builder.
- cli: add benchflow/cli/_options.py with reusable Typer option aliases
  (--agent/--model/--sandbox/--concurrency/--jobs-dir/--skill-mode) and
  apply them across commands; per-command defaults stay at the call site so
  --help output is byte-identical.
The daytona module is import-safe without the SDK (#358), so the factory's
import guard never tripped and selecting the Daytona environment leaked a raw
ImportError deep inside DaytonaSandbox.__init__ (after CPU/memory clamping).
Force _load_daytona_sdk() at env selection and route the failure through
_raise_missing_optional_sandbox_dependency, matching the modal branch — a clear
'uv sync --extra sandbox-daytona' / 'pip install benchflow[sandbox-daytona]'
error. Widen the helper's exc type to ImportError. Add a focused unit test.
v0.6 renamed SDK.run(trial_name=...) to rollout_name with no alias, a hard
break for pre-v0.6 callers. Re-add an optional 'trial_name' keyword that maps
to 'rollout_name' and emits a DeprecationWarning (matching the warnings.warn(
..., DeprecationWarning, stacklevel=2) style used in task/config.py). Passing
both is ambiguous and raises TypeError.

Adds tests/test_sdk_run_alias.py covering the warn+map path, the no-warn
rollout_name path, and the TypeError on both.
The strict validate_reward_map rejects any unrecognized non-numeric
top-level key (e.g. the Harbor-era {"reward":1.0,"done":true}) and any
non-numeric metric, so the classic test.sh->reward.json flow fails the
whole run. Add an opt-in lenient path aligned with the existing
validation abstraction.

- validate_reward_map(..., lenient=False): new flag. Lenient drops
  unrecognized/non-numeric top-level keys and non-numeric metric entries
  (and malformed recognized-structured keys) with a single warnings.warn
  listing everything dropped, instead of raising. Still requires a usable
  scalar reward: from 'reward', else a numeric 'score'/'rewards' alias,
  else numeric metrics + a declared aggregate policy. Default stays strict
  (no behaviour change).
- reward_lenient_from_env(): reads BENCHFLOW_REWARD_LENIENT (truthy
  1/true/yes/on), matching benchflow's BENCHFLOW_* operator-toggle
  convention. Threaded into Verifier._parse_reward_json and the final
  _ensure_canonical_rewards gate in rollout.py.
- tests/test_reward_lenient.py: strict still raises on {reward,done};
  lenient drops done + non-numeric metrics, keeps reward, warns; alias
  derivation; env parsing.
strip_provider_prefix leaves a bare model id (e.g. deepseek-v4-flash) that
find_provider can no longer match, so the openclaw shim's
_infer_provider_prefix defaulted every non-gemini/gpt id to anthropic --
silently running deepseek/glm/qwen/... as anthropic.

Add a registry-owned find_provider_for_bare_model(model): each provider
declares the bare-model family tokens it owns via a new
ProviderConfig.model_prefixes field, and the helper does longest-token
matching (with a non-letter family boundary so glm matches glm-4.6 / glm5
but not glmnext), falling back to an unambiguous declared models[].id.
_infer_provider_prefix now consults it before the native gemini/gpt
heuristics, then anthropic -- existing gemini/gpt/anthropic behavior
unchanged.

Tests: tests/test_bare_model_provider.py.
16-probe deterministic sweep (task.md gate, CLI arg validation, provider/agent
wiring, judge routing, ATIF/ADP, Daytona reaper) with adversarial verification,
plus 4 live rollout campaigns (oracle Docker/Daytona, deepseek-v4-flash openhands
Daytona, Docker/Daytona parity isolation, live gemini-3.1-flash-lite judge).

9 confirmed defects (2 P1, 6 P2, 3 P3) + 1 broken example oracle; 4 false alarms
refuted. Engine, trajectory artifacts, secret redaction, reaper ownership-scoping,
and conversion gate all pass under stress (3473/3474 fast tests).
xdotli added 4 commits June 13, 2026 17:17
From a 92-agent dead-code audit of the release HEAD. This PR removes only the
high-confidence, pure-deletion subset — symbols re-verified zero-reference WITH
CLASS CONTEXT (the audit's blanket signal flagged e.g. RolloutPaths.log_path as
"dead" while 14 unrelated `.log_path` attributes existed on other classes; each
item here was disambiguated against its real reference sites):

- task/paths.py: 7 unused `*_path` @Property's — TaskPaths.{readme_path,
  gitignore_path, verifier_document_path} and RolloutPaths.{artifacts_manifest_path,
  result_path, exception_message_path, log_path}. The underlying files are still
  written/read via inline `dir / "name"`; only the unused accessors go.
- sandbox/modal_impl.py: ModalSandbox.{supports_gpus, can_disable_internet} —
  vestigial capability properties not on the Sandbox Protocol (GPU config reads
  task_env_config.gpus; internet policy uses SandboxConfig.allow_internet).
- cli/continue_cmd.py: unused module-level `logger` + its `import logging`.
- experimental/mcp/hooks.py: orphaned `mcp_service_hooks_from_config` (not
  exported, no callers); `mcp_reviewer_hook` is kept (referenced in docs).

No public API touched. Full suite 4069 passed; ruff/format/ty clean.

Deferred to a follow-up (need per-site verification or an outward-facing
decision, NOT included here): dataclass fields with positional-construction risk
(AgentLift/JudgeConfig/ServiceConfig/ToolCall/SandboxReplayProxy), back-compat
aliases requiring test repoints (scoring.classify_result_dict/count_result_outcomes,
SDK pass-through wrappers, RuntimeResult.to_run_result), and the public-`__all__`
/ Protocol-surface items (OTelCollector, ACP types models, the Sandbox.read_file/
write_file runtime_checkable isinstance-trap, RoundResult.n_tool_calls).
chore: dead-code purge round 2 (verified zero-reference)
Continues the audit-driven cleanup. The keystone fix is the Sandbox Protocol:

- Removed `read_file`/`write_file` from the `@runtime_checkable` `Sandbox`
  Protocol. No docker/daytona/modal backend implements them (backends expose
  the `upload_file`/`download_file` family) and there were zero call sites on a
  benchflow Sandbox — they were a latent `isinstance` trap on the contract
  surface. (`harvey_lab_acp_shim.DirectSandbox` and the deepagents test
  workspace define their own `read_file`/`write_file`; those are unrelated
  classes, not this Protocol.)

Plus round-3 dead code — each symbol re-verified zero-reference WITH CLASS
CONTEXT (the audit's blanket signal mislabeled a couple of items, e.g.
`session_load`, which actually has a test caller and was therefore NOT removed):

- `TaskMetrics.audit_outcome` property + its now-orphaned `classify_audit_outcome`
  import (`TaskMetrics.outcome` deliberately kept — a `result.outcome` ref could
  not be cleanly disambiguated, so it waits for a follow-up).
- `OTelCollector.endpoint`, `ReplayRouter.cursor` (both unused @Property's; the
  look-alike `.endpoint`/`.cursor` reads belong to other classes).
- `RuntimeResult.to_run_result` (legacy SDK-compat converter, only caller was its
  own test — both removed).
- Never-read dataclass fields `ToolCall.output`, `JudgeConfig.{reference,
  prompt_template}` (+ their `_parse_judge` writes + a docs/llm-judge.md row).
- Write-only `ReplayProxy._host`; inert `AgentProtocolError.code` annotation;
  unused `retry_if_exception_type` import + fallback in `sandbox/daytona.py`.

Also renamed the misleading `TestReadFileError` test class (it only exercised
`ExecResult`, never `read_file`).

No public API affected. Full suite 4068 passed; ruff/format/ty clean; an
adversarial second-pass review confirmed all 10 removals are unreferenced.
…ocol

chore: remove dead Sandbox Protocol stubs + dead-code purge round 3

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 6affa1c. Configure here.

Comment thread .claude/skills/launch-prep/SKILL.md
xdotli added 2 commits June 13, 2026 17:45
…file removal

The `Sandbox` Protocol example still listed `read_file`/`write_file` — the
unimplemented stub methods removed from the Protocol in #745 (no backend
implements them; a custom sandbox would be misled into "must implement" them).
Replace with the real file-transfer contract (`upload_file`/`download_file`) and
correct the `exec` signature to match `sandbox/protocol.py` (`cmd: str` →
`ExecResult`, not `cmd: list[str]` → `SandboxExecResult`).
docs(python-api): fix Sandbox Protocol example after read_file/write_file removal
xdotli added 5 commits June 13, 2026 18:36
`OTelCollector` was a designed-but-never-wired OTLP/HTTP receiver from the v2
rewrite: it would have captured an agent's OpenTelemetry GenAI spans (prompts,
completions, model, token usage) into a benchflow Trajectory. But repo-wide it
is never instantiated, has zero tests, and is not part of any run path — the
default capture goes through ACP session events and the LiteLLM callback path.
Its only references were its own definition + the public `__all__` re-exports.

Per maintainer decision ("remove for now"), drop the `trajectories/otel.py`
module and its exports from `benchflow.__all__` and `benchflow.trajectories`.
This removes a public name, so it is a (minor) breaking change — appropriate in
the RC window. Re-add with a test + real wiring if OTel-based capture is revived.

Full suite 4068 passed; ruff/ty clean; `import benchflow` healthy.
chore: remove the unwired OTelCollector (public-API removal)
Pre-release cleanup of stale notes (per maintainer call):

- `.claude/dev-docs/0.3-plan.md` — the 0.3-era dev plan; unreferenced and
  superseded (we're at 0.6). The actively-referenced dev-docs
  (architecture/harden-sandbox/labs/tested-agents, used by the review skills)
  are kept.
- `docs/architecture-explorer.html` — an unlinked, stale generated explorer
  artifact (still showed the pre-rename 7-group CLI taxonomy); not in the docs
  nav and referenced nowhere.
- `docs/task-standard-roadmap.md` — the roadmap doc; removed from both
  `docs.json` navs and de-linked from `task-standard.md` (the "Open Primitives
  and Roadmap" subsection that only pointed at it is dropped).

Docs-only. Both docs.json files validate; no dangling references remain; the
CLI<->docs drift guard passes.
chore(docs): remove stale dev/old notes
- pyproject: 0.6.0.dev0 -> 0.6.0 (final release version; __version__ derives
  from package metadata, so this is the single source of truth).
- CHANGELOG: fold the accumulated `[Unreleased]` RC-loop entries into a single
  dated `## 0.6.0 — 2026-06-13` section, merging the two duplicate 0.6.0 blocks
  (the 06-10 draft + the [Unreleased] hardening) into one — Added / Changed /
  Renamed / Removed / Fixed, each appearing once. Updated the Added
  adoption-router bullet to the shipped `bench eval adopt` names (was the
  pre-rename `bench agent create|run|verify`).

NOTE: install-doc wheel URLs still point at the rc.6 wheel — switch those to the
final 0.6.0 / PyPI install at publish time (changing now would 404 until the
wheel exists). Full suite 4068 passed; ruff/ty clean.
xdotli added 2 commits June 13, 2026 18:58
chore: cut 0.6.0 — bump version + consolidate CHANGELOG
`test_environment_group_is_hidden_but_still_resolves` matched `│`-anchored rows
in `bench --help` raw output WITHOUT stripping ANSI. On CI, Rich emits color
codes (FORCE_COLOR), so the `^\s*│` anchor matched nothing → `rows == set()` →
`assert 'sandbox' in set()` failed. It passed locally only because there were no
ANSI codes. This was the sole failure on #665's `test` gate (1 failed / 4112
passed), blocking the 0.6.0 release.

Fix: assert against the Click command registry (`typer.main.get_command(app)` +
`.hidden`) instead of a regex over rendered help — authoritative and immune to
ANSI/locale rendering differences. Verified the fix passes under FORCE_COLOR=1
(which reproduces the CI failure with the old code).

The two other help-row parsers (test_cli_docs_drift, test_cli_adopt_aliases)
already strip ANSI, so they were unaffected.
xdotli added 2 commits June 13, 2026 19:20
test(cli): fix CI-fragile help-parse blocking the 0.6.0 release gate
#750's test rewrite passed `ruff check` (lint) but I didn't run
`ruff format --check` — the CI `test` job runs format-check as a gating step and
failed on `Would reformat: tests/test_cli_hub_env.py`. This applies ruff format
(collapses the set comprehension to ruff's canonical form). No logic change.

Verified: ruff format --check + ruff check + ty + the test file all pass.
xdotli added 3 commits June 13, 2026 19:24
style: ruff-format tests/test_cli_hub_env.py (CI gate fix)
…s-repros/)

A fan-out dead-docs sweep (adversarially verified against release/v0.6.0)
confirmed these as orphaned stray artifacts with ZERO inbound references:

- STRESS-TEST-v0.6.0.md — a one-off, dated (2026-06-11, rc3) stress-test
  campaign report at the repo root. Exhausted: the defects it found were fixed
  and filed in Linear (ENG-248..257). Not in either docs.json nav.
- stress-repros/ — the 12-file scratch repro tree (README, LINEAR-ISSUES, and
  10 p*.sh scripts) for that campaign. Never wired into the test suite or CI;
  references an out-of-repo /tmp env file. Its only internal link was the
  back-link to STRESS-TEST-v0.6.0.md (also removed), so no dangling refs remain.

Substantive content lives in Linear. Docs/scratch-only; no code, test, nav, or
CI reference touched (verified by grep across md/json/py/toml/yml).
@xdotli xdotli merged commit 73fbca5 into main Jun 13, 2026
3 of 4 checks passed
@cursor

cursor Bot commented Jun 13, 2026

Copy link
Copy Markdown

Bugbot couldn't run - usage limit reached

Bugbot is counted against Cursor usage for this user or team, and this run hit a usage or spend limit.

A user or team admin can review and increase usage limits in the Cursor dashboard.

(requestId: serverGenReqId_15517029-9dbe-4f7f-9367-b0d0441c2a1d)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants