Skip to content

2026-04-28-harbor#30

Open
joyemang33 wants to merge 7 commits into
FrontierCS:mainfrom
joyemang33:calico-blog
Open

2026-04-28-harbor#30
joyemang33 wants to merge 7 commits into
FrontierCS:mainfrom
joyemang33:calico-blog

Conversation

@joyemang33

Copy link
Copy Markdown
Collaborator

OpenReview Submission Thread

Checklist before opening a PR

  • I am opening a pull request against the main branch of the 2025 repo.

  • My post and all associated references to it are all lowercase, i.e

      2025-04-28-Sample-Test.md               -> 2025-04-28-sample-test.md
      assets/img/2025-04-28-Sample-Test/ 	-> assets/img/2025-04-28-sample-test/
    
  • The title of my PR is exactly the name of my markdown file

    • i.e. _posts/2025-04-28-[submission-name].md would require a PR name 2025-04-28-[submission-name]
  • I have anonymized my post: my author's list is Anonymous, and there is no potential
    content which can reveal my/my collaborators identities.

  • My post matches the formatting requirements, including (but not limited to):

    • I have ONLY MODIFIED files in the following locations (failure to do so will result in
      your PR automatically being closed!):
      • a Markdown (or HTML) file in _posts/ with the format _posts/2025-04-28-[submission-name].md (or .html)
      • static image assets added to assets/img/2025-04-28-[submission-name]/
      • interactive HTML figures added to assets/html/2025-04-28-[submission-name]/
      • citations in a bibtex file in assets/bibliography/2025-04-28-[submission-name].bib
    • I have a short 2-3 sentence abstract in the description field of my front-matter
    • I have a table of contents, formatted using the toc field of my front-matter
    • My bibliography is correctly formatted, using a .bibtex file as per the sample post

Any other comments

joyemang33 and others added 2 commits April 29, 2026 20:15
…tion

- Replace em-dashes with colons or sentence breaks throughout body and front-matter
- Remove prose parentheses (Plug in, Two strategies callout, Harbor benchmarks list)
- Sync opening callout to table: up to 456 turns / 405 tool calls / 531K tokens
- Update front-matter description with current data
- Drop the body opener that repeated the lead callout's first two sentences
- Normalize Pointers list bullets to colon-separated descriptions

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wenhaochai wenhaochai changed the title 1 2026-04-28-harbor Apr 30, 2026
wenhaochai and others added 5 commits April 30, 2026 16:11
- After the design-principle callout, add three short paragraphs covering:
  the structural isolation (read-only mounts, single-problem judge mount,
  HTTP-only channel), the five-step trial flow, and the three judge image
  modes (per-trial build, locally cached, published frozen).
- Add new section "Parity With the Native Eval" with the 10-problem,
  3-run-per-side comparison: 68.92% native vs 53.37% Harbor on
  claude-code@2.1.112 / claude-opus-4-6, plus the oracle-sweep 70.23%
  baseline on every problem with a shipped reference. Notes which pieces
  are intentionally aligned (prompt, CLI, chk.cc) and what changes
  (transport: in-process vs HTTP).
- Update TOC to include the new Parity section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The empirical evidence (105 runs, sustained hundreds of turns) now lands
right after the Polyomino example, while "172 Problems, One Command" +
"Parity With the Native Eval" move to the end and read as one continuous
"here's the adapter, here's parity, here's the command" block leading
into "Try It". TOC reordered to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the placeholder Kimi section with a trace-grounded analysis:

- All 17 Opus crashes share the same 96K-output-token-per-call cap
  (verified across all trajectory.json files); none of Kimi's 2 crashes do.
- Walk frontier-cs-220 (Opus 0 / Kimi 1.000): Opus's trajectory shows
  4 sequential thinking-only assistant chunks, 0 tool calls, empty
  artifacts dir; Kimi runs 207 turns with explicit context compaction.
- Add frontier-cs-0 as a second mode: Opus self-validates its solver
  on random cases, judge TLEs 70/70 (returncode=-9); Kimi ships
  conservative algo for 0.74.
- Drop the 17 zeros from Opus's average and its mean rises 35.1 -> 41.8.

Header renamed from "In-depth Study of How Kimi Achieves Similar Scores"
because the trace says Opus hits a self-imposed ceiling, not that
Kimi catches up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants