Correct content-awareness framing with a direct content-specificity eval#16
Conversation
A direct retrieval eval (does each AV output recover its own source document?) puts the v0.1 AV at chance across lexical, semantic, and two LLM-judge probes. So the output is format/genre-plausible but NOT per-row content- or theme-discriminative. Corrects the mild 'theme-correct' overclaim in README / model card / calibration doc; the 'detail-confabulated' half stands. Nuance kept: v0.1 output is diverse (45/50 unique strings); the diversity is decoupled from source content. Scope-limit noted (a feature constant across all 13 docs would be invisible to this eval). Bundles all eval scripts, inputs (rl.parquet, AV outputs), per-trial LLM-judge data, and CONTENT_SPECIFICITY_EVAL.md under experiments/v8_nla_local/ so the suite is reproducible from this repo. Also fixes v0.0.1 AV SFT-step count in MODEL_CARD_AV (15 -> 55 total = 15 base + 40 continuation; grad_accum 16 -> 4), restoring the value dropped in the 2026-05-17 retraction rewrite (commit 8dd0a99). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive content-specificity retrieval evaluation suite to determine if the v0.1 Natural Language Autoencoder (NLA) text outputs carry any recoverable content signal from their source documents. It adds several evaluation scripts, documentation, and results demonstrating that the NLA outputs are format-plausible but not content-discriminative. The feedback on the new Python scripts focuses on improving robustness and resource management. Specifically, it is recommended to check subprocess return codes in the LLM judge script to prevent silent failures, use context managers (with statements) for all file read/write operations, and gracefully handle the potential absence of the sentence_transformers package in the refinement script.
| def call_judge_claude(prompt, model=MODEL): | ||
| claude = shutil.which("claude") or "claude" | ||
| proc = subprocess.run( | ||
| [claude, "-p", "--model", model], | ||
| input=prompt, capture_output=True, text=True, cwd=HOME, timeout=300, | ||
| encoding="utf-8", errors="replace", | ||
| ) | ||
| return proc.stdout.strip() | ||
|
|
||
|
|
||
| def call_judge_gemini(prompt): | ||
| # Antigravity (Gemini 3.5 Flash) via the gemini-collab wrapper; prompt passed as argv. | ||
| proc = subprocess.run( | ||
| [sys.executable, GEMINI_WRAPPER, "--prompt", prompt, "--timeout", "300"], | ||
| capture_output=True, text=True, encoding="utf-8", errors="replace", timeout=360, | ||
| ) | ||
| return proc.stdout.strip() |
There was a problem hiding this comment.
The subprocess calls to the LLM judges currently fail silently if the command returns a non-zero exit code or fails to produce a valid response. This leads to silent failures where the script continues executing with empty results, eventually writing a report claiming 'no signal above chance' with 0 trials (as seen in the generated content_specificity_judge_gemini.json file). Adding a check for proc.returncode and raising an error with proc.stderr will prevent these silent failures and make debugging much easier.
def call_judge_claude(prompt, model=MODEL):
claude = shutil.which("claude") or "claude"
proc = subprocess.run(
[claude, "-p", "--model", model],
input=prompt, capture_output=True, text=True, cwd=HOME, timeout=300,
encoding="utf-8", errors="replace",
)
if proc.returncode != 0:
raise RuntimeError(f"Claude judge failed with exit code {proc.returncode}: {proc.stderr}")
return proc.stdout.strip()
def call_judge_gemini(prompt):
# Antigravity (Gemini 3.5 Flash) via the gemini-collab wrapper; prompt passed as argv.
proc = subprocess.run(
[sys.executable, GEMINI_WRAPPER, "--prompt", prompt, "--timeout", "300"],
capture_output=True, text=True, encoding="utf-8", errors="replace", timeout=360,
)
if proc.returncode != 0:
raise RuntimeError(f"Gemini judge failed with exit code {proc.returncode}: {proc.stderr}")
return proc.stdout.strip()| def load(): | ||
| df = pd.read_parquet(RL) | ||
| av = json.load(open(AV, encoding="utf-8"))["our_outputs"] |
There was a problem hiding this comment.
The file AV is opened using open() but is never explicitly closed, which can lead to resource leaks. It is recommended to use a with statement (context manager) to ensure the file is properly closed after reading.
| def load(): | |
| df = pd.read_parquet(RL) | |
| av = json.load(open(AV, encoding="utf-8"))["our_outputs"] | |
| def load(): | |
| df = pd.read_parquet(RL) | |
| with open(AV, encoding="utf-8") as f: | |
| av = json.load(f)["our_outputs"] |
| results["config"]["semantic"] = "sentence-transformers not installed; skipped" | ||
| print("[semantic] skipped (sentence-transformers not installed)") | ||
|
|
||
| json.dump(results, open(OUT, "w", encoding="utf-8"), indent=1) |
There was a problem hiding this comment.
The output file OUT is opened for writing without a context manager, leaving the file descriptor open. Using a with statement ensures the file is closed properly.
| json.dump(results, open(OUT, "w", encoding="utf-8"), indent=1) | |
| with open(OUT, "w", encoding="utf-8") as f: | |
| json.dump(results, f, indent=1) |
| def build_trials(): | ||
| df = pd.read_parquet(RL) | ||
| av = json.load(open(AV, encoding="utf-8"))["our_outputs"] |
There was a problem hiding this comment.
The file AV is opened using open() but is never explicitly closed. Using a with statement ensures the file is properly closed after reading.
| def build_trials(): | |
| df = pd.read_parquet(RL) | |
| av = json.load(open(AV, encoding="utf-8"))["our_outputs"] | |
| def build_trials(): | |
| df = pd.read_parquet(RL) | |
| with open(AV, encoding="utf-8") as f: | |
| av = json.load(f)["our_outputs"] |
| "per_trial": [{"trial_id": tid, "choice": ch, "true_pos": tp, "correct": ch == tp} | ||
| for tid, ch, tp in scored], | ||
| } | ||
| json.dump(result, open(OUT, "w", encoding="utf-8"), indent=1) |
| def main(): | ||
| df = pd.read_parquet(RL) | ||
| av_all = json.load(open(AV, encoding="utf-8"))["our_outputs"] |
There was a problem hiding this comment.
The file AV is opened using open() but is never explicitly closed. Using a with statement ensures the file is properly closed after reading.
| def main(): | |
| df = pd.read_parquet(RL) | |
| av_all = json.load(open(AV, encoding="utf-8"))["our_outputs"] | |
| def main(): | |
| df = pd.read_parquet(RL) | |
| with open(AV, encoding="utf-8") as f: | |
| av_all = json.load(f)["our_outputs"] |
| f"nontmpl top1={r['R2_nontemplate_tail']['doc_top1']:.3f}(p{r['R2_nontemplate_tail']['top1_p']:.2f}) " | ||
| f"| gap_p full={r['R1_full_doc']['gap_p']:.2f} tail={r['R1_tail_window']['gap_p']:.2f} nontmpl={r['R2_nontemplate_tail']['gap_p']:.2f}") | ||
|
|
||
| json.dump(out, open(OUT, "w", encoding="utf-8"), indent=1) |
There was a problem hiding this comment.
| def make_sim(av, src, kind): | ||
| if kind == "tfidf": | ||
| vec = TfidfVectorizer(ngram_range=(1, 2), min_df=1, sublinear_tf=True, | ||
| stop_words="english").fit(src + av) | ||
| return cosine_similarity(vec.transform(av), vec.transform(src)) | ||
| if kind == "semantic": | ||
| from sentence_transformers import SentenceTransformer | ||
| model = SentenceTransformer("all-MiniLM-L6-v2") | ||
| ea = model.encode(av, normalize_embeddings=True, show_progress_bar=False) | ||
| es = model.encode(src, normalize_embeddings=True, show_progress_bar=False) | ||
| return ea @ es.T |
There was a problem hiding this comment.
Unlike eval_content_specificity.py, this script does not handle the absence of sentence_transformers gracefully. If the package is not installed, running this script will crash with a ModuleNotFoundError. Wrapping the import in a try-except block and returning None allows the script to skip the semantic evaluation gracefully, matching the behavior of the main evaluation script.
| def make_sim(av, src, kind): | |
| if kind == "tfidf": | |
| vec = TfidfVectorizer(ngram_range=(1, 2), min_df=1, sublinear_tf=True, | |
| stop_words="english").fit(src + av) | |
| return cosine_similarity(vec.transform(av), vec.transform(src)) | |
| if kind == "semantic": | |
| from sentence_transformers import SentenceTransformer | |
| model = SentenceTransformer("all-MiniLM-L6-v2") | |
| ea = model.encode(av, normalize_embeddings=True, show_progress_bar=False) | |
| es = model.encode(src, normalize_embeddings=True, show_progress_bar=False) | |
| return ea @ es.T | |
| def make_sim(av, src, kind): | |
| if kind == "tfidf": | |
| vec = TfidfVectorizer(ngram_range=(1, 2), min_df=1, sublinear_tf=True, | |
| stop_words="english").fit(src + av) | |
| return cosine_similarity(vec.transform(av), vec.transform(src)) | |
| if kind == "semantic": | |
| try: | |
| from sentence_transformers import SentenceTransformer | |
| except ImportError: | |
| return None | |
| model = SentenceTransformer("all-MiniLM-L6-v2") | |
| ea = model.encode(av, normalize_embeddings=True, show_progress_bar=False) | |
| es = model.encode(src, normalize_embeddings=True, show_progress_bar=False) | |
| return ea @ es.T |
| for kind in ["tfidf", "semantic"]: | ||
| # R1: full vs tail-window source | ||
| sim_full = make_sim(av_all, full_src, kind) | ||
| sim_tail = make_sim(av_all, tail_src, kind) | ||
| # R2: non-template subset, tail source | ||
| av_k = [av_all[i] for i in keep] | ||
| tail_k = [tail_src[i] for i in keep] | ||
| doc_k = [doc[i] for i in keep] | ||
| sim_k = make_sim(av_k, tail_k, kind) | ||
| out["results"][kind] = { | ||
| "R1_full_doc": perm(sim_full, doc), | ||
| "R1_tail_window": perm(sim_tail, doc), | ||
| "R2_nontemplate_tail": perm(sim_k, doc_k), | ||
| } | ||
| r = out["results"][kind] | ||
| print(f"[{kind}] full top1={r['R1_full_doc']['doc_top1']:.3f}(p{r['R1_full_doc']['top1_p']:.2f}) " | ||
| f"tail top1={r['R1_tail_window']['doc_top1']:.3f}(p{r['R1_tail_window']['top1_p']:.2f}) " | ||
| f"nontmpl top1={r['R2_nontemplate_tail']['doc_top1']:.3f}(p{r['R2_nontemplate_tail']['top1_p']:.2f}) " | ||
| f"| gap_p full={r['R1_full_doc']['gap_p']:.2f} tail={r['R1_tail_window']['gap_p']:.2f} nontmpl={r['R2_nontemplate_tail']['gap_p']:.2f}") |
There was a problem hiding this comment.
To support graceful skipping of the semantic evaluation when sentence_transformers is not installed, update the loop in main to check if sim_full is None and skip the current iteration if so.
for kind in ["tfidf", "semantic"]:
# R1: full vs tail-window source
sim_full = make_sim(av_all, full_src, kind)
if sim_full is None:
print(f"[{kind}] skipped (sentence-transformers not installed)")
continue
sim_tail = make_sim(av_all, tail_src, kind)
# R2: non-template subset, tail source
av_k = [av_all[i] for i in keep]
tail_k = [tail_src[i] for i in keep]
doc_k = [doc[i] for i in keep]
sim_k = make_sim(av_k, tail_k, kind)
out["results"][kind] = {
"R1_full_doc": perm(sim_full, doc),
"R1_tail_window": perm(sim_tail, doc),
"R2_nontemplate_tail": perm(sim_k, doc_k),
}
r = out["results"][kind]
print(f"[{kind}] full top1={r['R1_full_doc']['doc_top1']:.3f}(p{r['R1_full_doc']['top1_p']:.2f}) "
f"tail top1={r['R1_tail_window']['doc_top1']:.3f}(p{r['R1_tail_window']['top1_p']:.2f}) "
f"nontmpl top1={r['R2_nontemplate_tail']['doc_top1']:.3f}(p{r['R2_nontemplate_tail']['top1_p']:.2f}) "
f"| gap_p full={r['R1_full_doc']['gap_p']:.2f} tail={r['R1_tail_window']['gap_p']:.2f} nontmpl={r['R2_nontemplate_tail']['gap_p']:.2f}")There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive content-specificity retrieval evaluation suite for the v0.1 Natural Language Autoencoder (NLA), adding several evaluation scripts, documentation, and benchmark results. The findings show that while the AV outputs are diverse, they perform at chance in recovering source-document content, prompting updates to the README, model card, and calibration notes. The reviewer feedback is highly constructive, recommending robustness and portability improvements such as handling subprocess timeouts gracefully in the LLM-judge script, avoiding hardcoded home directory paths, and utilizing context managers for safer file I/O operations.
| for b in range(0, len(trials), batch_size): | ||
| batch = trials[b:b + batch_size] | ||
| prompt = make_prompt(batch) | ||
| resp = call_judge(prompt) | ||
| parsed = parse(resp) | ||
| choices.update(parsed) | ||
| with open(RAW, "a", encoding="utf-8") as f: | ||
| f.write(json.dumps({"batch_start": b, "prompt_chars": len(prompt), | ||
| "response": resp, "parsed": parsed}) + "\n") | ||
| print(f"batch {b}-{b+len(batch)}: parsed {len(parsed)}/{len(batch)}") |
There was a problem hiding this comment.
The LLM judge API call can occasionally time out or fail due to network issues, rate limits, or subprocess hangs (as noted in the Sonnet run crash). Currently, any exception (such as subprocess.TimeoutExpired) raised during call_judge will crash the entire script, causing all previously completed batches in the loop to be lost since the final results are never written to OUT.\n\nWrapping the call_judge and parsing logic in a try-except block and breaking/continuing gracefully will allow the script to save the completed trials to OUT even if a subsequent batch fails.
for b in range(0, len(trials), batch_size):\n batch = trials[b:b + batch_size]\n prompt = make_prompt(batch)\n try:\n resp = call_judge(prompt)\n parsed = parse(resp)\n choices.update(parsed)\n with open(RAW, "a", encoding="utf-8") as f:\n f.write(json.dumps({"batch_start": b, "prompt_chars": len(prompt),\n "response": resp, "parsed": parsed}) + "\\n")\n print(f"batch {b}-{b+len(batch)}: parsed {len(parsed)}/{len(batch)}")\n except subprocess.TimeoutExpired as e:\n print(f"Timeout on batch {b}-{b+len(batch)}: {e}")\n break\n except Exception as e:\n print(f"Error on batch {b}-{b+len(batch)}: {e}")\n break| GEMINI_WRAPPER = os.path.join(HOME, ".claude", "skills", "gemini-collab", | ||
| "scripts", "gemini_client.py") |
There was a problem hiding this comment.
Hardcoding the path to gemini_client.py inside the user's home directory (~/.claude/skills/...) makes the script non-portable and fragile across different developer environments or CI/CD pipelines. Consider allowing this path to be overridden via an environment variable or command-line argument, with the current path as a default fallback.
GEMINI_WRAPPER = os.environ.get(\n "GEMINI_WRAPPER",\n os.path.join(HOME, ".claude", "skills", "gemini-collab", "scripts", "gemini_client.py")\n)|
|
||
| def load(): | ||
| df = pd.read_parquet(RL) | ||
| av = json.load(open(AV, encoding="utf-8"))["our_outputs"] |
There was a problem hiding this comment.
| results["config"]["semantic"] = "sentence-transformers not installed; skipped" | ||
| print("[semantic] skipped (sentence-transformers not installed)") | ||
|
|
||
| json.dump(results, open(OUT, "w", encoding="utf-8"), indent=1) |
… activation Follow-up to the content-specificity eval. A ceiling test on the RAW L23 activations recovers the source doc well above chance (retrieval 0.24 vs 0.077, p=0.0006; logistic probe 60% on 13-way doc id), so the content the AV fails to surface IS present in the activation. The per-row gap is the verbalizer's, and is fixable with better AV training rather than an intrinsic 2B-L23 ceiling. This also refutes the polysemanticity-at-2B reading for doc-level content, and validates the retrieval method (strong signal in activation, none in AV output). 8-layer sweep: every layer content-rich; L23 middling (0.350), L17 best (0.805), so a future NLA could also retarget the layer. Updates README + RELEASE_CALIBRATION framing (more accurate and more hopeful) and bundles the new scripts + result JSONs (inject probe, activation ceiling, layer sweep). Adds a single-token forced-choice injection probe (also at chance, non-degenerate). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…m is real) The ceiling result reframes the content gap as a verbalizer-training problem, not a model-scale ceiling (60% linear-probe doc id on the raw L23 activation). Update the bottom-line section so the summary carries both halves: the v0.1 AV does not yet read per-row content, AND the information is there to be read. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
What this does
Adds a direct content-specificity retrieval eval of the v0.1 AV and corrects the published content-awareness framing based on what it shows.
The eval
Tests whether the AV text recovers the source document it came from (the prior "content-blind" claim only measured the AR round-trip). Doc-level retrieval (chance 1/13) + 5-way LLM-judge forced choice (chance 0.20), 5000-iter permutation nulls.
All probes at chance: TF-IDF word/char, semantic (MiniLM), tail-window, non-template subset, Claude Haiku judge (0.24, n=50), Claude Sonnet judge (0.267, n=30). Semantic + any-connection judge prompt rule out the "real-but-non-topical feature" rescue (personality / event / tone).
Doc corrections
Data
Bundles all eval scripts, inputs (
rl.parquet, AV outputs), and per-trial LLM-judge data underexperiments/v8_nla_local/so the suite is reproducible from this repo. Full writeup:experiments/v8_nla_local/CONTENT_SPECIFICITY_EVAL.md.🤖 Generated with Claude Code