Reframe public tone: verbalizer training gap, not content-blind#17
Reframe public tone: verbalizer training gap, not content-blind#17SolshineCode wants to merge 5 commits into
Conversation
A direct retrieval eval (does each AV output recover its own source document?) puts the v0.1 AV at chance across lexical, semantic, and two LLM-judge probes. So the output is format/genre-plausible but NOT per-row content- or theme-discriminative. Corrects the mild 'theme-correct' overclaim in README / model card / calibration doc; the 'detail-confabulated' half stands. Nuance kept: v0.1 output is diverse (45/50 unique strings); the diversity is decoupled from source content. Scope-limit noted (a feature constant across all 13 docs would be invisible to this eval). Bundles all eval scripts, inputs (rl.parquet, AV outputs), per-trial LLM-judge data, and CONTENT_SPECIFICITY_EVAL.md under experiments/v8_nla_local/ so the suite is reproducible from this repo. Also fixes v0.0.1 AV SFT-step count in MODEL_CARD_AV (15 -> 55 total = 15 base + 40 continuation; grad_accum 16 -> 4), restoring the value dropped in the 2026-05-17 retraction rewrite (commit 8dd0a99). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… activation Follow-up to the content-specificity eval. A ceiling test on the RAW L23 activations recovers the source doc well above chance (retrieval 0.24 vs 0.077, p=0.0006; logistic probe 60% on 13-way doc id), so the content the AV fails to surface IS present in the activation. The per-row gap is the verbalizer's, and is fixable with better AV training rather than an intrinsic 2B-L23 ceiling. This also refutes the polysemanticity-at-2B reading for doc-level content, and validates the retrieval method (strong signal in activation, none in AV output). 8-layer sweep: every layer content-rich; L23 middling (0.350), L17 best (0.805), so a future NLA could also retarget the layer. Updates README + RELEASE_CALIBRATION framing (more accurate and more hopeful) and bundles the new scripts + result JSONs (inject probe, activation ceiling, layer sweep). Adds a single-token forced-choice injection probe (also at chance, non-degenerate). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…m is real) The ceiling result reframes the content gap as a verbalizer-training problem, not a model-scale ceiling (60% linear-probe doc id on the raw L23 activation). Update the bottom-line section so the summary carries both halves: the v0.1 AV does not yet read per-row content, AND the information is there to be read. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ia Antigravity' The 5-way Gemini judge was the Antigravity (agy) attempt that auth-failed (no data). 'Gemini 3.5 Flash' was an unverified display name from the agy banner. The actual labeler/judge model identifier is gemini-2.5-flash (verified via CLI telemetry); this row produced no data so it is labeled by tool, not an unverified version. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Lifts the §F76 ceiling finding (already in the deep Limitations) up into the lead surfaces so a skimmer doesn't leave with a 'content-blind' impression -- matching the tone of the Neuronpedia reply to Johnny. Purely additive (every honest number kept): - README lead paragraph + Headline numbers: add the verbalizer-not-activation pivot (60% probe, 0.24 doc-retrieval vs 0.077, L17 carries >2x L23 signal). - MODEL_CARD_AV: counterweight in the content-fidelity limitation + distinctive line. - MODEL_CARD_AR: note the structural-projection gap is reconstruction-side, content is present in the activation (60% probe). - lesswrong_post: dated append-only addendum with the same finding. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive content-specificity evaluation suite for the v0.1 Natural Language Autoencoder (NLA) on Gemma-4-E2B, adding several evaluation scripts, results, and updating documentation to reflect that the current AV output is format-plausible but not content-discriminative. Crucially, the added ceiling tests show that the underlying activations do contain the necessary document-level signal, indicating a verbalizer training bottleneck rather than a model ceiling. Feedback on the new code focuses on performance and robustness: optimizing GPU-CPU synchronization overhead in the injection probe script by vectorizing log-probability gathering and pre-computing indices, caching the SentenceTransformer model to avoid redundant reloading, and refactoring the LLM-judge script to use argparse and a non-greedy regex pattern to prevent JSON parsing failures.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| def hook(module, inp, out): | ||
| if out.shape[1] <= 1: | ||
| return out | ||
| ids, vec = pending["input_ids"], pending["vec"] | ||
| if ids is None or vec is None: | ||
| return out | ||
| h = out.clone() | ||
| for p2 in range(1, ids.shape[1] - 1): | ||
| if (ids[0, p2].item() == INJ_ID and ids[0, p2 - 1].item() == LEFT_ID | ||
| and ids[0, p2 + 1].item() == RIGHT_ID): | ||
| h[0, p2] = vec[0].to(h.dtype) | ||
| break | ||
| return h | ||
|
|
||
| handle = av.get_input_embeddings().register_forward_hook(hook) | ||
|
|
||
| prefix_ids = tok.encode(AV_TEMPLATE + "\n\n<explanation> This vector is about", | ||
| return_tensors="pt") | ||
| plen = prefix_ids.shape[1] |
There was a problem hiding this comment.
Calling .item() on GPU tensors inside a loop triggers synchronous host-device memory copies, which severely degrades performance. Since prefix_ids is constant and always placed at the beginning of full, the injection index inj_idx is static and can be computed once on the CPU during initialization. This completely eliminates the GPU-CPU synchronization overhead during the forward pass hooks.
prefix_ids = tok.encode(AV_TEMPLATE + "\n\n<explanation> This vector is about",\n return_tensors="pt")\n plen = prefix_ids.shape[1]\n\n # Find the injection index once on CPU to avoid GPU-CPU sync in the hook\n prefix_ids_cpu = prefix_ids[0]\n inj_idx = -1\n for i in range(1, len(prefix_ids_cpu) - 1):\n if (prefix_ids_cpu[i].item() == INJ_ID \n and prefix_ids_cpu[i - 1].item() == LEFT_ID \n and prefix_ids_cpu[i + 1].item() == RIGHT_ID):\n inj_idx = i\n break\n\n def hook(module, inp, out):\n if out.shape[1] <= 1:\n return out\n ids, vec = pending["input_ids"], pending["vec"]\n if ids is None or vec is None:\n return out\n h = out.clone()\n if inj_idx != -1:\n h[0, inj_idx] = vec[0].to(h.dtype)\n return h\n\n handle = av.get_input_embeddings().register_forward_hook(hook)| tot = 0.0 | ||
| m = cont_ids.shape[1] | ||
| for k in range(m): | ||
| tot += lp[plen + k - 1, full[0, plen + k].item()].item() |
There was a problem hiding this comment.
Looping over target tokens and calling .item() on GPU tensors inside score causes multiple synchronous host-device transfers per evaluation step. We can vectorize this operation by selecting the relevant slice of lp and using torch.gather to retrieve all target log-probabilities in a single GPU operation, then calling .sum().item() once to copy the scalar to the CPU.
m = cont_ids.shape[1]\n target_ids = cont_ids.to(av.device).view(m, 1)\n lp_selected = lp[plen - 1 : plen + m - 1]\n target_log_probs = torch.gather(lp_selected, dim=1, index=target_ids)\n tot = target_log_probs.sum().item()| smoke = "--smoke" in sys.argv | ||
| backend = "gemini" if "--backend" in sys.argv and sys.argv[sys.argv.index("--backend") + 1] == "gemini" else "claude" | ||
| model_override = sys.argv[sys.argv.index("--model") + 1] if "--model" in sys.argv else MODEL | ||
| call_judge = call_judge_gemini if backend == "gemini" else (lambda p: call_judge_claude(p, model_override)) | ||
| judge_model = "Gemini 3.5 Flash (Antigravity)" if backend == "gemini" else model_override | ||
| tag = backend if backend == "gemini" else ("claude_" + model_override.split("-")[1] if "-" in model_override else "claude") | ||
| batch_size = 5 if backend == "gemini" else 10 |
There was a problem hiding this comment.
Manual parsing of sys.argv is fragile and can easily raise IndexError if arguments are incomplete or misplaced (e.g., if --backend or --model is passed as the last argument without a value). Refactoring the argument parsing to use Python's standard argparse module provides robust validation, automatic help generation, and safe default values.
import argparse\n parser = argparse.ArgumentParser(description="LLM-judge forced-choice content-specificity eval")\n parser.add_argument("--smoke", action="store_true", help="Run in smoke test mode")\n parser.add_argument("--backend", choices=["claude", "gemini"], default="claude", help="LLM backend")\n parser.add_argument("--model", default=MODEL, help="Model override")\n args = parser.parse_args()\n\n smoke = args.smoke\n backend = args.backend\n model_override = args.model\n call_judge = call_judge_gemini if backend == "gemini" else (lambda p: call_judge_claude(p, model_override))\n judge_model = "Gemini 3.5 Flash (Antigravity)" if backend == "gemini" else model_override\n tag = backend if backend == "gemini" else ("claude_" + model_override.split("-")[1] if "-" in model_override else "claude")\n batch_size = 5 if backend == "gemini" else 10|
|
||
|
|
||
| def parse(resp): | ||
| m = re.search(r"\[.*\]", resp, re.DOTALL) |
There was a problem hiding this comment.
Using a greedy wildcard .* in re.search with re.DOTALL can over-extend and capture trailing conversational brackets (e.g., [note]) returned by the LLM, which will cause json.loads to fail. Using a non-greedy pattern .*? ensures that only the JSON array itself is captured.
| m = re.search(r"\[.*\]", resp, re.DOTALL) | |
| m = re.search(r"\[\s*\{.*?\}\s*\]", resp, re.DOTALL) |
| def make_sim(av, src, kind): | ||
| if kind == "tfidf": | ||
| vec = TfidfVectorizer(ngram_range=(1, 2), min_df=1, sublinear_tf=True, | ||
| stop_words="english").fit(src + av) | ||
| return cosine_similarity(vec.transform(av), vec.transform(src)) | ||
| if kind == "semantic": | ||
| from sentence_transformers import SentenceTransformer | ||
| model = SentenceTransformer("all-MiniLM-L6-v2") | ||
| ea = model.encode(av, normalize_embeddings=True, show_progress_bar=False) | ||
| es = model.encode(src, normalize_embeddings=True, show_progress_bar=False) | ||
| return ea @ es.T |
There was a problem hiding this comment.
The SentenceTransformer model is re-loaded and re-initialized from scratch every time make_sim is called with kind='semantic'. Since make_sim is called multiple times during the evaluation, this redundant loading of model weights severely slows down the script. Caching the initialized model in a mutable default argument or a local cache dictionary avoids this overhead.
def make_sim(av, src, kind, _model_cache={}):\n if kind == "tfidf":\n vec = TfidfVectorizer(ngram_range=(1, 2), min_df=1, sublinear_tf=True,\n stop_words="english").fit(src + av)\n return cosine_similarity(vec.transform(av), vec.transform(src))\n if kind == "semantic":\n if "model" not in _model_cache:\n from sentence_transformers import SentenceTransformer\n _model_cache["model"] = SentenceTransformer("all-MiniLM-L6-v2")\n model = _model_cache["model"]\n ea = model.encode(av, normalize_embeddings=True, show_progress_bar=False)\n es = model.encode(src, normalize_embeddings=True, show_progress_bar=False)\n return ea @ es.TThere was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive suite of evaluation scripts and results (including lexical, semantic, and LLM-judge probes) to assess the content-specificity of the Gemma-4-E2B NLA model, updating the documentation and model cards to reflect that the current AV's content-fidelity gap is a verbalizer training bottleneck rather than an intrinsic model ceiling. The review feedback focuses on improving the robustness of these new scripts, suggesting graceful handling of missing sentence-transformers dependencies, more resilient parsing of LLM-judge JSON outputs, adopting argparse for command-line options, and wrapping LLM API subprocess calls in try-except blocks to handle timeouts and network errors without losing progress.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if kind == "semantic": | ||
| from sentence_transformers import SentenceTransformer | ||
| model = SentenceTransformer("all-MiniLM-L6-v2") | ||
| ea = model.encode(av, normalize_embeddings=True, show_progress_bar=False) | ||
| es = model.encode(src, normalize_embeddings=True, show_progress_bar=False) | ||
| return ea @ es.T |
There was a problem hiding this comment.
If sentence-transformers is not installed, importing SentenceTransformer will raise an ImportError and crash the script. Wrapping the import in a try-except block and returning None allows the script to handle the missing dependency gracefully, matching the behavior in eval_content_specificity.py.
| if kind == "semantic": | |
| from sentence_transformers import SentenceTransformer | |
| model = SentenceTransformer("all-MiniLM-L6-v2") | |
| ea = model.encode(av, normalize_embeddings=True, show_progress_bar=False) | |
| es = model.encode(src, normalize_embeddings=True, show_progress_bar=False) | |
| return ea @ es.T | |
| if kind == "semantic": | |
| try: | |
| from sentence_transformers import SentenceTransformer | |
| except ImportError: | |
| return None | |
| model = SentenceTransformer("all-MiniLM-L6-v2") | |
| ea = model.encode(av, normalize_embeddings=True, show_progress_bar=False) | |
| es = model.encode(src, normalize_embeddings=True, show_progress_bar=False) | |
| return ea @ es.T |
| for kind in ["tfidf", "semantic"]: | ||
| # R1: full vs tail-window source | ||
| sim_full = make_sim(av_all, full_src, kind) | ||
| sim_tail = make_sim(av_all, tail_src, kind) |
There was a problem hiding this comment.
If make_sim returns None because sentence-transformers is not installed, the script will crash with an AttributeError when calling perm on None. Checking if sim_full is None and skipping the semantic evaluation gracefully prevents this crash.
| for kind in ["tfidf", "semantic"]: | |
| # R1: full vs tail-window source | |
| sim_full = make_sim(av_all, full_src, kind) | |
| sim_tail = make_sim(av_all, tail_src, kind) | |
| for kind in ["tfidf", "semantic"]: | |
| # R1: full vs tail-window source | |
| sim_full = make_sim(av_all, full_src, kind) | |
| if sim_full is None: | |
| print(f"[{kind}] skipped (sentence-transformers not installed)") | |
| continue | |
| sim_tail = make_sim(av_all, tail_src, kind) |
| def parse(resp): | ||
| m = re.search(r"\[.*\]", resp, re.DOTALL) | ||
| if not m: | ||
| return {} | ||
| try: | ||
| arr = json.loads(m.group(0)) | ||
| return {int(o["trial_id"]): int(o["choice"]) for o in arr if "choice" in o} | ||
| except Exception: | ||
| return {} |
There was a problem hiding this comment.
If a single item in the JSON array returned by the LLM is malformed or has type conversion issues, the entire batch of choices is currently discarded. Processing items individually inside the loop and catching conversion errors per item makes the parser much more robust against minor LLM formatting inconsistencies.
| def parse(resp): | |
| m = re.search(r"\[.*\]", resp, re.DOTALL) | |
| if not m: | |
| return {} | |
| try: | |
| arr = json.loads(m.group(0)) | |
| return {int(o["trial_id"]): int(o["choice"]) for o in arr if "choice" in o} | |
| except Exception: | |
| return {} | |
| def parse(resp): | |
| m = re.search(r"\\[.*\\]", resp, re.DOTALL) | |
| if not m: | |
| return {} | |
| try: | |
| arr = json.loads(m.group(0)) | |
| res = {} | |
| for o in arr: | |
| try: | |
| if "trial_id" in o and "choice" in o: | |
| res[int(o["trial_id"])] = int(o["choice"]) | |
| except (ValueError, TypeError): | |
| continue | |
| return res | |
| except Exception: | |
| return {} |
| def main(): | ||
| smoke = "--smoke" in sys.argv | ||
| backend = "gemini" if "--backend" in sys.argv and sys.argv[sys.argv.index("--backend") + 1] == "gemini" else "claude" | ||
| model_override = sys.argv[sys.argv.index("--model") + 1] if "--model" in sys.argv else MODEL | ||
| call_judge = call_judge_gemini if backend == "gemini" else (lambda p: call_judge_claude(p, model_override)) | ||
| judge_model = "Gemini 3.5 Flash (Antigravity)" if backend == "gemini" else model_override | ||
| tag = backend if backend == "gemini" else ("claude_" + model_override.split("-")[1] if "-" in model_override else "claude") | ||
| batch_size = 5 if backend == "gemini" else 10 |
There was a problem hiding this comment.
Manual command-line argument parsing using sys.argv.index is fragile and can raise IndexError if an option is passed without a value at the end of the command. Refactoring to use Python's standard argparse module provides robust parsing, automatic help generation, and cleaner code.
| def main(): | |
| smoke = "--smoke" in sys.argv | |
| backend = "gemini" if "--backend" in sys.argv and sys.argv[sys.argv.index("--backend") + 1] == "gemini" else "claude" | |
| model_override = sys.argv[sys.argv.index("--model") + 1] if "--model" in sys.argv else MODEL | |
| call_judge = call_judge_gemini if backend == "gemini" else (lambda p: call_judge_claude(p, model_override)) | |
| judge_model = "Gemini 3.5 Flash (Antigravity)" if backend == "gemini" else model_override | |
| tag = backend if backend == "gemini" else ("claude_" + model_override.split("-")[1] if "-" in model_override else "claude") | |
| batch_size = 5 if backend == "gemini" else 10 | |
| def main(): | |
| import argparse | |
| parser = argparse.ArgumentParser(description="LLM-judge forced-choice content-specificity eval") | |
| parser.add_argument("--smoke", action="store_true", help="Run a smoke test with 1 batch") | |
| parser.add_argument("--backend", choices=["claude", "gemini"], default="claude", help="LLM backend to use") | |
| parser.add_argument("--model", default=MODEL, help="Model override for the Claude backend") | |
| args = parser.parse_args() | |
| smoke = args.smoke | |
| backend = args.backend | |
| model_override = args.model | |
| call_judge = call_judge_gemini if backend == "gemini" else (lambda p: call_judge_claude(p, model_override)) | |
| judge_model = "Gemini 3.5 Flash (Antigravity)" if backend == "gemini" else model_override | |
| tag = backend if backend == "gemini" else ("claude_" + model_override.split("-")[1] if "-" in model_override else "claude") | |
| batch_size = 5 if backend == "gemini" else 10 |
| for b in range(0, len(trials), batch_size): | ||
| batch = trials[b:b + batch_size] | ||
| prompt = make_prompt(batch) | ||
| resp = call_judge(prompt) | ||
| parsed = parse(resp) | ||
| choices.update(parsed) | ||
| with open(RAW, "a", encoding="utf-8") as f: | ||
| f.write(json.dumps({"batch_start": b, "prompt_chars": len(prompt), | ||
| "response": resp, "parsed": parsed}) + "\n") | ||
| print(f"batch {b}-{b+len(batch)}: parsed {len(parsed)}/{len(batch)}") |
There was a problem hiding this comment.
LLM API calls are prone to transient network errors or timeouts (as noted in the Sonnet run's crash). Wrapping the call_judge and parse calls in a try-except block catching subprocess.SubprocessError and subprocess.TimeoutExpired allows the script to gracefully handle failures, log the error, and break/save the completed trials instead of crashing and losing all progress.
| for b in range(0, len(trials), batch_size): | |
| batch = trials[b:b + batch_size] | |
| prompt = make_prompt(batch) | |
| resp = call_judge(prompt) | |
| parsed = parse(resp) | |
| choices.update(parsed) | |
| with open(RAW, "a", encoding="utf-8") as f: | |
| f.write(json.dumps({"batch_start": b, "prompt_chars": len(prompt), | |
| "response": resp, "parsed": parsed}) + "\n") | |
| print(f"batch {b}-{b+len(batch)}: parsed {len(parsed)}/{len(batch)}") | |
| for b in range(0, len(trials), batch_size): | |
| batch = trials[b:b + batch_size] | |
| prompt = make_prompt(batch) | |
| try: | |
| resp = call_judge(prompt) | |
| parsed = parse(resp) | |
| choices.update(parsed) | |
| with open(RAW, "a", encoding="utf-8") as f: | |
| f.write(json.dumps({"batch_start": b, "prompt_chars": len(prompt), | |
| "response": resp, "parsed": parsed}) + "\\n") | |
| except (subprocess.SubprocessError, subprocess.TimeoutExpired) as e: | |
| print(f"Error or timeout on batch {b}: {e}") | |
| break | |
| print(f"batch {b}-{b+len(batch)}: parsed {len(parsed)}/{len(batch)}") |
Lifts the content-specificity ceiling finding (already in the deep Limitations) up into the lead surfaces so a skimmer doesn't leave with a "content-blind" impression. Matches the tone of the Neuronpedia reply to Johnny: lead with the gap, then pivot to "the content really is in there to read."
Purely additive — every honest number kept (at-chance retrieval, the n=50 head-to-head, the descope caveats all stay). Diff is 9 insertions / 4 modified lines.
🤖 Generated with Claude Code