Skip to content

Reframe public tone: verbalizer training gap, not content-blind#17

Closed
SolshineCode wants to merge 5 commits into
mainfrom
content-specificity-eval
Closed

Reframe public tone: verbalizer training gap, not content-blind#17
SolshineCode wants to merge 5 commits into
mainfrom
content-specificity-eval

Conversation

@SolshineCode
Copy link
Copy Markdown
Owner

Lifts the content-specificity ceiling finding (already in the deep Limitations) up into the lead surfaces so a skimmer doesn't leave with a "content-blind" impression. Matches the tone of the Neuronpedia reply to Johnny: lead with the gap, then pivot to "the content really is in there to read."

Purely additive — every honest number kept (at-chance retrieval, the n=50 head-to-head, the descope caveats all stay). Diff is 9 insertions / 4 modified lines.

  • README — lead paragraph + a new Headline-numbers bullet (60% probe, 0.24 doc-retrieval vs 0.077 chance, L17 carries >2x L23 signal; verbalizer training gap, not a 2B ceiling).
  • MODEL_CARD_AV — counterweight in the content-fidelity limitation + the distinctive-features line.
  • MODEL_CARD_AR — note the structural-projection gap is reconstruction-side; content is present in the activation.
  • lesswrong_post — dated append-only addendum (preserves the narrative voice).

🤖 Generated with Claude Code

SolshineCode and others added 5 commits June 1, 2026 00:04
A direct retrieval eval (does each AV output recover its own source document?)
puts the v0.1 AV at chance across lexical, semantic, and two LLM-judge probes.
So the output is format/genre-plausible but NOT per-row content- or
theme-discriminative. Corrects the mild 'theme-correct' overclaim in README /
model card / calibration doc; the 'detail-confabulated' half stands.

Nuance kept: v0.1 output is diverse (45/50 unique strings); the diversity is
decoupled from source content. Scope-limit noted (a feature constant across all
13 docs would be invisible to this eval).

Bundles all eval scripts, inputs (rl.parquet, AV outputs), per-trial LLM-judge
data, and CONTENT_SPECIFICITY_EVAL.md under experiments/v8_nla_local/ so the
suite is reproducible from this repo.

Also fixes v0.0.1 AV SFT-step count in MODEL_CARD_AV (15 -> 55 total = 15 base +
40 continuation; grad_accum 16 -> 4), restoring the value dropped in the
2026-05-17 retraction rewrite (commit 8dd0a99).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… activation

Follow-up to the content-specificity eval. A ceiling test on the RAW L23
activations recovers the source doc well above chance (retrieval 0.24 vs 0.077,
p=0.0006; logistic probe 60% on 13-way doc id), so the content the AV fails to
surface IS present in the activation. The per-row gap is the verbalizer's, and is
fixable with better AV training rather than an intrinsic 2B-L23 ceiling. This also
refutes the polysemanticity-at-2B reading for doc-level content, and validates the
retrieval method (strong signal in activation, none in AV output).

8-layer sweep: every layer content-rich; L23 middling (0.350), L17 best (0.805),
so a future NLA could also retarget the layer.

Updates README + RELEASE_CALIBRATION framing (more accurate and more hopeful) and
bundles the new scripts + result JSONs (inject probe, activation ceiling, layer
sweep). Adds a single-token forced-choice injection probe (also at chance,
non-degenerate).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…m is real)

The ceiling result reframes the content gap as a verbalizer-training problem, not
a model-scale ceiling (60% linear-probe doc id on the raw L23 activation). Update
the bottom-line section so the summary carries both halves: the v0.1 AV does not
yet read per-row content, AND the information is there to be read.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ia Antigravity'

The 5-way Gemini judge was the Antigravity (agy) attempt that auth-failed (no data).
'Gemini 3.5 Flash' was an unverified display name from the agy banner. The actual
labeler/judge model identifier is gemini-2.5-flash (verified via CLI telemetry); this
row produced no data so it is labeled by tool, not an unverified version.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Lifts the §F76 ceiling finding (already in the deep Limitations) up into the lead
surfaces so a skimmer doesn't leave with a 'content-blind' impression -- matching the
tone of the Neuronpedia reply to Johnny. Purely additive (every honest number kept):
- README lead paragraph + Headline numbers: add the verbalizer-not-activation pivot
  (60% probe, 0.24 doc-retrieval vs 0.077, L17 carries >2x L23 signal).
- MODEL_CARD_AV: counterweight in the content-fidelity limitation + distinctive line.
- MODEL_CARD_AR: note the structural-projection gap is reconstruction-side, content is
  present in the activation (60% probe).
- lesswrong_post: dated append-only addendum with the same finding.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@SolshineCode
Copy link
Copy Markdown
Owner Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive content-specificity evaluation suite for the v0.1 Natural Language Autoencoder (NLA) on Gemma-4-E2B, adding several evaluation scripts, results, and updating documentation to reflect that the current AV output is format-plausible but not content-discriminative. Crucially, the added ceiling tests show that the underlying activations do contain the necessary document-level signal, indicating a verbalizer training bottleneck rather than a model ceiling. Feedback on the new code focuses on performance and robustness: optimizing GPU-CPU synchronization overhead in the injection probe script by vectorizing log-probability gathering and pre-computing indices, caching the SentenceTransformer model to avoid redundant reloading, and refactoring the LLM-judge script to use argparse and a non-greedy regex pattern to prevent JSON parsing failures.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +100 to +118
def hook(module, inp, out):
if out.shape[1] <= 1:
return out
ids, vec = pending["input_ids"], pending["vec"]
if ids is None or vec is None:
return out
h = out.clone()
for p2 in range(1, ids.shape[1] - 1):
if (ids[0, p2].item() == INJ_ID and ids[0, p2 - 1].item() == LEFT_ID
and ids[0, p2 + 1].item() == RIGHT_ID):
h[0, p2] = vec[0].to(h.dtype)
break
return h

handle = av.get_input_embeddings().register_forward_hook(hook)

prefix_ids = tok.encode(AV_TEMPLATE + "\n\n<explanation> This vector is about",
return_tensors="pt")
plen = prefix_ids.shape[1]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Calling .item() on GPU tensors inside a loop triggers synchronous host-device memory copies, which severely degrades performance. Since prefix_ids is constant and always placed at the beginning of full, the injection index inj_idx is static and can be computed once on the CPU during initialization. This completely eliminates the GPU-CPU synchronization overhead during the forward pass hooks.

    prefix_ids = tok.encode(AV_TEMPLATE + "\n\n<explanation> This vector is about",\n                            return_tensors="pt")\n    plen = prefix_ids.shape[1]\n\n    # Find the injection index once on CPU to avoid GPU-CPU sync in the hook\n    prefix_ids_cpu = prefix_ids[0]\n    inj_idx = -1\n    for i in range(1, len(prefix_ids_cpu) - 1):\n        if (prefix_ids_cpu[i].item() == INJ_ID \n                and prefix_ids_cpu[i - 1].item() == LEFT_ID \n                and prefix_ids_cpu[i + 1].item() == RIGHT_ID):\n            inj_idx = i\n            break\n\n    def hook(module, inp, out):\n        if out.shape[1] <= 1:\n            return out\n        ids, vec = pending["input_ids"], pending["vec"]\n        if ids is None or vec is None:\n            return out\n        h = out.clone()\n        if inj_idx != -1:\n            h[0, inj_idx] = vec[0].to(h.dtype)\n        return h\n\n    handle = av.get_input_embeddings().register_forward_hook(hook)

Comment on lines +127 to +130
tot = 0.0
m = cont_ids.shape[1]
for k in range(m):
tot += lp[plen + k - 1, full[0, plen + k].item()].item()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Looping over target tokens and calling .item() on GPU tensors inside score causes multiple synchronous host-device transfers per evaluation step. We can vectorize this operation by selecting the relevant slice of lp and using torch.gather to retrieve all target log-probabilities in a single GPU operation, then calling .sum().item() once to copy the scalar to the CPU.

        m = cont_ids.shape[1]\n        target_ids = cont_ids.to(av.device).view(m, 1)\n        lp_selected = lp[plen - 1 : plen + m - 1]\n        target_log_probs = torch.gather(lp_selected, dim=1, index=target_ids)\n        tot = target_log_probs.sum().item()

Comment on lines +124 to +130
smoke = "--smoke" in sys.argv
backend = "gemini" if "--backend" in sys.argv and sys.argv[sys.argv.index("--backend") + 1] == "gemini" else "claude"
model_override = sys.argv[sys.argv.index("--model") + 1] if "--model" in sys.argv else MODEL
call_judge = call_judge_gemini if backend == "gemini" else (lambda p: call_judge_claude(p, model_override))
judge_model = "Gemini 3.5 Flash (Antigravity)" if backend == "gemini" else model_override
tag = backend if backend == "gemini" else ("claude_" + model_override.split("-")[1] if "-" in model_override else "claude")
batch_size = 5 if backend == "gemini" else 10
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Manual parsing of sys.argv is fragile and can easily raise IndexError if arguments are incomplete or misplaced (e.g., if --backend or --model is passed as the last argument without a value). Refactoring the argument parsing to use Python's standard argparse module provides robust validation, automatic help generation, and safe default values.

    import argparse\n    parser = argparse.ArgumentParser(description="LLM-judge forced-choice content-specificity eval")\n    parser.add_argument("--smoke", action="store_true", help="Run in smoke test mode")\n    parser.add_argument("--backend", choices=["claude", "gemini"], default="claude", help="LLM backend")\n    parser.add_argument("--model", default=MODEL, help="Model override")\n    args = parser.parse_args()\n\n    smoke = args.smoke\n    backend = args.backend\n    model_override = args.model\n    call_judge = call_judge_gemini if backend == "gemini" else (lambda p: call_judge_claude(p, model_override))\n    judge_model = "Gemini 3.5 Flash (Antigravity)" if backend == "gemini" else model_override\n    tag = backend if backend == "gemini" else ("claude_" + model_override.split("-")[1] if "-" in model_override else "claude")\n    batch_size = 5 if backend == "gemini" else 10



def parse(resp):
m = re.search(r"\[.*\]", resp, re.DOTALL)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a greedy wildcard .* in re.search with re.DOTALL can over-extend and capture trailing conversational brackets (e.g., [note]) returned by the LLM, which will cause json.loads to fail. Using a non-greedy pattern .*? ensures that only the JSON array itself is captured.

Suggested change
m = re.search(r"\[.*\]", resp, re.DOTALL)
m = re.search(r"\[\s*\{.*?\}\s*\]", resp, re.DOTALL)

Comment on lines +68 to +78
def make_sim(av, src, kind):
if kind == "tfidf":
vec = TfidfVectorizer(ngram_range=(1, 2), min_df=1, sublinear_tf=True,
stop_words="english").fit(src + av)
return cosine_similarity(vec.transform(av), vec.transform(src))
if kind == "semantic":
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
ea = model.encode(av, normalize_embeddings=True, show_progress_bar=False)
es = model.encode(src, normalize_embeddings=True, show_progress_bar=False)
return ea @ es.T
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The SentenceTransformer model is re-loaded and re-initialized from scratch every time make_sim is called with kind='semantic'. Since make_sim is called multiple times during the evaluation, this redundant loading of model weights severely slows down the script. Caching the initialized model in a mutable default argument or a local cache dictionary avoids this overhead.

def make_sim(av, src, kind, _model_cache={}):\n    if kind == "tfidf":\n        vec = TfidfVectorizer(ngram_range=(1, 2), min_df=1, sublinear_tf=True,\n                              stop_words="english").fit(src + av)\n        return cosine_similarity(vec.transform(av), vec.transform(src))\n    if kind == "semantic":\n        if "model" not in _model_cache:\n            from sentence_transformers import SentenceTransformer\n            _model_cache["model"] = SentenceTransformer("all-MiniLM-L6-v2")\n        model = _model_cache["model"]\n        ea = model.encode(av, normalize_embeddings=True, show_progress_bar=False)\n        es = model.encode(src, normalize_embeddings=True, show_progress_bar=False)\n        return ea @ es.T

@SolshineCode
Copy link
Copy Markdown
Owner Author

Superseded by #18 (clean branch off main; #17 conflicted due to the squash-merge of #16 on the reused branch).

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive suite of evaluation scripts and results (including lexical, semantic, and LLM-judge probes) to assess the content-specificity of the Gemma-4-E2B NLA model, updating the documentation and model cards to reflect that the current AV's content-fidelity gap is a verbalizer training bottleneck rather than an intrinsic model ceiling. The review feedback focuses on improving the robustness of these new scripts, suggesting graceful handling of missing sentence-transformers dependencies, more resilient parsing of LLM-judge JSON outputs, adopting argparse for command-line options, and wrapping LLM API subprocess calls in try-except blocks to handle timeouts and network errors without losing progress.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +73 to +78
if kind == "semantic":
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
ea = model.encode(av, normalize_embeddings=True, show_progress_bar=False)
es = model.encode(src, normalize_embeddings=True, show_progress_bar=False)
return ea @ es.T
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If sentence-transformers is not installed, importing SentenceTransformer will raise an ImportError and crash the script. Wrapping the import in a try-except block and returning None allows the script to handle the missing dependency gracefully, matching the behavior in eval_content_specificity.py.

Suggested change
if kind == "semantic":
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
ea = model.encode(av, normalize_embeddings=True, show_progress_bar=False)
es = model.encode(src, normalize_embeddings=True, show_progress_bar=False)
return ea @ es.T
if kind == "semantic":
try:
from sentence_transformers import SentenceTransformer
except ImportError:
return None
model = SentenceTransformer("all-MiniLM-L6-v2")
ea = model.encode(av, normalize_embeddings=True, show_progress_bar=False)
es = model.encode(src, normalize_embeddings=True, show_progress_bar=False)
return ea @ es.T

Comment on lines +93 to +96
for kind in ["tfidf", "semantic"]:
# R1: full vs tail-window source
sim_full = make_sim(av_all, full_src, kind)
sim_tail = make_sim(av_all, tail_src, kind)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If make_sim returns None because sentence-transformers is not installed, the script will crash with an AttributeError when calling perm on None. Checking if sim_full is None and skipping the semantic evaluation gracefully prevents this crash.

Suggested change
for kind in ["tfidf", "semantic"]:
# R1: full vs tail-window source
sim_full = make_sim(av_all, full_src, kind)
sim_tail = make_sim(av_all, tail_src, kind)
for kind in ["tfidf", "semantic"]:
# R1: full vs tail-window source
sim_full = make_sim(av_all, full_src, kind)
if sim_full is None:
print(f"[{kind}] skipped (sentence-transformers not installed)")
continue
sim_tail = make_sim(av_all, tail_src, kind)

Comment on lines +112 to +120
def parse(resp):
m = re.search(r"\[.*\]", resp, re.DOTALL)
if not m:
return {}
try:
arr = json.loads(m.group(0))
return {int(o["trial_id"]): int(o["choice"]) for o in arr if "choice" in o}
except Exception:
return {}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If a single item in the JSON array returned by the LLM is malformed or has type conversion issues, the entire batch of choices is currently discarded. Processing items individually inside the loop and catching conversion errors per item makes the parser much more robust against minor LLM formatting inconsistencies.

Suggested change
def parse(resp):
m = re.search(r"\[.*\]", resp, re.DOTALL)
if not m:
return {}
try:
arr = json.loads(m.group(0))
return {int(o["trial_id"]): int(o["choice"]) for o in arr if "choice" in o}
except Exception:
return {}
def parse(resp):
m = re.search(r"\\[.*\\]", resp, re.DOTALL)
if not m:
return {}
try:
arr = json.loads(m.group(0))
res = {}
for o in arr:
try:
if "trial_id" in o and "choice" in o:
res[int(o["trial_id"])] = int(o["choice"])
except (ValueError, TypeError):
continue
return res
except Exception:
return {}

Comment on lines +123 to +130
def main():
smoke = "--smoke" in sys.argv
backend = "gemini" if "--backend" in sys.argv and sys.argv[sys.argv.index("--backend") + 1] == "gemini" else "claude"
model_override = sys.argv[sys.argv.index("--model") + 1] if "--model" in sys.argv else MODEL
call_judge = call_judge_gemini if backend == "gemini" else (lambda p: call_judge_claude(p, model_override))
judge_model = "Gemini 3.5 Flash (Antigravity)" if backend == "gemini" else model_override
tag = backend if backend == "gemini" else ("claude_" + model_override.split("-")[1] if "-" in model_override else "claude")
batch_size = 5 if backend == "gemini" else 10
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Manual command-line argument parsing using sys.argv.index is fragile and can raise IndexError if an option is passed without a value at the end of the command. Refactoring to use Python's standard argparse module provides robust parsing, automatic help generation, and cleaner code.

Suggested change
def main():
smoke = "--smoke" in sys.argv
backend = "gemini" if "--backend" in sys.argv and sys.argv[sys.argv.index("--backend") + 1] == "gemini" else "claude"
model_override = sys.argv[sys.argv.index("--model") + 1] if "--model" in sys.argv else MODEL
call_judge = call_judge_gemini if backend == "gemini" else (lambda p: call_judge_claude(p, model_override))
judge_model = "Gemini 3.5 Flash (Antigravity)" if backend == "gemini" else model_override
tag = backend if backend == "gemini" else ("claude_" + model_override.split("-")[1] if "-" in model_override else "claude")
batch_size = 5 if backend == "gemini" else 10
def main():
import argparse
parser = argparse.ArgumentParser(description="LLM-judge forced-choice content-specificity eval")
parser.add_argument("--smoke", action="store_true", help="Run a smoke test with 1 batch")
parser.add_argument("--backend", choices=["claude", "gemini"], default="claude", help="LLM backend to use")
parser.add_argument("--model", default=MODEL, help="Model override for the Claude backend")
args = parser.parse_args()
smoke = args.smoke
backend = args.backend
model_override = args.model
call_judge = call_judge_gemini if backend == "gemini" else (lambda p: call_judge_claude(p, model_override))
judge_model = "Gemini 3.5 Flash (Antigravity)" if backend == "gemini" else model_override
tag = backend if backend == "gemini" else ("claude_" + model_override.split("-")[1] if "-" in model_override else "claude")
batch_size = 5 if backend == "gemini" else 10

Comment on lines +138 to +147
for b in range(0, len(trials), batch_size):
batch = trials[b:b + batch_size]
prompt = make_prompt(batch)
resp = call_judge(prompt)
parsed = parse(resp)
choices.update(parsed)
with open(RAW, "a", encoding="utf-8") as f:
f.write(json.dumps({"batch_start": b, "prompt_chars": len(prompt),
"response": resp, "parsed": parsed}) + "\n")
print(f"batch {b}-{b+len(batch)}: parsed {len(parsed)}/{len(batch)}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

LLM API calls are prone to transient network errors or timeouts (as noted in the Sonnet run's crash). Wrapping the call_judge and parse calls in a try-except block catching subprocess.SubprocessError and subprocess.TimeoutExpired allows the script to gracefully handle failures, log the error, and break/save the completed trials instead of crashing and losing all progress.

Suggested change
for b in range(0, len(trials), batch_size):
batch = trials[b:b + batch_size]
prompt = make_prompt(batch)
resp = call_judge(prompt)
parsed = parse(resp)
choices.update(parsed)
with open(RAW, "a", encoding="utf-8") as f:
f.write(json.dumps({"batch_start": b, "prompt_chars": len(prompt),
"response": resp, "parsed": parsed}) + "\n")
print(f"batch {b}-{b+len(batch)}: parsed {len(parsed)}/{len(batch)}")
for b in range(0, len(trials), batch_size):
batch = trials[b:b + batch_size]
prompt = make_prompt(batch)
try:
resp = call_judge(prompt)
parsed = parse(resp)
choices.update(parsed)
with open(RAW, "a", encoding="utf-8") as f:
f.write(json.dumps({"batch_start": b, "prompt_chars": len(prompt),
"response": resp, "parsed": parsed}) + "\\n")
except (subprocess.SubprocessError, subprocess.TimeoutExpired) as e:
print(f"Error or timeout on batch {b}: {e}")
break
print(f"batch {b}-{b+len(batch)}: parsed {len(parsed)}/{len(batch)}")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant