Reframe public tone: verbalizer training gap, not content-blind by SolshineCode · Pull Request #17 · SolshineCode/nla-gemma-4-e2b

SolshineCode · 2026-06-03T04:29:33Z

Lifts the content-specificity ceiling finding (already in the deep Limitations) up into the lead surfaces so a skimmer doesn't leave with a "content-blind" impression. Matches the tone of the Neuronpedia reply to Johnny: lead with the gap, then pivot to "the content really is in there to read."

Purely additive — every honest number kept (at-chance retrieval, the n=50 head-to-head, the descope caveats all stay). Diff is 9 insertions / 4 modified lines.

README — lead paragraph + a new Headline-numbers bullet (60% probe, 0.24 doc-retrieval vs 0.077 chance, L17 carries >2x L23 signal; verbalizer training gap, not a 2B ceiling).
MODEL_CARD_AV — counterweight in the content-fidelity limitation + the distinctive-features line.
MODEL_CARD_AR — note the structural-projection gap is reconstruction-side; content is present in the activation.
lesswrong_post — dated append-only addendum (preserves the narrative voice).

🤖 Generated with Claude Code

A direct retrieval eval (does each AV output recover its own source document?) puts the v0.1 AV at chance across lexical, semantic, and two LLM-judge probes. So the output is format/genre-plausible but NOT per-row content- or theme-discriminative. Corrects the mild 'theme-correct' overclaim in README / model card / calibration doc; the 'detail-confabulated' half stands. Nuance kept: v0.1 output is diverse (45/50 unique strings); the diversity is decoupled from source content. Scope-limit noted (a feature constant across all 13 docs would be invisible to this eval). Bundles all eval scripts, inputs (rl.parquet, AV outputs), per-trial LLM-judge data, and CONTENT_SPECIFICITY_EVAL.md under experiments/v8_nla_local/ so the suite is reproducible from this repo. Also fixes v0.0.1 AV SFT-step count in MODEL_CARD_AV (15 -> 55 total = 15 base + 40 continuation; grad_accum 16 -> 4), restoring the value dropped in the 2026-05-17 retraction rewrite (commit 8dd0a99). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… activation Follow-up to the content-specificity eval. A ceiling test on the RAW L23 activations recovers the source doc well above chance (retrieval 0.24 vs 0.077, p=0.0006; logistic probe 60% on 13-way doc id), so the content the AV fails to surface IS present in the activation. The per-row gap is the verbalizer's, and is fixable with better AV training rather than an intrinsic 2B-L23 ceiling. This also refutes the polysemanticity-at-2B reading for doc-level content, and validates the retrieval method (strong signal in activation, none in AV output). 8-layer sweep: every layer content-rich; L23 middling (0.350), L17 best (0.805), so a future NLA could also retarget the layer. Updates README + RELEASE_CALIBRATION framing (more accurate and more hopeful) and bundles the new scripts + result JSONs (inject probe, activation ceiling, layer sweep). Adds a single-token forced-choice injection probe (also at chance, non-degenerate). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…m is real) The ceiling result reframes the content gap as a verbalizer-training problem, not a model-scale ceiling (60% linear-probe doc id on the raw L23 activation). Update the bottom-line section so the summary carries both halves: the v0.1 AV does not yet read per-row content, AND the information is there to be read. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ia Antigravity' The 5-way Gemini judge was the Antigravity (agy) attempt that auth-failed (no data). 'Gemini 3.5 Flash' was an unverified display name from the agy banner. The actual labeler/judge model identifier is gemini-2.5-flash (verified via CLI telemetry); this row produced no data so it is labeled by tool, not an unverified version. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Lifts the §F76 ceiling finding (already in the deep Limitations) up into the lead surfaces so a skimmer doesn't leave with a 'content-blind' impression -- matching the tone of the Neuronpedia reply to Johnny. Purely additive (every honest number kept): - README lead paragraph + Headline numbers: add the verbalizer-not-activation pivot (60% probe, 0.24 doc-retrieval vs 0.077, L17 carries >2x L23 signal). - MODEL_CARD_AV: counterweight in the content-fidelity limitation + distinctive line. - MODEL_CARD_AR: note the structural-projection gap is reconstruction-side, content is present in the activation (60% probe). - lesswrong_post: dated append-only addendum with the same finding. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

SolshineCode · 2026-06-03T04:29:38Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a comprehensive content-specificity evaluation suite for the v0.1 Natural Language Autoencoder (NLA) on Gemma-4-E2B, adding several evaluation scripts, results, and updating documentation to reflect that the current AV output is format-plausible but not content-discriminative. Crucially, the added ceiling tests show that the underlying activations do contain the necessary document-level signal, indicating a verbalizer training bottleneck rather than a model ceiling. Feedback on the new code focuses on performance and robustness: optimizing GPU-CPU synchronization overhead in the injection probe script by vectorizing log-probability gathering and pre-computing indices, caching the SentenceTransformer model to avoid redundant reloading, and refactoring the LLM-judge script to use argparse and a non-greedy regex pattern to prevent JSON parsing failures.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-03T04:32:11Z

+    def hook(module, inp, out):
+        if out.shape[1] <= 1:
+            return out
+        ids, vec = pending["input_ids"], pending["vec"]
+        if ids is None or vec is None:
+            return out
+        h = out.clone()
+        for p2 in range(1, ids.shape[1] - 1):
+            if (ids[0, p2].item() == INJ_ID and ids[0, p2 - 1].item() == LEFT_ID
+                    and ids[0, p2 + 1].item() == RIGHT_ID):
+                h[0, p2] = vec[0].to(h.dtype)
+                break
+        return h
+
+    handle = av.get_input_embeddings().register_forward_hook(hook)
+
+    prefix_ids = tok.encode(AV_TEMPLATE + "\n\n<explanation> This vector is about",
+                            return_tensors="pt")
+    plen = prefix_ids.shape[1]


Calling .item() on GPU tensors inside a loop triggers synchronous host-device memory copies, which severely degrades performance. Since prefix_ids is constant and always placed at the beginning of full, the injection index inj_idx is static and can be computed once on the CPU during initialization. This completely eliminates the GPU-CPU synchronization overhead during the forward pass hooks.

prefix_ids = tok.encode(AV_TEMPLATE + "\n\n<explanation> This vector is about",\n return_tensors="pt")\n plen = prefix_ids.shape[1]\n\n # Find the injection index once on CPU to avoid GPU-CPU sync in the hook\n prefix_ids_cpu = prefix_ids[0]\n inj_idx = -1\n for i in range(1, len(prefix_ids_cpu) - 1):\n if (prefix_ids_cpu[i].item() == INJ_ID \n and prefix_ids_cpu[i - 1].item() == LEFT_ID \n and prefix_ids_cpu[i + 1].item() == RIGHT_ID):\n inj_idx = i\n break\n\n def hook(module, inp, out):\n if out.shape[1] <= 1:\n return out\n ids, vec = pending["input_ids"], pending["vec"]\n if ids is None or vec is None:\n return out\n h = out.clone()\n if inj_idx != -1:\n h[0, inj_idx] = vec[0].to(h.dtype)\n return h\n\n handle = av.get_input_embeddings().register_forward_hook(hook)

gemini-code-assist · 2026-06-03T04:32:11Z

+        tot = 0.0
+        m = cont_ids.shape[1]
+        for k in range(m):
+            tot += lp[plen + k - 1, full[0, plen + k].item()].item()


Looping over target tokens and calling .item() on GPU tensors inside score causes multiple synchronous host-device transfers per evaluation step. We can vectorize this operation by selecting the relevant slice of lp and using torch.gather to retrieve all target log-probabilities in a single GPU operation, then calling .sum().item() once to copy the scalar to the CPU.

m = cont_ids.shape[1]\n target_ids = cont_ids.to(av.device).view(m, 1)\n lp_selected = lp[plen - 1 : plen + m - 1]\n target_log_probs = torch.gather(lp_selected, dim=1, index=target_ids)\n tot = target_log_probs.sum().item()

gemini-code-assist · 2026-06-03T04:32:11Z

+    smoke = "--smoke" in sys.argv
+    backend = "gemini" if "--backend" in sys.argv and sys.argv[sys.argv.index("--backend") + 1] == "gemini" else "claude"
+    model_override = sys.argv[sys.argv.index("--model") + 1] if "--model" in sys.argv else MODEL
+    call_judge = call_judge_gemini if backend == "gemini" else (lambda p: call_judge_claude(p, model_override))
+    judge_model = "Gemini 3.5 Flash (Antigravity)" if backend == "gemini" else model_override
+    tag = backend if backend == "gemini" else ("claude_" + model_override.split("-")[1] if "-" in model_override else "claude")
+    batch_size = 5 if backend == "gemini" else 10


Manual parsing of sys.argv is fragile and can easily raise IndexError if arguments are incomplete or misplaced (e.g., if --backend or --model is passed as the last argument without a value). Refactoring the argument parsing to use Python's standard argparse module provides robust validation, automatic help generation, and safe default values.

import argparse\n parser = argparse.ArgumentParser(description="LLM-judge forced-choice content-specificity eval")\n parser.add_argument("--smoke", action="store_true", help="Run in smoke test mode")\n parser.add_argument("--backend", choices=["claude", "gemini"], default="claude", help="LLM backend")\n parser.add_argument("--model", default=MODEL, help="Model override")\n args = parser.parse_args()\n\n smoke = args.smoke\n backend = args.backend\n model_override = args.model\n call_judge = call_judge_gemini if backend == "gemini" else (lambda p: call_judge_claude(p, model_override))\n judge_model = "Gemini 3.5 Flash (Antigravity)" if backend == "gemini" else model_override\n tag = backend if backend == "gemini" else ("claude_" + model_override.split("-")[1] if "-" in model_override else "claude")\n batch_size = 5 if backend == "gemini" else 10

gemini-code-assist · 2026-06-03T04:32:11Z

+
+
+def parse(resp):
+    m = re.search(r"\[.*\]", resp, re.DOTALL)


Using a greedy wildcard .* in re.search with re.DOTALL can over-extend and capture trailing conversational brackets (e.g., [note]) returned by the LLM, which will cause json.loads to fail. Using a non-greedy pattern .*? ensures that only the JSON array itself is captured.

Suggested change

m = re.search(r"\[.*\]", resp, re.DOTALL)

m = re.search(r"\[\s*\{.*?\}\s*\]", resp, re.DOTALL)

gemini-code-assist · 2026-06-03T04:32:11Z

+def make_sim(av, src, kind):
+    if kind == "tfidf":
+        vec = TfidfVectorizer(ngram_range=(1, 2), min_df=1, sublinear_tf=True,
+                              stop_words="english").fit(src + av)
+        return cosine_similarity(vec.transform(av), vec.transform(src))
+    if kind == "semantic":
+        from sentence_transformers import SentenceTransformer
+        model = SentenceTransformer("all-MiniLM-L6-v2")
+        ea = model.encode(av, normalize_embeddings=True, show_progress_bar=False)
+        es = model.encode(src, normalize_embeddings=True, show_progress_bar=False)
+        return ea @ es.T


The SentenceTransformer model is re-loaded and re-initialized from scratch every time make_sim is called with kind='semantic'. Since make_sim is called multiple times during the evaluation, this redundant loading of model weights severely slows down the script. Caching the initialized model in a mutable default argument or a local cache dictionary avoids this overhead.

def make_sim(av, src, kind, _model_cache={}):\n if kind == "tfidf":\n vec = TfidfVectorizer(ngram_range=(1, 2), min_df=1, sublinear_tf=True,\n stop_words="english").fit(src + av)\n return cosine_similarity(vec.transform(av), vec.transform(src))\n if kind == "semantic":\n if "model" not in _model_cache:\n from sentence_transformers import SentenceTransformer\n _model_cache["model"] = SentenceTransformer("all-MiniLM-L6-v2")\n model = _model_cache["model"]\n ea = model.encode(av, normalize_embeddings=True, show_progress_bar=False)\n es = model.encode(src, normalize_embeddings=True, show_progress_bar=False)\n return ea @ es.T

SolshineCode · 2026-06-03T04:32:52Z

Superseded by #18 (clean branch off main; #17 conflicted due to the squash-merge of #16 on the reused branch).

gemini-code-assist

Code Review

This pull request introduces a comprehensive suite of evaluation scripts and results (including lexical, semantic, and LLM-judge probes) to assess the content-specificity of the Gemma-4-E2B NLA model, updating the documentation and model cards to reflect that the current AV's content-fidelity gap is a verbalizer training bottleneck rather than an intrinsic model ceiling. The review feedback focuses on improving the robustness of these new scripts, suggesting graceful handling of missing sentence-transformers dependencies, more resilient parsing of LLM-judge JSON outputs, adopting argparse for command-line options, and wrapping LLM API subprocess calls in try-except blocks to handle timeouts and network errors without losing progress.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-03T04:33:58Z

+    if kind == "semantic":
+        from sentence_transformers import SentenceTransformer
+        model = SentenceTransformer("all-MiniLM-L6-v2")
+        ea = model.encode(av, normalize_embeddings=True, show_progress_bar=False)
+        es = model.encode(src, normalize_embeddings=True, show_progress_bar=False)
+        return ea @ es.T


If sentence-transformers is not installed, importing SentenceTransformer will raise an ImportError and crash the script. Wrapping the import in a try-except block and returning None allows the script to handle the missing dependency gracefully, matching the behavior in eval_content_specificity.py.

Suggested change

if kind == "semantic":

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

ea = model.encode(av, normalize_embeddings=True, show_progress_bar=False)

es = model.encode(src, normalize_embeddings=True, show_progress_bar=False)

return ea @ es.T

if kind == "semantic":

try:

from sentence_transformers import SentenceTransformer

except ImportError:

return None

model = SentenceTransformer("all-MiniLM-L6-v2")

ea = model.encode(av, normalize_embeddings=True, show_progress_bar=False)

es = model.encode(src, normalize_embeddings=True, show_progress_bar=False)

return ea @ es.T

gemini-code-assist · 2026-06-03T04:33:58Z

+    for kind in ["tfidf", "semantic"]:
+        # R1: full vs tail-window source
+        sim_full = make_sim(av_all, full_src, kind)
+        sim_tail = make_sim(av_all, tail_src, kind)


If make_sim returns None because sentence-transformers is not installed, the script will crash with an AttributeError when calling perm on None. Checking if sim_full is None and skipping the semantic evaluation gracefully prevents this crash.

Suggested change

for kind in ["tfidf", "semantic"]:

# R1: full vs tail-window source

sim_full = make_sim(av_all, full_src, kind)

sim_tail = make_sim(av_all, tail_src, kind)

for kind in ["tfidf", "semantic"]:

# R1: full vs tail-window source

sim_full = make_sim(av_all, full_src, kind)

if sim_full is None:

print(f"[{kind}] skipped (sentence-transformers not installed)")

continue

sim_tail = make_sim(av_all, tail_src, kind)

gemini-code-assist · 2026-06-03T04:33:58Z

+def parse(resp):
+    m = re.search(r"\[.*\]", resp, re.DOTALL)
+    if not m:
+        return {}
+    try:
+        arr = json.loads(m.group(0))
+        return {int(o["trial_id"]): int(o["choice"]) for o in arr if "choice" in o}
+    except Exception:
+        return {}


If a single item in the JSON array returned by the LLM is malformed or has type conversion issues, the entire batch of choices is currently discarded. Processing items individually inside the loop and catching conversion errors per item makes the parser much more robust against minor LLM formatting inconsistencies.

Suggested change

def parse(resp):

m = re.search(r"\[.*\]", resp, re.DOTALL)

if not m:

return {}

try:

arr = json.loads(m.group(0))

return {int(o["trial_id"]): int(o["choice"]) for o in arr if "choice" in o}

except Exception:

return {}

def parse(resp):

m = re.search(r"\\[.*\\]", resp, re.DOTALL)

if not m:

return {}

try:

arr = json.loads(m.group(0))

res = {}

for o in arr:

try:

if "trial_id" in o and "choice" in o:

res[int(o["trial_id"])] = int(o["choice"])

except (ValueError, TypeError):

continue

return res

except Exception:

return {}

gemini-code-assist · 2026-06-03T04:33:58Z

+def main():
+    smoke = "--smoke" in sys.argv
+    backend = "gemini" if "--backend" in sys.argv and sys.argv[sys.argv.index("--backend") + 1] == "gemini" else "claude"
+    model_override = sys.argv[sys.argv.index("--model") + 1] if "--model" in sys.argv else MODEL
+    call_judge = call_judge_gemini if backend == "gemini" else (lambda p: call_judge_claude(p, model_override))
+    judge_model = "Gemini 3.5 Flash (Antigravity)" if backend == "gemini" else model_override
+    tag = backend if backend == "gemini" else ("claude_" + model_override.split("-")[1] if "-" in model_override else "claude")
+    batch_size = 5 if backend == "gemini" else 10


Manual command-line argument parsing using sys.argv.index is fragile and can raise IndexError if an option is passed without a value at the end of the command. Refactoring to use Python's standard argparse module provides robust parsing, automatic help generation, and cleaner code.

Suggested change

def main():

smoke = "--smoke" in sys.argv

backend = "gemini" if "--backend" in sys.argv and sys.argv[sys.argv.index("--backend") + 1] == "gemini" else "claude"

model_override = sys.argv[sys.argv.index("--model") + 1] if "--model" in sys.argv else MODEL

call_judge = call_judge_gemini if backend == "gemini" else (lambda p: call_judge_claude(p, model_override))

judge_model = "Gemini 3.5 Flash (Antigravity)" if backend == "gemini" else model_override

tag = backend if backend == "gemini" else ("claude_" + model_override.split("-")[1] if "-" in model_override else "claude")

batch_size = 5 if backend == "gemini" else 10

def main():

import argparse

parser = argparse.ArgumentParser(description="LLM-judge forced-choice content-specificity eval")

parser.add_argument("--smoke", action="store_true", help="Run a smoke test with 1 batch")

parser.add_argument("--backend", choices=["claude", "gemini"], default="claude", help="LLM backend to use")

parser.add_argument("--model", default=MODEL, help="Model override for the Claude backend")

args = parser.parse_args()

smoke = args.smoke

backend = args.backend

model_override = args.model

call_judge = call_judge_gemini if backend == "gemini" else (lambda p: call_judge_claude(p, model_override))

judge_model = "Gemini 3.5 Flash (Antigravity)" if backend == "gemini" else model_override

tag = backend if backend == "gemini" else ("claude_" + model_override.split("-")[1] if "-" in model_override else "claude")

batch_size = 5 if backend == "gemini" else 10

gemini-code-assist · 2026-06-03T04:33:58Z

+    for b in range(0, len(trials), batch_size):
+        batch = trials[b:b + batch_size]
+        prompt = make_prompt(batch)
+        resp = call_judge(prompt)
+        parsed = parse(resp)
+        choices.update(parsed)
+        with open(RAW, "a", encoding="utf-8") as f:
+            f.write(json.dumps({"batch_start": b, "prompt_chars": len(prompt),
+                                "response": resp, "parsed": parsed}) + "\n")
+        print(f"batch {b}-{b+len(batch)}: parsed {len(parsed)}/{len(batch)}")


LLM API calls are prone to transient network errors or timeouts (as noted in the Sonnet run's crash). Wrapping the call_judge and parse calls in a try-except block catching subprocess.SubprocessError and subprocess.TimeoutExpired allows the script to gracefully handle failures, log the error, and break/save the completed trials instead of crashing and losing all progress.

Suggested change

for b in range(0, len(trials), batch_size):

batch = trials[b:b + batch_size]

prompt = make_prompt(batch)

resp = call_judge(prompt)

parsed = parse(resp)

choices.update(parsed)

with open(RAW, "a", encoding="utf-8") as f:

f.write(json.dumps({"batch_start": b, "prompt_chars": len(prompt),

"response": resp, "parsed": parsed}) + "\n")

print(f"batch {b}-{b+len(batch)}: parsed {len(parsed)}/{len(batch)}")

for b in range(0, len(trials), batch_size):

batch = trials[b:b + batch_size]

prompt = make_prompt(batch)

try:

resp = call_judge(prompt)

parsed = parse(resp)

choices.update(parsed)

with open(RAW, "a", encoding="utf-8") as f:

f.write(json.dumps({"batch_start": b, "prompt_chars": len(prompt),

"response": resp, "parsed": parsed}) + "\\n")

except (subprocess.SubprocessError, subprocess.TimeoutExpired) as e:

print(f"Error or timeout on batch {b}: {e}")

break

print(f"batch {b}-{b+len(batch)}: parsed {len(parsed)}/{len(batch)}")

SolshineCode and others added 5 commits June 1, 2026 00:04

gemini-code-assist Bot reviewed Jun 3, 2026

View reviewed changes

SolshineCode closed this Jun 3, 2026

gemini-code-assist Bot reviewed Jun 3, 2026

View reviewed changes

	m = re.search(r"\[.*\]", resp, re.DOTALL)
	m = re.search(r"\[\s\{.?\}\s*\]", resp, re.DOTALL)

Conversation

SolshineCode commented Jun 3, 2026

Uh oh!

SolshineCode commented Jun 3, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

SolshineCode commented Jun 3, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant