Skip to content

Correct content-awareness framing with a direct content-specificity eval#16

Merged
SolshineCode merged 3 commits into
mainfrom
content-specificity-eval
Jun 2, 2026
Merged

Correct content-awareness framing with a direct content-specificity eval#16
SolshineCode merged 3 commits into
mainfrom
content-specificity-eval

Conversation

@SolshineCode
Copy link
Copy Markdown
Owner

What this does

Adds a direct content-specificity retrieval eval of the v0.1 AV and corrects the published content-awareness framing based on what it shows.

The eval

Tests whether the AV text recovers the source document it came from (the prior "content-blind" claim only measured the AR round-trip). Doc-level retrieval (chance 1/13) + 5-way LLM-judge forced choice (chance 0.20), 5000-iter permutation nulls.

All probes at chance: TF-IDF word/char, semantic (MiniLM), tail-window, non-template subset, Claude Haiku judge (0.24, n=50), Claude Sonnet judge (0.267, n=30). Semantic + any-connection judge prompt rule out the "real-but-non-topical feature" rescue (personality / event / tone).

Doc corrections

  • Replaces the mild "theme-correct" overclaim with "format/genre-plausible but not per-row content- or theme-discriminative" in README, MODEL_CARD_AV, and a new RELEASE_CALIBRATION addendum. The "detail-confabulated" half stands.
  • Keeps the genuine nuance: v0.1 output is diverse (45/50 unique strings); the diversity is decoupled from source content.
  • Notes the one honest scope-limit (a feature constant across all 13 docs would be invisible to this eval).
  • Fixes the v0.0.1 AV SFT-step count in MODEL_CARD_AV: 15 → 55 total (15 base + 40 continuation), grad_accum 16 → 4, restoring the value dropped in the 2026-05-17 retraction rewrite (commit 8dd0a99).

Data

Bundles all eval scripts, inputs (rl.parquet, AV outputs), and per-trial LLM-judge data under experiments/v8_nla_local/ so the suite is reproducible from this repo. Full writeup: experiments/v8_nla_local/CONTENT_SPECIFICITY_EVAL.md.

🤖 Generated with Claude Code

A direct retrieval eval (does each AV output recover its own source document?)
puts the v0.1 AV at chance across lexical, semantic, and two LLM-judge probes.
So the output is format/genre-plausible but NOT per-row content- or
theme-discriminative. Corrects the mild 'theme-correct' overclaim in README /
model card / calibration doc; the 'detail-confabulated' half stands.

Nuance kept: v0.1 output is diverse (45/50 unique strings); the diversity is
decoupled from source content. Scope-limit noted (a feature constant across all
13 docs would be invisible to this eval).

Bundles all eval scripts, inputs (rl.parquet, AV outputs), per-trial LLM-judge
data, and CONTENT_SPECIFICITY_EVAL.md under experiments/v8_nla_local/ so the
suite is reproducible from this repo.

Also fixes v0.0.1 AV SFT-step count in MODEL_CARD_AV (15 -> 55 total = 15 base +
40 continuation; grad_accum 16 -> 4), restoring the value dropped in the
2026-05-17 retraction rewrite (commit 8dd0a99).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@SolshineCode
Copy link
Copy Markdown
Owner Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive content-specificity retrieval evaluation suite to determine if the v0.1 Natural Language Autoencoder (NLA) text outputs carry any recoverable content signal from their source documents. It adds several evaluation scripts, documentation, and results demonstrating that the NLA outputs are format-plausible but not content-discriminative. The feedback on the new Python scripts focuses on improving robustness and resource management. Specifically, it is recommended to check subprocess return codes in the LLM judge script to prevent silent failures, use context managers (with statements) for all file read/write operations, and gracefully handle the potential absence of the sentence_transformers package in the refinement script.

Comment on lines +93 to +109
def call_judge_claude(prompt, model=MODEL):
claude = shutil.which("claude") or "claude"
proc = subprocess.run(
[claude, "-p", "--model", model],
input=prompt, capture_output=True, text=True, cwd=HOME, timeout=300,
encoding="utf-8", errors="replace",
)
return proc.stdout.strip()


def call_judge_gemini(prompt):
# Antigravity (Gemini 3.5 Flash) via the gemini-collab wrapper; prompt passed as argv.
proc = subprocess.run(
[sys.executable, GEMINI_WRAPPER, "--prompt", prompt, "--timeout", "300"],
capture_output=True, text=True, encoding="utf-8", errors="replace", timeout=360,
)
return proc.stdout.strip()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The subprocess calls to the LLM judges currently fail silently if the command returns a non-zero exit code or fails to produce a valid response. This leads to silent failures where the script continues executing with empty results, eventually writing a report claiming 'no signal above chance' with 0 trials (as seen in the generated content_specificity_judge_gemini.json file). Adding a check for proc.returncode and raising an error with proc.stderr will prevent these silent failures and make debugging much easier.

def call_judge_claude(prompt, model=MODEL):
    claude = shutil.which("claude") or "claude"
    proc = subprocess.run(
        [claude, "-p", "--model", model],
        input=prompt, capture_output=True, text=True, cwd=HOME, timeout=300,
        encoding="utf-8", errors="replace",
    )
    if proc.returncode != 0:
        raise RuntimeError(f"Claude judge failed with exit code {proc.returncode}: {proc.stderr}")
    return proc.stdout.strip()


def call_judge_gemini(prompt):
    # Antigravity (Gemini 3.5 Flash) via the gemini-collab wrapper; prompt passed as argv.
    proc = subprocess.run(
        [sys.executable, GEMINI_WRAPPER, "--prompt", prompt, "--timeout", "300"],
        capture_output=True, text=True, encoding="utf-8", errors="replace", timeout=360,
    )
    if proc.returncode != 0:
        raise RuntimeError(f"Gemini judge failed with exit code {proc.returncode}: {proc.stderr}")
    return proc.stdout.strip()

Comment on lines +44 to +46
def load():
df = pd.read_parquet(RL)
av = json.load(open(AV, encoding="utf-8"))["our_outputs"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The file AV is opened using open() but is never explicitly closed, which can lead to resource leaks. It is recommended to use a with statement (context manager) to ensure the file is properly closed after reading.

Suggested change
def load():
df = pd.read_parquet(RL)
av = json.load(open(AV, encoding="utf-8"))["our_outputs"]
def load():
df = pd.read_parquet(RL)
with open(AV, encoding="utf-8") as f:
av = json.load(f)["our_outputs"]

results["config"]["semantic"] = "sentence-transformers not installed; skipped"
print("[semantic] skipped (sentence-transformers not installed)")

json.dump(results, open(OUT, "w", encoding="utf-8"), indent=1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The output file OUT is opened for writing without a context manager, leaving the file descriptor open. Using a with statement ensures the file is closed properly.

Suggested change
json.dump(results, open(OUT, "w", encoding="utf-8"), indent=1)
with open(OUT, "w", encoding="utf-8") as f:
json.dump(results, f, indent=1)

Comment on lines +45 to +47
def build_trials():
df = pd.read_parquet(RL)
av = json.load(open(AV, encoding="utf-8"))["our_outputs"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The file AV is opened using open() but is never explicitly closed. Using a with statement ensures the file is properly closed after reading.

Suggested change
def build_trials():
df = pd.read_parquet(RL)
av = json.load(open(AV, encoding="utf-8"))["our_outputs"]
def build_trials():
df = pd.read_parquet(RL)
with open(AV, encoding="utf-8") as f:
av = json.load(f)["our_outputs"]

"per_trial": [{"trial_id": tid, "choice": ch, "true_pos": tp, "correct": ch == tp}
for tid, ch, tp in scored],
}
json.dump(result, open(OUT, "w", encoding="utf-8"), indent=1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The output file OUT is opened for writing without a context manager. Using a with statement ensures the file is closed properly.

    with open(OUT, "w", encoding="utf-8") as f:
        json.dump(result, f, indent=1)

Comment on lines +81 to +83
def main():
df = pd.read_parquet(RL)
av_all = json.load(open(AV, encoding="utf-8"))["our_outputs"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The file AV is opened using open() but is never explicitly closed. Using a with statement ensures the file is properly closed after reading.

Suggested change
def main():
df = pd.read_parquet(RL)
av_all = json.load(open(AV, encoding="utf-8"))["our_outputs"]
def main():
df = pd.read_parquet(RL)
with open(AV, encoding="utf-8") as f:
av_all = json.load(f)["our_outputs"]

f"nontmpl top1={r['R2_nontemplate_tail']['doc_top1']:.3f}(p{r['R2_nontemplate_tail']['top1_p']:.2f}) "
f"| gap_p full={r['R1_full_doc']['gap_p']:.2f} tail={r['R1_tail_window']['gap_p']:.2f} nontmpl={r['R2_nontemplate_tail']['gap_p']:.2f}")

json.dump(out, open(OUT, "w", encoding="utf-8"), indent=1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The output file OUT is opened for writing without a context manager. Using a with statement ensures the file is closed properly.

Suggested change
json.dump(out, open(OUT, "w", encoding="utf-8"), indent=1)
with open(OUT, "w", encoding="utf-8") as f:
json.dump(out, f, indent=1)

Comment on lines +68 to +78
def make_sim(av, src, kind):
if kind == "tfidf":
vec = TfidfVectorizer(ngram_range=(1, 2), min_df=1, sublinear_tf=True,
stop_words="english").fit(src + av)
return cosine_similarity(vec.transform(av), vec.transform(src))
if kind == "semantic":
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
ea = model.encode(av, normalize_embeddings=True, show_progress_bar=False)
es = model.encode(src, normalize_embeddings=True, show_progress_bar=False)
return ea @ es.T
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Unlike eval_content_specificity.py, this script does not handle the absence of sentence_transformers gracefully. If the package is not installed, running this script will crash with a ModuleNotFoundError. Wrapping the import in a try-except block and returning None allows the script to skip the semantic evaluation gracefully, matching the behavior of the main evaluation script.

Suggested change
def make_sim(av, src, kind):
if kind == "tfidf":
vec = TfidfVectorizer(ngram_range=(1, 2), min_df=1, sublinear_tf=True,
stop_words="english").fit(src + av)
return cosine_similarity(vec.transform(av), vec.transform(src))
if kind == "semantic":
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
ea = model.encode(av, normalize_embeddings=True, show_progress_bar=False)
es = model.encode(src, normalize_embeddings=True, show_progress_bar=False)
return ea @ es.T
def make_sim(av, src, kind):
if kind == "tfidf":
vec = TfidfVectorizer(ngram_range=(1, 2), min_df=1, sublinear_tf=True,
stop_words="english").fit(src + av)
return cosine_similarity(vec.transform(av), vec.transform(src))
if kind == "semantic":
try:
from sentence_transformers import SentenceTransformer
except ImportError:
return None
model = SentenceTransformer("all-MiniLM-L6-v2")
ea = model.encode(av, normalize_embeddings=True, show_progress_bar=False)
es = model.encode(src, normalize_embeddings=True, show_progress_bar=False)
return ea @ es.T

Comment on lines +93 to +111
for kind in ["tfidf", "semantic"]:
# R1: full vs tail-window source
sim_full = make_sim(av_all, full_src, kind)
sim_tail = make_sim(av_all, tail_src, kind)
# R2: non-template subset, tail source
av_k = [av_all[i] for i in keep]
tail_k = [tail_src[i] for i in keep]
doc_k = [doc[i] for i in keep]
sim_k = make_sim(av_k, tail_k, kind)
out["results"][kind] = {
"R1_full_doc": perm(sim_full, doc),
"R1_tail_window": perm(sim_tail, doc),
"R2_nontemplate_tail": perm(sim_k, doc_k),
}
r = out["results"][kind]
print(f"[{kind}] full top1={r['R1_full_doc']['doc_top1']:.3f}(p{r['R1_full_doc']['top1_p']:.2f}) "
f"tail top1={r['R1_tail_window']['doc_top1']:.3f}(p{r['R1_tail_window']['top1_p']:.2f}) "
f"nontmpl top1={r['R2_nontemplate_tail']['doc_top1']:.3f}(p{r['R2_nontemplate_tail']['top1_p']:.2f}) "
f"| gap_p full={r['R1_full_doc']['gap_p']:.2f} tail={r['R1_tail_window']['gap_p']:.2f} nontmpl={r['R2_nontemplate_tail']['gap_p']:.2f}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To support graceful skipping of the semantic evaluation when sentence_transformers is not installed, update the loop in main to check if sim_full is None and skip the current iteration if so.

    for kind in ["tfidf", "semantic"]:
        # R1: full vs tail-window source
        sim_full = make_sim(av_all, full_src, kind)
        if sim_full is None:
            print(f"[{kind}] skipped (sentence-transformers not installed)")
            continue
        sim_tail = make_sim(av_all, tail_src, kind)
        # R2: non-template subset, tail source
        av_k = [av_all[i] for i in keep]
        tail_k = [tail_src[i] for i in keep]
        doc_k = [doc[i] for i in keep]
        sim_k = make_sim(av_k, tail_k, kind)
        out["results"][kind] = {
            "R1_full_doc": perm(sim_full, doc),
            "R1_tail_window": perm(sim_tail, doc),
            "R2_nontemplate_tail": perm(sim_k, doc_k),
        }
        r = out["results"][kind]
        print(f"[{kind}] full top1={r['R1_full_doc']['doc_top1']:.3f}(p{r['R1_full_doc']['top1_p']:.2f}) "
              f"tail top1={r['R1_tail_window']['doc_top1']:.3f}(p{r['R1_tail_window']['top1_p']:.2f}) "
              f"nontmpl top1={r['R2_nontemplate_tail']['doc_top1']:.3f}(p{r['R2_nontemplate_tail']['top1_p']:.2f}) "
              f"| gap_p full={r['R1_full_doc']['gap_p']:.2f} tail={r['R1_tail_window']['gap_p']:.2f} nontmpl={r['R2_nontemplate_tail']['gap_p']:.2f}")

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive content-specificity retrieval evaluation suite for the v0.1 Natural Language Autoencoder (NLA), adding several evaluation scripts, documentation, and benchmark results. The findings show that while the AV outputs are diverse, they perform at chance in recovering source-document content, prompting updates to the README, model card, and calibration notes. The reviewer feedback is highly constructive, recommending robustness and portability improvements such as handling subprocess timeouts gracefully in the LLM-judge script, avoiding hardcoded home directory paths, and utilizing context managers for safer file I/O operations.

Comment on lines +138 to +147
for b in range(0, len(trials), batch_size):
batch = trials[b:b + batch_size]
prompt = make_prompt(batch)
resp = call_judge(prompt)
parsed = parse(resp)
choices.update(parsed)
with open(RAW, "a", encoding="utf-8") as f:
f.write(json.dumps({"batch_start": b, "prompt_chars": len(prompt),
"response": resp, "parsed": parsed}) + "\n")
print(f"batch {b}-{b+len(batch)}: parsed {len(parsed)}/{len(batch)}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The LLM judge API call can occasionally time out or fail due to network issues, rate limits, or subprocess hangs (as noted in the Sonnet run crash). Currently, any exception (such as subprocess.TimeoutExpired) raised during call_judge will crash the entire script, causing all previously completed batches in the loop to be lost since the final results are never written to OUT.\n\nWrapping the call_judge and parsing logic in a try-except block and breaking/continuing gracefully will allow the script to save the completed trials to OUT even if a subsequent batch fails.

    for b in range(0, len(trials), batch_size):\n        batch = trials[b:b + batch_size]\n        prompt = make_prompt(batch)\n        try:\n            resp = call_judge(prompt)\n            parsed = parse(resp)\n            choices.update(parsed)\n            with open(RAW, "a", encoding="utf-8") as f:\n                f.write(json.dumps({"batch_start": b, "prompt_chars": len(prompt),\n                                    "response": resp, "parsed": parsed}) + "\\n")\n            print(f"batch {b}-{b+len(batch)}: parsed {len(parsed)}/{len(batch)}")\n        except subprocess.TimeoutExpired as e:\n            print(f"Timeout on batch {b}-{b+len(batch)}: {e}")\n            break\n        except Exception as e:\n            print(f"Error on batch {b}-{b+len(batch)}: {e}")\n            break

Comment on lines +38 to +39
GEMINI_WRAPPER = os.path.join(HOME, ".claude", "skills", "gemini-collab",
"scripts", "gemini_client.py")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding the path to gemini_client.py inside the user's home directory (~/.claude/skills/...) makes the script non-portable and fragile across different developer environments or CI/CD pipelines. Consider allowing this path to be overridden via an environment variable or command-line argument, with the current path as a default fallback.

GEMINI_WRAPPER = os.environ.get(\n    "GEMINI_WRAPPER",\n    os.path.join(HOME, ".claude", "skills", "gemini-collab", "scripts", "gemini_client.py")\n)


def load():
df = pd.read_parquet(RL)
av = json.load(open(AV, encoding="utf-8"))["our_outputs"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Opening files without a with statement (context manager) leaves the file descriptor open until garbage collection runs. It is safer and more idiomatic to use with open(...) to guarantee the file is closed properly.

    with open(AV, encoding="utf-8") as f:\n        av = json.load(f)["our_outputs"]

results["config"]["semantic"] = "sentence-transformers not installed; skipped"
print("[semantic] skipped (sentence-transformers not installed)")

json.dump(results, open(OUT, "w", encoding="utf-8"), indent=1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a context manager (with statement) ensures that the output file is properly closed and flushed, even if an exception occurs during writing.

    with open(OUT, "w", encoding="utf-8") as f:\n        json.dump(results, f, indent=1)

SolshineCode and others added 2 commits June 1, 2026 01:29
… activation

Follow-up to the content-specificity eval. A ceiling test on the RAW L23
activations recovers the source doc well above chance (retrieval 0.24 vs 0.077,
p=0.0006; logistic probe 60% on 13-way doc id), so the content the AV fails to
surface IS present in the activation. The per-row gap is the verbalizer's, and is
fixable with better AV training rather than an intrinsic 2B-L23 ceiling. This also
refutes the polysemanticity-at-2B reading for doc-level content, and validates the
retrieval method (strong signal in activation, none in AV output).

8-layer sweep: every layer content-rich; L23 middling (0.350), L17 best (0.805),
so a future NLA could also retarget the layer.

Updates README + RELEASE_CALIBRATION framing (more accurate and more hopeful) and
bundles the new scripts + result JSONs (inject probe, activation ceiling, layer
sweep). Adds a single-token forced-choice injection probe (also at chance,
non-degenerate).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…m is real)

The ceiling result reframes the content gap as a verbalizer-training problem, not
a model-scale ceiling (60% linear-probe doc id on the raw L23 activation). Update
the bottom-line section so the summary carries both halves: the v0.1 AV does not
yet read per-row content, AND the information is there to be read.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant