From e9439529813198b303b4ee5c6102512343cb7640 Mon Sep 17 00:00:00 2001
From: Cail Daley <cail.daley@cea.fr>
Date: Sat, 30 May 2026 18:23:30 +0200
Subject: [PATCH] fix(paper-extraction): never fabricate DOIs for no-DOI bib
 entries
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The DOI resolver fuzzy-searches Crossref/ADS by title for any bib entry
lacking a doi:/eprint: field. Two failure modes let it attach *wrong*
DOIs — a direct traceability violation, since a wrong DOI silently
points evidence verification at the wrong paper:

  - "in preparation" companion papers (no DOI exists yet) matched
    unrelated journal articles via fuzzy title search.
  - An ASCL software record (TreeCorr, "Two-point correlation functions")
    matched a 1978 DOE OSTI report subtitled "[Two-point correlation
    functions]" through a permissive 0.55 title gate.

Fix (traceability over coverage — a flagged miss beats a fabricated hit):

  - classify_unresolvable(): short-circuit entries that have no DOI to
    find (in-prep/submitted/to-appear/in-press/forthcoming; ASCL software
    records; no publication metadata) to doi=null + a needs_human tag.
    Never fuzzy-resolve them.
  - Raise the fuzzy title-match gate 0.55 -> 0.80 AND require a
    first-author surname match, on BOTH the Crossref and ADS paths (the
    ADS path previously had no gate at all).
  - Surface unresolved entries via needs_human in the citations: block.

Verified against the UNIONS B-modes bib: the four in-prep companions and
the TreeCorr ASCL entry that previously got phantom DOIs are now flagged;
all published entries still resolve. Adds tests/test_paper_extraction_doi.py.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .../skills/paper-extraction/SKILL.md          |  12 +-
 .../scripts/extract-paper-substrate.py        | 127 +++++++++++++++++-
 tests/test_paper_extraction_doi.py            | 110 +++++++++++++++
 3 files changed, 242 insertions(+), 7 deletions(-)
 create mode 100644 tests/test_paper_extraction_doi.py

diff --git a/claude/lightcone/skills/paper-extraction/SKILL.md b/claude/lightcone/skills/paper-extraction/SKILL.md
index dd4dce3..63163d1 100644
--- a/claude/lightcone/skills/paper-extraction/SKILL.md
+++ b/claude/lightcone/skills/paper-extraction/SKILL.md
@@ -147,7 +147,14 @@ The script detects the path automatically and produces:
 
 The `--arxiv-id` / `--doi` argument populates the `id` and the evidence `doi:` field in `astra.yaml`. If neither is provided, the script writes placeholder text the agent can fix.
 
-The DOI resolver tries, in order: the entry's `doi:` field → an `eprint:`-derived arXiv DOI → Crossref bibliographic query (free, no API key needed) → ADS title search (only if `ADS_API_TOKEN` env var or `~/.ads/dev_key` is present — graceful skip when absent). Title hits from Crossref are gated by a similarity check against the queried title to drop noisy false matches.
+The DOI resolver tries, in order: the entry's `doi:` field → an `eprint:`-derived arXiv DOI → Crossref bibliographic query (free, no API key needed) → ADS title search (only if `ADS_API_TOKEN` env var or `~/.ads/dev_key` is present — graceful skip when absent).
+
+**Fuzzy resolution is gated to avoid wrong DOIs.** A missed resolution (`doi: null`, flagged for human review) is strictly preferable to a fabricated one — a wrong DOI silently points evidence verification at the wrong paper. Two safeguards:
+
+1. **Entries with no DOI to find are never fuzzy-resolved.** `classify_unresolvable()` short-circuits an entry to `doi: null` plus a `needs_human` tag when it is unpublished (`journal`/`note`/`howpublished` says *in preparation*, *submitted*, *to appear*, *in press*, *forthcoming*), an ASCL software record (`archivePrefix=ascl` or an `ascl:` eprint — software cites verify by existence, not a journal DOI), or carries no `journal`/`booktitle` at all. Without this guard, fuzzy title search returns a *false positive* for these (e.g. an "in preparation" companion paper matching an unrelated journal article, or `TreeCorr: Two-point correlation functions` matching a 1978 DOE report subtitled `[Two-point correlation functions]`).
+2. **Both Crossref and ADS hits require a title-similarity ≥ 0.80 _and_ a first-author surname match.** The historical Crossref-only gate of 0.55 (with no author check, and no gate at all on the ADS path) was too permissive.
+
+Entries the resolver leaves unresolved carry `needs_human` in the `citations:` block (values: the `classify_unresolvable` tag, or `"unresolved"` when the gated lookups found nothing) so a downstream consumer or human can act on them.
 
 ### Step 4 — Review the script's output and fix structural gaps
 
@@ -156,7 +163,8 @@ The script is purely deterministic. It walks the structural surface but does not
 - **`figure figN: \includegraphics{X} could not resolve`** — the LaTeX referenced a file the script couldn't find. Search the source tree manually (sometimes figures live in non-standard subdirectories with non-standard extensions); copy the file into `figures/` and update the corresponding `index.json` entry's `file` so it's no longer null.
 - **`figure figN: no \caption found`** — composite figures (subfloats) sometimes lack a top-level caption; verify the figure block in source and either record the per-subfigure captions in `caption` or note that the figure is composite.
 - **`table tabN: no \label`** — verify the table is intentional (some `\begin{table}` blocks are non-tabular layout); rename or annotate as needed.
-- **`citation <key>: could not resolve DOI`** — the entry has no `doi:` / `eprint:` field, and neither Crossref nor ADS (when available) returned a match. The entry stays in `citations:` with `doi: null`; a downstream consumer can flag it for human resolution or skip it. If many entries are unresolved, check that the title field is clean (sometimes `.bib` titles carry uncleaned LaTeX commands that drag down the Crossref similarity gate). Delete `.doi-cache.json` to force re-resolution.
+- **`citation <key>: could not resolve DOI`** — the entry has no `doi:` / `eprint:` field, and the gated Crossref/ADS lookups found no match clearing the title-similarity + author check. The entry stays in `citations:` with `doi: null` and `needs_human: "unresolved"`; flag for human resolution or skip. If many published entries are unresolved, check that the title field is clean (uncleaned LaTeX in `.bib` titles drags down the similarity gate). Delete `.doi-cache.json` to force re-resolution.
+- **`citation <key>: NO DOI — in preparation | software ascl | no publication info`** — the entry was deliberately *not* fuzzy-resolved because it has no DOI to find (unpublished, an ASCL software record, or no publication metadata). It carries `doi: null` and a `needs_human` tag. This is correct behavior, not a failure: resolve it by hand once the paper is published (in-prep), or leave it (software cites verify by existence). Re-running after the `.bib` gains a real `doi:`/`eprint:` resolves it automatically.
 - **`citation <key>: cited in source but no matching entry in bibliography-source.{bib,bbl}`** — a `\cite{<key>}` invocation has no corresponding bib record. Usually a typo in the LaTeX source; flag it and move on. The entry stays in `citations:` with `citation: null, doi: null`, locations preserved.
 - **Path B caveat** — outline extraction is not yet implemented for the Docling fallback. Bibliography resolution works on Path B by parsing the references section at the tail of `document.md` and synthesizing keys (`<lastname>_<year>`), but citation *invocations* from rendered prose aren't yet extracted — Path B citations carry empty `locations: []`. The warnings list flags this.
 
diff --git a/claude/lightcone/skills/paper-extraction/scripts/extract-paper-substrate.py b/claude/lightcone/skills/paper-extraction/scripts/extract-paper-substrate.py
index ce2309e..bcbdfef 100755
--- a/claude/lightcone/skills/paper-extraction/scripts/extract-paper-substrate.py
+++ b/claude/lightcone/skills/paper-extraction/scripts/extract-paper-substrate.py
@@ -1128,8 +1128,11 @@ def _resolve_via_crossref(self, title: str, first_author: str) -> tuple[str | No
         if not candidate_doi:
             return None, "unresolved"
         # Title-similarity gate: drop noisy hits where the top result clearly isn't
-        # the paper we asked about.
-        if candidate_titles and _title_similarity(title, candidate_titles[0]) < 0.55:
+        # the paper we asked about. A bare-title match with no author corroboration
+        # is exactly how phantom DOIs slipped in, so require both.
+        if not candidate_titles or _title_similarity(title, candidate_titles[0]) < TITLE_MATCH_MIN:
+            return None, "unresolved"
+        if not _author_matches(first_author, top.get("author")):
             return None, "unresolved"
         return self._normalize_doi(candidate_doi), "crossref"
 
@@ -1139,7 +1142,7 @@ def _resolve_via_ads(self, title: str, first_author: str) -> tuple[str | None, s
         q = f'title:"{title}"'
         if first_author:
             q += f' author:"{first_author}"'
-        params = {"q": q, "fl": "doi,title", "rows": "1"}
+        params = {"q": q, "fl": "doi,title,author", "rows": "1"}
         url = f"{ADS_API}?{urllib.parse.urlencode(params)}"
         try:
             data = self._http_get_json(
@@ -1151,9 +1154,18 @@ def _resolve_via_ads(self, title: str, first_author: str) -> tuple[str | None, s
         docs = ((data or {}).get("response", {}) or {}).get("docs", []) or []
         if not docs:
             return None, "unresolved"
-        doi_list = docs[0].get("doi") or []
+        doc = docs[0]
+        doi_list = doc.get("doi") or []
         if not doi_list:
             return None, "unresolved"
+        # ADS's quoted title/author query is fuzzy too — apply the same
+        # title-similarity + first-author gate the Crossref path uses, so the
+        # ADS fallback can't reintroduce a wrong DOI.
+        candidate_titles = doc.get("title") or []
+        if not candidate_titles or _title_similarity(title, candidate_titles[0]) < TITLE_MATCH_MIN:
+            return None, "unresolved"
+        if not _author_matches(first_author, doc.get("author")):
+            return None, "unresolved"
         return self._normalize_doi(doi_list[0]), "ads"
 
     @staticmethod
@@ -1184,6 +1196,93 @@ def _title_similarity(a: str, b: str) -> float:
     return SequenceMatcher(None, a_norm, b_norm).ratio()
 
 
+# Minimum title-similarity for a fuzzy DOI match to be accepted. Raised from a
+# historical 0.55 (too permissive — it let a 1978 DOE report whose subtitle was
+# "[Two-point correlation functions]" match a "TreeCorr: Two-point correlation
+# functions" software entry) to 0.80, paired with a first-author check. A miss
+# is cheaper than a wrong DOI: an unresolved entry is flagged for human review;
+# a wrong DOI silently points evidence verification at the wrong paper.
+TITLE_MATCH_MIN = 0.80
+
+# Markers that a bibliography entry is not yet published — there is no DOI to
+# find, so fuzzy title resolution can only produce a false positive.
+_IN_PREP_RE = re.compile(
+    r"\b(in[\s_-]*prep(?:aration)?|submitted|to[\s_-]+appear|in[\s_-]+press|forthcoming)\b",
+    re.IGNORECASE,
+)
+
+
+def classify_unresolvable(fields: dict[str, str]) -> str | None:
+    """Return a non-resolvable status tag for a parsed `.bib` entry, or None.
+
+    Entries that carry a real DOI or arXiv eprint are always resolvable (None).
+    Otherwise we refuse to fuzzy-resolve, and flag for human review, when:
+
+      - 'software_ascl'       — an ASCL software record (archivePrefix=ascl or an
+                                `ascl:` eprint). Software cites verify by existence
+                                (ASCL id / bibcode), not a journal DOI.
+      - 'in_preparation'      — journal/note marks it unpublished (in prep,
+                                submitted, to appear, in press, forthcoming).
+      - 'no_publication_info' — no doi, no eprint, and no journal/booktitle at all,
+                                so we can't even form a trustworthy query.
+
+    Priority: a missed resolution (doi=None, flagged) is strictly preferable to a
+    fabricated one. Traceability over coverage.
+    """
+    if fields.get("doi"):
+        return None
+    archive = (fields.get("archiveprefix") or "").strip().lower()
+    eprint = (fields.get("eprint") or "").strip()
+    if archive == "ascl" or eprint.lower().startswith("ascl:"):
+        return "software_ascl"
+    # A genuine arXiv eprint (YYMM.NNNNN or old archive/NNNNNNN) is resolvable.
+    if eprint and re.match(
+        r"^(?:\d{4}\.\d{4,5}|[a-z\-]+/\d{7})(?:v\d+)?$", eprint, re.IGNORECASE
+    ):
+        return None
+    blob = " ".join(
+        v
+        for v in (fields.get("journal"), fields.get("note"), fields.get("howpublished"))
+        if v
+    )
+    if _IN_PREP_RE.search(blob):
+        return "in_preparation"
+    if not (fields.get("journal") or fields.get("booktitle") or fields.get("howpublished")):
+        return "no_publication_info"
+    return None
+
+
+def _candidate_surname(author_entry) -> str:
+    """First-author surname from a Crossref author dict or an ADS/bib author string."""
+    if isinstance(author_entry, dict):
+        return (author_entry.get("family") or "").strip().lower()
+    if isinstance(author_entry, str):
+        # ADS / BibTeX comma form "Last, First"; else take the whole token.
+        head = author_entry.split(",")[0].strip()
+        return (head or author_entry).strip().lower()
+    return ""
+
+
+def _author_matches(surname: str, candidate_authors) -> bool:
+    """True if `surname` plausibly matches the candidate's first-author family name.
+
+    Conservative on missing metadata: if no candidate author can be extracted we
+    return True and lean on the title gate. We only *block* on a positive
+    mismatch — two real surnames that don't overlap.
+    """
+    if not surname:
+        return True
+    fam = ""
+    if isinstance(candidate_authors, list) and candidate_authors:
+        fam = _candidate_surname(candidate_authors[0])
+    elif isinstance(candidate_authors, str):
+        fam = _candidate_surname(candidate_authors)
+    if not fam:
+        return True
+    s = surname.strip().lower()
+    return s in fam or fam in s
+
+
 # ---------------------------------------------------------------------------
 # Top-level bibliography pipeline
 
@@ -1224,6 +1323,7 @@ def resolve_bibliography(
                 "title": _clean_bib_text(fields.get("title", "")).strip(),
                 "first_author": _first_author_from_bib_field(fields.get("author", "")).split(",")[0],
                 "source": "bib",
+                "unresolvable": classify_unresolvable(fields),
             }
 
     if bbl_path and bbl_path.is_file() and not bib_entries:
@@ -1282,18 +1382,35 @@ def resolve_bibliography(
             )
             enriched[key] = {"locations": locations, "citation": None, "doi": None}
             continue
+        status = entry.get("unresolvable")
+        if status:
+            # No DOI exists (in-prep / software / no publication info). Refuse to
+            # fuzzy-resolve — that is exactly how a wrong DOI gets attached. Leave
+            # doi=null and surface for human attention.
+            warnings.append(
+                f"citation {key}: NO DOI — {status.replace('_', ' ')}; left unresolved for "
+                f"human review (not fuzzy-matched). [{(entry['citation'] or '')[:70]}]"
+            )
+            enriched[key] = {
+                "locations": locations,
+                "citation": entry["citation"],
+                "doi": None,
+                "needs_human": status,
+            }
+            continue
         doi, _source = resolver.resolve(
             entry["title"], entry["first_author"], entry["doi_hint"], entry["arxiv_hint"]
         )
         if doi is None:
             warnings.append(
                 f"citation {key}: could not resolve DOI; tried doi-field, eprint-field, "
-                f"Crossref{', ADS' if resolver.ads_key else ''}"
+                f"Crossref{', ADS' if resolver.ads_key else ''}. Flagged for human review."
             )
         enriched[key] = {
             "locations": locations,
             "citation": entry["citation"],
             "doi": doi,
+            **({} if doi else {"needs_human": "unresolved"}),
         }
 
     # Path B: every parsed entry lands in the citations block with empty locations.
diff --git a/tests/test_paper_extraction_doi.py b/tests/test_paper_extraction_doi.py
new file mode 100644
index 0000000..c98c952
--- /dev/null
+++ b/tests/test_paper_extraction_doi.py
@@ -0,0 +1,110 @@
+"""Tests for the paper-extraction DOI-resolver guards.
+
+These guard against the phantom-DOI class of bug: a programmatic Crossref/ADS
+title search returning a *wrong* DOI for an entry that has no DOI to find
+(an in-preparation paper, an ASCL software record), or a too-loose fuzzy
+match accepting an unrelated paper. A wrong DOI silently points evidence
+verification at the wrong paper, so the resolver must prefer a flagged miss
+over a fabricated hit.
+
+The resolver lives in a hyphenated script (`extract-paper-substrate.py`) that
+isn't importable as a normal module, so we load it via importlib.
+"""
+
+from __future__ import annotations
+
+import importlib.util
+from pathlib import Path
+
+import pytest
+
+_SCRIPT = (
+    Path(__file__).resolve().parent.parent
+    / "claude/lightcone/skills/paper-extraction/scripts/extract-paper-substrate.py"
+)
+
+
+def _load():
+    spec = importlib.util.spec_from_file_location("extract_paper_substrate", _SCRIPT)
+    assert spec and spec.loader
+    mod = importlib.util.module_from_spec(spec)
+    spec.loader.exec_module(mod)
+    return mod
+
+
+eps = _load()
+
+
+# --- classify_unresolvable -------------------------------------------------
+
+
+@pytest.mark.parametrize(
+    "fields,expected",
+    [
+        # Published: a real DOI or arXiv eprint -> resolvable (None).
+        ({"doi": "10.1051/0004-6361/202142479"}, None),
+        ({"eprint": "2212.03257", "archiveprefix": "arXiv"}, None),
+        ({"eprint": "astro-ph/0307393", "archiveprefix": "arXiv"}, None),
+        # In preparation: no DOI to find -> flag, never fuzzy-resolve.
+        ({"journal": "in preparation", "title": "Paper V"}, "in_preparation"),
+        ({"note": "submitted to MNRAS"}, "in_preparation"),
+        ({"journal": "to appear in A&A"}, "in_preparation"),
+        # ASCL software record -> software cite, verify by existence.
+        (
+            {"archiveprefix": "ascl", "eprint": "1508.007", "title": "TreeCorr"},
+            "software_ascl",
+        ),
+        ({"eprint": "ascl:1508.007"}, "software_ascl"),
+        # No publication info at all -> can't form a trustworthy query.
+        ({"title": "Some untethered note", "author": "Doe, J."}, "no_publication_info"),
+        # Has a journal but no doi/eprint -> resolvable (will be gated-resolved).
+        ({"journal": "MNRAS", "title": "A real paper", "author": "Doe, J."}, None),
+    ],
+)
+def test_classify_unresolvable(fields, expected):
+    assert eps.classify_unresolvable(fields) == expected
+
+
+def test_ascl_is_not_treated_as_arxiv():
+    """The treecorr15 regression: ascl:1508.007 must not resolve to a DOI."""
+    fields = {"archiveprefix": "ascl", "eprint": "1508.007"}
+    assert eps.classify_unresolvable(fields) == "software_ascl"
+
+
+# --- _author_matches -------------------------------------------------------
+
+
+def test_author_matches_blocks_real_mismatch():
+    # Crossref dict form.
+    assert eps._author_matches("Goh", [{"family": "Hawken"}]) is False
+    # ADS / bib "Last, First" string form.
+    assert eps._author_matches("Jarvis", ["Jarvis, M."]) is True
+
+
+def test_author_matches_accepts_missing_metadata():
+    # No candidate author -> don't block; lean on the title gate.
+    assert eps._author_matches("Goh", []) is True
+    assert eps._author_matches("Goh", None) is True
+    # No queried surname -> nothing to check.
+    assert eps._author_matches("", [{"family": "Anyone"}]) is True
+
+
+# --- title gate ------------------------------------------------------------
+
+
+def test_phantom_title_collision_below_gate():
+    """The exact phantom-DOI collision: a 1978 DOE report whose subtitle matches
+    a software title must fall below the 0.80 gate."""
+    sim = eps._title_similarity(
+        "TreeCorr: Two-point correlation functions",
+        "Quantum electrodynamics and light rays. [Two-point correlation functions]",
+    )
+    assert sim < eps.TITLE_MATCH_MIN
+
+
+def test_true_title_above_gate():
+    sim = eps._title_similarity(
+        "A general framework for removing point-spread function additive systematics",
+        "A general framework for removing point spread function additive systematics",
+    )
+    assert sim >= eps.TITLE_MATCH_MIN