From e9439529813198b303b4ee5c6102512343cb7640 Mon Sep 17 00:00:00 2001 From: Cail Daley Date: Sat, 30 May 2026 18:23:30 +0200 Subject: [PATCH] fix(paper-extraction): never fabricate DOIs for no-DOI bib entries MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The DOI resolver fuzzy-searches Crossref/ADS by title for any bib entry lacking a doi:/eprint: field. Two failure modes let it attach *wrong* DOIs — a direct traceability violation, since a wrong DOI silently points evidence verification at the wrong paper: - "in preparation" companion papers (no DOI exists yet) matched unrelated journal articles via fuzzy title search. - An ASCL software record (TreeCorr, "Two-point correlation functions") matched a 1978 DOE OSTI report subtitled "[Two-point correlation functions]" through a permissive 0.55 title gate. Fix (traceability over coverage — a flagged miss beats a fabricated hit): - classify_unresolvable(): short-circuit entries that have no DOI to find (in-prep/submitted/to-appear/in-press/forthcoming; ASCL software records; no publication metadata) to doi=null + a needs_human tag. Never fuzzy-resolve them. - Raise the fuzzy title-match gate 0.55 -> 0.80 AND require a first-author surname match, on BOTH the Crossref and ADS paths (the ADS path previously had no gate at all). - Surface unresolved entries via needs_human in the citations: block. Verified against the UNIONS B-modes bib: the four in-prep companions and the TreeCorr ASCL entry that previously got phantom DOIs are now flagged; all published entries still resolve. Adds tests/test_paper_extraction_doi.py. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../skills/paper-extraction/SKILL.md | 12 +- .../scripts/extract-paper-substrate.py | 127 +++++++++++++++++- tests/test_paper_extraction_doi.py | 110 +++++++++++++++ 3 files changed, 242 insertions(+), 7 deletions(-) create mode 100644 tests/test_paper_extraction_doi.py diff --git a/claude/lightcone/skills/paper-extraction/SKILL.md b/claude/lightcone/skills/paper-extraction/SKILL.md index dd4dce3..63163d1 100644 --- a/claude/lightcone/skills/paper-extraction/SKILL.md +++ b/claude/lightcone/skills/paper-extraction/SKILL.md @@ -147,7 +147,14 @@ The script detects the path automatically and produces: The `--arxiv-id` / `--doi` argument populates the `id` and the evidence `doi:` field in `astra.yaml`. If neither is provided, the script writes placeholder text the agent can fix. -The DOI resolver tries, in order: the entry's `doi:` field → an `eprint:`-derived arXiv DOI → Crossref bibliographic query (free, no API key needed) → ADS title search (only if `ADS_API_TOKEN` env var or `~/.ads/dev_key` is present — graceful skip when absent). Title hits from Crossref are gated by a similarity check against the queried title to drop noisy false matches. +The DOI resolver tries, in order: the entry's `doi:` field → an `eprint:`-derived arXiv DOI → Crossref bibliographic query (free, no API key needed) → ADS title search (only if `ADS_API_TOKEN` env var or `~/.ads/dev_key` is present — graceful skip when absent). + +**Fuzzy resolution is gated to avoid wrong DOIs.** A missed resolution (`doi: null`, flagged for human review) is strictly preferable to a fabricated one — a wrong DOI silently points evidence verification at the wrong paper. Two safeguards: + +1. **Entries with no DOI to find are never fuzzy-resolved.** `classify_unresolvable()` short-circuits an entry to `doi: null` plus a `needs_human` tag when it is unpublished (`journal`/`note`/`howpublished` says *in preparation*, *submitted*, *to appear*, *in press*, *forthcoming*), an ASCL software record (`archivePrefix=ascl` or an `ascl:` eprint — software cites verify by existence, not a journal DOI), or carries no `journal`/`booktitle` at all. Without this guard, fuzzy title search returns a *false positive* for these (e.g. an "in preparation" companion paper matching an unrelated journal article, or `TreeCorr: Two-point correlation functions` matching a 1978 DOE report subtitled `[Two-point correlation functions]`). +2. **Both Crossref and ADS hits require a title-similarity ≥ 0.80 _and_ a first-author surname match.** The historical Crossref-only gate of 0.55 (with no author check, and no gate at all on the ADS path) was too permissive. + +Entries the resolver leaves unresolved carry `needs_human` in the `citations:` block (values: the `classify_unresolvable` tag, or `"unresolved"` when the gated lookups found nothing) so a downstream consumer or human can act on them. ### Step 4 — Review the script's output and fix structural gaps @@ -156,7 +163,8 @@ The script is purely deterministic. It walks the structural surface but does not - **`figure figN: \includegraphics{X} could not resolve`** — the LaTeX referenced a file the script couldn't find. Search the source tree manually (sometimes figures live in non-standard subdirectories with non-standard extensions); copy the file into `figures/` and update the corresponding `index.json` entry's `file` so it's no longer null. - **`figure figN: no \caption found`** — composite figures (subfloats) sometimes lack a top-level caption; verify the figure block in source and either record the per-subfigure captions in `caption` or note that the figure is composite. - **`table tabN: no \label`** — verify the table is intentional (some `\begin{table}` blocks are non-tabular layout); rename or annotate as needed. -- **`citation : could not resolve DOI`** — the entry has no `doi:` / `eprint:` field, and neither Crossref nor ADS (when available) returned a match. The entry stays in `citations:` with `doi: null`; a downstream consumer can flag it for human resolution or skip it. If many entries are unresolved, check that the title field is clean (sometimes `.bib` titles carry uncleaned LaTeX commands that drag down the Crossref similarity gate). Delete `.doi-cache.json` to force re-resolution. +- **`citation : could not resolve DOI`** — the entry has no `doi:` / `eprint:` field, and the gated Crossref/ADS lookups found no match clearing the title-similarity + author check. The entry stays in `citations:` with `doi: null` and `needs_human: "unresolved"`; flag for human resolution or skip. If many published entries are unresolved, check that the title field is clean (uncleaned LaTeX in `.bib` titles drags down the similarity gate). Delete `.doi-cache.json` to force re-resolution. +- **`citation : NO DOI — in preparation | software ascl | no publication info`** — the entry was deliberately *not* fuzzy-resolved because it has no DOI to find (unpublished, an ASCL software record, or no publication metadata). It carries `doi: null` and a `needs_human` tag. This is correct behavior, not a failure: resolve it by hand once the paper is published (in-prep), or leave it (software cites verify by existence). Re-running after the `.bib` gains a real `doi:`/`eprint:` resolves it automatically. - **`citation : cited in source but no matching entry in bibliography-source.{bib,bbl}`** — a `\cite{}` invocation has no corresponding bib record. Usually a typo in the LaTeX source; flag it and move on. The entry stays in `citations:` with `citation: null, doi: null`, locations preserved. - **Path B caveat** — outline extraction is not yet implemented for the Docling fallback. Bibliography resolution works on Path B by parsing the references section at the tail of `document.md` and synthesizing keys (`_`), but citation *invocations* from rendered prose aren't yet extracted — Path B citations carry empty `locations: []`. The warnings list flags this. diff --git a/claude/lightcone/skills/paper-extraction/scripts/extract-paper-substrate.py b/claude/lightcone/skills/paper-extraction/scripts/extract-paper-substrate.py index ce2309e..bcbdfef 100755 --- a/claude/lightcone/skills/paper-extraction/scripts/extract-paper-substrate.py +++ b/claude/lightcone/skills/paper-extraction/scripts/extract-paper-substrate.py @@ -1128,8 +1128,11 @@ def _resolve_via_crossref(self, title: str, first_author: str) -> tuple[str | No if not candidate_doi: return None, "unresolved" # Title-similarity gate: drop noisy hits where the top result clearly isn't - # the paper we asked about. - if candidate_titles and _title_similarity(title, candidate_titles[0]) < 0.55: + # the paper we asked about. A bare-title match with no author corroboration + # is exactly how phantom DOIs slipped in, so require both. + if not candidate_titles or _title_similarity(title, candidate_titles[0]) < TITLE_MATCH_MIN: + return None, "unresolved" + if not _author_matches(first_author, top.get("author")): return None, "unresolved" return self._normalize_doi(candidate_doi), "crossref" @@ -1139,7 +1142,7 @@ def _resolve_via_ads(self, title: str, first_author: str) -> tuple[str | None, s q = f'title:"{title}"' if first_author: q += f' author:"{first_author}"' - params = {"q": q, "fl": "doi,title", "rows": "1"} + params = {"q": q, "fl": "doi,title,author", "rows": "1"} url = f"{ADS_API}?{urllib.parse.urlencode(params)}" try: data = self._http_get_json( @@ -1151,9 +1154,18 @@ def _resolve_via_ads(self, title: str, first_author: str) -> tuple[str | None, s docs = ((data or {}).get("response", {}) or {}).get("docs", []) or [] if not docs: return None, "unresolved" - doi_list = docs[0].get("doi") or [] + doc = docs[0] + doi_list = doc.get("doi") or [] if not doi_list: return None, "unresolved" + # ADS's quoted title/author query is fuzzy too — apply the same + # title-similarity + first-author gate the Crossref path uses, so the + # ADS fallback can't reintroduce a wrong DOI. + candidate_titles = doc.get("title") or [] + if not candidate_titles or _title_similarity(title, candidate_titles[0]) < TITLE_MATCH_MIN: + return None, "unresolved" + if not _author_matches(first_author, doc.get("author")): + return None, "unresolved" return self._normalize_doi(doi_list[0]), "ads" @staticmethod @@ -1184,6 +1196,93 @@ def _title_similarity(a: str, b: str) -> float: return SequenceMatcher(None, a_norm, b_norm).ratio() +# Minimum title-similarity for a fuzzy DOI match to be accepted. Raised from a +# historical 0.55 (too permissive — it let a 1978 DOE report whose subtitle was +# "[Two-point correlation functions]" match a "TreeCorr: Two-point correlation +# functions" software entry) to 0.80, paired with a first-author check. A miss +# is cheaper than a wrong DOI: an unresolved entry is flagged for human review; +# a wrong DOI silently points evidence verification at the wrong paper. +TITLE_MATCH_MIN = 0.80 + +# Markers that a bibliography entry is not yet published — there is no DOI to +# find, so fuzzy title resolution can only produce a false positive. +_IN_PREP_RE = re.compile( + r"\b(in[\s_-]*prep(?:aration)?|submitted|to[\s_-]+appear|in[\s_-]+press|forthcoming)\b", + re.IGNORECASE, +) + + +def classify_unresolvable(fields: dict[str, str]) -> str | None: + """Return a non-resolvable status tag for a parsed `.bib` entry, or None. + + Entries that carry a real DOI or arXiv eprint are always resolvable (None). + Otherwise we refuse to fuzzy-resolve, and flag for human review, when: + + - 'software_ascl' — an ASCL software record (archivePrefix=ascl or an + `ascl:` eprint). Software cites verify by existence + (ASCL id / bibcode), not a journal DOI. + - 'in_preparation' — journal/note marks it unpublished (in prep, + submitted, to appear, in press, forthcoming). + - 'no_publication_info' — no doi, no eprint, and no journal/booktitle at all, + so we can't even form a trustworthy query. + + Priority: a missed resolution (doi=None, flagged) is strictly preferable to a + fabricated one. Traceability over coverage. + """ + if fields.get("doi"): + return None + archive = (fields.get("archiveprefix") or "").strip().lower() + eprint = (fields.get("eprint") or "").strip() + if archive == "ascl" or eprint.lower().startswith("ascl:"): + return "software_ascl" + # A genuine arXiv eprint (YYMM.NNNNN or old archive/NNNNNNN) is resolvable. + if eprint and re.match( + r"^(?:\d{4}\.\d{4,5}|[a-z\-]+/\d{7})(?:v\d+)?$", eprint, re.IGNORECASE + ): + return None + blob = " ".join( + v + for v in (fields.get("journal"), fields.get("note"), fields.get("howpublished")) + if v + ) + if _IN_PREP_RE.search(blob): + return "in_preparation" + if not (fields.get("journal") or fields.get("booktitle") or fields.get("howpublished")): + return "no_publication_info" + return None + + +def _candidate_surname(author_entry) -> str: + """First-author surname from a Crossref author dict or an ADS/bib author string.""" + if isinstance(author_entry, dict): + return (author_entry.get("family") or "").strip().lower() + if isinstance(author_entry, str): + # ADS / BibTeX comma form "Last, First"; else take the whole token. + head = author_entry.split(",")[0].strip() + return (head or author_entry).strip().lower() + return "" + + +def _author_matches(surname: str, candidate_authors) -> bool: + """True if `surname` plausibly matches the candidate's first-author family name. + + Conservative on missing metadata: if no candidate author can be extracted we + return True and lean on the title gate. We only *block* on a positive + mismatch — two real surnames that don't overlap. + """ + if not surname: + return True + fam = "" + if isinstance(candidate_authors, list) and candidate_authors: + fam = _candidate_surname(candidate_authors[0]) + elif isinstance(candidate_authors, str): + fam = _candidate_surname(candidate_authors) + if not fam: + return True + s = surname.strip().lower() + return s in fam or fam in s + + # --------------------------------------------------------------------------- # Top-level bibliography pipeline @@ -1224,6 +1323,7 @@ def resolve_bibliography( "title": _clean_bib_text(fields.get("title", "")).strip(), "first_author": _first_author_from_bib_field(fields.get("author", "")).split(",")[0], "source": "bib", + "unresolvable": classify_unresolvable(fields), } if bbl_path and bbl_path.is_file() and not bib_entries: @@ -1282,18 +1382,35 @@ def resolve_bibliography( ) enriched[key] = {"locations": locations, "citation": None, "doi": None} continue + status = entry.get("unresolvable") + if status: + # No DOI exists (in-prep / software / no publication info). Refuse to + # fuzzy-resolve — that is exactly how a wrong DOI gets attached. Leave + # doi=null and surface for human attention. + warnings.append( + f"citation {key}: NO DOI — {status.replace('_', ' ')}; left unresolved for " + f"human review (not fuzzy-matched). [{(entry['citation'] or '')[:70]}]" + ) + enriched[key] = { + "locations": locations, + "citation": entry["citation"], + "doi": None, + "needs_human": status, + } + continue doi, _source = resolver.resolve( entry["title"], entry["first_author"], entry["doi_hint"], entry["arxiv_hint"] ) if doi is None: warnings.append( f"citation {key}: could not resolve DOI; tried doi-field, eprint-field, " - f"Crossref{', ADS' if resolver.ads_key else ''}" + f"Crossref{', ADS' if resolver.ads_key else ''}. Flagged for human review." ) enriched[key] = { "locations": locations, "citation": entry["citation"], "doi": doi, + **({} if doi else {"needs_human": "unresolved"}), } # Path B: every parsed entry lands in the citations block with empty locations. diff --git a/tests/test_paper_extraction_doi.py b/tests/test_paper_extraction_doi.py new file mode 100644 index 0000000..c98c952 --- /dev/null +++ b/tests/test_paper_extraction_doi.py @@ -0,0 +1,110 @@ +"""Tests for the paper-extraction DOI-resolver guards. + +These guard against the phantom-DOI class of bug: a programmatic Crossref/ADS +title search returning a *wrong* DOI for an entry that has no DOI to find +(an in-preparation paper, an ASCL software record), or a too-loose fuzzy +match accepting an unrelated paper. A wrong DOI silently points evidence +verification at the wrong paper, so the resolver must prefer a flagged miss +over a fabricated hit. + +The resolver lives in a hyphenated script (`extract-paper-substrate.py`) that +isn't importable as a normal module, so we load it via importlib. +""" + +from __future__ import annotations + +import importlib.util +from pathlib import Path + +import pytest + +_SCRIPT = ( + Path(__file__).resolve().parent.parent + / "claude/lightcone/skills/paper-extraction/scripts/extract-paper-substrate.py" +) + + +def _load(): + spec = importlib.util.spec_from_file_location("extract_paper_substrate", _SCRIPT) + assert spec and spec.loader + mod = importlib.util.module_from_spec(spec) + spec.loader.exec_module(mod) + return mod + + +eps = _load() + + +# --- classify_unresolvable ------------------------------------------------- + + +@pytest.mark.parametrize( + "fields,expected", + [ + # Published: a real DOI or arXiv eprint -> resolvable (None). + ({"doi": "10.1051/0004-6361/202142479"}, None), + ({"eprint": "2212.03257", "archiveprefix": "arXiv"}, None), + ({"eprint": "astro-ph/0307393", "archiveprefix": "arXiv"}, None), + # In preparation: no DOI to find -> flag, never fuzzy-resolve. + ({"journal": "in preparation", "title": "Paper V"}, "in_preparation"), + ({"note": "submitted to MNRAS"}, "in_preparation"), + ({"journal": "to appear in A&A"}, "in_preparation"), + # ASCL software record -> software cite, verify by existence. + ( + {"archiveprefix": "ascl", "eprint": "1508.007", "title": "TreeCorr"}, + "software_ascl", + ), + ({"eprint": "ascl:1508.007"}, "software_ascl"), + # No publication info at all -> can't form a trustworthy query. + ({"title": "Some untethered note", "author": "Doe, J."}, "no_publication_info"), + # Has a journal but no doi/eprint -> resolvable (will be gated-resolved). + ({"journal": "MNRAS", "title": "A real paper", "author": "Doe, J."}, None), + ], +) +def test_classify_unresolvable(fields, expected): + assert eps.classify_unresolvable(fields) == expected + + +def test_ascl_is_not_treated_as_arxiv(): + """The treecorr15 regression: ascl:1508.007 must not resolve to a DOI.""" + fields = {"archiveprefix": "ascl", "eprint": "1508.007"} + assert eps.classify_unresolvable(fields) == "software_ascl" + + +# --- _author_matches ------------------------------------------------------- + + +def test_author_matches_blocks_real_mismatch(): + # Crossref dict form. + assert eps._author_matches("Goh", [{"family": "Hawken"}]) is False + # ADS / bib "Last, First" string form. + assert eps._author_matches("Jarvis", ["Jarvis, M."]) is True + + +def test_author_matches_accepts_missing_metadata(): + # No candidate author -> don't block; lean on the title gate. + assert eps._author_matches("Goh", []) is True + assert eps._author_matches("Goh", None) is True + # No queried surname -> nothing to check. + assert eps._author_matches("", [{"family": "Anyone"}]) is True + + +# --- title gate ------------------------------------------------------------ + + +def test_phantom_title_collision_below_gate(): + """The exact phantom-DOI collision: a 1978 DOE report whose subtitle matches + a software title must fall below the 0.80 gate.""" + sim = eps._title_similarity( + "TreeCorr: Two-point correlation functions", + "Quantum electrodynamics and light rays. [Two-point correlation functions]", + ) + assert sim < eps.TITLE_MATCH_MIN + + +def test_true_title_above_gate(): + sim = eps._title_similarity( + "A general framework for removing point-spread function additive systematics", + "A general framework for removing point spread function additive systematics", + ) + assert sim >= eps.TITLE_MATCH_MIN