Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 10 additions & 2 deletions claude/lightcone/skills/paper-extraction/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -147,7 +147,14 @@ The script detects the path automatically and produces:

The `--arxiv-id` / `--doi` argument populates the `id` and the evidence `doi:` field in `astra.yaml`. If neither is provided, the script writes placeholder text the agent can fix.

The DOI resolver tries, in order: the entry's `doi:` field → an `eprint:`-derived arXiv DOI → Crossref bibliographic query (free, no API key needed) → ADS title search (only if `ADS_API_TOKEN` env var or `~/.ads/dev_key` is present — graceful skip when absent). Title hits from Crossref are gated by a similarity check against the queried title to drop noisy false matches.
The DOI resolver tries, in order: the entry's `doi:` field → an `eprint:`-derived arXiv DOI → Crossref bibliographic query (free, no API key needed) → ADS title search (only if `ADS_API_TOKEN` env var or `~/.ads/dev_key` is present — graceful skip when absent).

**Fuzzy resolution is gated to avoid wrong DOIs.** A missed resolution (`doi: null`, flagged for human review) is strictly preferable to a fabricated one — a wrong DOI silently points evidence verification at the wrong paper. Two safeguards:

1. **Entries with no DOI to find are never fuzzy-resolved.** `classify_unresolvable()` short-circuits an entry to `doi: null` plus a `needs_human` tag when it is unpublished (`journal`/`note`/`howpublished` says *in preparation*, *submitted*, *to appear*, *in press*, *forthcoming*), an ASCL software record (`archivePrefix=ascl` or an `ascl:` eprint — software cites verify by existence, not a journal DOI), or carries no `journal`/`booktitle` at all. Without this guard, fuzzy title search returns a *false positive* for these (e.g. an "in preparation" companion paper matching an unrelated journal article, or `TreeCorr: Two-point correlation functions` matching a 1978 DOE report subtitled `[Two-point correlation functions]`).
2. **Both Crossref and ADS hits require a title-similarity ≥ 0.80 _and_ a first-author surname match.** The historical Crossref-only gate of 0.55 (with no author check, and no gate at all on the ADS path) was too permissive.

Entries the resolver leaves unresolved carry `needs_human` in the `citations:` block (values: the `classify_unresolvable` tag, or `"unresolved"` when the gated lookups found nothing) so a downstream consumer or human can act on them.

### Step 4 — Review the script's output and fix structural gaps

Expand All @@ -156,7 +163,8 @@ The script is purely deterministic. It walks the structural surface but does not
- **`figure figN: \includegraphics{X} could not resolve`** — the LaTeX referenced a file the script couldn't find. Search the source tree manually (sometimes figures live in non-standard subdirectories with non-standard extensions); copy the file into `figures/` and update the corresponding `index.json` entry's `file` so it's no longer null.
- **`figure figN: no \caption found`** — composite figures (subfloats) sometimes lack a top-level caption; verify the figure block in source and either record the per-subfigure captions in `caption` or note that the figure is composite.
- **`table tabN: no \label`** — verify the table is intentional (some `\begin{table}` blocks are non-tabular layout); rename or annotate as needed.
- **`citation <key>: could not resolve DOI`** — the entry has no `doi:` / `eprint:` field, and neither Crossref nor ADS (when available) returned a match. The entry stays in `citations:` with `doi: null`; a downstream consumer can flag it for human resolution or skip it. If many entries are unresolved, check that the title field is clean (sometimes `.bib` titles carry uncleaned LaTeX commands that drag down the Crossref similarity gate). Delete `.doi-cache.json` to force re-resolution.
- **`citation <key>: could not resolve DOI`** — the entry has no `doi:` / `eprint:` field, and the gated Crossref/ADS lookups found no match clearing the title-similarity + author check. The entry stays in `citations:` with `doi: null` and `needs_human: "unresolved"`; flag for human resolution or skip. If many published entries are unresolved, check that the title field is clean (uncleaned LaTeX in `.bib` titles drags down the similarity gate). Delete `.doi-cache.json` to force re-resolution.
- **`citation <key>: NO DOI — in preparation | software ascl | no publication info`** — the entry was deliberately *not* fuzzy-resolved because it has no DOI to find (unpublished, an ASCL software record, or no publication metadata). It carries `doi: null` and a `needs_human` tag. This is correct behavior, not a failure: resolve it by hand once the paper is published (in-prep), or leave it (software cites verify by existence). Re-running after the `.bib` gains a real `doi:`/`eprint:` resolves it automatically.
- **`citation <key>: cited in source but no matching entry in bibliography-source.{bib,bbl}`** — a `\cite{<key>}` invocation has no corresponding bib record. Usually a typo in the LaTeX source; flag it and move on. The entry stays in `citations:` with `citation: null, doi: null`, locations preserved.
- **Path B caveat** — outline extraction is not yet implemented for the Docling fallback. Bibliography resolution works on Path B by parsing the references section at the tail of `document.md` and synthesizing keys (`<lastname>_<year>`), but citation *invocations* from rendered prose aren't yet extracted — Path B citations carry empty `locations: []`. The warnings list flags this.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1128,8 +1128,11 @@ def _resolve_via_crossref(self, title: str, first_author: str) -> tuple[str | No
if not candidate_doi:
return None, "unresolved"
# Title-similarity gate: drop noisy hits where the top result clearly isn't
# the paper we asked about.
if candidate_titles and _title_similarity(title, candidate_titles[0]) < 0.55:
# the paper we asked about. A bare-title match with no author corroboration
# is exactly how phantom DOIs slipped in, so require both.
if not candidate_titles or _title_similarity(title, candidate_titles[0]) < TITLE_MATCH_MIN:
return None, "unresolved"
if not _author_matches(first_author, top.get("author")):
return None, "unresolved"
return self._normalize_doi(candidate_doi), "crossref"

Expand All @@ -1139,7 +1142,7 @@ def _resolve_via_ads(self, title: str, first_author: str) -> tuple[str | None, s
q = f'title:"{title}"'
if first_author:
q += f' author:"{first_author}"'
params = {"q": q, "fl": "doi,title", "rows": "1"}
params = {"q": q, "fl": "doi,title,author", "rows": "1"}
url = f"{ADS_API}?{urllib.parse.urlencode(params)}"
try:
data = self._http_get_json(
Expand All @@ -1151,9 +1154,18 @@ def _resolve_via_ads(self, title: str, first_author: str) -> tuple[str | None, s
docs = ((data or {}).get("response", {}) or {}).get("docs", []) or []
if not docs:
return None, "unresolved"
doi_list = docs[0].get("doi") or []
doc = docs[0]
doi_list = doc.get("doi") or []
if not doi_list:
return None, "unresolved"
# ADS's quoted title/author query is fuzzy too — apply the same
# title-similarity + first-author gate the Crossref path uses, so the
# ADS fallback can't reintroduce a wrong DOI.
candidate_titles = doc.get("title") or []
if not candidate_titles or _title_similarity(title, candidate_titles[0]) < TITLE_MATCH_MIN:
return None, "unresolved"
if not _author_matches(first_author, doc.get("author")):
return None, "unresolved"
return self._normalize_doi(doi_list[0]), "ads"

@staticmethod
Expand Down Expand Up @@ -1184,6 +1196,93 @@ def _title_similarity(a: str, b: str) -> float:
return SequenceMatcher(None, a_norm, b_norm).ratio()


# Minimum title-similarity for a fuzzy DOI match to be accepted. Raised from a
# historical 0.55 (too permissive — it let a 1978 DOE report whose subtitle was
# "[Two-point correlation functions]" match a "TreeCorr: Two-point correlation
# functions" software entry) to 0.80, paired with a first-author check. A miss
# is cheaper than a wrong DOI: an unresolved entry is flagged for human review;
# a wrong DOI silently points evidence verification at the wrong paper.
TITLE_MATCH_MIN = 0.80

# Markers that a bibliography entry is not yet published — there is no DOI to
# find, so fuzzy title resolution can only produce a false positive.
_IN_PREP_RE = re.compile(
r"\b(in[\s_-]*prep(?:aration)?|submitted|to[\s_-]+appear|in[\s_-]+press|forthcoming)\b",
re.IGNORECASE,
)


def classify_unresolvable(fields: dict[str, str]) -> str | None:
"""Return a non-resolvable status tag for a parsed `.bib` entry, or None.

Entries that carry a real DOI or arXiv eprint are always resolvable (None).
Otherwise we refuse to fuzzy-resolve, and flag for human review, when:

- 'software_ascl' — an ASCL software record (archivePrefix=ascl or an
`ascl:` eprint). Software cites verify by existence
(ASCL id / bibcode), not a journal DOI.
- 'in_preparation' — journal/note marks it unpublished (in prep,
submitted, to appear, in press, forthcoming).
- 'no_publication_info' — no doi, no eprint, and no journal/booktitle at all,
so we can't even form a trustworthy query.

Priority: a missed resolution (doi=None, flagged) is strictly preferable to a
fabricated one. Traceability over coverage.
"""
if fields.get("doi"):
return None
archive = (fields.get("archiveprefix") or "").strip().lower()
eprint = (fields.get("eprint") or "").strip()
if archive == "ascl" or eprint.lower().startswith("ascl:"):
return "software_ascl"
# A genuine arXiv eprint (YYMM.NNNNN or old archive/NNNNNNN) is resolvable.
if eprint and re.match(
r"^(?:\d{4}\.\d{4,5}|[a-z\-]+/\d{7})(?:v\d+)?$", eprint, re.IGNORECASE
):
return None
blob = " ".join(
v
for v in (fields.get("journal"), fields.get("note"), fields.get("howpublished"))
if v
)
if _IN_PREP_RE.search(blob):
return "in_preparation"
if not (fields.get("journal") or fields.get("booktitle") or fields.get("howpublished")):
return "no_publication_info"
return None


def _candidate_surname(author_entry) -> str:
"""First-author surname from a Crossref author dict or an ADS/bib author string."""
if isinstance(author_entry, dict):
return (author_entry.get("family") or "").strip().lower()
if isinstance(author_entry, str):
# ADS / BibTeX comma form "Last, First"; else take the whole token.
head = author_entry.split(",")[0].strip()
return (head or author_entry).strip().lower()
return ""


def _author_matches(surname: str, candidate_authors) -> bool:
"""True if `surname` plausibly matches the candidate's first-author family name.

Conservative on missing metadata: if no candidate author can be extracted we
return True and lean on the title gate. We only *block* on a positive
mismatch — two real surnames that don't overlap.
"""
if not surname:
return True
fam = ""
if isinstance(candidate_authors, list) and candidate_authors:
fam = _candidate_surname(candidate_authors[0])
elif isinstance(candidate_authors, str):
fam = _candidate_surname(candidate_authors)
if not fam:
return True
s = surname.strip().lower()
return s in fam or fam in s


# ---------------------------------------------------------------------------
# Top-level bibliography pipeline

Expand Down Expand Up @@ -1224,6 +1323,7 @@ def resolve_bibliography(
"title": _clean_bib_text(fields.get("title", "")).strip(),
"first_author": _first_author_from_bib_field(fields.get("author", "")).split(",")[0],
"source": "bib",
"unresolvable": classify_unresolvable(fields),
}

if bbl_path and bbl_path.is_file() and not bib_entries:
Expand Down Expand Up @@ -1282,18 +1382,35 @@ def resolve_bibliography(
)
enriched[key] = {"locations": locations, "citation": None, "doi": None}
continue
status = entry.get("unresolvable")
if status:
# No DOI exists (in-prep / software / no publication info). Refuse to
# fuzzy-resolve — that is exactly how a wrong DOI gets attached. Leave
# doi=null and surface for human attention.
warnings.append(
f"citation {key}: NO DOI — {status.replace('_', ' ')}; left unresolved for "
f"human review (not fuzzy-matched). [{(entry['citation'] or '')[:70]}]"
)
enriched[key] = {
"locations": locations,
"citation": entry["citation"],
"doi": None,
"needs_human": status,
}
continue
doi, _source = resolver.resolve(
entry["title"], entry["first_author"], entry["doi_hint"], entry["arxiv_hint"]
)
if doi is None:
warnings.append(
f"citation {key}: could not resolve DOI; tried doi-field, eprint-field, "
f"Crossref{', ADS' if resolver.ads_key else ''}"
f"Crossref{', ADS' if resolver.ads_key else ''}. Flagged for human review."
)
enriched[key] = {
"locations": locations,
"citation": entry["citation"],
"doi": doi,
**({} if doi else {"needs_human": "unresolved"}),
}

# Path B: every parsed entry lands in the citations block with empty locations.
Expand Down
110 changes: 110 additions & 0 deletions tests/test_paper_extraction_doi.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
"""Tests for the paper-extraction DOI-resolver guards.

These guard against the phantom-DOI class of bug: a programmatic Crossref/ADS
title search returning a *wrong* DOI for an entry that has no DOI to find
(an in-preparation paper, an ASCL software record), or a too-loose fuzzy
match accepting an unrelated paper. A wrong DOI silently points evidence
verification at the wrong paper, so the resolver must prefer a flagged miss
over a fabricated hit.

The resolver lives in a hyphenated script (`extract-paper-substrate.py`) that
isn't importable as a normal module, so we load it via importlib.
"""

from __future__ import annotations

import importlib.util
from pathlib import Path

import pytest

_SCRIPT = (
Path(__file__).resolve().parent.parent
/ "claude/lightcone/skills/paper-extraction/scripts/extract-paper-substrate.py"
)


def _load():
spec = importlib.util.spec_from_file_location("extract_paper_substrate", _SCRIPT)
assert spec and spec.loader
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)
return mod


eps = _load()


# --- classify_unresolvable -------------------------------------------------


@pytest.mark.parametrize(
"fields,expected",
[
# Published: a real DOI or arXiv eprint -> resolvable (None).
({"doi": "10.1051/0004-6361/202142479"}, None),
({"eprint": "2212.03257", "archiveprefix": "arXiv"}, None),
({"eprint": "astro-ph/0307393", "archiveprefix": "arXiv"}, None),
# In preparation: no DOI to find -> flag, never fuzzy-resolve.
({"journal": "in preparation", "title": "Paper V"}, "in_preparation"),
({"note": "submitted to MNRAS"}, "in_preparation"),
({"journal": "to appear in A&A"}, "in_preparation"),
# ASCL software record -> software cite, verify by existence.
(
{"archiveprefix": "ascl", "eprint": "1508.007", "title": "TreeCorr"},
"software_ascl",
),
({"eprint": "ascl:1508.007"}, "software_ascl"),
# No publication info at all -> can't form a trustworthy query.
({"title": "Some untethered note", "author": "Doe, J."}, "no_publication_info"),
# Has a journal but no doi/eprint -> resolvable (will be gated-resolved).
({"journal": "MNRAS", "title": "A real paper", "author": "Doe, J."}, None),
],
)
def test_classify_unresolvable(fields, expected):
assert eps.classify_unresolvable(fields) == expected


def test_ascl_is_not_treated_as_arxiv():
"""The treecorr15 regression: ascl:1508.007 must not resolve to a DOI."""
fields = {"archiveprefix": "ascl", "eprint": "1508.007"}
assert eps.classify_unresolvable(fields) == "software_ascl"


# --- _author_matches -------------------------------------------------------


def test_author_matches_blocks_real_mismatch():
# Crossref dict form.
assert eps._author_matches("Goh", [{"family": "Hawken"}]) is False
# ADS / bib "Last, First" string form.
assert eps._author_matches("Jarvis", ["Jarvis, M."]) is True


def test_author_matches_accepts_missing_metadata():
# No candidate author -> don't block; lean on the title gate.
assert eps._author_matches("Goh", []) is True
assert eps._author_matches("Goh", None) is True
# No queried surname -> nothing to check.
assert eps._author_matches("", [{"family": "Anyone"}]) is True


# --- title gate ------------------------------------------------------------


def test_phantom_title_collision_below_gate():
"""The exact phantom-DOI collision: a 1978 DOE report whose subtitle matches
a software title must fall below the 0.80 gate."""
sim = eps._title_similarity(
"TreeCorr: Two-point correlation functions",
"Quantum electrodynamics and light rays. [Two-point correlation functions]",
)
assert sim < eps.TITLE_MATCH_MIN


def test_true_title_above_gate():
sim = eps._title_similarity(
"A general framework for removing point-spread function additive systematics",
"A general framework for removing point spread function additive systematics",
)
assert sim >= eps.TITLE_MATCH_MIN
Loading