feat(capture): identify hashed fonts via OpenType name table#987
feat(capture): identify hashed fonts via OpenType name table#987ukimsanov wants to merge 2 commits into
Conversation
Modern frameworks (Next.js, Webpack) hash font filenames like
`f9b8e1e8d4c3f0a7-s.woff2`, so the capture pipeline can't tell which
file belongs to which family by reading the filename. Sub-agents
authoring DESIGN.md were guessing or falling back to system fonts.
This adds `fontMetadataExtractor.ts`: reads the binary OpenType `name`
table via `fontkit`, identifies each downloaded font by its real
family name, and writes `capture/extracted/fonts-manifest.json` with
per-file metadata + per-family aggregates (weights, variable-font
axes, file counts).
- Canonicalizes static-weight family-name packaging: "Inter Medium"
resolves to family "Inter" with weight 500, "Semi Bold" normalizes
to "SemiBold", etc. Width modifiers ("Tight", "Condensed") are NOT
stripped — they denote separate typographic families.
- Reads variable-font axes from `fvar` so a single .woff2 carrying a
full weight range is identified as variable (e.g. "Inter (100-900
variable)").
- Uses `@types/fontkit` properly (no `unknown` cast), with a
Font/FontCollection type guard. fontkit API drift surfaces as a
compile error rather than silent undefined.
- Wired into `capture/index.ts` after `downloadAndRewriteFonts` so it
runs after fonts are already on disk. Non-fatal try/catch — capture
succeeds even if extraction fails.
Tested against 9 captures: 132/132 fonts identified by real family
name, including hashed Next.js builds.
There was a problem hiding this comment.
Pull request overview
Adds font binary inspection to the CLI capture pipeline so hashed/downloaded font files can be identified by their real OpenType family names and summarized for downstream consumers (e.g., DESIGN.md authoring).
Changes:
- Add a
fontMetadataExtractorthat usesfontkitto read OpenType name/OS/2/fvar metadata and writeextracted/fonts-manifest.json. - Wire the extractor into
captureWebsiteafterdownloadAndRewriteFonts, with non-fatal logging. - Add
fontkitdependency (and update lockfile).
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| packages/cli/src/capture/index.ts | Invokes font metadata extraction after font download/URL rewrite and logs a concise summary. |
| packages/cli/src/capture/fontMetadataExtractor.ts | New extractor that parses font binaries, canonicalizes family names/weights, aggregates per-family, and writes a manifest. |
| packages/cli/package.json | Adds fontkit (and @types/fontkit), plus bumps sharp. |
| bun.lock | Locks new dependency graph including fontkit and related transitive deps. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (s.includes("thin")) return 100; | ||
| if (s.includes("extralight") || s.includes("ultralight")) return 200; | ||
| if (s.includes("light")) return 300; | ||
| if (s.includes("medium")) return 500; | ||
| if (s.includes("semibold") || s.includes("demibold")) return 600; | ||
| if (s.includes("extrabold") || s.includes("ultrabold")) return 800; | ||
| if (s.includes("black") || s.includes("heavy")) return 900; | ||
| if (s.includes("bold")) return 700; |
| /** Raw family name from the OpenType name table (nameID 16 preferred, then nameID 1). Empty if unidentifiable. */ | ||
| rawFamily: string; | ||
| /** Subfamily / style name from nameID 17 or 2 (e.g. "Regular", "Bold Italic") */ | ||
| subfamily: string; | ||
| /** PostScript name from nameID 6 (e.g. "Inter-Regular") */ | ||
| postscript: string; |
| /** OS/2 usWeightClass (100–900). Approximate for variable fonts — see variationAxes. */ | ||
| weight: number; |
| unidentified, | ||
| meta: { | ||
| generatedAt: new Date().toISOString(), | ||
| tool: "fontkit@2.0.4", |
| /** | ||
| * Read all font files in fontsDir, extract metadata via fontkit, and write | ||
| * the manifest to outputPath. Returns the manifest in case callers want to log it. | ||
| * | ||
| * Failures are non-fatal: if a single font's name table is missing or corrupt, | ||
| * the file is added to `unidentified` and the rest continue. If the fonts | ||
| * directory doesn't exist, returns an empty manifest without throwing. | ||
| */ | ||
| export function extractFontMetadata(fontsDir: string, outputPath: string): FontsManifest { | ||
| const files: FontFileMetadata[] = []; | ||
| const unidentified: string[] = []; | ||
|
|
||
| if (existsSync(fontsDir)) { | ||
| const fontFiles = readdirSync(fontsDir).filter((f) => /\.(woff2?|ttf|otf)$/i.test(f)); | ||
| for (const filename of fontFiles) { | ||
| const fullPath = join(fontsDir, filename); | ||
| const meta = readSingleFont(fullPath, filename); | ||
| if (meta.identified) { | ||
| files.push(meta); | ||
| } else { | ||
| files.push(meta); | ||
| unidentified.push(filename); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| const families = aggregateFamilies(files); | ||
|
|
||
| const manifest: FontsManifest = { | ||
| files, | ||
| families, | ||
| unidentified, | ||
| meta: { | ||
| generatedAt: new Date().toISOString(), | ||
| tool: "fontkit@2.0.4", | ||
| }, | ||
| }; | ||
|
|
||
| writeFileSync(outputPath, JSON.stringify(manifest, null, 2), "utf-8"); | ||
| return manifest; |
| "open": "^10.0.0", | ||
| "postcss": "^8.5.8", | ||
| "prettier": "^3.8.1", | ||
| "puppeteer-core": "^24.39.1", | ||
| "sharp": "^0.34.0" | ||
| "sharp": "^0.34.5" | ||
| }, |
miguel-heygen
left a comment
There was a problem hiding this comment.
Clean extraction from the original #984. Font metadata extractor is solid — proper typed imports from @types/fontkit, weight canonicalization well-documented, non-fatal wiring in capture/index.ts. fontkit + sharp in cli package.json only (not root). Minor: hardcoded tool: "fontkit@2.0.4" will go stale on dep bumps.
jrusso1020
left a comment
There was a problem hiding this comment.
Approve at db94b505. Magi covered the headline correctness (typed fontkit, weight canonicalization, non-fatal wiring); my additive read:
- Type guard for
Font | FontCollectionis clean —isFontCollectioncheckstype === "TTC" || "DFont"against the@types/fontkitdiscriminator. Noany/unknowncast bypasses. This addresses Magi's priorfontkit.create()cast concern from hf#984 cleanly. - Variable-font axes via
font.variationAxesreadsfvar-equivalent metadata. Per-family aggregates list axes correctly (e.g.,["wght", "slnt"]). - Malformed-binary handling — silent skip with the file marked
identified: falsein the unidentified array. The error doesn't propagate so the pipeline continues. Reasonable for a downstream-consumer-tolerates-missing context, but worth noting that a systematically misparseable font (e.g., a corrupted upstream CDN) won't surface as an error in the manifest summary. Lower-priority: a one-line log per skip would make debugging easier when a font silently disappears. - Path validation:
readdirSync(fontsDir)is called with the caller-supplied path. No symlink or traversal canonicalization. Low priority for a local-dev CLI that the user controls, but worth knowing iffontsDirever becomes a user-controlled input (URL, agent prompt, etc.). - Stack base: confirmed base is
main. Foundation for the stack.
Solid, isolated, reviewable. Direction is the right replacement for the v1/v2/v3/v4 eval failure mode where Inter Medium and Geist Mono showed up as font_xxxxxx.woff2 in DESIGN.md.
— Rames Jusso
Five fixes from Copilot's inline review + Miguel's note on PR #987: 1. inferWeightFromSubfamily — only matched concatenated forms ("extralight", "semibold"). Spaced ("Extra Light") and hyphenated ("Extra-Light") variants fell through to the 400 default, misreporting 200-weight fonts as 400. Now normalizes `[\s-]+` out of the subfamily before matching. 2. meta.tool — was hardcoded to "fontkit@2.0.4" but `packages/cli/package.json` allows ^2.0.4, so the manifest string would drift on every dep bump. Now records just "fontkit"; the version moves with the dep and can be discovered from package.json at debug-time if needed. 3. FontFileMetadata.rawFamily — docstring said "nameID 16 preferred, then nameID 1" but the code also derives from PostScript via deriveFamilyFromPostscript when both name-table fields are missing. Doc now reflects the actual three-step precedence. 4. FontFileMetadata.weight — docstring said "100-900" but the code emits 0 (when identified: false) and 950 (when canonicalizeFamily picks ExtraBlack/UltraBlack). Doc now documents both edge values explicitly. 5. sharp ^0.34.5 — bumped from ^0.34.0 on this PR but font extraction doesn't use sharp; the bump is needed by the contact sheet code in PR #988. Reverted on #987; will re-bump on #988 where it's actually consumed. Also adds vitest coverage: - 34 tests in fontMetadataExtractor.test.ts - Covers inferWeightFromSubfamily for concatenated, spaced, and hyphenated forms (including composite styles like "Bold Italic" and case-insensitivity) - Covers canonicalizeFamily for unchanged families, stripped weight tokens, preserved width modifiers, and the 950 emit - Integration tests for extractFontMetadata (non-existent dir, empty dir) verifying the meta.tool / generatedAt shape Exported `inferWeightFromSubfamily` and `canonicalizeFamily` for testing. Pure functions, internal helpers, but exporting is the clean way to pin their behavior against regressions.
miguel-heygen
left a comment
There was a problem hiding this comment.
Re-reviewed after new commit 5e7a7a89.
Changes address review feedback well:
inferWeightFromSubfamilynow normalizes spaces/hyphens (.replace(/[\s-]+/g, "")) before matching — fixes the "Extra Light" / "Extra-Light" bug Copilot flagged- Both weight and family helpers exported for unit testing
meta.toolchanged from hardcoded"fontkit@2.0.4"to"fontkit"to avoid drift on bumps- 176 lines of new tests covering concatenated/spaced/hyphenated weight forms, composite styles, case insensitivity, and
canonicalizeFamilytoken stripping - Clean
sharprevert back to^0.34.0
All correct. Ship it.

What
Adds
fontMetadataExtractor.tsto the capture pipeline. Reads each downloaded font's binary OpenTypenametable viafontkit, identifies the real family name, and writescapture/extracted/fonts-manifest.jsonwith per-file metadata + per-family aggregates (weights, variable-font axes, file counts).1 of 5 in the pipeline-quality stack. Replaces the huge PR #984 with a sequence of independent, reviewable PRs.
Stack order (bottom → top):
Why
Modern frameworks (Next.js, Webpack) hash font filenames like
f9b8e1e8d4c3f0a7-s.woff2. The capture pipeline downloads the font binaries but couldn't tell which file belonged to which family by reading the filename. Sub-agents authoring DESIGN.md were guessing or falling back to system fonts, producing visual fidelity loss in every video the v2/v3/v4 evals shipped.How
packages/cli/src/capture/fontMetadataExtractor.ts(new) — reads the OpenTypenametable viafontkit. Canonicalizes static-weight family-name packaging ("Inter Medium" → family "Inter" + weight 500; compound names like "Semi Bold" → "SemiBold"). Readsfvaraxes for variable fonts. Uses@types/fontkitdirectly (Font / FontCollection types, real fsSelection booleans) with a type guard — fontkit API drift surfaces as a compile error rather than silent undefined.packages/cli/src/capture/index.ts— wiresextractFontMetadatain afterdownloadAndRewriteFonts. Non-fatal try/catch.packages/cli/package.json— addsfontkit ^2.0.4.@types/fontkitis already in devDependencies.Test plan
npx tsx packages/cli/src/cli.ts capture <url>writesfonts-manifest.jsonwith correct family aggregations. Typecheck clean.