Skip to content

feat(capture): identify hashed fonts via OpenType name table#987

Open
ukimsanov wants to merge 2 commits into
mainfrom
feat/capture-font-extractor
Open

feat(capture): identify hashed fonts via OpenType name table#987
ukimsanov wants to merge 2 commits into
mainfrom
feat/capture-font-extractor

Conversation

@ukimsanov
Copy link
Copy Markdown
Collaborator

What

Adds fontMetadataExtractor.ts to the capture pipeline. Reads each downloaded font's binary OpenType name table via fontkit, identifies the real family name, and writes capture/extracted/fonts-manifest.json with per-file metadata + per-family aggregates (weights, variable-font axes, file counts).

1 of 5 in the pipeline-quality stack. Replaces the huge PR #984 with a sequence of independent, reviewable PRs.

Stack order (bottom → top):

  1. feat/capture-font-extractor ← this PR
  2. feat/capture-pipeline-improvements
  3. feat/lint-rules
  4. feat/skill-website-to-hyperframes
  5. feat/skill-hyperframes-core

Why

Modern frameworks (Next.js, Webpack) hash font filenames like f9b8e1e8d4c3f0a7-s.woff2. The capture pipeline downloads the font binaries but couldn't tell which file belonged to which family by reading the filename. Sub-agents authoring DESIGN.md were guessing or falling back to system fonts, producing visual fidelity loss in every video the v2/v3/v4 evals shipped.

How

  • packages/cli/src/capture/fontMetadataExtractor.ts (new) — reads the OpenType name table via fontkit. Canonicalizes static-weight family-name packaging ("Inter Medium" → family "Inter" + weight 500; compound names like "Semi Bold" → "SemiBold"). Reads fvar axes for variable fonts. Uses @types/fontkit directly (Font / FontCollection types, real fsSelection booleans) with a type guard — fontkit API drift surfaces as a compile error rather than silent undefined.
  • packages/cli/src/capture/index.ts — wires extractFontMetadata in after downloadAndRewriteFonts. Non-fatal try/catch.
  • packages/cli/package.json — adds fontkit ^2.0.4. @types/fontkit is already in devDependencies.

Test plan

  • Unit tests added/updated — manual exercise against 9 captures: 132/132 fonts identified by real family name, including hashed Next.js builds.
  • Manual testing performed — npx tsx packages/cli/src/cli.ts capture <url> writes fonts-manifest.json with correct family aggregations. Typecheck clean.
  • Documentation updated (if applicable) — not user-facing; downstream pipeline consumers are documented in the skill PR (PR feat(studio): consolidate into single OSS-ready NLE editor #4 in the stack).

Modern frameworks (Next.js, Webpack) hash font filenames like
`f9b8e1e8d4c3f0a7-s.woff2`, so the capture pipeline can't tell which
file belongs to which family by reading the filename. Sub-agents
authoring DESIGN.md were guessing or falling back to system fonts.

This adds `fontMetadataExtractor.ts`: reads the binary OpenType `name`
table via `fontkit`, identifies each downloaded font by its real
family name, and writes `capture/extracted/fonts-manifest.json` with
per-file metadata + per-family aggregates (weights, variable-font
axes, file counts).

- Canonicalizes static-weight family-name packaging: "Inter Medium"
  resolves to family "Inter" with weight 500, "Semi Bold" normalizes
  to "SemiBold", etc. Width modifiers ("Tight", "Condensed") are NOT
  stripped — they denote separate typographic families.
- Reads variable-font axes from `fvar` so a single .woff2 carrying a
  full weight range is identified as variable (e.g. "Inter (100-900
  variable)").
- Uses `@types/fontkit` properly (no `unknown` cast), with a
  Font/FontCollection type guard. fontkit API drift surfaces as a
  compile error rather than silent undefined.
- Wired into `capture/index.ts` after `downloadAndRewriteFonts` so it
  runs after fonts are already on disk. Non-fatal try/catch — capture
  succeeds even if extraction fails.

Tested against 9 captures: 132/132 fonts identified by real family
name, including hashed Next.js builds.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds font binary inspection to the CLI capture pipeline so hashed/downloaded font files can be identified by their real OpenType family names and summarized for downstream consumers (e.g., DESIGN.md authoring).

Changes:

  • Add a fontMetadataExtractor that uses fontkit to read OpenType name/OS/2/fvar metadata and write extracted/fonts-manifest.json.
  • Wire the extractor into captureWebsite after downloadAndRewriteFonts, with non-fatal logging.
  • Add fontkit dependency (and update lockfile).

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 6 comments.

File Description
packages/cli/src/capture/index.ts Invokes font metadata extraction after font download/URL rewrite and logs a concise summary.
packages/cli/src/capture/fontMetadataExtractor.ts New extractor that parses font binaries, canonicalizes family names/weights, aggregates per-family, and writes a manifest.
packages/cli/package.json Adds fontkit (and @types/fontkit), plus bumps sharp.
bun.lock Locks new dependency graph including fontkit and related transitive deps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +213 to +220
if (s.includes("thin")) return 100;
if (s.includes("extralight") || s.includes("ultralight")) return 200;
if (s.includes("light")) return 300;
if (s.includes("medium")) return 500;
if (s.includes("semibold") || s.includes("demibold")) return 600;
if (s.includes("extrabold") || s.includes("ultrabold")) return 800;
if (s.includes("black") || s.includes("heavy")) return 900;
if (s.includes("bold")) return 700;
Comment on lines +39 to +44
/** Raw family name from the OpenType name table (nameID 16 preferred, then nameID 1). Empty if unidentifiable. */
rawFamily: string;
/** Subfamily / style name from nameID 17 or 2 (e.g. "Regular", "Bold Italic") */
subfamily: string;
/** PostScript name from nameID 6 (e.g. "Inter-Regular") */
postscript: string;
Comment on lines +45 to +46
/** OS/2 usWeightClass (100–900). Approximate for variable fonts — see variationAxes. */
weight: number;
unidentified,
meta: {
generatedAt: new Date().toISOString(),
tool: "fontkit@2.0.4",
Comment on lines +79 to +118
/**
* Read all font files in fontsDir, extract metadata via fontkit, and write
* the manifest to outputPath. Returns the manifest in case callers want to log it.
*
* Failures are non-fatal: if a single font's name table is missing or corrupt,
* the file is added to `unidentified` and the rest continue. If the fonts
* directory doesn't exist, returns an empty manifest without throwing.
*/
export function extractFontMetadata(fontsDir: string, outputPath: string): FontsManifest {
const files: FontFileMetadata[] = [];
const unidentified: string[] = [];

if (existsSync(fontsDir)) {
const fontFiles = readdirSync(fontsDir).filter((f) => /\.(woff2?|ttf|otf)$/i.test(f));
for (const filename of fontFiles) {
const fullPath = join(fontsDir, filename);
const meta = readSingleFont(fullPath, filename);
if (meta.identified) {
files.push(meta);
} else {
files.push(meta);
unidentified.push(filename);
}
}
}

const families = aggregateFamilies(files);

const manifest: FontsManifest = {
files,
families,
unidentified,
meta: {
generatedAt: new Date().toISOString(),
tool: "fontkit@2.0.4",
},
};

writeFileSync(outputPath, JSON.stringify(manifest, null, 2), "utf-8");
return manifest;
Comment thread packages/cli/package.json
Comment on lines 37 to 42
"open": "^10.0.0",
"postcss": "^8.5.8",
"prettier": "^3.8.1",
"puppeteer-core": "^24.39.1",
"sharp": "^0.34.0"
"sharp": "^0.34.5"
},
Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean extraction from the original #984. Font metadata extractor is solid — proper typed imports from @types/fontkit, weight canonicalization well-documented, non-fatal wiring in capture/index.ts. fontkit + sharp in cli package.json only (not root). Minor: hardcoded tool: "fontkit@2.0.4" will go stale on dep bumps.

Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve at db94b505. Magi covered the headline correctness (typed fontkit, weight canonicalization, non-fatal wiring); my additive read:

  • Type guard for Font | FontCollection is clean — isFontCollection checks type === "TTC" || "DFont" against the @types/fontkit discriminator. No any / unknown cast bypasses. This addresses Magi's prior fontkit.create() cast concern from hf#984 cleanly.
  • Variable-font axes via font.variationAxes reads fvar-equivalent metadata. Per-family aggregates list axes correctly (e.g., ["wght", "slnt"]).
  • Malformed-binary handling — silent skip with the file marked identified: false in the unidentified array. The error doesn't propagate so the pipeline continues. Reasonable for a downstream-consumer-tolerates-missing context, but worth noting that a systematically misparseable font (e.g., a corrupted upstream CDN) won't surface as an error in the manifest summary. Lower-priority: a one-line log per skip would make debugging easier when a font silently disappears.
  • Path validation: readdirSync(fontsDir) is called with the caller-supplied path. No symlink or traversal canonicalization. Low priority for a local-dev CLI that the user controls, but worth knowing if fontsDir ever becomes a user-controlled input (URL, agent prompt, etc.).
  • Stack base: confirmed base is main. Foundation for the stack.

Solid, isolated, reviewable. Direction is the right replacement for the v1/v2/v3/v4 eval failure mode where Inter Medium and Geist Mono showed up as font_xxxxxx.woff2 in DESIGN.md.

— Rames Jusso

Five fixes from Copilot's inline review + Miguel's note on PR #987:

1. inferWeightFromSubfamily — only matched concatenated forms
   ("extralight", "semibold"). Spaced ("Extra Light") and
   hyphenated ("Extra-Light") variants fell through to the 400
   default, misreporting 200-weight fonts as 400. Now normalizes
   `[\s-]+` out of the subfamily before matching.

2. meta.tool — was hardcoded to "fontkit@2.0.4" but
   `packages/cli/package.json` allows ^2.0.4, so the manifest
   string would drift on every dep bump. Now records just
   "fontkit"; the version moves with the dep and can be discovered
   from package.json at debug-time if needed.

3. FontFileMetadata.rawFamily — docstring said "nameID 16 preferred,
   then nameID 1" but the code also derives from PostScript via
   deriveFamilyFromPostscript when both name-table fields are
   missing. Doc now reflects the actual three-step precedence.

4. FontFileMetadata.weight — docstring said "100-900" but the code
   emits 0 (when identified: false) and 950 (when
   canonicalizeFamily picks ExtraBlack/UltraBlack). Doc now
   documents both edge values explicitly.

5. sharp ^0.34.5 — bumped from ^0.34.0 on this PR but font
   extraction doesn't use sharp; the bump is needed by the contact
   sheet code in PR #988. Reverted on #987; will re-bump on #988
   where it's actually consumed.

Also adds vitest coverage:
- 34 tests in fontMetadataExtractor.test.ts
- Covers inferWeightFromSubfamily for concatenated, spaced, and
  hyphenated forms (including composite styles like "Bold Italic"
  and case-insensitivity)
- Covers canonicalizeFamily for unchanged families, stripped
  weight tokens, preserved width modifiers, and the 950 emit
- Integration tests for extractFontMetadata (non-existent dir,
  empty dir) verifying the meta.tool / generatedAt shape

Exported `inferWeightFromSubfamily` and `canonicalizeFamily` for
testing. Pure functions, internal helpers, but exporting is the
clean way to pin their behavior against regressions.
@ukimsanov ukimsanov requested a review from miguel-heygen May 21, 2026 03:20
Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed after new commit 5e7a7a89.

Changes address review feedback well:

  • inferWeightFromSubfamily now normalizes spaces/hyphens (.replace(/[\s-]+/g, "")) before matching — fixes the "Extra Light" / "Extra-Light" bug Copilot flagged
  • Both weight and family helpers exported for unit testing
  • meta.tool changed from hardcoded "fontkit@2.0.4" to "fontkit" to avoid drift on bumps
  • 176 lines of new tests covering concatenated/spaced/hyphenated weight forms, composite styles, case insensitivity, and canonicalizeFamily token stripping
  • Clean sharp revert back to ^0.34.0

All correct. Ship it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants