Skip to content

feat: store browser / HTTP client fingerprint in Session#3646

Draft
barjin wants to merge 6 commits into
v4from
feat/session-fingerprint
Draft

feat: store browser / HTTP client fingerprint in Session#3646
barjin wants to merge 6 commits into
v4from
feat/session-fingerprint

Conversation

@barjin
Copy link
Copy Markdown
Member

@barjin barjin commented May 11, 2026

Closes #3628.

Adds a fingerprint field to Session so repeated requests with the same session can stay consistent on user-agent, headers, and TLS profile across HTTP clients and browser crawlers. The format is intentionally two-layered: a small SessionFingerprint interface in @crawlee/types carries the cross-cutting bits every backend understands (userAgent, headers, browser, platform, device, httpVersion, locales), and an opaque browserFingerprint?: unknown slot holds the rich BrowserFingerprintWithHeaders payload populated by @crawlee/browser-pool. The opaque slot keeps @crawlee/core free of any dependency on fingerprint-generator.

browser-pool's prelaunch hook now treats session.fingerprint.browserFingerprint as the source of truth — it reuses what the session already has, falls back to the existing LRU when the session is empty, and writes back both the rich payload and the derived lean fields after generation. Hydration via SessionPool._maybeLoadSessionPool already round-trips the field through SessionOptions, so persisted fingerprints survive restarts.

One pre-existing gap surfaced: launchContext.session is referenced in fingerprinting/hooks.ts but never actually set — browser-crawler's preparePage doesn't thread session into browserPool.newPage. The new hook logic is forward-compatible (the LRU still keys on proxyUrl until a session lands on launchContext), and threading session through BrowserPoolNewPageOptions is a natural follow-up.

barjin added 6 commits May 11, 2026 14:22
…able

Previously, calling retire() bumped errorScore to maxErrorScore but a subsequent markGood() (e.g. the automatic markGood after a successful requestHandler that explicitly retired the session) could decrement the score back below the threshold, making the session usable again. Track retirement in a dedicated _retired flag checked by isUsable() so retire() is a true terminal state.
Replace the global EVENT_SESSION_RETIRED listener and the per-controller
browserSessionIds map with a check at the per-request cleanup hook: if
the session ended the request unusable, retire the browser controller.
The previous mechanism tore down browsers eagerly mid-flight; the new
one lets the in-flight request finish on the doomed browser and retires
it once the request is done. Same outcome, no global event subscription
needed.
SessionPool no longer extends EventEmitter and no longer fires a
sessionRetired event. The Session->SessionPool back-reference, the
sessionPool constructor option on Session, and the EVENT_SESSION_RETIRED
constant are gone with it. The only consumer of that event was the
browser crawler, which now retires browsers via the per-request context
pipeline cleanup. Custom createSessionFunction implementations that
manually constructed Session instances should drop the sessionPool
argument.
Define a small, dependency-free fingerprint contract on the Session
interface so both HTTP clients and browser crawlers can read the same
shape. The rich browser fingerprint payload lives in an opaque
`browserFingerprint` slot to keep `@crawlee/types` free of any
dependency on `fingerprint-generator`.
Expose a `fingerprint` field on `Session` (constructor option, getter,
setter, and persisted in `getState()`) so HTTP clients and browser
crawlers can read and write the session's browser/HTTP client
fingerprint. Hydration via `SessionPool._maybeLoadSessionPool` rounds
back through the same option, so persisted fingerprints survive
restarts.
When a session is attached to the launch context, the prelaunch
fingerprint hook now reads from and writes to `session.fingerprint` as
the source of truth, falling back to the existing LRU only when the
session has no fingerprint yet. Generated payloads are written back to
`session.fingerprint.browserFingerprint` together with derived
`userAgent`/`headers`/`browser`/`platform`/`device`/`locales` fields,
so HTTP backends and persisted sessions can consume the same shape.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants