feat: store browser / HTTP client fingerprint in Session#3646
Draft
barjin wants to merge 6 commits into
Draft
Conversation
…able Previously, calling retire() bumped errorScore to maxErrorScore but a subsequent markGood() (e.g. the automatic markGood after a successful requestHandler that explicitly retired the session) could decrement the score back below the threshold, making the session usable again. Track retirement in a dedicated _retired flag checked by isUsable() so retire() is a true terminal state.
Replace the global EVENT_SESSION_RETIRED listener and the per-controller browserSessionIds map with a check at the per-request cleanup hook: if the session ended the request unusable, retire the browser controller. The previous mechanism tore down browsers eagerly mid-flight; the new one lets the in-flight request finish on the doomed browser and retires it once the request is done. Same outcome, no global event subscription needed.
SessionPool no longer extends EventEmitter and no longer fires a sessionRetired event. The Session->SessionPool back-reference, the sessionPool constructor option on Session, and the EVENT_SESSION_RETIRED constant are gone with it. The only consumer of that event was the browser crawler, which now retires browsers via the per-request context pipeline cleanup. Custom createSessionFunction implementations that manually constructed Session instances should drop the sessionPool argument.
Define a small, dependency-free fingerprint contract on the Session interface so both HTTP clients and browser crawlers can read the same shape. The rich browser fingerprint payload lives in an opaque `browserFingerprint` slot to keep `@crawlee/types` free of any dependency on `fingerprint-generator`.
Expose a `fingerprint` field on `Session` (constructor option, getter, setter, and persisted in `getState()`) so HTTP clients and browser crawlers can read and write the session's browser/HTTP client fingerprint. Hydration via `SessionPool._maybeLoadSessionPool` rounds back through the same option, so persisted fingerprints survive restarts.
When a session is attached to the launch context, the prelaunch fingerprint hook now reads from and writes to `session.fingerprint` as the source of truth, falling back to the existing LRU only when the session has no fingerprint yet. Generated payloads are written back to `session.fingerprint.browserFingerprint` together with derived `userAgent`/`headers`/`browser`/`platform`/`device`/`locales` fields, so HTTP backends and persisted sessions can consume the same shape.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #3628.
Adds a
fingerprintfield toSessionso repeated requests with the same session can stay consistent on user-agent, headers, and TLS profile across HTTP clients and browser crawlers. The format is intentionally two-layered: a smallSessionFingerprintinterface in@crawlee/typescarries the cross-cutting bits every backend understands (userAgent,headers,browser,platform,device,httpVersion,locales), and an opaquebrowserFingerprint?: unknownslot holds the richBrowserFingerprintWithHeaderspayload populated by@crawlee/browser-pool. The opaque slot keeps@crawlee/corefree of any dependency onfingerprint-generator.browser-pool's prelaunch hook now treatssession.fingerprint.browserFingerprintas the source of truth — it reuses what the session already has, falls back to the existing LRU when the session is empty, and writes back both the rich payload and the derived lean fields after generation. Hydration viaSessionPool._maybeLoadSessionPoolalready round-trips the field throughSessionOptions, so persisted fingerprints survive restarts.One pre-existing gap surfaced:
launchContext.sessionis referenced infingerprinting/hooks.tsbut never actually set —browser-crawler'spreparePagedoesn't threadsessionintobrowserPool.newPage. The new hook logic is forward-compatible (the LRU still keys onproxyUrluntil a session lands onlaunchContext), and threadingsessionthroughBrowserPoolNewPageOptionsis a natural follow-up.