fix: handleCloudflareChallenge skips blocked response code#3630
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to restore correct behavior of Playwright’s handleCloudflareChallenge helper in Crawlee v4 by preventing the initial “blocked” navigation status code (commonly 403 from Cloudflare challenge pages) from immediately failing the request/session handling logic in BrowserCrawler.processResponse.
Changes:
- Introduces a new
SKIP_BLOCKED_STATUS_CODE_CHECKsymbol marker in@crawlee/browserto allow post-navigation logic to opt out of blocked-status-code handling inprocessResponse. - Updates
PlaywrightCrawler’shandleCloudflareChallengecontext helper to set that marker after the challenge helper completes. - Removes the previous approach of temporarily mutating
blockedStatusCodeswithin the Playwright Cloudflare challenge helper.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| packages/playwright-crawler/src/internals/utils/playwright-utils.ts | Removes blocked-status-code set mutation from handleCloudflareChallenge and keeps challenge-solving flow self-contained. |
| packages/playwright-crawler/src/internals/playwright-crawler.ts | Sets the new skip marker after running handleCloudflareChallenge so processResponse doesn’t fail on the initial challenge response status code. |
| packages/browser-crawler/src/internals/browser-crawler.ts | Adds the exported SKIP_BLOCKED_STATUS_CODE_CHECK symbol and uses it to conditionally bypass _throwOnBlockedRequest() in processResponse. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
`handleCloudflareChallenge` now returns whether it actually detected and solved a challenge. The wrapper sets `SKIP_BLOCKED_STATUS_CODE_CHECK` only in that case, so calling the helper on every request no longer suppresses session rotation on legitimate 401/429 responses.
…rawlingContext` Adds an optional computed property signature on `BrowserCrawlingContext` keyed by the symbol so call sites can drop the `as any` casts.
Adds a 403 test endpoint and verifies that a post-navigation hook setting the symbol routes the response to the request handler, while omitting it falls back to the default blocked-status-code behavior.
3bb68c9 to
567a9eb
Compare
| if (this.sessionPool && response && session) { | ||
| if (typeof response === 'object' && typeof response.status === 'function') { | ||
| this._throwOnBlockedRequest(response.status()); | ||
| if (!crawlingContext[SKIP_BLOCKED_STATUS_CODE_CHECK]) { |
There was a problem hiding this comment.
So, when a cloudflare challenge is successfully handled, the context helper sets the flag on the context. Here, when checking the navigation result, we check both the original http response status and the value of the flag. That sounds fairly convoluted.
Doesn't handling the challenge result in a redirection? Perhaps we could track those and check the result of the last redirect here instead 🤔
There was a problem hiding this comment.
Tbh, I'm not too fond of passing the flag through the CrawlingContext either, but it is the most convenient place, really.
Doesn't handling the challenge result in a redirection? Perhaps we could track those and check the result of the last redirect here instead 🤔
There is a fresh issue (#3629) that hints at some issues with that. I am also not convinced we should rewrite context.response.status based on the user-based actions in postNavigationHooks (and unfortunately, there is no such API as page.responseStatus() in Playwright).
There was a problem hiding this comment.
Hm... what if we could just repeat the navigation after handling the cloudflare thingy? New page.goto, a legit, green response...
There was a problem hiding this comment.
New
page.goto, a legit, green response
We can definitely issue a new page.goto call, but I'm not sure this should change the contents of context.response.status.
We could technically reassign it (context['response'] = await page.goto(request.url))), but to me it feels this breaks the frontier between "user-land" code (an exported util) and the Crawlee internals. Also, imo it could mess up the context pipeline setup (e.g., if we add a response clean-up step, but the response gets rewritten if a random util is called).
There was a problem hiding this comment.
I guess I should've make myself more clear. What I had in mind was closer to "retry the whole context pipeline transparently" rather than call page.goto and patch up the response. But we'd need to make sure that we use the same browser, proxy, fingerprint, etc, so that Cloudflare doesn't block us again.
There was a problem hiding this comment.
Hmm, fair... I still cannot imagine how a context-bound helper function (called in the postNavigationHook) would do all this. Perhaps we could:
- unlock the Cloudflare challenge
- bind the
Sessioninstance to therequest(request.sessionId = session.id) - throw a
RetryErroronce done, putting the request back to the RQ
...... process a bunch of other requests
- fetch the request again and process it with the right session (which, in v4, means the same proxy, browser, fingerprint, etc.)
The issue with this approach is imo in the time between 2. and 123. - it's possible that the Cloudflare cookie expires, the proxy exit nodes rotate upstream, etc.
There was a problem hiding this comment.
Yeah, I was thinking something similar - hooking into the existing retry mechanism sure sounds appealing. But as you said, we cannot afford to wait for a potentially unbounded time for a retry.
It might be possible to throw a special error from handleCloudlareChallenge (similar to ContextPipelineInterruptedError) and either handle it in the context pipeline itself (by restarting) or let it propagate to the runTaskFunction (or whatever it's called these days) and handle the restarting there instead.
Recent
Sessionrelated changes inv4accidentally broke thehandleCloudflareChallengehelper.This PR adds the
SKIP_BLOCKED_STATUS_CODE_CHECKsymbol for signaling that the current request processing shouldn't fail on "blocking" status code.This is used, e.g., with the
handleCloudflareChallengehelper, since Cloudflare returns the challenge with403.Related to #3502