Skip to content

fix: handleCloudflareChallenge skips blocked response code#3630

Open
barjin wants to merge 4 commits into
v4from
fix/cloudflare-challenge-skip-403
Open

fix: handleCloudflareChallenge skips blocked response code#3630
barjin wants to merge 4 commits into
v4from
fix/cloudflare-challenge-skip-403

Conversation

@barjin
Copy link
Copy Markdown
Member

@barjin barjin commented May 6, 2026

Recent Session related changes in v4 accidentally broke the handleCloudflareChallenge helper.

This PR adds the SKIP_BLOCKED_STATUS_CODE_CHECK symbol for signaling that the current request processing shouldn't fail on "blocking" status code.
This is used, e.g., with the handleCloudflareChallenge helper, since Cloudflare returns the challenge with 403.

Related to #3502

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to restore correct behavior of Playwright’s handleCloudflareChallenge helper in Crawlee v4 by preventing the initial “blocked” navigation status code (commonly 403 from Cloudflare challenge pages) from immediately failing the request/session handling logic in BrowserCrawler.processResponse.

Changes:

  • Introduces a new SKIP_BLOCKED_STATUS_CODE_CHECK symbol marker in @crawlee/browser to allow post-navigation logic to opt out of blocked-status-code handling in processResponse.
  • Updates PlaywrightCrawler’s handleCloudflareChallenge context helper to set that marker after the challenge helper completes.
  • Removes the previous approach of temporarily mutating blockedStatusCodes within the Playwright Cloudflare challenge helper.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
packages/playwright-crawler/src/internals/utils/playwright-utils.ts Removes blocked-status-code set mutation from handleCloudflareChallenge and keeps challenge-solving flow self-contained.
packages/playwright-crawler/src/internals/playwright-crawler.ts Sets the new skip marker after running handleCloudflareChallenge so processResponse doesn’t fail on the initial challenge response status code.
packages/browser-crawler/src/internals/browser-crawler.ts Adds the exported SKIP_BLOCKED_STATUS_CODE_CHECK symbol and uses it to conditionally bypass _throwOnBlockedRequest() in processResponse.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/playwright-crawler/src/internals/playwright-crawler.ts Outdated
Comment thread packages/browser-crawler/src/internals/browser-crawler.ts
Comment thread packages/playwright-crawler/src/internals/playwright-crawler.ts
barjin added 3 commits May 6, 2026 14:05
`handleCloudflareChallenge` now returns whether it actually detected and
solved a challenge. The wrapper sets `SKIP_BLOCKED_STATUS_CODE_CHECK`
only in that case, so calling the helper on every request no longer
suppresses session rotation on legitimate 401/429 responses.
…rawlingContext`

Adds an optional computed property signature on `BrowserCrawlingContext`
keyed by the symbol so call sites can drop the `as any` casts.
Adds a 403 test endpoint and verifies that a post-navigation hook
setting the symbol routes the response to the request handler, while
omitting it falls back to the default blocked-status-code behavior.
@barjin barjin force-pushed the fix/cloudflare-challenge-skip-403 branch from 3bb68c9 to 567a9eb Compare May 6, 2026 12:42
@barjin barjin requested review from B4nan and janbuchar May 6, 2026 12:47
if (this.sessionPool && response && session) {
if (typeof response === 'object' && typeof response.status === 'function') {
this._throwOnBlockedRequest(response.status());
if (!crawlingContext[SKIP_BLOCKED_STATUS_CODE_CHECK]) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, when a cloudflare challenge is successfully handled, the context helper sets the flag on the context. Here, when checking the navigation result, we check both the original http response status and the value of the flag. That sounds fairly convoluted.

Doesn't handling the challenge result in a redirection? Perhaps we could track those and check the result of the last redirect here instead 🤔

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh, I'm not too fond of passing the flag through the CrawlingContext either, but it is the most convenient place, really.

Doesn't handling the challenge result in a redirection? Perhaps we could track those and check the result of the last redirect here instead 🤔

There is a fresh issue (#3629) that hints at some issues with that. I am also not convinced we should rewrite context.response.status based on the user-based actions in postNavigationHooks (and unfortunately, there is no such API as page.responseStatus() in Playwright).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... what if we could just repeat the navigation after handling the cloudflare thingy? New page.goto, a legit, green response...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New page.goto, a legit, green response

We can definitely issue a new page.goto call, but I'm not sure this should change the contents of context.response.status.

We could technically reassign it (context['response'] = await page.goto(request.url))), but to me it feels this breaks the frontier between "user-land" code (an exported util) and the Crawlee internals. Also, imo it could mess up the context pipeline setup (e.g., if we add a response clean-up step, but the response gets rewritten if a random util is called).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I should've make myself more clear. What I had in mind was closer to "retry the whole context pipeline transparently" rather than call page.goto and patch up the response. But we'd need to make sure that we use the same browser, proxy, fingerprint, etc, so that Cloudflare doesn't block us again.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, fair... I still cannot imagine how a context-bound helper function (called in the postNavigationHook) would do all this. Perhaps we could:

  1. unlock the Cloudflare challenge
  2. bind the Session instance to the request (request.sessionId = session.id)
  3. throw a RetryError once done, putting the request back to the RQ
    ...... process a bunch of other requests
  1. fetch the request again and process it with the right session (which, in v4, means the same proxy, browser, fingerprint, etc.)

The issue with this approach is imo in the time between 2. and 123. - it's possible that the Cloudflare cookie expires, the proxy exit nodes rotate upstream, etc.

Copy link
Copy Markdown
Contributor

@janbuchar janbuchar May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking something similar - hooking into the existing retry mechanism sure sounds appealing. But as you said, we cannot afford to wait for a potentially unbounded time for a retry.

It might be possible to throw a special error from handleCloudlareChallenge (similar to ContextPipelineInterruptedError) and either handle it in the context pipeline itself (by restarting) or let it propagate to the runTaskFunction (or whatever it's called these days) and handle the restarting there instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants