Skip to content

fix(ssrf): check every resolved DNS record, not just the first#269

Merged
Jiaaqiliu merged 1 commit into
aiming-lab:mainfrom
atulya-singh:fix/ssrf-validate-all-dns-records
May 29, 2026
Merged

fix(ssrf): check every resolved DNS record, not just the first#269
Jiaaqiliu merged 1 commit into
aiming-lab:mainfrom
atulya-singh:fix/ssrf-validate-all-dns-records

Conversation

@atulya-singh
Copy link
Copy Markdown
Contributor

Was poking around researchclaw/web/_ssrf.py after the recent bypass fixes in #254 and noticed the resolution path only looks at info[0]:

info = socket.getaddrinfo(hostname, None, socket.AF_UNSPEC, socket.SOCK_STREAM)
addr = ipaddress.ip_address(info[0][4][0])

If a hostname returns multiple A/AAAA records, the check only sees the first one. So a domain that resolves to e.g. [8.8.8.8, 127.0.0.1] will pass — but urllib / Crawl4AI can connect to either address. Pretty easy bypass for anyone who controls a DNS zone.

Changed the resolution path to loop over every record and block if any of them lands in a private/loopback/link-local/reserved range. Also threw in is_unspecified and is_multicast since they were the obvious gaps once I started listing things out (and added the resolved IP to the error message so it's easier to tell what got blocked from the logs). Pulled out the predicate into a small _is_blocked_addr helper so the literal-IP branch and the resolved branch share the same rules.

One unrelated cleanup: removed the _SUSPICIOUS_URL_RE regex at the top of the file — it's defined but nothing imports it, looks like leftover from an earlier draft of the backslash/userinfo check.

New tests in tests/test_web_crawler.py:

  • multi-record DNS where any record is private → blocked, with the bad IP in the message
  • multi-record DNS where all records are public → allowed
  • 0.0.0.0, [::1], [::ffff:127.0.0.1]
  • the backslash and userinfo cases from SSRF bypass in check_url_ssrf #254 (didn't have direct tests on check_url_ssrf for those)

Not trying to fix DNS rebinding here — that needs resolve-once-and-pin-the-IP plumbing through the actual HTTP client, which is a bigger change. This just stops the trivial multi-record case.

Test plan

  • pytest tests/test_web_crawler.py — 35 passed (29 existing + 6 new)
  • pytest tests/ — 2790 passed, 56 skipped

`check_url_ssrf` resolves the hostname via `getaddrinfo` and then checks
only `info[0]` against the private-IP ranges. If a domain returns
multiple A/AAAA records — for example one public and one private —
the first record passes the check while the underlying HTTP client
remains free to connect to any returned address.

Iterate every record and block the URL if any address is in a
private / loopback / link-local / reserved / unspecified / multicast
range. Also drop the unused `_SUSPICIOUS_URL_RE` regex.

Adds tests for the multi-record case, 0.0.0.0, IPv6 loopback,
IPv4-mapped IPv6 loopback, backslash bypass, and userinfo bypass.
@Jiaaqiliu Jiaaqiliu merged commit d66ef84 into aiming-lab:main May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants