collapse URLs, IPv6, and multi-segment hostnames to single placeholders#20
collapse URLs, IPv6, and multi-segment hostnames to single placeholders#20HrachShah wants to merge 2 commits into
Conversation
…se OSError/ValueError/TypeError
The error-normalization pipeline had three related bugs that caused real log lines to split into nonsense patterns: - URLs lost their scheme: GET https://api.example.com/foo?x=1 was becoming 'GET https:<PATH>' because the host pattern required a single hostname segment and the path rule then ate the rest of the line. Replace the URL with a single <URL> placeholder first. - IPv6 addresses were shredded: 2001:db8::1 became '<NUM>:db8::<PORT>' because the IPv4 regex misses IPv6 and the bare-port rule fires on the IPv6 colon-segments. Add an IPv6 match before the IPv4/port rules. - Multi-segment hostnames like api.example.com or internal.svc.cluster.local leaked through to the path rule. Extend the host regex to require a known TLD after one-or-more dotted labels, and add a few more TLDs (co, uk, gov, app, ai, me, info, biz, us, edu). The path rule now only matches paths preceded by whitespace or '(', or paths starting with '/' or './' or '../', so it no longer consumes the trailing slash of an already-collapsed <URL>. Added 11 unit tests covering each case, plus a regression test for the specific 'GET https://api.example.com/foo?x=1' input. The 11 pre-existing CLI test failures in tests/test_cli.py are unrelated: they expect fixtures examples/syslog-sample.log, examples/apache-sample.log, and examples/app-json.log, which have never been checked in (git ls-tree finds no examples/ entries).
Reviewer's GuideUpdates log normalization to collapse entire URLs, IPv6 addresses, and multi-segment hostnames into single placeholders, refines path matching to avoid consuming URL components, narrows CLI exception handling, and adds focused unit tests for the new normalization behavior. Flow diagram for updated log normalization pipelineflowchart TD
A[error_msg input] --> B[normalize_error_pattern]
B --> C[re.sub URL -> <URL>]
C --> D[re.sub IPv6 -> <IPV6>]
D --> E[re.sub IPv4_ip_port -> <IP>]
E --> F[re.sub IPv4_ip -> <IP>]
F --> G[re.sub hostname_multi_segment -> <HOST>]
G --> H[re.sub localhost -> <HOST>]
H --> I[re.sub port_suffix -> :<PORT>]
I --> J[re.sub uuid -> <UUID>]
J --> K[re.sub path_with_prefix -> <PATH>]
K --> L[re.sub remaining_numbers -> <NUM>]
L --> M[normalized pattern output]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
Warning Review limit reached
More reviews will be available in 36 minutes and 50 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses rolling per-developer review limits. Reviews become available again as older review attempts age out of the rolling limit window. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- The new path replacement
re.sublambda only preserves a single leading character (m.group(0)[0]), which will collapse./fooand../footo. <PATH>and potentially lose intended path semantics; consider preserving the full prefix (./,../, or leading whitespace/paren) instead of just the first character. - The IPv6 regex
(?:[0-9a-fA-F]{0,4}:){2,7}[0-9a-fA-F]{0,4}may match non-IP hex-with-colon sequences (e.g. genericabcd:1234tokens); if that matters for your logs, you might want to tighten it (e.g. anchor against surrounding context or require at least one::/multiple segments) to avoid over-normalizing unrelated values.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new path replacement `re.sub` lambda only preserves a single leading character (`m.group(0)[0]`), which will collapse `./foo` and `../foo` to `. <PATH>` and potentially lose intended path semantics; consider preserving the full prefix (`./`, `../`, or leading whitespace/paren) instead of just the first character.
- The IPv6 regex `(?:[0-9a-fA-F]{0,4}:){2,7}[0-9a-fA-F]{0,4}` may match non-IP hex-with-colon sequences (e.g. generic `abcd:1234` tokens); if that matters for your logs, you might want to tighten it (e.g. anchor against surrounding context or require at least one `::`/multiple segments) to avoid over-normalizing unrelated values.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Replaces a real bug where several log-line shapes were being split into nonsensical placeholders:
GET https://api.example.com/foo?x=1becameGET https:<PATH>because the host pattern required a single hostname segment and the path rule then ate the rest of the line.2001:db8::1became<NUM>:db8::<PORT>because the IPv4 regex misses IPv6 and the bare-port rule fires on the IPv6 colon-segments.api.example.comorinternal.svc.cluster.localleaked through to the path rule.The path rule now only matches paths preceded by whitespace or
(, or paths starting with/,./, or../, so it no longer consumes the trailing slash of an already-collapsed<URL>.Added 11 unit tests covering each case, including a regression test for the specific
GET https://api.example.com/foo?x=1input. The 11 pre-existing CLI test failures intests/test_cli.pyare unrelated — they expect fixtures (examples/syslog-sample.log, etc.) that have never been checked in.Summary by Sourcery
Improve log normalization to collapse URLs, IPv6 addresses, and multi-segment hostnames into single placeholders and narrow CLI error handling to expected exceptions.
Bug Fixes:
Enhancements:
Tests: