Skip to content

collapse URLs, IPv6, and multi-segment hostnames to single placeholders#20

Open
HrachShah wants to merge 2 commits into
mainfrom
fix/normalize-url-and-ipv6
Open

collapse URLs, IPv6, and multi-segment hostnames to single placeholders#20
HrachShah wants to merge 2 commits into
mainfrom
fix/normalize-url-and-ipv6

Conversation

@HrachShah

@HrachShah HrachShah commented Jun 23, 2026

Copy link
Copy Markdown
Owner

Replaces a real bug where several log-line shapes were being split into nonsensical placeholders:

  • URLs lost their scheme: GET https://api.example.com/foo?x=1 became GET https:<PATH> because the host pattern required a single hostname segment and the path rule then ate the rest of the line.
  • IPv6 addresses were shredded: 2001:db8::1 became <NUM>:db8::<PORT> because the IPv4 regex misses IPv6 and the bare-port rule fires on the IPv6 colon-segments.
  • Multi-segment hostnames like api.example.com or internal.svc.cluster.local leaked through to the path rule.

The path rule now only matches paths preceded by whitespace or (, or paths starting with /, ./, or ../, so it no longer consumes the trailing slash of an already-collapsed <URL>.

Added 11 unit tests covering each case, including a regression test for the specific GET https://api.example.com/foo?x=1 input. The 11 pre-existing CLI test failures in tests/test_cli.py are unrelated — they expect fixtures (examples/syslog-sample.log, etc.) that have never been checked in.

Summary by Sourcery

Improve log normalization to collapse URLs, IPv6 addresses, and multi-segment hostnames into single placeholders and narrow CLI error handling to expected exceptions.

Bug Fixes:

  • Normalize full URLs with schemes as single placeholders instead of splitting them into scheme and path components.
  • Correctly collapse IPv6 addresses into a single placeholder rather than partially matching segments as ports or numbers.
  • Handle multi-segment hostnames and common TLD combinations so they are collapsed to instead of leaking into path normalization.

Enhancements:

  • Refine path normalization to only match legitimate filesystem-like paths and avoid consuming parts of already-collapsed tokens.
  • Extend hostname TLD support to a broader set of common domains and keep email localparts while normalizing only the host.
  • Tighten CLI exception handling to catch only anticipated I/O and value/type errors when running analysis and loading parsers.

Tests:

  • Add unit test coverage for URL, hostname (including multi-segment), IPv4/IPv6, UUID, path, version string, and email normalization edge cases in normalize_error_pattern.

Zo Bot added 2 commits May 14, 2026 21:26
The error-normalization pipeline had three related bugs that caused
real log lines to split into nonsense patterns:

- URLs lost their scheme: GET https://api.example.com/foo?x=1 was
  becoming 'GET https:<PATH>' because the host pattern required a
  single hostname segment and the path rule then ate the rest of the
  line. Replace the URL with a single <URL> placeholder first.

- IPv6 addresses were shredded: 2001:db8::1 became '<NUM>:db8::<PORT>'
  because the IPv4 regex misses IPv6 and the bare-port rule fires on
  the IPv6 colon-segments. Add an IPv6 match before the IPv4/port
  rules.

- Multi-segment hostnames like api.example.com or
  internal.svc.cluster.local leaked through to the path rule. Extend
  the host regex to require a known TLD after one-or-more dotted
  labels, and add a few more TLDs (co, uk, gov, app, ai, me, info,
  biz, us, edu).

The path rule now only matches paths preceded by whitespace or '(', or
paths starting with '/' or './' or '../', so it no longer consumes the
trailing slash of an already-collapsed <URL>.

Added 11 unit tests covering each case, plus a regression test for the
specific 'GET https://api.example.com/foo?x=1' input.

The 11 pre-existing CLI test failures in tests/test_cli.py are
unrelated: they expect fixtures examples/syslog-sample.log,
examples/apache-sample.log, and examples/app-json.log, which have
never been checked in (git ls-tree finds no examples/ entries).
@sourcery-ai

sourcery-ai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Reviewer's Guide

Updates log normalization to collapse entire URLs, IPv6 addresses, and multi-segment hostnames into single placeholders, refines path matching to avoid consuming URL components, narrows CLI exception handling, and adds focused unit tests for the new normalization behavior.

Flow diagram for updated log normalization pipeline

flowchart TD
    A[error_msg input] --> B[normalize_error_pattern]
    B --> C[re.sub URL -> <URL>]
    C --> D[re.sub IPv6 -> <IPV6>]
    D --> E[re.sub IPv4_ip_port -> <IP>]
    E --> F[re.sub IPv4_ip -> <IP>]
    F --> G[re.sub hostname_multi_segment -> <HOST>]
    G --> H[re.sub localhost -> <HOST>]
    H --> I[re.sub port_suffix -> :<PORT>]
    I --> J[re.sub uuid -> <UUID>]
    J --> K[re.sub path_with_prefix -> <PATH>]
    K --> L[re.sub remaining_numbers -> <NUM>]
    L --> M[normalized pattern output]
Loading

File-Level Changes

Change Details Files
Normalize URLs as single placeholders before other substitutions to avoid splitting schemes, hosts, and paths.
  • Add regex to detect full URLs with schemes and replace them with a placeholder before any hostname or path substitutions.
  • Ensure URLs with ports and paths (including multi-label subdomains) are fully collapsed into a single token.
src/log_analyzer_cli/utils.py
tests/test_utils.py
Add IPv6-specific normalization and improve hostname matching to support multi-segment domains and broader TLD coverage.
  • Introduce IPv6 regex that matches typical IPv6 forms, including :: compression, and replaces them with an placeholder before IPv4/port handling.
  • Replace the simple single-label hostname pattern with one that recognizes multi-segment hostnames (e.g., api.example.com, internal.svc.cluster.local) and extend the allowed TLD list.
  • Retain a separate localhost-to- replacement for explicit localhost handling.
src/log_analyzer_cli/utils.py
tests/test_utils.py
Refine path detection so it only matches real paths and no longer consumes parts of already-collapsed URLs.
  • Change the path regex to only match paths starting at line-beginning or preceded by whitespace/"("/"=", and with prefixes "/", "./", or "../".
  • Use a replacement function that preserves the leading separator while substituting the remainder of the path with .
  • Add regression tests ensuring paths in non-URL contexts are still normalized and that version strings and emails are not over-normalized.
src/log_analyzer_cli/utils.py
tests/test_utils.py
Tighten CLI exception handling to only catch expected IO/validation errors instead of all exceptions.
  • Change broad Exception handling in analyze() to catch OSError, ValueError, and TypeError, and continue to print an error and exit with code 1.
  • Update _get_parser() to catch only OSError and ValueError when reading sample files and emit a warning instead of failing generically.
src/log_analyzer_cli/cli.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@HrachShah, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 36 minutes and 50 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses rolling per-developer review limits. Reviews become available again as older review attempts age out of the rolling limit window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1e231583-9a9e-40e9-9d17-e50888bb02d6

📥 Commits

Reviewing files that changed from the base of the PR and between e93757f and 859e166.

📒 Files selected for processing (3)
  • src/log_analyzer_cli/cli.py
  • src/log_analyzer_cli/utils.py
  • tests/test_utils.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/normalize-url-and-ipv6

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • The new path replacement re.sub lambda only preserves a single leading character (m.group(0)[0]), which will collapse ./foo and ../foo to . <PATH> and potentially lose intended path semantics; consider preserving the full prefix (./, ../, or leading whitespace/paren) instead of just the first character.
  • The IPv6 regex (?:[0-9a-fA-F]{0,4}:){2,7}[0-9a-fA-F]{0,4} may match non-IP hex-with-colon sequences (e.g. generic abcd:1234 tokens); if that matters for your logs, you might want to tighten it (e.g. anchor against surrounding context or require at least one ::/multiple segments) to avoid over-normalizing unrelated values.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The new path replacement `re.sub` lambda only preserves a single leading character (`m.group(0)[0]`), which will collapse `./foo` and `../foo` to `. <PATH>` and potentially lose intended path semantics; consider preserving the full prefix (`./`, `../`, or leading whitespace/paren) instead of just the first character.
- The IPv6 regex `(?:[0-9a-fA-F]{0,4}:){2,7}[0-9a-fA-F]{0,4}` may match non-IP hex-with-colon sequences (e.g. generic `abcd:1234` tokens); if that matters for your logs, you might want to tighten it (e.g. anchor against surrounding context or require at least one `::`/multiple segments) to avoid over-normalizing unrelated values.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant