collapse URLs, IPv6, and multi-segment hostnames to single placeholders by HrachShah · Pull Request #20 · HrachShah/log-analyzer-cli

HrachShah · 2026-06-23T11:51:25Z

Replaces a real bug where several log-line shapes were being split into nonsensical placeholders:

URLs lost their scheme: GET https://api.example.com/foo?x=1 became GET https:<PATH> because the host pattern required a single hostname segment and the path rule then ate the rest of the line.
IPv6 addresses were shredded: 2001:db8::1 became <NUM>:db8::<PORT> because the IPv4 regex misses IPv6 and the bare-port rule fires on the IPv6 colon-segments.
Multi-segment hostnames like api.example.com or internal.svc.cluster.local leaked through to the path rule.

The path rule now only matches paths preceded by whitespace or (, or paths starting with /, ./, or ../, so it no longer consumes the trailing slash of an already-collapsed <URL>.

Added 11 unit tests covering each case, including a regression test for the specific GET https://api.example.com/foo?x=1 input. The 11 pre-existing CLI test failures in tests/test_cli.py are unrelated — they expect fixtures (examples/syslog-sample.log, etc.) that have never been checked in.

Summary by Sourcery

Improve log normalization to collapse URLs, IPv6 addresses, and multi-segment hostnames into single placeholders and narrow CLI error handling to expected exceptions.

Bug Fixes:

Normalize full URLs with schemes as single placeholders instead of splitting them into scheme and path components.
Correctly collapse IPv6 addresses into a single placeholder rather than partially matching segments as ports or numbers.
Handle multi-segment hostnames and common TLD combinations so they are collapsed to instead of leaking into path normalization.

Enhancements:

Refine path normalization to only match legitimate filesystem-like paths and avoid consuming parts of already-collapsed tokens.
Extend hostname TLD support to a broader set of common domains and keep email localparts while normalizing only the host.
Tighten CLI exception handling to catch only anticipated I/O and value/type errors when running analysis and loading parsers.

Tests:

Add unit test coverage for URL, hostname (including multi-segment), IPv4/IPv6, UUID, path, version string, and email normalization edge cases in normalize_error_pattern.

…se OSError/ValueError/TypeError

The error-normalization pipeline had three related bugs that caused real log lines to split into nonsense patterns: - URLs lost their scheme: GET https://api.example.com/foo?x=1 was becoming 'GET https:<PATH>' because the host pattern required a single hostname segment and the path rule then ate the rest of the line. Replace the URL with a single <URL> placeholder first. - IPv6 addresses were shredded: 2001:db8::1 became '<NUM>:db8::<PORT>' because the IPv4 regex misses IPv6 and the bare-port rule fires on the IPv6 colon-segments. Add an IPv6 match before the IPv4/port rules. - Multi-segment hostnames like api.example.com or internal.svc.cluster.local leaked through to the path rule. Extend the host regex to require a known TLD after one-or-more dotted labels, and add a few more TLDs (co, uk, gov, app, ai, me, info, biz, us, edu). The path rule now only matches paths preceded by whitespace or '(', or paths starting with '/' or './' or '../', so it no longer consumes the trailing slash of an already-collapsed <URL>. Added 11 unit tests covering each case, plus a regression test for the specific 'GET https://api.example.com/foo?x=1' input. The 11 pre-existing CLI test failures in tests/test_cli.py are unrelated: they expect fixtures examples/syslog-sample.log, examples/apache-sample.log, and examples/app-json.log, which have never been checked in (git ls-tree finds no examples/ entries).

sourcery-ai · 2026-06-23T11:51:34Z

Reviewer's Guide

Updates log normalization to collapse entire URLs, IPv6 addresses, and multi-segment hostnames into single placeholders, refines path matching to avoid consuming URL components, narrows CLI exception handling, and adds focused unit tests for the new normalization behavior.

Flow diagram for updated log normalization pipeline

flowchart TD
    A[error_msg input] --> B[normalize_error_pattern]
    B --> C[re.sub URL -> <URL>]
    C --> D[re.sub IPv6 -> <IPV6>]
    D --> E[re.sub IPv4_ip_port -> <IP>]
    E --> F[re.sub IPv4_ip -> <IP>]
    F --> G[re.sub hostname_multi_segment -> <HOST>]
    G --> H[re.sub localhost -> <HOST>]
    H --> I[re.sub port_suffix -> :<PORT>]
    I --> J[re.sub uuid -> <UUID>]
    J --> K[re.sub path_with_prefix -> <PATH>]
    K --> L[re.sub remaining_numbers -> <NUM>]
    L --> M[normalized pattern output]

File-Level Changes

Change	Details	Files
Normalize URLs as single placeholders before other substitutions to avoid splitting schemes, hosts, and paths.	Add regex to detect full URLs with schemes and replace them with a placeholder before any hostname or path substitutions. Ensure URLs with ports and paths (including multi-label subdomains) are fully collapsed into a single token.	`src/log_analyzer_cli/utils.py` `tests/test_utils.py`
Add IPv6-specific normalization and improve hostname matching to support multi-segment domains and broader TLD coverage.	Introduce IPv6 regex that matches typical IPv6 forms, including :: compression, and replaces them with an placeholder before IPv4/port handling. Replace the simple single-label hostname pattern with one that recognizes multi-segment hostnames (e.g., api.example.com, internal.svc.cluster.local) and extend the allowed TLD list. Retain a separate localhost-to- replacement for explicit localhost handling.	`src/log_analyzer_cli/utils.py` `tests/test_utils.py`
Refine path detection so it only matches real paths and no longer consumes parts of already-collapsed URLs.	Change the path regex to only match paths starting at line-beginning or preceded by whitespace/"("/"=", and with prefixes "/", "./", or "../". Use a replacement function that preserves the leading separator while substituting the remainder of the path with . Add regression tests ensuring paths in non-URL contexts are still normalized and that version strings and emails are not over-normalized.	`src/log_analyzer_cli/utils.py` `tests/test_utils.py`
Tighten CLI exception handling to only catch expected IO/validation errors instead of all exceptions.	Change broad Exception handling in analyze() to catch OSError, ValueError, and TypeError, and continue to print an error and exit with code 1. Update _get_parser() to catch only OSError and ValueError when reading sample files and emit a warning instead of failing generically.	`src/log_analyzer_cli/cli.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

coderabbitai · 2026-06-23T11:53:58Z

Warning

Review limit reached

@HrachShah, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 36 minutes and 50 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses rolling per-developer review limits. Reviews become available again as older review attempts age out of the rolling limit window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1e231583-9a9e-40e9-9d17-e50888bb02d6

📥 Commits

Reviewing files that changed from the base of the PR and between e93757f and 859e166.

📒 Files selected for processing (3)

src/log_analyzer_cli/cli.py
src/log_analyzer_cli/utils.py
tests/test_utils.py

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/normalize-url-and-ipv6

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

sourcery-ai

Hey - I've left some high level feedback:

The new path replacement re.sub lambda only preserves a single leading character (m.group(0)[0]), which will collapse ./foo and ../foo to . <PATH> and potentially lose intended path semantics; consider preserving the full prefix (./, ../, or leading whitespace/paren) instead of just the first character.
The IPv6 regex (?:[0-9a-fA-F]{0,4}:){2,7}[0-9a-fA-F]{0,4} may match non-IP hex-with-colon sequences (e.g. generic abcd:1234 tokens); if that matters for your logs, you might want to tighten it (e.g. anchor against surrounding context or require at least one ::/multiple segments) to avoid over-normalizing unrelated values.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The new path replacement `re.sub` lambda only preserves a single leading character (`m.group(0)[0]`), which will collapse `./foo` and `../foo` to `. <PATH>` and potentially lose intended path semantics; consider preserving the full prefix (`./`, `../`, or leading whitespace/paren) instead of just the first character.
- The IPv6 regex `(?:[0-9a-fA-F]{0,4}:){2,7}[0-9a-fA-F]{0,4}` may match non-IP hex-with-colon sequences (e.g. generic `abcd:1234` tokens); if that matters for your logs, you might want to tighten it (e.g. anchor against surrounding context or require at least one `::`/multiple segments) to avoid over-normalizing unrelated values.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

Zo Bot added 2 commits May 14, 2026 21:26

narrow exception handlers in CLI commands — open() and click.echo rai…

75d99f2

…se OSError/ValueError/TypeError

sourcery-ai Bot reviewed Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

collapse URLs, IPv6, and multi-segment hostnames to single placeholders#20

collapse URLs, IPv6, and multi-segment hostnames to single placeholders#20
HrachShah wants to merge 2 commits into
mainfrom
fix/normalize-url-and-ipv6

HrachShah commented Jun 23, 2026 •

edited by sourcery-ai Bot

Loading

Uh oh!

sourcery-ai Bot commented Jun 23, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

coderabbitai Bot commented Jun 23, 2026

Review limit reached

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HrachShah commented Jun 23, 2026 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Flow diagram for updated log normalization pipeline

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

coderabbitai Bot commented Jun 23, 2026

Review limit reached

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

HrachShah commented Jun 23, 2026 •

edited by sourcery-ai Bot

Loading

sourcery-ai Bot commented Jun 23, 2026 •

edited

Loading