Skip to content

fix(pipeline): stop creating Route nodes from URLs in config files#646

Open
mvanhorn wants to merge 1 commit into
DeusData:mainfrom
mvanhorn:fix/521-route-nodes-config-files
Open

fix(pipeline): stop creating Route nodes from URLs in config files#646
mvanhorn wants to merge 1 commit into
DeusData:mainfrom
mvanhorn:fix/521-route-nodes-config-files

Conversation

@mvanhorn

Copy link
Copy Markdown
Contributor

What does this PR do?

Indexing a repo of only config files produced spurious Route nodes from arbitrary URL-like strings. The infra-route extractor (cbm_pipeline_extract_infra_routes in src/pipeline/pipeline.c) harvested any CBM_STRREF_URL string literal from any .yaml/.yml/.tf/.hcl/.toml file, regardless of whether the host file actually defines service routes.

This restricts the loose URL string-ref harvesting to genuine Infrastructure-as-Code files (Terraform / HCL) and additionally requires the value to be a bare URL:

  • cbm_is_infra_route_source_file() — only .tf, .tf.json, .hcl are route sources. Generic config (config.yaml), dependency manifests (dependabot.yaml), container orchestration (compose.yaml), and Kubernetes / Kustomize manifests are excluded.
  • cbm_is_bare_endpoint_url() — rejects command strings that merely embed a URL (e.g. a curl ... || exit 1 healthcheck), while still accepting query-string URLs.

Structured topic→endpoint bindings still flow through cbm_pipeline_process_infra_bindings(), so real infrastructure endpoints (Cloud Scheduler / Pub/Sub targets) continue to produce Route nodes.

Why this matters: per #521, a three-file repro (dependabot.yaml + config.yaml + compose.yaml) yielded four bogus routes — a Terraform registry URL, a JWKS discovery URL, an upstream service host, and a healthcheck shell command. None is a route the service serves; they inflate the Route set that get_architecture and cross-repo route matching depend on, making downstream matching noisier.

Checklist

  • Every commit is signed off (git commit -s) — required, CI rejects
    unsigned commits (DCO, see CONTRIBUTING.md)
  • Tests pass locally (make -f Makefile.cbm test)
  • Lint passes (make -f Makefile.cbm lint-ci)
  • New behavior is covered by a test (reproduce-first for bug fixes)

Testing notes

  • Built the indexer and reproduced the issue: before the fix the three-file config repro produced the four Route nodes from Route nodes created from URL strings in config / non-source files #521; after the fix it produces zero. A Terraform .tf endpoint URL still produces an infra Route.
  • Added unit tests in tests/test_pipeline.c:
    • infra_route_source_file_gate — Terraform/HCL accepted; dependabot/config/compose/k8s/kustomize/toml and .tfvars rejected.
    • infra_bare_endpoint_url_gate — bare URLs accepted; healthcheck/command strings rejected.

Fixes #521

Infra Route extraction harvested any URL-like string literal from any
YAML/TF/TOML file, so a repo of only config files produced spurious
Route nodes (terraform registry URL, a JWKS discovery URL, an upstream
host, and a healthcheck shell command). These inflated the Route set
that get_architecture and cross-repo matching rely on.

Restrict the loose string-ref harvesting to genuine Infrastructure-as-Code
files (Terraform / HCL) and require a bare URL value, so generic config,
dependabot, compose and k8s/kustomize manifests no longer emit Routes.
Structured topic->endpoint bindings still flow through
cbm_pipeline_process_infra_bindings(), so real infra endpoints are kept.

Fixes DeusData#521

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01QK73cX8EuqqwQEJUbycu6g
Signed-off-by: mvanhorn <mvanhorn@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Route nodes created from URL strings in config / non-source files

1 participant