Skip to content

feat(cypher): name un-aliased RETURN columns by their source text (#598) — corpus 250→313#906

Merged
DecisionNerd merged 1 commit into
mainfrom
feature/598-return-column-names
Jun 24, 2026
Merged

feat(cypher): name un-aliased RETURN columns by their source text (#598) — corpus 250→313#906
DecisionNerd merged 1 commit into
mainfrom
feature/598-return-column-names

Conversation

@DecisionNerd

@DecisionNerd DecisionNerd commented Jun 24, 2026

Copy link
Copy Markdown
Owner

What

openCypher names an un-aliased projection column by the expression as writtenn.prop, count(*), a.x IS NULL. GraphForge left those names to DataFusion, which derives them from the lowered expression (internal var_N qualifiers, substituted literals). So a large class of TCK scenarios computed the right values under the wrong header and failed the header-exact the result should be comparison (column_by_name(<header>)).

  • Parser: captures each un-aliased item's verbatim expression text into a new ReturnItem::display (sliced from the source via the expr's span). RETURN * expands to one display per in-scope variable.
  • Binder: lower_return_items uses display as the default column name when there is no explicit AS, after the existing node/path-variable naming.
  • WITH is untouched — it still requires an explicit alias for non-variable items.

Impact (advisory TCK gate)

Corpus passing 250 → 313 (+63, 0 regressions) — Boolean1–5, Comparison1–2, Match5, List1/6, Graph6, and more, which were value-correct but header-mismatched. Baseline re-blessed.

Test/golden updates

  • gf-ir logical-plan goldens re-blessed: un-aliased columns now carry their source-text alias (e.g. "alias": null → "n.name"). gf-rel goldens unaffected (their queries are all aliased).
  • e2e_baseline node_uuid tests now pin the column with an explicit AS node_uuid — they assert the uuid type/value, not column naming.

Verification

$ cargo test --workspace --no-fail-fast       # ALL GREEN
$ cargo test -p gf-api --test bdd             # 313 passing, 0 regressed
$ cargo clippy --workspace -- -D warnings     # clean

Refs #598 (does not close).

🤖 Generated with Claude Code

Note

Name un-aliased Cypher RETURN columns by their verbatim source expression text

  • Adds a display field to ReturnItem in ast.rs that stores the verbatim source text of un-aliased projection expressions, captured via a new TokenStream.text() helper in the parser.
  • Updates the binder's lower_return_items to use display as the default column name for un-aliased expressions, while bare path/node variables continue to use the variable name and explicit AS aliases take priority.
  • RETURN * expansion now names each synthesized column by its variable identifier.
  • Expands TCK scenario passing count from 250 to 313; existing e2e tests are updated to use explicit AS aliases to remain compatible with the new naming behavior.
  • Behavioral Change: un-aliased property access columns (e.g. RETURN n.node_uuid) are now named n.node_uuid instead of a derived name; callers relying on the old column names must add explicit aliases.

Macroscope summarized 292b913.

Summary by CodeRabbit

  • New Features

    • Improved deterministic column naming for unaliased RETURN expressions to better align with openCypher standards.
  • Tests

    • Expanded test coverage significantly: 313 scenarios now passing (up from 250), including Boolean operations, comparisons, graph literals, list indexing, pattern matching, mathematical functions, and return projections.

openCypher names a projection column by the expression as written (`n.prop`,
`count(*)`, `a.x IS NULL`). GraphForge left un-aliased items' names to DataFusion,
which derives them from the LOWERED expression (internal `var_N` qualifiers,
substituted literals) — so a large class of scenarios computed the right values
under the wrong header and failed the TCK's header-exact result comparison.

The parser now captures each un-aliased item's verbatim expression text into
`ReturnItem::display` (sliced from the source by the expr's span; `RETURN *`
expands to one display per variable). The binder uses it as the default `RETURN`
column name when there is no explicit `AS`, after the existing node/path-var
naming. `WITH` is unchanged (it still requires an alias for non-variable items).

Corpus passing 250 → 313 (+63: Boolean1–5, Comparison1–2, Match5, List, Graph6,
…); 0 regressions; baseline re-blessed. IR goldens re-blessed (un-aliased columns
now carry their source-text alias). e2e node_uuid tests pin the column name with
an explicit `AS node_uuid` (they assert the uuid type/value, not column naming).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

Adds a display: Option<String> field to ReturnItem in the AST to store verbatim expression text for unaliased projections. The parser populates this field via a new TokenStream::text helper. The binder then uses display to assign deterministic column names for unaliased RETURN items. E2E tests add explicit AS node_uuid aliases, and the TCK passing baseline gains 63 new scenarios.

Changes

ReturnItem.display for deterministic projection column names

Layer / File(s) Summary
ReturnItem.display AST contract
crates/gf-ast/src/ast.rs, crates/gf-ast/src/tests.rs
ReturnItem gains pub display: Option<String> with #[serde(default)]. AST round-trip tests (ast_return_clause_roundtrip, ast_call_clause_roundtrip) updated to include display: None in struct initializers.
Parser captures verbatim expression text
crates/gf-cypher/src/parser/mod.rs, crates/gf-cypher/src/parser/clauses.rs
TokenStream::text(span) slices the raw input by byte-offset span, returning "" on out-of-bounds. parse_one_return_item uses it to capture the trimmed expression source as display before optional AS alias parsing; RETURN * sets display: None.
Binder uses display for unaliased projection aliasing
crates/gf-ir/src/binder.rs
lower_return_items now defaults an unaliased item's projection column name to item.display instead of leaving it unspecified. expand_return_wildcard sets display: Some(variable_name) on each expanded RETURN * item.
E2E tests, TCK baseline, and coverage updates
crates/gf-api/tests/e2e_baseline.rs, tests/tck/passing_baseline.txt, tests/tck/coverage_matrix.json
E2E tests add explicit AS node_uuid aliases so column-name lookups remain stable under the new projection naming. TCK passing baseline adds 63 scenarios across Boolean, Comparison, Graph, List, Match, Return, TypeConversion, and Unwind categories. Coverage matrix increments scenarios_passing from 250 to 313.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • DecisionNerd/graphforge#171: Changes how column names are derived for unaliased RETURN projections via expression-to-string conversion and enforces UNION branch column-name equality, directly overlapping with this PR's projection naming goal.
  • DecisionNerd/graphforge#367: Adds a source_text field on ReturnItem and uses it to drive projection column key naming in the executor — the same structural pattern as this PR's display field.
  • DecisionNerd/graphforge#897: Modifies RETURN * wildcard handling in binder.rs, directly adjacent to the expand_return_wildcard change in this PR that sets display on expanded items.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding source-text-based naming for un-aliased RETURN columns in Cypher, with TCK corpus metrics (+63 passing scenarios).
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description comprehensively addresses what changed, why, the impact, and test updates with clear examples and verification results.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/598-return-column-names

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/gf-ir/src/binder.rs (1)

923-934: 🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift

display naming is skipped for aggregate RETURN paths.

Line 930 routes any aggregate-containing RETURN to lower_return_aggregate, so the new fallback at Line 1001 (item.display) is never applied there. As a result, unaliased aggregate outputs still default to agg_{idx} (Line 1077), and aggregate RETURN headers remain inconsistent with the deterministic expression-text naming introduced in this PR.

Suggested starting fix
-                let alias = item.alias.clone().unwrap_or_else(|| format!("agg_{idx}"));
+                let alias = item
+                    .alias
+                    .clone()
+                    .or_else(|| item.display.clone())
+                    .unwrap_or_else(|| format!("agg_{idx}"));

Also applies to: 994-1012, 1053-1078

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/gf-ir/src/binder.rs` around lines 923 - 934, The lower_return function
routes aggregate-containing RETURN clauses to lower_return_aggregate, which
bypasses the new display naming fallback introduced in this PR (used in the
non-aggregate path via lower_return_items). This causes unaliased aggregate
outputs to default to agg_{idx} naming instead of using the deterministic
expression-text naming. Apply the same item.display naming logic that is used in
the non-aggregate branch (around the lower_return_items call) to the
lower_return_aggregate function to ensure consistent and deterministic naming
for both aggregate and non-aggregate RETURN clauses.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/gf-ir/src/binder.rs`:
- Around line 923-934: The lower_return function routes aggregate-containing
RETURN clauses to lower_return_aggregate, which bypasses the new display naming
fallback introduced in this PR (used in the non-aggregate path via
lower_return_items). This causes unaliased aggregate outputs to default to
agg_{idx} naming instead of using the deterministic expression-text naming.
Apply the same item.display naming logic that is used in the non-aggregate
branch (around the lower_return_items call) to the lower_return_aggregate
function to ensure consistent and deterministic naming for both aggregate and
non-aggregate RETURN clauses.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 18f3564a-797f-4c7f-8ce2-c60d61ac0c00

📥 Commits

Reviewing files that changed from the base of the PR and between 8ffeea7 and 292b913.

⛔ Files ignored due to path filters (12)
  • CHANGELOG.md is excluded by !**/*.md
  • crates/gf-ir/tests/ir_goldens/golden__filtered_scan.snap is excluded by !**/*.snap
  • crates/gf-ir/tests/ir_goldens/golden__one_hop_expand.snap is excluded by !**/*.snap
  • crates/gf-ir/tests/ir_goldens/golden__optional_match.snap is excluded by !**/*.snap
  • crates/gf-ir/tests/ir_goldens/golden__order_by_limit.snap is excluded by !**/*.snap
  • crates/gf-ir/tests/ir_goldens/golden__parameter.snap is excluded by !**/*.snap
  • crates/gf-ir/tests/ir_goldens/golden__simple_node_scan.snap is excluded by !**/*.snap
  • crates/gf-ir/tests/ir_goldens/golden__two_hop_expand.snap is excluded by !**/*.snap
  • crates/gf-ir/tests/ir_goldens/golden__unwind.snap is excluded by !**/*.snap
  • crates/gf-ir/tests/ir_goldens/golden__variable_length_expand.snap is excluded by !**/*.snap
  • crates/gf-ir/tests/ir_goldens/golden__with_pipeline.snap is excluded by !**/*.snap
  • docs/reference/tck-compliance.md is excluded by !**/*.md, !**/docs/**
📒 Files selected for processing (8)
  • crates/gf-api/tests/e2e_baseline.rs
  • crates/gf-ast/src/ast.rs
  • crates/gf-ast/src/tests.rs
  • crates/gf-cypher/src/parser/clauses.rs
  • crates/gf-cypher/src/parser/mod.rs
  • crates/gf-ir/src/binder.rs
  • tests/tck/coverage_matrix.json
  • tests/tck/passing_baseline.txt

@DecisionNerd DecisionNerd merged commit 4bd3433 into main Jun 24, 2026
43 checks passed
@DecisionNerd DecisionNerd deleted the feature/598-return-column-names branch June 24, 2026 03:54
DecisionNerd added a commit that referenced this pull request Jun 24, 2026
#909)

An aggregate RETURN item with no `AS` (`RETURN count(*)`, `min(x)`, `sum(i)`) was
named `agg_<idx>`, so it failed the TCK's header-exact result comparison. It now
takes the verbatim expression text (`ReturnItem::display`, the same source the
non-aggregate columns use since #906), falling back to `agg_<idx>` only when
neither an explicit alias nor display text is present.

Corpus passing 555 → 570 (+15, 0 regressions); baseline re-blessed. Naming the
group-by keys of a *mixed* aggregate query (`RETURN n.name, sum(n.num)`) is a
follow-up — it needs per-key output aliases on the `Aggregate` op.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant