diff --git a/docs/database-support.md b/docs/database-support.md index af2da54d..3414695e 100644 --- a/docs/database-support.md +++ b/docs/database-support.md @@ -5,23 +5,35 @@ SQL generation. Databases are supported at two tiers. ## Tier 1 — fully tested -Integration tests and/or Docker examples; must not regress. - -| Engine | Coverage | -|---|---| -| **SQLite** | Integration tests in `tests/integration/test_integration.py`; embedded example. | -| **Postgres** | Integration tests in `tests/integration/test_integration_postgres.py`; Docker example. | -| **DuckDB** | Integration tests in `tests/integration/test_integration_duckdb.py` (in-process, no Docker). | -| **MySQL** | Docker example with `verify.py`. | -| **ClickHouse** | Docker example with `verify.py`. | -| **SQL Server** | Docker example with `verify.py` in `examples/sqlserver/`. | -| **Snowflake** | Integration tests in `tests/integration/test_integration_snowflake.py` (skip without `~/.snowflake/connections.toml`); `examples/snowflake/` ships `README.md` + `verify.py`. No Docker (no free local image). | +Live-instance integration tests must not regress. Where Docker images exist, +the suites spin up the engine via `testcontainers`; the cloud-only engines +(BigQuery, Snowflake) skip cleanly when credentials aren't available and run +against the live service in CI when they are. + +| Engine | Live test | Docker example | +|---|---|---| +| **SQLite** | `tests/integration/test_integration.py` (in-process) | `examples/embedded/` | +| **Postgres** | `tests/integration/test_integration_postgres.py` (pytest-postgresql, spawned temp instance) | `examples/postgres/` | +| **DuckDB** | `tests/integration/test_integration_duckdb.py` (in-process) | `examples/embedded/` (DuckDB mode) | +| **MySQL** | `tests/integration/test_integration_mysql.py` (`testcontainers[mysql]`) | `examples/mysql/` | +| **ClickHouse** | `tests/integration/test_integration_clickhouse.py` (`testcontainers[clickhouse]`) | `examples/clickhouse/` | +| **SQL Server** | `tests/integration/test_integration_sqlserver.py` (`testcontainers`, `msodbcsql18` + `unixodbc-dev` on the runner) | `examples/sqlserver/` | +| **Snowflake** | `tests/integration/test_integration_snowflake.py` (skips without `~/.snowflake/connections.toml`; profile name overridable via `$SLAYER_SNOWFLAKE_CONNECTION`) | `examples/snowflake/` (no Docker) | +| **BigQuery** | `examples/bigquery/verify.py` driven by CI against `bigquery-public-data.thelook_ecommerce` (gated on `GCP_PROJECT_ID` / `GCP_SA_KEY_B64` repo secrets) | `examples/bigquery/` (no Docker — managed service) | + +BigQuery does not yet have a pytest-style integration suite; its CI coverage +runs the example's `verify.py` directly via `.github/workflows/ci.yml`. That +exercises auto-ingestion, basic projection, joins, time-grain dimensions, and +the cardinality / sum-of-grouped-equals-total invariants — enough to catch +emitted-SQL regressions, but the verify-script tier is shallower than the +testcontainers suites. ## Tier 2 — code-covered Unit tests for SQL generation; no live-instance verification. -BigQuery, Redshift, Trino/Presto, Databricks/Spark, Oracle. +Redshift, Trino/Presto (Athena uses the Presto dialect), Databricks/Spark, +Oracle. ## Aggregation support @@ -40,6 +52,8 @@ because no standard syntax works everywhere: | ClickHouse | yes | yes | yes | yes | Native `median(x)`, parametric `quantile(p)(x)`, native `stddev_*`/`var_*`/`corr`/`covar*` (camelCase variants emitted by sqlglot for `var_samp`). | | Snowflake | yes | yes | yes | yes | Native `MEDIAN`, `PERCENTILE_CONT(p) WITHIN GROUP`, `STDDEV_*`/`VAR_*`/`CORR`/`COVAR_*`. `LOG10` native; no native `LOG2` (falls through to `LOG(2, x)`). | | MySQL | **no** | **no** | yes | **no** | No native `MEDIAN`/`PERCENTILE_CONT`/`CORR`/`COVAR_*` and no Python-UDF mechanism — SLayer raises `NotImplementedError` for those. `STDDEV_SAMP`/`STDDEV_POP`/`VAR_SAMP`/`VAR_POP` are native on MySQL. Use MariaDB or compute the unsupported aggregations client-side. | +| SQL Server (T-SQL) | **no** | **no** | yes | yes (decomposed) | `MEDIAN` doesn't exist and T-SQL's `PERCENTILE_CONT` is window-only (no `WITHIN GROUP` aggregate form) — SLayer raises `NotImplementedError`. Native `STDEV`/`STDEVP`/`VAR`/`VARP` (slayer renames the canonical `STDDEV_*`/`VAR_*` names at emit time). `CORR`/`COVAR_*` use the same variance-decomposition formula as MySQL (`cov(x,y) = (var(x+y) − var(x) − var(y)) / 2`, `corr = cov / (stddev(x) · stddev(y))`). | +| BigQuery | **no** | **no** | yes | yes | BigQuery has no `MEDIAN` aggregate, and its `PERCENTILE_CONT` is analytic-only (no `WITHIN GROUP` syntax) — the base class emit `PERCENTILE_CONT(p) WITHIN GROUP (ORDER BY x)` fails at runtime. If you need percentile on BigQuery, define a custom `Aggregation` using `APPROX_QUANTILES(x, 100)[OFFSET(N)]`. Native `STDDEV_SAMP`/`STDDEV_POP`/`VAR_SAMP`/`VAR_POP`/`CORR`/`COVAR_SAMP`/`COVAR_POP` (sqlglot may emit `VARIANCE` for `var_samp`). | ### SQLite caveats @@ -105,6 +119,37 @@ If you need percentiles on MySQL, the recommended options are: - Define a custom `Aggregation` on the model with whatever `GROUP_CONCAT`- based or windowed expression suits your data shape and group sizes. +### SQL Server (T-SQL) caveats + +T-SQL has `STDEV`/`STDEVP`/`VAR`/`VARP` (not `STDDEV_SAMP`/`STDDEV_POP`/ +`VAR_SAMP`/`VAR_POP`); sqlglot's tsql transpiler emits incorrect names like +`VAR_SAMP` and `VARIANCE_POP`, so the T-SQL dialect overrides the canonical +spellings via `Anonymous` rewrites in `slayer/sql/dialects/tsql.py`. + +`CORR`/`COVAR_SAMP`/`COVAR_POP` are derived from variance: +`cov(x, y) = (var(x + y) − var(x) − var(y)) / 2`, +`corr = cov / (stddev(x) · stddev(y))`. The decomposition is shared with +MySQL via `_build_covar_decomposition` in `slayer/sql/dialects/base.py`. + +`MEDIAN` doesn't exist, and `PERCENTILE_CONT` in T-SQL is a window function +only — there is no `WITHIN GROUP` aggregate form. SLayer raises +`NotImplementedError` for both at SQL generation time. Use the windowed form +as a custom `Aggregation` if you need it, or compute client-side. + +Other T-SQL specifics surfaced by the dialect: + +- `DATETRUNC(unit, col)` for time-grain dimensions (SQL Server 2022+ — + earlier versions don't have `DATETRUNC` and aren't supported). +- `DATETRUNC(iso_week, col)` for Monday-aligned week truncation — + `@@DATEFIRST`-independent so the bucketing is deterministic. +- `DATEADD(unit, n, col)` for time-shift arithmetic — T-SQL has no + `INTERVAL` literal. +- Bracketed `[ident]` quoting — `.` SLayer aliases get + mangled to `___` at emit and decoded back on result-row + keys (mirror of the BigQuery `___` mangling; see DEV-1571). +- Native `LOG10`, no native `LOG2` (`log2(x)` falls through to the + canonical 2-arg `LOG(2, x)` form). + ### Snowflake caveats Snowflake is a fully managed cloud warehouse — no Docker, no local instance. @@ -130,6 +175,40 @@ Snowflake](configuration/datasources.md#snowflake) for connection setup. canonical 2-arg `LOG(2, x)` form. `LOG10` and the rest of the math / statistical functions are native. +### BigQuery caveats + +BigQuery is a fully managed cloud warehouse — no Docker, no local instance. +CI runs the example's `verify.py` against `bigquery-public-data.thelook_ecommerce`, +gated on `GCP_PROJECT_ID` and `GCP_SA_KEY_B64` repo secrets (forks without +them skip cleanly). Auth via Google Application Default Credentials +(`$GOOGLE_APPLICATION_CREDENTIALS` pointing at a service-account JSON key, +plus `$GCP_PROJECT_ID` for billing). The `bigquery://` driver requires the +`sqlalchemy-bigquery` extra. + +- **No FK introspection.** BigQuery exposes no foreign-key metadata via + `INFORMATION_SCHEMA`, so auto-ingestion cannot discover joins. Hand-declare + `ModelJoin`s on the model. +- **Dotted alias mangling.** BigQuery rejects column names containing `.` + (output schema names must match `[A-Za-z_][A-Za-z0-9_]*`), so SLayer + rewrites `.` aliases (`orders._count`, + `orders.products.category`) to `___` at emit time and + reverses the mapping on result rows. The triple-underscore separator is + distinct from `__` (used by `_query_as_model` for cross-model leaf + flattening), so the two encodings never collide. In `Column.sql`, + fully-qualified table paths must be backticked per-segment + (`` `project`.`dataset`.`table` ``) — a single backticked dotted path of + word-only segments (`` `my_dataset.my_table` ``) would false-positive + mangle. +- **No `MEDIAN` aggregate; `PERCENTILE_CONT` is analytic-only.** Both + raise at SQL generation time (sqlglot doesn't transpile the base class's + `PERCENTILE_CONT(p) WITHIN GROUP (ORDER BY x)` to BigQuery's analytic + form). Use a custom `Aggregation` with `APPROX_QUANTILES(x, 100)[OFFSET(N)]` + when you need it. +- **No native EXPLAIN.** BigQuery has no SQL-level `EXPLAIN`. The + `BigqueryDialect.explain_prefix` is `None`, so `engine.execute(..., + explain=True)` returns the dry-run SQL unchanged rather than an execution + plan. + ## Adding a new dialect 1. Add the mapping to `slayer/engine/query_engine.py:_dialect_for_type()`. diff --git a/docs/interfaces/pg-facade.md b/docs/interfaces/pg-facade.md index e7f00adf..fb277c6b 100644 --- a/docs/interfaces/pg-facade.md +++ b/docs/interfaces/pg-facade.md @@ -142,6 +142,122 @@ Postgres-specific predicates that aren't valid SLayer DSL (`ILIKE`, `::cast`, re `ANY`/`ALL`) parse but are rejected at execution — use the standard comparison / `IN` / `BETWEEN` forms. +### `CAST( AS )` in projection + +A projection of the shape `CAST( AS )` (and the equivalent `col::type` +sugar) is accepted when the inner expression is a bare or qualified column reference +**and** the (source, target) pair is in the allowlist below. The engine still executes +the bare column — the cast is a pure wire-layer type rewrite. The projected column's +Postgres OID is overridden to match the casted type. + +Common BI shapes covered: `SELECT CAST(ordered_at AS TIMESTAMP) FROM orders` (DATE +column promoted for a TIMESTAMP-aware client), `SELECT CAST(amount AS TEXT) AS s +FROM orders` (stringification), `SELECT CAST(customers.region AS TEXT) FROM orders` +(joined column). + +Out of scope: `CAST` around aggregates (`CAST(SUM(amount) AS DOUBLE)`), `TRY_CAST`, +and `CAST` around expressions that aren't a bare column (`CAST(SUBSTRING(...) AS T)`). +`CAST` wrapping a `DATE_TRUNC(...)` continues to route through the time-grain unwrap. + +`CAST(...)` in `ORDER BY` and `GROUP BY` has two layers of admission: + +1. **Unaliased canonical-form** (e.g. `ORDER BY CAST(c AS T)` repeating the + projection's CAST verbatim): **never admitted.** The translator raises + `ORDER BY column '...' is not in the projection list` / the GROUP BY + strict-on-extras error. Workaround: alias and reference the alias. +2. **Aliased reference** (`SELECT CAST(c AS T) AS x ... ORDER BY x` / + `... GROUP BY x`): admitted **only** when the `(source, target)` pair + preserves sort/group semantics under the bare-column engine projection. + +Pairs that **fail** the aliased-reference admission and raise +`ORDER BY on CAST projection '...' with lossy pair X→T is unsupported` +(symmetric message for GROUP BY): + +| Path | Lossy pairs | +|----------|--------------------------------------------------------------------------| +| ORDER BY | `X → TEXT` for every `X` (lex sort ≠ engine's natural sort) | +| GROUP BY | `TIMESTAMP → DATE` (many-to-one rollup); `INT → DOUBLE` (IEEE 754 collapse beyond ±2^53) | + +Every other admitted pair — identity (`X → X`), `DATE → TIMESTAMP`, +`TIMESTAMP → DATE` for ORDER BY, `INT → DOUBLE` — preserves the casted +semantics under the bare-column engine projection, so the alias path stays +open. + +```sql +-- Always rejected (canonical form): +SELECT CAST(delivered_at AS TIMESTAMP) FROM orders +ORDER BY CAST(delivered_at AS TIMESTAMP); + +-- Aliased reference, safe pair → works: +SELECT CAST(delivered_at AS TIMESTAMP) AS dt FROM orders +ORDER BY dt; + +-- Aliased reference, lossy pair → rejected: +SELECT CAST(id AS TEXT) AS s FROM orders ORDER BY s; +SELECT CAST(ordered_at AS DATE) AS d, COUNT(*) FROM orders GROUP BY d; +``` + +The wire-type override still applies in the safe-pair case — `dt` is +wire-typed `TIMESTAMP` even though the engine sorts the underlying `DATE`. +A future ticket can lift the remaining restrictions by pushing the CAST +into the engine SQL. + +Admitted (source, target) coercions: + +| Source type | Admitted target types | +|---------------|------------------------------| +| `DATE` | `DATE`, `TIMESTAMP`, `TEXT` | +| `TIMESTAMP` | `TIMESTAMP`, `DATE`, `TEXT` | +| `INT` | `INT`, `DOUBLE`, `TEXT` | +| `DOUBLE` | `DOUBLE`, `TEXT` | +| `BOOLEAN` | `BOOLEAN`, `TEXT` | +| `TEXT` | `TEXT` | +| *(unknown)* | `TEXT` | + +Pairs outside the allowlist (e.g. `CAST(name AS INT)`, `CAST(amount AS BOOLEAN)`) +raise `Unsupported CAST: cannot project column as (...). Admitted +coercions: see docs/interfaces/pg-facade.md.` Unsupported target types (`UUID`, +`JSON`, `ARRAY`, `STRUCT`, …) raise the standard `Unsupported projection +expression` error. + +#### CAST coarse-OID mapping + +CAST is a **coarse wire-OID hint**, not a precision-preserving conversion. +The SLayer engine projects the bare column unchanged; the pg-facade encoder +is OID-driven, so the wire bytes always match the OID we advertise. Some +PostgreSQL types the user can write in a CAST don't have a one-to-one +SLayer equivalent — those collapse onto the nearest broader SLayer type: + +| User wrote in `CAST(... AS X)` | SLayer maps to | Wire OID advertised | +|---|---|---| +| `INTEGER` / `INT` (pre-existing) | `DataType.INT` | 20 (`int8`) — not 23 (`int4`) | +| `SMALLINT` | `DataType.INT` | 20 (`int8`) — not 21 (`int2`) | +| `TINYINT` / `MEDIUMINT` (non-Postgres widths) | `DataType.INT` | 20 (`int8`) | +| `BIGINT` | `DataType.INT` | 20 (`int8`) ✓ exact match | +| `DECIMAL` / `NUMERIC` | `DataType.DOUBLE` | 701 (`float8`) — not 1700 (`numeric`) | +| `FLOAT` / `REAL` / `DOUBLE` | `DataType.DOUBLE` | 701 (`float8`) ✓ | +| `TIMESTAMPTZ` / `TIMESTAMP WITH TIME ZONE` | `DataType.TIMESTAMP` | 1114 (`timestamp`, no TZ) — not 1184 (`timestamptz`) | +| `TIMESTAMP` / `DATETIME` | `DataType.TIMESTAMP` | 1114 (`timestamp`) ✓ | + +What this means in practice: + +- The wire bytes the client receives are always consistent with the OID we + advertise (the encoder picks the binary/text form from the OID). There is + no value corruption. +- The OID is potentially broader than what the user typed. A client that + asked for `NUMERIC` and got `float8` sees a float on the wire and decodes + it correctly as a float — but loses the "exact precision" expectation. + A client that asked for `TIMESTAMPTZ` sees naive `timestamp` bytes — and + loses TZ-aware decoding semantics. +- Callers needing exact `NUMERIC` precision, narrow integer wire widths, or + TZ-aware timestamps must compute upstream (or wait for SLayer to model + those types natively). + +`DOUBLE → INT` is intentionally excluded: Python's `int()` truncates toward zero +while Postgres rounds half-to-even, so silently admitting the pair would diverge from +`psql` semantics. Pre-aggregate or pre-round on your side when an integer-typed result +is required. + ## Introspection * `INFORMATION_SCHEMA.METRICS` / `DIMENSIONS` / `SCHEMATA` / `TABLES` / `COLUMNS`. diff --git a/slayer/facade/catalog_sql.py b/slayer/facade/catalog_sql.py index 38cea90e..f338e36c 100644 --- a/slayer/facade/catalog_sql.py +++ b/slayer/facade/catalog_sql.py @@ -1222,6 +1222,7 @@ def _substitute_context_functions(self, node: exp.Expression) -> exp.Expression: # Try each substitution branch in order; first hit wins. substituted = ( self._substitute_qualified_context_call(node) + or self._substitute_qualified_context_column(node) or self._substitute_dedicated_func(node) or self._substitute_bareword_column(node) or self._substitute_anonymous_function(node) @@ -1253,6 +1254,31 @@ def _substitute_qualified_context_call( or self._substitute_anonymous_function(rhs) ) + def _substitute_qualified_context_column( + self, node: exp.Expression, + ) -> Optional[exp.Expression]: + """Replace ``pg_catalog.`` where sqlglot parses the + whole thing as ``Column(this=, table='pg_catalog')`` — the + no-parens shape (``pg_catalog.current_user``, + ``pg_catalog.current_catalog``). The Dot-shaped variant + (``pg_catalog.current_database()``) is handled by + ``_substitute_qualified_context_call``. + """ + if not isinstance(node, exp.Column): + return None + table = node.args.get("table") + if table is None: + return None + table_name = ( + str(table.this) if hasattr(table, "this") else str(table) + ).lower() + if table_name != "pg_catalog": + return None + ident = node.this + if not isinstance(ident, exp.Identifier): + return None + return self._literal_for_context_name(str(ident.this).lower()) + def _substitute_dedicated_func(self, node: exp.Expression) -> Optional[exp.Expression]: """Dedicated sqlglot Func subclasses (typed nodes for niladic ctx fns).""" if isinstance(node, (exp.CurrentDatabase, getattr(exp, "CurrentCatalog", exp.CurrentDatabase))): diff --git a/slayer/facade/translator.py b/slayer/facade/translator.py index 3555f6dc..1b42deeb 100644 --- a/slayer/facade/translator.py +++ b/slayer/facade/translator.py @@ -281,6 +281,118 @@ def __init__(self, message: str, *, status: str = "INVALID_ARGUMENT") -> None: exp.NEQ: "<>", } +# DEV-1566: sqlglot DataType.Type → SLayer DataType for CAST( AS ) +# projection support. Aliases sqlglot canonicalises at parse time (STRING→TEXT, +# INTEGER→INT, NUMERIC→DECIMAL, BOOL→BOOLEAN, FLOAT→DOUBLE) need no entry; +# REAL parses to exp.DataType.Type.FLOAT (not normalised), so DOUBLE coverage +# routes through here. Parameterised forms (VARCHAR(255), DECIMAL(10,2)) +# collapse onto the same base member at parse time — precision is dropped at +# the SLayer boundary because SLayer wire types don't carry it. +# +# CAST is a COARSE wire-OID hint, not a precision-preserving conversion. The +# SLayer engine projects the bare column unchanged; the pg-facade encoder is +# OID-driven, so the wire bytes always match the OID we advertise. The only +# "mismatch" exposed by the coarsenings below is that the advertised OID is +# broader than what the user typed: +# +# * DECIMAL/NUMERIC → DataType.DOUBLE (OID 701 `float8`, not 1700 `numeric`). +# * INTEGER / SMALLINT / TINYINT / MEDIUMINT → DataType.INT +# (OID 20 `int8`, not 23/21). +# * TIMESTAMPTZ / TIMESTAMP WITH TIME ZONE / TIMESTAMPLTZ → DataType.TIMESTAMP +# (OID 1114 `timestamp`, no TZ semantics). +# +# Callers needing exact NUMERIC precision, narrow integer wire widths, or +# TZ-aware timestamps must compute upstream. See pg-facade.md +# §"CAST coarse-OID mapping" for the user-facing table. +_SQLGLOT_TYPE_TO_DATATYPE: Dict[exp.DataType.Type, DataType] = { + exp.DataType.Type.TEXT: DataType.TEXT, + exp.DataType.Type.VARCHAR: DataType.TEXT, + exp.DataType.Type.CHAR: DataType.TEXT, + exp.DataType.Type.NCHAR: DataType.TEXT, + exp.DataType.Type.NVARCHAR: DataType.TEXT, + exp.DataType.Type.INT: DataType.INT, + exp.DataType.Type.BIGINT: DataType.INT, + exp.DataType.Type.SMALLINT: DataType.INT, + exp.DataType.Type.TINYINT: DataType.INT, + exp.DataType.Type.MEDIUMINT: DataType.INT, + exp.DataType.Type.DOUBLE: DataType.DOUBLE, + exp.DataType.Type.FLOAT: DataType.DOUBLE, + exp.DataType.Type.DECIMAL: DataType.DOUBLE, + exp.DataType.Type.BOOLEAN: DataType.BOOLEAN, + exp.DataType.Type.DATE: DataType.DATE, + exp.DataType.Type.TIMESTAMP: DataType.TIMESTAMP, + exp.DataType.Type.DATETIME: DataType.TIMESTAMP, + exp.DataType.Type.TIMESTAMPTZ: DataType.TIMESTAMP, + exp.DataType.Type.TIMESTAMPLTZ: DataType.TIMESTAMP, +} + +# DEV-1566: per-pair allowlist of CAST coercions admitted in projection. +# Each pair is one the wire encoders in slayer/pg_facade/types.py handle +# losslessly. The unknown-source (None) case admits ONLY TEXT — for every +# other target we'd push the failure into value_to_text / value_to_binary at +# response time, surfacing as an opaque connection error instead of a clean +# translation error. Identity (X→X) is always admitted; computed implicitly. +_CAST_ADMITTED_TARGETS: Dict[DataType, frozenset] = { + DataType.DATE: frozenset({DataType.DATE, DataType.TIMESTAMP, DataType.TEXT}), + DataType.TIMESTAMP: frozenset({DataType.TIMESTAMP, DataType.DATE, DataType.TEXT}), + DataType.INT: frozenset({DataType.INT, DataType.DOUBLE, DataType.TEXT}), + # DOUBLE → INT intentionally dropped (Python int(x) truncates toward zero; + # Postgres rounds half-to-even). Round-2 Codex review. + DataType.DOUBLE: frozenset({DataType.DOUBLE, DataType.TEXT}), + DataType.BOOLEAN: frozenset({DataType.BOOLEAN, DataType.TEXT}), + DataType.TEXT: frozenset({DataType.TEXT}), +} + +# DEV-1566 — Codex round 3: pairs whose ORDER BY / GROUP BY semantics under +# Postgres differ from what the bare-column engine projection would produce. +# A query that references a CAST-projected item (via alias or canonical form) +# in ORDER BY / GROUP BY is rejected when the pair is in the corresponding +# set — workaround: order/group by the bare column, or wait for a follow-up +# ticket that pushes CAST into the engine SQL. +# +# ORDER BY lossy: every X→TEXT pair (lex sort ≠ engine's natural sort). +# Identity X→X, INT→DOUBLE, DATE↔TIMESTAMP all preserve relative order so +# they stay admitted. +_LOSSY_ORDER_BY_CAST_PAIRS: frozenset = frozenset({ + (DataType.INT, DataType.TEXT), + (DataType.DOUBLE, DataType.TEXT), + (DataType.BOOLEAN, DataType.TEXT), + (DataType.DATE, DataType.TEXT), + (DataType.TIMESTAMP, DataType.TEXT), +}) +# GROUP BY lossy: many-to-one casts where the engine-column grouping +# returns MORE groups than the casted column would. +# - TIMESTAMP → DATE: multiple timestamps per date. +# - INT → DOUBLE: int64 has precise range ±2^53 in IEEE 754 float64; +# larger bigints lose precision so distinct ints can collapse to the +# same double under Postgres's GROUP BY semantics, but the facade +# groups by the bare int and over-reports groups. +# Every other admitted pair is a 1:1 / identity mapping within the +# supported value range. +_LOSSY_GROUP_BY_CAST_PAIRS: frozenset = frozenset({ + (DataType.TIMESTAMP, DataType.DATE), + (DataType.INT, DataType.DOUBLE), +}) + + +def _sqlglot_type_to_datatype(node: exp.DataType) -> Optional[DataType]: + """Map a sqlglot ``DataType`` node to a SLayer ``DataType``. + + Returns ``None`` when the type member isn't in the mapping (UUID, JSON, + ARRAY, STRUCT, …) — caller falls through to the existing + 'Unsupported projection expression' error. + """ + return _SQLGLOT_TYPE_TO_DATATYPE.get(node.this) + + +def _is_admitted_cast(source: Optional[DataType], target: DataType) -> bool: + """Strict per-pair admission. ``None`` source admits only ``TEXT``.""" + if source is None: + return target is DataType.TEXT + if source == target: + return True + return target in _CAST_ADMITTED_TARGETS.get(source, frozenset()) + def _column_to_dotted( col: exp.Column, @@ -867,6 +979,10 @@ class _ProjectionItem(BaseModel): dimension: Optional[FacadeDimension] = None time_grain: Optional[TimeGranularity] = None time_grain_underlying: Optional[FacadeDimension] = None + # DEV-1566: target type when the projection was CAST( AS ). + # Overrides projection_types[i] at _record_metric / _record_dimension time, + # leaving engine_alias and the SlayerQuery measure/dimension untouched. + cast_target: Optional[DataType] = None def _resolve_time_grain_projection( @@ -947,6 +1063,63 @@ def _resolve_column_projection( ) +def _detect_column_cast( + body: exp.Expression, +) -> Optional[Tuple[exp.Column, DataType]]: + """If ``body`` is ``CAST( AS )``, + return ``(inner_col, target_data_type)``. Otherwise ``None``. + + The detector REQUIRES ``body.this`` to be an ``exp.Column`` — aggregates, + hygiene wrappers, function calls, and arithmetic stay outside scope. + ``exp.TryCast`` is rejected (Postgres has no native TRY_CAST equivalent). + """ + if not isinstance(body, exp.Cast) or isinstance(body, exp.TryCast): + return None + inner = body.this + if not isinstance(inner, exp.Column): + return None + target = _sqlglot_type_to_datatype(body.to) + if target is None: + return None + return inner, target + + +def _resolve_column_cast_projection( + *, + body: exp.Column, + cast_target: DataType, + alias_name: Optional[str], + table: FacadeTable, + metrics_by_name: Dict[str, FacadeMetric], + dims_by_name: Dict[str, FacadeDimension], + strip_prefix: Optional[Tuple[str, str]] = None, + alias_map: Optional[Dict[str, str]] = None, +) -> _ProjectionItem: + """Resolve ``CAST( AS )`` to a projection item that runs + the bare column through the engine and overrides the wire OID via + ``cast_target``. Enforces the strict per-pair coercion allowlist. + + DEV-1565: ``alias_map`` carries the LEFT-JOIN subquery's + ``"Stores"."name"`` → catalog ``stores.name`` rewrite so Metabase-style + joined CAST refs resolve through the same dimension-lookup path as + non-casted joined refs. + """ + inner = _resolve_column_projection( + body=body, alias_name=alias_name, table=table, + metrics_by_name=metrics_by_name, dims_by_name=dims_by_name, + strip_prefix=strip_prefix, alias_map=alias_map, + ) + source = _item_cast_source_type(inner) + if not _is_admitted_cast(source=source, target=cast_target): + offending = exp.Cast(this=body, to=exp.DataType.build(cast_target.value)).sql() + raise TranslationError( + f"Unsupported CAST: cannot project {source!s} column as " + f"{cast_target!s} ({offending!r}). Admitted coercions: see " + f"docs/interfaces/pg-facade.md." + ) + return inner.model_copy(update={"cast_target": cast_target}) + + # DEV-1558 B5: hygiene-scalar wrappers Metabase uses for fingerprint queries. # Each maps a sqlglot AST class to a human-readable name for the WARNING log. _HYGIENE_FUNC_CLASSES: Dict[type, str] = { @@ -1050,7 +1223,7 @@ def _build_projection_lookups( return metrics_by_name, metrics_by_formula, dims_by_name -def _resolve_projection( +def _resolve_projection( # NOSONAR(S3776) — flat dispatch over projection-expression shapes; one branch per shape by design expressions: Sequence[exp.Expression], table: FacadeTable, *, schema_name: Optional[str] = None, @@ -1058,6 +1231,7 @@ def _resolve_projection( extra_dims_by_name: Optional[Dict[str, FacadeDimension]] = None, extra_metrics_by_name: Optional[Dict[str, FacadeMetric]] = None, extra_metrics_by_formula: Optional[Dict[str, FacadeMetric]] = None, + allow_column_cast: bool = True, ) -> List[_ProjectionItem]: """Walk the projection list, classifying each item against the table. @@ -1065,6 +1239,13 @@ def _resolve_projection( ``extra_dims_by_name`` / ``extra_metrics_by_*`` carry the materialised ``.`` lookups so the joined dotted refs resolve without needing a configured catalog join. + + DEV-1566: ``allow_column_cast`` gates the ``CAST( AS )`` + projection branch. The pg-facade passes ``True`` (default) — its wire + encoders coerce values to the declared OID. The Flight facade passes + ``False`` because its ``pa.Table.from_pylist`` schema-binding rejects + values whose Python type doesn't match the declared Arrow type and the + facade has no value-coercion pass. """ metrics_by_name, metrics_by_formula, dims_by_name = _build_projection_lookups( table, @@ -1108,6 +1289,25 @@ def _resolve_projection( )) continue + # DEV-1566: CAST( AS ) projection. Runs AFTER the + # time-grain unwrap (preserves Metabase's CAST(DATE_TRUNC(...) AS DATE)) + # and AFTER the aggregate detector (CAST(SUM(...) AS T) is out of + # scope — the detector requires the inner to be a bare Column). + # Gated by allow_column_cast so the Flight facade falls through to + # the "Unsupported projection expression" error instead of producing + # a wire schema its row materialiser can't fulfil. + if allow_column_cast: + cast_match = _detect_column_cast(body) + if cast_match is not None: + inner_col, cast_target = cast_match + out.append(_resolve_column_cast_projection( + body=inner_col, cast_target=cast_target, + alias_name=alias_name, table=table, + metrics_by_name=metrics_by_name, dims_by_name=dims_by_name, + strip_prefix=strip_prefix, alias_map=alias_map, + )) + continue + if isinstance(body, exp.Column): out.append(_resolve_column_projection( body=body, alias_name=alias_name, table=table, @@ -1483,6 +1683,74 @@ def _flip_comparator(op_sql: str) -> str: # --- ORDER BY / GROUP BY ----------------------------------------------------- +def _item_cast_source_type(item: _ProjectionItem) -> Optional[DataType]: + """Return the underlying source DataType of a CAST-projected item, or + None when the source type is unknown (custom metric without a declared + ``data_type``). Caller MUST gate on ``item.cast_target is not None`` + before calling this — the helper's None return is intentionally + overloaded with the unknown-source case so the lossy-pair check can + treat unknown-source casts as lossy by default.""" + if item.metric is not None: + return item.metric.data_type + if item.dimension is not None: + return item.dimension.data_type + return None + + +def _cast_pair_is_lossy( + *, source: Optional[DataType], target: DataType, + lossy_pairs: frozenset, +) -> bool: + """A cast is lossy when either: (a) the source is known and (source, + target) is in ``lossy_pairs``, or (b) the source is unknown — the only + admitted unknown-source target is TEXT (see ``_is_admitted_cast``) + which is always lossy for ORDER BY (lex vs engine's natural sort). + For GROUP BY the unknown-source path is admitted because the engine's + bare-value grouping preserves whatever per-value distinction TEXT + would expose.""" + if source is None: + return target is DataType.TEXT + return (source, target) in lossy_pairs + + +def _lossy_cast_error_message( + *, kind: str, name: str, source: Optional[DataType], target: DataType, +) -> str: + """One-line user-facing message for the lossy-CAST-pair rejection. The + ``kind`` is ``ORDER BY`` or ``GROUP BY``; ``name`` is the user-visible + identifier the message points at (typically the alias). ``source`` is + None when the underlying metric/dimension has no declared data_type + (custom metric path), which is treated as lossy by default.""" + source_str = str(source) if source is not None else "unknown" + return ( + f"{kind} on CAST projection {name!r} with lossy pair " + f"{source_str}→{target!s} is unsupported: the engine query projects " + f"the bare column, so the underlying sort/group semantics don't " + f"match the casted type. Use the bare column, or wait for engine-" + f"side CAST pushdown." + ) + + +def _reject_lossy_cast_or_pass( + *, kind: str, name: str, item: _ProjectionItem, lossy_pairs: frozenset, +) -> None: + """If ``item`` is a CAST projection AND its (source, target) pair is + lossy under bare-column ORDER BY / GROUP BY semantics, raise. Otherwise + no-op. Unknown-source casts (custom metrics without a declared + data_type) are admitted as None→TEXT only; ``_cast_pair_is_lossy`` + treats that case as lossy for ORDER BY by default.""" + if item.cast_target is None: + return + source = _item_cast_source_type(item) + if not _cast_pair_is_lossy( + source=source, target=item.cast_target, lossy_pairs=lossy_pairs, + ): + return + raise TranslationError(_lossy_cast_error_message( + kind=kind, name=name, source=source, target=item.cast_target, + )) + + def _resolve_order_by_item( body: exp.Expression, item_by_projected_name: Dict[str, _ProjectionItem], @@ -1490,14 +1758,16 @@ def _resolve_order_by_item( *, strip_prefix: Optional[Tuple[str, str]] = None, alias_map: Optional[Dict[str, str]] = None, -) -> _ProjectionItem: +) -> Tuple[_ProjectionItem, str]: """Look up the projection item an ORDER BY term resolves to. Tries the bare-name lookup first (alias or canonical metric name from ``_order_by_name``); on miss, DEV-1568 fallback maps a literal aggregate call (``ORDER BY SUM(revenue)``) to the projection registered under a different alias (``SELECT SUM(revenue) AS "rev"``) by matching on the - aggregate's canonical ``measure_formula``. Raises ``TranslationError`` + aggregate's canonical ``measure_formula``. Returns ``(item, name)`` so + the DEV-1566 lossy-CAST check downstream has both the resolved item and + the user-visible name for error messaging. Raises ``TranslationError`` if neither path resolves. """ name = _order_by_name(body, strip_prefix=strip_prefix, alias_map=alias_map) @@ -1510,7 +1780,7 @@ def _resolve_order_by_item( raise TranslationError( f"ORDER BY column {name!r} is not in the projection list" ) - return item + return item, name def _translate_order_by( @@ -1538,10 +1808,14 @@ def _translate_order_by( if not isinstance(ord_expr, exp.Ordered): continue direction = "desc" if ord_expr.args.get("desc") else "asc" - item = _resolve_order_by_item( + item, name = _resolve_order_by_item( ord_expr.this, item_by_projected_name, metric_item_by_formula, strip_prefix=strip_prefix, alias_map=alias_map, ) + _reject_lossy_cast_or_pass( + kind="ORDER BY", name=name, item=item, + lossy_pairs=_LOSSY_ORDER_BY_CAST_PAIRS, + ) if item.metric is not None: # DEV-1568: use the projection's SELECT alias, not the catalog # ``FacadeMetric.name``. ``_record_metric`` registered the SLayer @@ -1587,11 +1861,19 @@ def _order_by_name( def _validate_group_by( # NOSONAR(S3776) — single GROUP BY validation pass; extraction adds indirection group: Optional[exp.Group], derived: List[str], + item_by_projected_name: Dict[str, _ProjectionItem], *, strip_prefix: Optional[Tuple[str, str]] = None, alias_map: Optional[Dict[str, str]] = None, ) -> None: - """Apply the strict-on-extras / lenient-on-omissions policy (§6.1).""" + """Apply the strict-on-extras / lenient-on-omissions policy (§6.1). + + ``item_by_projected_name`` is consulted only to look up the projection + item behind a user-facing name so the DEV-1566 lossy-CAST-pair rejection + can fire. Membership validation still runs against the ``derived`` list + (which carries time-grain canonical forms in addition to the projected + names) to preserve the existing strict-on-extras behaviour. + """ if group is None: return derived_set = set(derived) @@ -1606,8 +1888,8 @@ def _validate_group_by( # NOSONAR(S3776) — single GROUP BY validation pass; e user_items.append(_alias_for_time_grain( grain, col, strip_prefix=strip_prefix, alias_map=alias_map, )) - else: - user_items.append(g.sql()) + continue + user_items.append(g.sql()) for u in user_items: if u not in derived_set: # GROUP BY positional refs (GROUP BY 1) and aggregate terms are @@ -1619,6 +1901,18 @@ def _validate_group_by( # NOSONAR(S3776) — single GROUP BY validation pass; e f"GROUP BY item {u!r} is not in the projection's derived " f"dimension set ({sorted(derived_set)})" ) + # DEV-1566 — Codex round 3: reject GROUP BY on CAST projections + # whose (source, target) pair is lossy under bare-column grouping. + # Unknown-source casts (only admitted as None→TEXT) are admitted + # here — the engine's per-value grouping already collapses to the + # same set of TEXT representations. + item = item_by_projected_name.get(u) + if item is None: + continue + _reject_lossy_cast_or_pass( + kind="GROUP BY", name=u, item=item, + lossy_pairs=_LOSSY_GROUP_BY_CAST_PAIRS, + ) def _is_ignorable_group_item(item: str) -> bool: @@ -1875,6 +2169,7 @@ def translate( catalog_sql_executor: ( "Optional[CatalogSqlExecutorProtocol | Callable[[], CatalogSqlExecutorProtocol]]" ) = None, + allow_column_cast: bool = True, ) -> TranslatorResult: """Translate a SQL string into a TranslatorResult. @@ -1946,7 +2241,9 @@ def translate( ) # Step 5 / 6 — SLayer-table translation. - return _translate_slayer_select(parsed, catalog) + return _translate_slayer_select( + parsed, catalog, allow_column_cast=allow_column_cast, + ) # Lightweight Protocol so the translator doesn't pull catalog_sql at import @@ -1984,7 +2281,8 @@ def _record_metric( }) engine_alias = f"{table.name}.{item.projected_name}" plan.column_name_mapping.append((engine_alias, item.projected_name)) - plan.projection_types.append(item.metric.data_type) + # DEV-1566: CAST( AS T) overrides the declared metric type. + plan.projection_types.append(item.cast_target or item.metric.data_type) def _record_time_grain( @@ -2028,7 +2326,8 @@ def _record_dimension( plan.derived_dims.append(item.dimension.dimension_ref) engine_alias = f"{table.name}.{item.dimension.dimension_ref}" plan.column_name_mapping.append((engine_alias, item.projected_name)) - plan.projection_types.append(item.dimension.data_type) + # DEV-1566: CAST( AS T) overrides the declared dim type. + plan.projection_types.append(item.cast_target or item.dimension.data_type) def _build_projection_plan( @@ -2048,6 +2347,36 @@ def _build_projection_plan( return plan +def _index_items_by_canonical_form( + items: Sequence[_ProjectionItem], +) -> Dict[str, _ProjectionItem]: + """Index projection items by alias plus canonical forms (time-grain only). + + Time-grain items are also keyed by ``grain(col)`` so an unaliased + Metabase-style GROUP BY / ORDER BY (``ORDER BY CAST(DATE_TRUNC('month', + ordered_at) AS DATE)``) resolves against the aliased projection. + + Column-CAST items are intentionally NOT registered here: DEV-1566's CAST + projection is a wire-type override that leaves the engine query + projecting the bare column, so ``ORDER BY CAST(col AS T)`` would sort by + the engine column's natural type (numeric/temporal) instead of the + casted type's semantics (e.g. lex for TEXT), and + ``GROUP BY CAST(ts AS DATE)`` would group per-timestamp instead of + per-date. Both are silently wrong. We reject these queries with the + existing "ORDER BY column not in projection list" / GROUP BY + strict-on-extras error and the user aliases the CAST projection to + reference it (``SELECT CAST(c AS T) AS x ... ORDER BY x``). + """ + by_name: Dict[str, _ProjectionItem] = {item.projected_name: item for item in items} + for item in items: + if item.time_grain is not None and item.time_grain_underlying is not None: + canonical = ( + f"{item.time_grain.value}({item.time_grain_underlying.dimension_ref})" + ) + by_name.setdefault(canonical, item) + return by_name + + def _parse_int_literal(node: Optional[exp.Expression]) -> Optional[int]: """Pull an int out of ``LIMIT N`` / ``OFFSET N`` style nodes.""" if node is None or not isinstance(node.expression, exp.Literal): @@ -2482,7 +2811,15 @@ def _build_item_index(items: List[_ProjectionItem]) -> Dict[str, _ProjectionItem secondary key (time-grain canonical form for `CAST(date_trunc(...))` GROUP BY matches; dimension_ref dotted form for joined-col aliased projections — see DEV-1565). ``setdefault`` preserves the primary - projected_name entry when the secondary key collides with it.""" + projected_name entry when the secondary key collides with it. + + DEV-1566: CAST-projected items are NOT registered under their + ``dimension_ref`` secondary key — that would route ``ORDER BY `` + on a query like ``SELECT CAST(id AS TEXT) AS x ... ORDER BY id`` to the + cast item and trip the lossy-pair rejection, even though the user + explicitly named the bare column (and the engine ordering already + matches that intent). + """ out: Dict[str, _ProjectionItem] = {item.projected_name: item for item in items} for item in items: if item.time_grain is not None and item.time_grain_underlying is not None: @@ -2492,6 +2829,7 @@ def _build_item_index(items: List[_ProjectionItem]) -> Dict[str, _ProjectionItem out.setdefault(canonical, item) elif ( item.dimension is not None + and item.cast_target is None and item.dimension.dimension_ref != item.projected_name ): out.setdefault(item.dimension.dimension_ref, item) @@ -2500,6 +2838,7 @@ def _build_item_index(items: List[_ProjectionItem]) -> Dict[str, _ProjectionItem def _translate_slayer_select( parsed: exp.Select, catalog: FacadeCatalog, + *, allow_column_cast: bool = True, ) -> QueryResult: from_clause = parsed.args.get("from_") if from_clause is None: @@ -2535,18 +2874,16 @@ def _translate_slayer_select( extra_dims_by_name=overlays.extra_dims_by_name or None, extra_metrics_by_name=overlays.extra_metrics_by_name or None, extra_metrics_by_formula=overlays.extra_metrics_by_formula or None, + allow_column_cast=allow_column_cast, ) plan = _build_projection_plan(items, table) + item_by_projected_name = _index_items_by_canonical_form(items) _validate_group_by( - parsed.args.get("group"), plan.derived_dims, + parsed.args.get("group"), plan.derived_dims, item_by_projected_name, strip_prefix=strip_prefix, alias_map=overlays.alias_map, ) - # DEV-1568: WHERE/HAVING/ORDER BY all need the alias→projection-item map. - # Build once and pass to each consumer. - item_by_projected_name = {item.projected_name: item for item in items} - filters: List[str] = [] _apply_where( parsed.args.get("where"), plan.time_dim_by_name, diff --git a/slayer/flight/translator.py b/slayer/flight/translator.py index 2a4ca4c2..8ee4fb91 100644 --- a/slayer/flight/translator.py +++ b/slayer/flight/translator.py @@ -50,8 +50,18 @@ class InfoSchemaResult(TranslatorResult): def translate(sql: str, catalog: FacadeCatalog) -> TranslatorResult: """Translate ``sql`` for the Flight facade (``dialect=None``), converting - the shared ``RowBatch`` results into Arrow-shaped ones.""" - result = _shared_translate(sql, catalog, dialect=None) + the shared ``RowBatch`` results into Arrow-shaped ones. + + ``allow_column_cast=False`` — DEV-1566 ``CAST( AS )`` projections + are gated to the pg-facade. The Flight handler materialises rows via + ``pa.Table.from_pylist`` against a catalog-typed schema, which rejects + values whose Python type doesn't match the declared Arrow type + (e.g. ``datetime.date`` against ``pa.timestamp``); Flight has no + value-coercion pass to bridge the gap. Rejecting at translate time + surfaces a clean ``TranslationError`` instead of an opaque + ``ArrowTypeError`` at materialisation. + """ + result = _shared_translate(sql, catalog, dialect=None, allow_column_cast=False) if isinstance(result, _SharedProbeResult): return ProbeResult(table=row_batch_to_arrow(result.batch)) if isinstance(result, _SharedInfoSchemaResult): diff --git a/slayer/pg_facade/types.py b/slayer/pg_facade/types.py index c020161a..301b38f3 100644 --- a/slayer/pg_facade/types.py +++ b/slayer/pg_facade/types.py @@ -77,6 +77,21 @@ def value_to_text( # NOSONAR(S3776) — flat per-Python-type dispatch # narrowed to a date before the per-Python-type dispatch below. if oid == OID_DATE and isinstance(value, _dt.datetime): value = value.date() + # DEV-1566: symmetric widening for CAST( AS TIMESTAMP). The + # bare `dt.date` branch below emits `YYYY-MM-DD`, which pgjdbc / + # psycopg2 mis-parse for an OID_TIMESTAMP-declared column. Widen to + # `dt.datetime` so the timestamp formatter emits `YYYY-MM-DD HH:MM:SS`. + if ( + oid == OID_TIMESTAMP + and isinstance(value, _dt.date) + and not isinstance(value, _dt.datetime) + ): + value = _dt.datetime(value.year, value.month, value.day) + # DEV-1566: CAST( AS TEXT) must emit `true`/`false` (Postgres + # text shape), not the BOOL wire shape `t`/`f` the next branch would + # produce. Check OID_TEXT BEFORE the bool branch. + if oid == OID_TEXT and isinstance(value, bool): + return b"true" if value else b"false" if isinstance(value, bool): return b"t" if value else b"f" if isinstance(value, float): diff --git a/tests/dialects/test_byte_equivalence.py b/tests/dialects/test_byte_equivalence.py index 409cae4d..df3ea411 100644 --- a/tests/dialects/test_byte_equivalence.py +++ b/tests/dialects/test_byte_equivalence.py @@ -173,13 +173,15 @@ async def test_byte_equivalence_basic_query( ' DATE_TRUNC(\'MONTH\', orders.created_at)', ), ( + # ClickHouse accepts both ``dateTrunc`` and ``DATE_TRUNC``; sqlglot + # now emits the canonical lowercase ClickHouse form. "clickhouse", 'SELECT\n' - ' DATE_TRUNC(\'MONTH\', orders.created_at) AS "orders.created_at",\n' + ' dateTrunc(\'MONTH\', orders.created_at) AS "orders.created_at",\n' ' SUM(orders.amount) AS "orders.revenue_sum"\n' 'FROM public.orders AS orders\n' 'GROUP BY\n' - ' DATE_TRUNC(\'MONTH\', orders.created_at)', + ' dateTrunc(\'MONTH\', orders.created_at)', ), ( "mysql", diff --git a/tests/dialects/test_clickhouse.py b/tests/dialects/test_clickhouse.py index f2123b9e..3a0da3dd 100644 --- a/tests/dialects/test_clickhouse.py +++ b/tests/dialects/test_clickhouse.py @@ -123,6 +123,10 @@ def test_clickhouse_build_date_trunc_week_sunday_shift() -> None: col = sqlglot.parse_one("ordered_at", dialect="clickhouse") out = d.build_date_trunc(col, TimeGranularity.WEEK_SUNDAY, parse=_parse_ch) up = out.sql(dialect="clickhouse").upper() - assert "DATE_TRUNC('WEEK'" in up + # sqlglot emits the canonical ClickHouse spelling ``dateTrunc`` + # (upper-cased here to ``DATETRUNC``). ClickHouse accepts both + # ``DATE_TRUNC`` and ``dateTrunc`` so either is correct on the wire, + # but sqlglot only emits one form — pin it. + assert "DATETRUNC('WEEK'" in up assert "+ INTERVAL 1 DAY" in up assert "- INTERVAL 1 DAY" in up diff --git a/tests/facade/test_translator.py b/tests/facade/test_translator.py index c39a4196..8b977444 100644 --- a/tests/facade/test_translator.py +++ b/tests/facade/test_translator.py @@ -49,6 +49,10 @@ def _catalog() -> FacadeCatalog: Column(name="revenue", type=DataType.DOUBLE), Column(name="status", type=DataType.TEXT), Column(name="ordered_at", type=DataType.TIMESTAMP), + # DEV-1566: DATE column for CAST( AS TIMESTAMP) coverage. + Column(name="delivered_at", type=DataType.DATE), + # DEV-1566: BOOLEAN column for CAST allowlist coverage. + Column(name="is_paid", type=DataType.BOOLEAN), ], measures=[ ModelMeasure(name="aov", formula="revenue:sum / *:count", @@ -130,6 +134,48 @@ def test_show_statement_is_noop_with_tag(dialect) -> None: assert result.command_tag == "SHOW" +def test_command_fallback_warning_suppressed_during_translate(dialect, caplog) -> None: + # sqlglot warns when a statement parses to the generic Command node; for + # facade traffic that path is expected and handled, so translate() must + # not leak one warning line per BI connection. + with caplog.at_level(logging.WARNING, logger="sqlglot"): + result = translate( + sql="SHOW TRANSACTION ISOLATION LEVEL", catalog=_catalog(), dialect=dialect + ) + assert isinstance(result, NoOpResult) + assert not [r for r in caplog.records if "Falling back" in r.getMessage()] + + +@pytest.mark.parametrize( + "sql", + [ + "INSERT INTO orders VALUES (1)", + "UPDATE orders SET id = 2", + "DELETE FROM orders", + "CREATE TABLE x (a INT)", + "DROP TABLE orders", + "ALTER TABLE orders ADD COLUMN foo INT", + ], +) +def test_dml_ddl_rejected_read_only(sql: str, dialect) -> None: + with pytest.raises(TranslationError) as exc_info: + translate(sql=sql, catalog=_catalog(), dialect=dialect) + assert READ_ONLY_MESSAGE in str(exc_info.value) + + +def test_select_star_on_table_rejected(dialect) -> None: + with pytest.raises(TranslationError) as exc_info: + translate(sql="SELECT * FROM orders", catalog=_catalog(), dialect=dialect) + assert "SELECT *" in str(exc_info.value) + assert "INFORMATION_SCHEMA.METRICS" in str(exc_info.value) + + +def test_parse_error_translates(dialect) -> None: + with pytest.raises(TranslationError) as exc_info: + translate(sql="SELECT FROM WHERE", catalog=_catalog(), dialect=dialect) + assert "parse error" in str(exc_info.value).lower() + + # --- DEV-1569: SET / RESET capture on NoOpResult, set_config mutation tunneling --- @@ -471,47 +517,8 @@ def test_reset_setting_op_is_pydantic_model() -> None: ) -def test_command_fallback_warning_suppressed_during_translate(dialect, caplog) -> None: - # sqlglot warns when a statement parses to the generic Command node; for - # facade traffic that path is expected and handled, so translate() must - # not leak one warning line per BI connection. - with caplog.at_level(logging.WARNING, logger="sqlglot"): - result = translate( - sql="SHOW TRANSACTION ISOLATION LEVEL", catalog=_catalog(), dialect=dialect - ) - assert isinstance(result, NoOpResult) - assert not [r for r in caplog.records if "Falling back" in r.getMessage()] -@pytest.mark.parametrize( - "sql", - [ - "INSERT INTO orders VALUES (1)", - "UPDATE orders SET id = 2", - "DELETE FROM orders", - "CREATE TABLE x (a INT)", - "DROP TABLE orders", - "ALTER TABLE orders ADD COLUMN foo INT", - ], -) -def test_dml_ddl_rejected_read_only(sql: str, dialect) -> None: - with pytest.raises(TranslationError) as exc_info: - translate(sql=sql, catalog=_catalog(), dialect=dialect) - assert READ_ONLY_MESSAGE in str(exc_info.value) - - -def test_select_star_on_table_rejected(dialect) -> None: - with pytest.raises(TranslationError) as exc_info: - translate(sql="SELECT * FROM orders", catalog=_catalog(), dialect=dialect) - assert "SELECT *" in str(exc_info.value) - assert "INFORMATION_SCHEMA.METRICS" in str(exc_info.value) - - -def test_parse_error_translates(dialect) -> None: - with pytest.raises(TranslationError) as exc_info: - translate(sql="SELECT FROM WHERE", catalog=_catalog(), dialect=dialect) - assert "parse error" in str(exc_info.value).lower() - # --- table resolution -------------------------------------------------------- @@ -968,6 +975,7 @@ def test_where_on_dimension_alias_does_not_route_to_aggregate_path(dialect) -> N assert len(filters) == 1 + # --- time-grain wrapping ----------------------------------------------------- @@ -1071,6 +1079,23 @@ def test_metabase_sunday_week_wrapper_recognised(dialect) -> None: assert result.query.time_dimensions[0].dimension.full_name == "ordered_at" +def test_one_day_offset_on_non_week_is_preserved(dialect) -> None: + """The day-offset unwrap is scoped to WEEK only: a ``date_trunc('month', + col + INTERVAL '1 day')`` query is NOT a Sunday-week wrapper, so the + column-side offset must be treated as user intent (the column is not a + bare ``ordered_at`` ref) and rejected with the existing translator error. + """ + with pytest.raises(TranslationError): + translate( + sql=( + 'SELECT date_trunc(\'month\', ' + '("orders"."ordered_at" + INTERVAL \'1 day\')), ' + 'COUNT(*) FROM "orders"' + ), + catalog=_catalog(), dialect=dialect, + ) + + def test_sunday_week_wrapper_two_day_offset_rejected(dialect) -> None: """The Sunday-week detector matches ONLY a one-day shift on EACH leg. A two-day shift on either leg is a different bucketing transform (not @@ -1108,21 +1133,6 @@ def test_sunday_week_wrapper_two_day_offset_rejected(dialect) -> None: ) -def test_one_day_offset_on_non_week_is_preserved(dialect) -> None: - """The day-offset unwrap is scoped to WEEK only: a ``date_trunc('month', - col + INTERVAL '1 day')`` query is NOT a Sunday-week wrapper, so the - column-side offset must be treated as user intent (the column is not a - bare ``ordered_at`` ref) and rejected with the existing translator error. - """ - with pytest.raises(TranslationError): - translate( - sql=( - 'SELECT date_trunc(\'month\', ' - '("orders"."ordered_at" + INTERVAL \'1 day\')), ' - 'COUNT(*) FROM "orders"' - ), - catalog=_catalog(), dialect=dialect, - ) def test_partial_sunday_week_wrapper_is_rejected(dialect) -> None: @@ -1173,6 +1183,11 @@ def test_outer_week_day_offset_direction_aware(dialect) -> None: ) + + + + + # --- dialect-only parse acceptance ------------------------------------------ @@ -1336,6 +1351,714 @@ def test_limit_and_offset_pass_through(dialect) -> None: assert result.query.offset == 50 +# --- CAST( AS ) projection (DEV-1566) -------------------------- +# +# The translator admits CAST around a bare/qualified Column reference, overrides +# the wire OID via projection_types, and leaves the engine SQL projecting the +# bare column. The strict allowlist mirrors the (source, target) pairs the wire +# encoders in slayer/pg_facade/types.py can losslessly handle. + + +def test_cast_column_projection_admits_date_to_timestamp(dialect) -> None: + """Linear repro: CAST( AS TIMESTAMP) round-trips through the + translator. Engine still projects the bare column; the wire layer learns + the new OID via projection_types.""" + result = translate( + sql="SELECT CAST(delivered_at AS TIMESTAMP) FROM orders", + catalog=_catalog(), dialect=dialect, + ) + assert isinstance(result, QueryResult) + # Engine sees the bare column — no CAST pushed into SlayerQuery. + assert result.query.dimensions is not None + assert [d.full_name for d in result.query.dimensions] == ["delivered_at"] + # Wire schema reflects the casted type, not the column's declared DATE. + assert result.projection_types == [DataType.TIMESTAMP] + # column_name_mapping uses the inner column's dotted form as projected_name. + assert dict(result.column_name_mapping) == { + "orders.delivered_at": "delivered_at", + } + + +def test_cast_column_with_alias_uses_alias_as_projected_name(dialect) -> None: + """The engine still selects the BARE column; only the projected_name + (user-facing label) carries the alias. So engine_alias stays + ``orders.delivered_at`` — NOT ``orders.ts`` (which would be the + aggregate-alias pattern; aggregates rename the SLayer-level measure).""" + result = translate( + sql="SELECT CAST(delivered_at AS TIMESTAMP) AS ts FROM orders", + catalog=_catalog(), dialect=dialect, + ) + assert isinstance(result, QueryResult) + assert dict(result.column_name_mapping) == {"orders.delivered_at": "ts"} + assert result.projection_types == [DataType.TIMESTAMP] + + +def test_postgres_double_colon_cast_works() -> None: + """``col::TYPE`` sugar parses to ``exp.Cast`` under the postgres dialect — + same outcome as the keyword form.""" + result = translate( + sql="SELECT delivered_at::TIMESTAMP FROM orders", + catalog=_catalog(), dialect="postgres", + ) + assert isinstance(result, QueryResult) + assert result.projection_types == [DataType.TIMESTAMP] + + +def test_cast_joined_column_projection(dialect) -> None: + """CAST around a joined dotted column ref resolves through the same + cross-model dimension path as a bare projection.""" + result = translate( + sql="SELECT CAST(customers.region AS TEXT) FROM orders", + catalog=_catalog(), dialect=dialect, + ) + assert isinstance(result, QueryResult) + assert result.query.dimensions is not None + assert [d.full_name for d in result.query.dimensions] == ["customers.region"] + assert result.projection_types == [DataType.TEXT] + + +@pytest.mark.parametrize("type_name", ["UUID", "JSON", "ARRAY", "STRUCT"]) +def test_cast_unsupported_target_type_raises(type_name: str, dialect) -> None: + """Cast target types not in the SLayer DataType mapping fall through to + the existing 'Unsupported projection expression' error.""" + with pytest.raises(TranslationError) as exc_info: + translate( + sql=f"SELECT CAST(revenue AS {type_name}) FROM orders", + catalog=_catalog(), dialect=dialect, + ) + assert "Unsupported projection expression" in str(exc_info.value) + + +@pytest.mark.parametrize( + ("col", "target"), + [ + ("status", "INT"), # TEXT → INT + ("status", "BOOLEAN"), # TEXT → BOOLEAN + ("revenue", "BOOLEAN"), # DOUBLE → BOOLEAN + ("revenue", "INT"), # DOUBLE → INT (lossy; dropped from allowlist) + ("revenue", "DATE"), # DOUBLE → DATE + ("is_paid", "DATE"), # BOOLEAN → DATE + ("is_paid", "TIMESTAMP"), # BOOLEAN → TIMESTAMP + ("is_paid", "INT"), # BOOLEAN → INT + ("delivered_at", "INT"), # DATE → INT + ("delivered_at", "DOUBLE"),# DATE → DOUBLE + ("ordered_at", "INT"), # TIMESTAMP → INT + ("ordered_at", "DOUBLE"), # TIMESTAMP → DOUBLE + ("id", "DATE"), # INT → DATE + ("id", "BOOLEAN"), # INT → BOOLEAN + ], +) +def test_cast_rejected_coercions_raise(col: str, target: str, dialect) -> None: + """Pairs outside the §5 allowlist surface a strict, named error message.""" + with pytest.raises(TranslationError) as exc_info: + translate( + sql=f"SELECT CAST({col} AS {target}) FROM orders", + catalog=_catalog(), dialect=dialect, + ) + assert "Unsupported CAST" in str(exc_info.value) + + +def test_cast_rejected_error_message_pins_full_contract(dialect) -> None: + """The rejected-coercion error must name the source DataType, target + DataType, the offending SQL, and link the docs reference. Vague messages + would regress agent-debuggability.""" + with pytest.raises(TranslationError) as exc_info: + translate( + sql="SELECT CAST(status AS INT) FROM orders", + catalog=_catalog(), dialect=dialect, + ) + msg = str(exc_info.value) + assert "Unsupported CAST" in msg + assert "TEXT" in msg # source DataType + assert "INT" in msg # target DataType + assert "CAST(status AS INT)" in msg # offending SQL fragment + assert "docs/interfaces/pg-facade.md" in msg # docs pointer + + +@pytest.mark.parametrize( + ("col", "target"), + [ + ("status", "TEXT"), # TEXT → TEXT + ("revenue", "DOUBLE"), # DOUBLE → DOUBLE + ("id", "INT"), # INT → INT + ("delivered_at", "DATE"), # DATE → DATE + ("ordered_at", "TIMESTAMP"),# TIMESTAMP → TIMESTAMP + ("is_paid", "BOOLEAN"), # BOOLEAN → BOOLEAN + ], +) +def test_cast_identity_pair_admitted_for_every_type(col: str, target: str, dialect) -> None: + result = translate( + sql=f"SELECT CAST({col} AS {target}) FROM orders", + catalog=_catalog(), dialect=dialect, + ) + assert isinstance(result, QueryResult) + assert result.projection_types == [DataType(target)] + + +@pytest.mark.parametrize( + ("col", "target", "expected_type"), + [ + # Date/time pair + ("delivered_at", "TIMESTAMP", DataType.TIMESTAMP), + ("ordered_at", "DATE", DataType.DATE), + # Numeric pair (INT→DOUBLE only; DOUBLE→INT dropped) + ("id", "DOUBLE", DataType.DOUBLE), + # X → TEXT (always admitted) + ("delivered_at", "TEXT", DataType.TEXT), + ("ordered_at", "TEXT", DataType.TEXT), + ("id", "TEXT", DataType.TEXT), + ("revenue", "TEXT", DataType.TEXT), + ("is_paid", "TEXT", DataType.TEXT), + ("status", "TEXT", DataType.TEXT), + ], +) +def test_cast_admitted_coercions_parametrised( + col: str, target: str, expected_type: DataType, dialect, +) -> None: + result = translate( + sql=f"SELECT CAST({col} AS {target}) FROM orders", + catalog=_catalog(), dialect=dialect, + ) + assert isinstance(result, QueryResult) + assert result.projection_types == [expected_type] + + +def test_cast_try_cast_rejected(dialect) -> None: + """TRY_CAST parses to exp.TryCast, not exp.Cast, and is explicitly out + of scope — Postgres has no native TRY_CAST.""" + with pytest.raises(TranslationError): + translate( + sql="SELECT TRY_CAST(status AS INT) FROM orders", + catalog=_catalog(), dialect=dialect, + ) + + +def test_cast_aggregate_inner_rejected(dialect) -> None: + """CAST( AS T) is explicitly out of scope (Column only).""" + with pytest.raises(TranslationError): + translate( + sql="SELECT CAST(SUM(revenue) AS DOUBLE) FROM orders", + catalog=_catalog(), dialect=dialect, + ) + + +def test_cast_time_grain_compat_unchanged(dialect) -> None: + """CAST(DATE_TRUNC(...) AS DATE) is the time-grain pattern — body.this is + DateTrunc, not Column, so the new column-CAST branch returns None and the + existing time-grain CAST-unwrap still handles it.""" + result = translate( + sql="SELECT CAST(date_trunc('month', ordered_at) AS DATE), revenue_sum FROM orders", + catalog=_catalog(), dialect=dialect, + ) + assert isinstance(result, QueryResult) + assert result.query.time_dimensions is not None + assert result.query.time_dimensions[0].granularity == TimeGranularity.MONTH + + +def test_cast_order_by_unaliased_rejected(dialect) -> None: + """Codex round 2: ORDER BY CAST( AS ) is rejected because the + engine query still projects the bare column — sorting by the bare column + follows the engine column's natural type (numeric for INT, temporal for + TIMESTAMP) instead of the casted type's semantics (lex for TEXT), which + is silently wrong. Workaround: alias the CAST projection and ORDER BY + the alias (see test_cast_order_by_via_alias_works).""" + with pytest.raises(TranslationError) as exc_info: + translate( + sql=( + "SELECT CAST(id AS TEXT) FROM orders " + "ORDER BY CAST(id AS TEXT) ASC" + ), + catalog=_catalog(), dialect=dialect, + ) + assert "ORDER BY" in str(exc_info.value) + assert "not in the projection list" in str(exc_info.value) + + +def test_cast_group_by_unaliased_rejected(dialect) -> None: + """Codex round 2: GROUP BY CAST( AS ) is rejected. For lossy + pairs like TIMESTAMP→DATE the engine SQL still groups by the bare + column, so multiple timestamps on the same date produce duplicate + rows in the client (silently wrong).""" + with pytest.raises(TranslationError): + translate( + sql=( + "SELECT CAST(ordered_at AS DATE) FROM orders " + "GROUP BY CAST(ordered_at AS DATE)" + ), + catalog=_catalog(), dialect=dialect, + ) + + +def test_cast_order_by_via_alias_works(dialect) -> None: + """The documented workaround: alias the CAST projection and ORDER BY + the alias. The alias resolves to the projection item directly, no + canonical-form registration needed. Sort order still follows the engine + column's natural type — DATE→TIMESTAMP is 1:1 so the result is correct, + but ``SELECT CAST(id AS TEXT) AS x ... ORDER BY x`` would still sort + numerically. The alias path makes the limitation explicit at the SQL + level instead of hiding it behind a canonical-form rewrite.""" + result = translate( + sql=( + "SELECT CAST(delivered_at AS TIMESTAMP) AS dt FROM orders " + "ORDER BY dt ASC" + ), + catalog=_catalog(), dialect=dialect, + ) + assert isinstance(result, QueryResult) + assert result.query.order is not None + assert result.query.order[0].column.name == "delivered_at" + assert result.query.order[0].direction == "asc" + # Wire schema still reflects the cast. + assert result.projection_types == [DataType.TIMESTAMP] + + +def test_cast_unknown_source_datatype_admits_text_only(dialect) -> None: + """Custom metrics with declared data_type=None admit ONLY CAST→TEXT; + every other target is rejected so wire-encode-time crashes don't surface + as opaque connection errors. Custom aggregations carry data_type=None + when constructed without an explicit type — exercise that path.""" + orders = SlayerModel( + name="orders", + data_source="jaffle", + sql_table="orders", + columns=[ + Column(name="id", type=DataType.INT, primary_key=True), + Column(name="amount", type=DataType.DOUBLE), + ], + measures=[ + # No `type=` → ModelMeasure.type is None. + ModelMeasure(name="custom_metric", formula="amount:sum"), + ], + ) + catalog = build_catalog(models_by_datasource={"jaffle": [orders]}) + + # → TEXT admitted. + result = translate( + sql="SELECT CAST(custom_metric AS TEXT) FROM orders", + catalog=catalog, dialect=dialect, + ) + assert isinstance(result, QueryResult) + assert result.projection_types == [DataType.TEXT] + + # → TIMESTAMP rejected with the strict-allowlist message. + with pytest.raises(TranslationError) as exc_info: + translate( + sql="SELECT CAST(custom_metric AS TIMESTAMP) FROM orders", + catalog=catalog, dialect=dialect, + ) + assert "Unsupported CAST" in str(exc_info.value) + + # Codex round 4: ORDER BY on the alias of an unknown-source CAST→TEXT + # must be rejected too — the lex-vs-natural sort gap applies regardless + # of whether the underlying source type was declared. The error message + # surfaces "unknown" for the source. (GROUP BY on a metric alias is + # already rejected by the strict-on-extras membership check — only + # dimensions populate derived_dims — so the lossy GROUP BY path can't + # be triggered through an unknown-source metric.) + with pytest.raises(TranslationError) as exc_info: + translate( + sql=( + "SELECT CAST(custom_metric AS TEXT) AS m FROM orders " + "ORDER BY m" + ), + catalog=catalog, dialect=dialect, + ) + msg = str(exc_info.value) + assert "ORDER BY on CAST projection" in msg + assert "unknown" in msg + + +@pytest.mark.parametrize( + "col", ["delivered_at", "ordered_at", "id", "revenue", "is_paid", "status"], +) +def test_cast_text_target_admitted_from_every_source(col: str, dialect) -> None: + """X → TEXT is always admitted (stringification is universal).""" + result = translate( + sql=f"SELECT CAST({col} AS TEXT) FROM orders", + catalog=_catalog(), dialect=dialect, + ) + assert isinstance(result, QueryResult) + assert result.projection_types == [DataType.TEXT] + + +@pytest.mark.parametrize( + ("type_alias", "expected"), + [ + # TEXT-family aliases. + ("VARCHAR", DataType.TEXT), + ("CHAR", DataType.TEXT), + # INT-family aliases (narrowing widths collapse to DataType.INT — see + # the coarse-OID note in pg-facade.md). + ("INTEGER", DataType.INT), + ("BIGINT", DataType.INT), + ("SMALLINT", DataType.INT), + # DOUBLE-family aliases (floating + DECIMAL/NUMERIC all collapse to + # DataType.DOUBLE — same coarse-OID note). + ("FLOAT", DataType.DOUBLE), + ("REAL", DataType.DOUBLE), + ("DECIMAL", DataType.DOUBLE), + ("NUMERIC", DataType.DOUBLE), + # TIMESTAMP-family aliases (TIMESTAMPTZ / TIMESTAMP WITH TIME ZONE + # collapse to DataType.TIMESTAMP — same coarse-OID note). + ("DATETIME", DataType.TIMESTAMP), + ("TIMESTAMPTZ", DataType.TIMESTAMP), + ], +) +def test_cast_sqlglot_type_aliases_map_to_slayer_datatype( + type_alias: str, expected: DataType, +) -> None: + """Each accepted sqlglot DataType.Type alias collapses onto the canonical + SLayer DataType. Pinned under postgres dialect since some aliases + (TIMESTAMPTZ) only parse cleanly there.""" + # Pick a source column whose declared type makes an + # admitted pair: identity on the expected canonical type. + source_col = { + DataType.TEXT: "status", + DataType.INT: "id", + DataType.DOUBLE: "revenue", + DataType.TIMESTAMP: "ordered_at", + }[expected] + result = translate( + sql=f"SELECT CAST({source_col} AS {type_alias}) FROM orders", + catalog=_catalog(), dialect="postgres", + ) + assert isinstance(result, QueryResult) + assert result.projection_types == [expected] + + +def test_cast_parameterised_type_form_works() -> None: + """sqlglot represents ``VARCHAR(255)`` etc. with their precision modifier + on the SAME ``DataType.Type`` member, so the mapping collapses precision + implicitly — SLayer wire types don't carry it.""" + result = translate( + sql="SELECT CAST(status AS VARCHAR(255)) FROM orders", + catalog=_catalog(), dialect="postgres", + ) + assert isinstance(result, QueryResult) + assert result.projection_types == [DataType.TEXT] + + +def test_cast_non_column_non_aggregate_inner_rejected(dialect) -> None: + """The CAST detector REQUIRES body.this == exp.Column. Inner expressions + that aren't bare columns (hygiene wrappers like SUBSTRING, function + calls, arithmetic) must not accidentally route through the CAST branch + — they must fall through to the existing fallback.""" + with pytest.raises(TranslationError) as exc_info: + translate( + sql="SELECT CAST(SUBSTRING(status, 1, 1) AS TEXT) FROM orders", + catalog=_catalog(), dialect=dialect, + ) + # Specifically: NOT the strict-allowlist message. + assert "Unsupported CAST" not in str(exc_info.value) + + +def test_cast_qualified_ref_order_by_unaliased_rejected(dialect) -> None: + """Codex round 2: same rejection for qualified (joined) CAST refs in + ORDER BY. Engine still selects the bare ``customers.region``; lex sort + semantics don't carry through. User must alias and ORDER BY the alias.""" + with pytest.raises(TranslationError) as exc_info: + translate( + sql=( + "SELECT CAST(customers.region AS TEXT) FROM orders " + "ORDER BY CAST(customers.region AS TEXT) ASC" + ), + catalog=_catalog(), dialect=dialect, + ) + assert "not in the projection list" in str(exc_info.value) + + +def test_cast_qualified_ref_group_by_unaliased_rejected(dialect) -> None: + """Codex round 2: same rejection for qualified-ref GROUP BY CAST.""" + with pytest.raises(TranslationError): + translate( + sql=( + "SELECT CAST(customers.region AS TEXT) FROM orders " + "GROUP BY CAST(customers.region AS TEXT)" + ), + catalog=_catalog(), dialect=dialect, + ) + + +def test_cast_aliased_projection_order_by_via_alias_works(dialect) -> None: + """Workaround: with an aliased CAST projection, ORDER BY + resolves directly through item_by_projected_name. The ORDER BY CAST(...) + canonical form was removed (Codex round 2) — agents must reference the + alias instead.""" + result = translate( + sql=( + "SELECT CAST(delivered_at AS TIMESTAMP) AS ts FROM orders " + "ORDER BY ts ASC" + ), + catalog=_catalog(), dialect=dialect, + ) + assert isinstance(result, QueryResult) + assert result.query.order is not None + assert result.query.order[0].column.name == "delivered_at" + assert result.query.order[0].direction == "asc" + + +def test_cast_aliased_projection_group_by_unaliased_rejected(dialect) -> None: + """An aliased CAST projection + GROUP BY repeating the CAST shape is + still rejected — the canonical form is not in derived_dims after Codex + round 2's removal. Agents must rewrite to ``GROUP BY `` or the + bare column.""" + with pytest.raises(TranslationError): + translate( + sql=( + "SELECT CAST(delivered_at AS TIMESTAMP) AS ts FROM orders " + "GROUP BY CAST(delivered_at AS TIMESTAMP)" + ), + catalog=_catalog(), dialect=dialect, + ) + + +# --- Codex round 3: lossy-pair rejection via alias --------------------------- +# +# Round 3 dropped CAST canonical-form registration so unaliased +# ORDER BY / GROUP BY CAST(...) fail with "not in projection list". But the +# aliased path (SELECT CAST(c AS T) AS x ... ORDER BY x) still resolves via +# the bare alias; it sorts/groups by the bare column, which for lossy pairs +# (X→TEXT for ORDER BY, TIMESTAMP→DATE for GROUP BY) silently disagrees with +# the casted semantics. These pairs are rejected at order/group-by time; +# every other admitted pair (1:1, identity, INT→DOUBLE, DATE↔TIMESTAMP) +# stays admitted via alias because the underlying engine value preserves +# the casted order/grouping. + + +@pytest.mark.parametrize( + ("col", "target"), + [ + ("id", "TEXT"), # INT → TEXT + ("revenue", "TEXT"), # DOUBLE → TEXT + ("is_paid", "TEXT"), # BOOLEAN → TEXT + ("delivered_at", "TEXT"), # DATE → TEXT + ("ordered_at", "TEXT"), # TIMESTAMP → TEXT + ], +) +def test_cast_alias_order_by_lossy_pair_rejected( + col: str, target: str, dialect, +) -> None: + """SELECT CAST( AS TEXT) AS x ... ORDER BY x is admitted in + projection but rejected at ORDER BY resolution — the engine query still + sorts by the bare column's natural type (numeric/temporal) rather than + the casted type's lex order.""" + with pytest.raises(TranslationError) as exc_info: + translate( + sql=( + f"SELECT CAST({col} AS {target}) AS x FROM orders ORDER BY x" + ), + catalog=_catalog(), dialect=dialect, + ) + msg = str(exc_info.value) + assert "ORDER BY on CAST projection" in msg + assert "lossy pair" in msg + + +@pytest.mark.parametrize( + ("col", "target", "expected_source", "expected_target"), + [ + # TIMESTAMP → DATE: many timestamps per date. + ("ordered_at", "DATE", "TIMESTAMP", "DATE"), + # Codex round 5: INT → DOUBLE is also lossy for grouping because + # IEEE 754 float64 cannot represent every int64 distinctly; large + # bigints collapse to the same double under Postgres semantics + # while the facade groups by the bare int and over-reports groups. + ("id", "DOUBLE", "INT", "DOUBLE"), + ], +) +def test_cast_alias_group_by_lossy_pair_rejected( + col: str, target: str, expected_source: str, expected_target: str, + dialect, +) -> None: + """Many-to-one admitted pairs (TIMESTAMP→DATE, INT→DOUBLE) are rejected + in GROUP BY alias paths — the facade's per-engine-column grouping + returns more groups than Postgres semantics would produce.""" + with pytest.raises(TranslationError) as exc_info: + translate( + sql=( + f"SELECT CAST({col} AS {target}) AS d, COUNT(*) FROM orders " + f"GROUP BY d" + ), + catalog=_catalog(), dialect=dialect, + ) + msg = str(exc_info.value) + assert "GROUP BY on CAST projection" in msg + assert "lossy pair" in msg + assert expected_source in msg + assert expected_target in msg + + +@pytest.mark.parametrize( + ("col", "target"), + [ + ("delivered_at", "TIMESTAMP"), # DATE → TIMESTAMP (1:1) + ("ordered_at", "DATE"), # TIMESTAMP → DATE — only for ORDER BY + ("id", "DOUBLE"), # INT → DOUBLE (1:1 within INT range) + ("delivered_at", "DATE"), # identity DATE → DATE + ("revenue", "DOUBLE"), # identity DOUBLE → DOUBLE + ], +) +def test_cast_alias_order_by_safe_pair_admitted( + col: str, target: str, dialect, +) -> None: + """Order-preserving pairs (1:1 / identity / DATE↔TIMESTAMP) stay + admitted via alias because the engine column's natural sort matches + the casted type's sort. The wire-type override still applies.""" + result = translate( + sql=( + f"SELECT CAST({col} AS {target}) AS x FROM orders ORDER BY x" + ), + catalog=_catalog(), dialect=dialect, + ) + assert isinstance(result, QueryResult) + assert result.query.order is not None + assert result.query.order[0].direction == "asc" + assert result.projection_types == [DataType(target)] + + +@pytest.mark.parametrize( + ("col", "target"), + [ + ("delivered_at", "TIMESTAMP"), # DATE → TIMESTAMP (1:1) + ("revenue", "DOUBLE"), # identity + ("delivered_at", "DATE"), # identity + ("status", "TEXT"), # identity TEXT → TEXT + ], +) +def test_cast_alias_group_by_safe_pair_admitted( + col: str, target: str, dialect, +) -> None: + """1:1 / identity pairs preserve the per-engine-column grouping under + the casted type, so GROUP BY stays admitted.""" + result = translate( + sql=( + f"SELECT CAST({col} AS {target}) AS x, COUNT(*) FROM orders " + f"GROUP BY x" + ), + catalog=_catalog(), dialect=dialect, + ) + assert isinstance(result, QueryResult) + assert result.projection_types[0] == DataType(target) + + +def test_cast_order_by_bare_column_does_not_shadow_cast_projection(dialect) -> None: + """Codex round 8: when both the bare column and a CAST alias are + projected, ``ORDER BY `` must route to the bare column + projection — NOT to the CAST item via the secondary-key fallback that + would then trip the lossy-pair rejection. Validates the + ``item.cast_target is None`` guard in ``_build_item_index``.""" + result = translate( + sql="SELECT id, CAST(id AS TEXT) AS x FROM orders ORDER BY id", + catalog=_catalog(), dialect=dialect, + ) + assert isinstance(result, QueryResult) + assert result.query.order is not None + assert result.query.order[0].column.name == "id" + assert result.query.order[0].direction == "asc" + # Wire schema: bare id keeps INT; the CAST alias surfaces as TEXT. + assert result.projection_types == [DataType.INT, DataType.TEXT] + + +def test_cast_order_by_bare_column_without_bare_projection_fails_cleanly(dialect) -> None: + """Codex round 8: when the bare column is NOT projected, ``ORDER BY + `` over a CAST projection must surface ``not in the + projection list`` (the existing strict-on-extras error) — NOT the + lossy-CAST rejection (which would falsely imply the user asked for + lex-sort semantics on the casted column).""" + with pytest.raises(TranslationError) as exc_info: + translate( + sql="SELECT CAST(id AS TEXT) AS x, status FROM orders ORDER BY id", + catalog=_catalog(), dialect=dialect, + ) + msg = str(exc_info.value) + assert "not in the projection list" in msg + assert "lossy pair" not in msg + + +def test_cast_joined_column_metabase_alias_resolves(dialect) -> None: + """Codex round 8: ``CAST("Stores"."name" AS TEXT)`` in a Metabase-style + LEFT JOIN subquery must resolve through the alias_map (\"Stores\" → + \"stores\"). Before the fix the CAST branch dropped the alias_map and + the joined ref came through as ``Stores.name`` → ``Unknown projection + item``. Now the cast branch forwards alias_map and the projection + resolves to ``stores.name`` like the non-casted shape does.""" + sql = _metabase_join_sql( + projection='CAST("Stores"."name" AS TEXT) AS "store_name"', + ) + result = translate(sql=sql, catalog=_join_catalog(), dialect=dialect) + assert isinstance(result, QueryResult) + assert result.query.dimensions is not None + assert [d.full_name for d in result.query.dimensions] == ["stores.name"] + assert result.projection_types == [DataType.TEXT] + + +def test_cast_metric_projection_overrides_wire_type(dialect) -> None: + """A CAST around a metric reference resolves through the metric path + (`_record_metric`), and the cast target wins over the declared metric type.""" + result = translate( + sql="SELECT CAST(revenue_sum AS TEXT) FROM orders", + catalog=_catalog(), dialect=dialect, + ) + assert isinstance(result, QueryResult) + assert result.query.measures is not None + assert result.query.measures[0].formula == "revenue:sum" + # Declared metric type is DOUBLE; CAST( AS TEXT) overrides. + assert result.projection_types == [DataType.TEXT] + + +# --- allow_column_cast gate (Codex round 1 — Flight regression guard) -------- +# +# DEV-1566 admits CAST( AS ) projection in the shared translator. +# The Flight facade materialises rows via pa.Table.from_pylist against a +# catalog-typed schema, which raises ArrowTypeError on values whose Python +# type doesn't match the declared Arrow type (date vs timestamp, bool vs +# utf8, etc.). The Flight shim passes allow_column_cast=False to reject the +# new projection shape at translate time; this test pins the gate. + + +def test_allow_column_cast_false_rejects_cast_projection() -> None: + """With allow_column_cast=False, the CAST projection branch is skipped + and the body falls through to the 'Unsupported projection expression' + error (the existing terminal path).""" + with pytest.raises(TranslationError) as exc_info: + translate( + sql="SELECT CAST(delivered_at AS TIMESTAMP) FROM orders", + catalog=_catalog(), dialect=None, allow_column_cast=False, + ) + assert "Unsupported projection expression" in str(exc_info.value) + + +def test_allow_column_cast_false_leaves_time_grain_cast_unwrap_working() -> None: + """The gate must not regress the time-grain CAST-unwrap path — the + Metabase fingerprint ``CAST(DATE_TRUNC(...) AS DATE)`` is detected by + ``_detect_time_grain``, which runs BEFORE the column-CAST branch.""" + result = translate( + sql=( + "SELECT CAST(date_trunc('month', ordered_at) AS DATE), revenue_sum " + "FROM orders" + ), + catalog=_catalog(), dialect=None, allow_column_cast=False, + ) + assert isinstance(result, QueryResult) + assert result.query.time_dimensions is not None + assert result.query.time_dimensions[0].granularity == TimeGranularity.MONTH + + +def test_allow_column_cast_default_true_unchanged(dialect) -> None: + """Sanity: the default-True path is the same as not passing the kwarg + (pg-facade behaviour). Pinned so the default never silently flips.""" + explicit = translate( + sql="SELECT CAST(delivered_at AS TIMESTAMP) FROM orders", + catalog=_catalog(), dialect=dialect, allow_column_cast=True, + ) + implicit = translate( + sql="SELECT CAST(delivered_at AS TIMESTAMP) FROM orders", + catalog=_catalog(), dialect=dialect, + ) + assert isinstance(explicit, QueryResult) + assert isinstance(implicit, QueryResult) + assert explicit.projection_types == implicit.projection_types == [DataType.TIMESTAMP] + # --- DEV-1565: LEFT JOIN with subquery (Metabase MBQL shape) ----------------- # # Metabase v0.62 emits joins as: diff --git a/tests/flight/test_translator.py b/tests/flight/test_translator.py new file mode 100644 index 00000000..0a5fd142 --- /dev/null +++ b/tests/flight/test_translator.py @@ -0,0 +1,82 @@ +"""Tests for slayer.flight.translator — Flight-shim-specific behaviour. + +Most translator coverage lives in tests/facade/test_translator.py (the shared +SQL → SlayerQuery pipeline). This file pins behaviours that are specific to +the Flight shim's `_shared_translate(..., allow_column_cast=False)` call: + +* DEV-1566 ``CAST( AS )`` projection is rejected at translate + time (Codex round 1 — Flight has no value-coercion pass and would crash + inside ``pa.Table.from_pylist`` if the projection were admitted). +* The time-grain ``CAST(DATE_TRUNC(...) AS DATE)`` Metabase fingerprint is + unaffected — the time-grain unwrap runs before the column-CAST branch. +""" + +from __future__ import annotations + +import pytest + +from slayer.core.enums import DataType +from slayer.core.models import Column, SlayerModel +from slayer.facade.catalog import FacadeCatalog, build_catalog +from slayer.flight.translator import QueryResult, TranslationError, translate + + +def _catalog() -> FacadeCatalog: + orders = SlayerModel( + name="orders", + data_source="jaffle", + sql_table="orders", + columns=[ + Column(name="id", type=DataType.INT, primary_key=True), + Column(name="revenue", type=DataType.DOUBLE), + Column(name="status", type=DataType.TEXT), + Column(name="ordered_at", type=DataType.TIMESTAMP), + Column(name="delivered_at", type=DataType.DATE), + Column(name="is_paid", type=DataType.BOOLEAN), + ], + ) + return build_catalog(models_by_datasource={"jaffle": [orders]}) + + +# --- DEV-1566 gate: Flight rejects CAST( AS ) projections --------- + + +@pytest.mark.parametrize( + ("col", "target"), + [ + ("delivered_at", "TIMESTAMP"), # DATE → TIMESTAMP (date can't fill pa.timestamp) + ("is_paid", "TEXT"), # BOOLEAN → TEXT (bool can't fill pa.utf8) + ("revenue", "TEXT"), # DOUBLE → TEXT + ("id", "TEXT"), # INT → TEXT + ("delivered_at", "TEXT"), # DATE → TEXT + ("ordered_at", "TEXT"), # TIMESTAMP → TEXT + ], +) +def test_flight_rejects_cast_projection(col: str, target: str) -> None: + """The Flight shim sets allow_column_cast=False so the CAST projection + branch is skipped and the body falls through to the existing + 'Unsupported projection expression' terminal error. Without this gate + pa.Table.from_pylist would raise ArrowTypeError at materialisation.""" + with pytest.raises(TranslationError) as exc_info: + translate( + sql=f"SELECT CAST({col} AS {target}) FROM orders", + catalog=_catalog(), + ) + assert "Unsupported projection expression" in str(exc_info.value) + + +def test_flight_admits_time_grain_cast_unwrap() -> None: + """``CAST(DATE_TRUNC(, ) AS DATE)`` is the Metabase time-grain + fingerprint, NOT a column-CAST projection. The time-grain unwrap runs + BEFORE the column-CAST branch in _resolve_projection, so the gate must + not regress it. (This is the pattern Flight clients actually emit.)""" + result = translate( + sql=( + "SELECT CAST(date_trunc('month', ordered_at) AS DATE), revenue_sum " + "FROM orders" + ), + catalog=_catalog(), + ) + assert isinstance(result, QueryResult) + assert result.query.time_dimensions is not None + assert result.query.time_dimensions[0].granularity.value == "month" diff --git a/tests/integration/test_integration_pg_facade.py b/tests/integration/test_integration_pg_facade.py index 3bda77a6..3ce314e4 100644 --- a/tests/integration/test_integration_pg_facade.py +++ b/tests/integration/test_integration_pg_facade.py @@ -373,6 +373,21 @@ async def test_dml_rejected(pg_demo_server) -> None: await conn.close() +async def test_cast_unsupported_coercion_returns_postgres_error(pg_demo_server) -> None: + """DEV-1566: CAST( AS INT) is outside the admitted-coercion + allowlist (truncation/parse semantics differ from Postgres). The + translator surfaces a clean PostgresError with the strict-allowlist + message — not an internal connection crash at wire-encode time.""" + host, port = pg_demo_server + conn = await _connect(host, port) + try: + with pytest.raises(asyncpg.PostgresError) as exc_info: + await conn.fetch("SELECT CAST(name AS INT) FROM customers LIMIT 1") + assert "Unsupported CAST" in str(exc_info.value) + finally: + await conn.close() + + # --- parameterised query (literal substitution) ------------------------------ diff --git a/tests/integration/test_metabase_e2e.py b/tests/integration/test_metabase_e2e.py index 99711100..27e550a0 100644 --- a/tests/integration/test_metabase_e2e.py +++ b/tests/integration/test_metabase_e2e.py @@ -1140,10 +1140,6 @@ def test_date_psycopg_text_round_trip(metabase_e2e_env: MetabaseE2EEnv) -> None: conn.close() -@pytest.mark.xfail( - strict=True, - reason="DEV-1566: pg-facade rejects CAST( AS TIMESTAMP) in projection; no SLayer-query primitive for per-query type coercion", -) async def test_timestamp_round_trip_both_formats(metabase_e2e_env: MetabaseE2EEnv) -> None: host, port = metabase_e2e_env.pg_primary conn = await _asyncpg_connect(host, port, password=metabase_e2e_env.pg_primary_password) diff --git a/tests/pg_facade/test_types.py b/tests/pg_facade/test_types.py index f40bf045..7dc385aa 100644 --- a/tests/pg_facade/test_types.py +++ b/tests/pg_facade/test_types.py @@ -9,6 +9,9 @@ import pytest from slayer.core.enums import DataType +from slayer.core.models import Column, SlayerModel +from slayer.facade.catalog import build_catalog +from slayer.facade.translator import QueryResult, translate from slayer.pg_facade import types as t from slayer.pg_facade.protocol import ( OID_BOOL, @@ -42,6 +45,32 @@ def test_datatype_to_oid_none_falls_back_to_text() -> None: assert t.datatype_to_oid(None) == OID_TEXT +def test_query_result_oid_overridden_by_cast() -> None: + """DEV-1566: end-to-end — a CAST( AS TIMESTAMP) translation + yields a projection_types entry that maps to OID_TIMESTAMP, not the + column's declared OID_DATE.""" + orders = SlayerModel( + name="orders", data_source="jaffle", sql_table="orders", + columns=[ + Column(name="id", type=DataType.INT, primary_key=True), + Column(name="delivered_at", type=DataType.DATE), + ], + ) + catalog = build_catalog(models_by_datasource={"jaffle": [orders]}) + + # Baseline: bare projection → OID_DATE. + bare = translate(sql="SELECT delivered_at FROM orders", catalog=catalog) + assert isinstance(bare, QueryResult) + assert t.datatype_to_oid(bare.projection_types[0]) == OID_DATE + + # CAST projection → OID_TIMESTAMP (wire layer rewritten). + casted = translate( + sql="SELECT CAST(delivered_at AS TIMESTAMP) FROM orders", catalog=catalog, + ) + assert isinstance(casted, QueryResult) + assert t.datatype_to_oid(casted.projection_types[0]) == OID_TIMESTAMP + + # --- value_to_text ----------------------------------------------------------- @@ -50,8 +79,18 @@ def test_value_to_text_none_is_sql_null() -> None: def test_value_to_text_bool() -> None: - assert t.value_to_text(True) == b"t" - assert t.value_to_text(False) == b"f" + # DEV-1566: the default oid is OID_TEXT, which is the Postgres text shape + # for booleans (``true``/``false``). The BOOL-wire ``t``/``f`` shape is + # covered separately by test_value_to_text_bool_in_bool_column. + assert t.value_to_text(True) == b"true" + assert t.value_to_text(False) == b"false" + + +def test_value_to_text_bool_in_bool_column() -> None: + """OID_BOOL keeps the wire-format ``t``/``f`` shape — that's what the + Postgres binary protocol carries on the text-format path for BOOL.""" + assert t.value_to_text(True, OID_BOOL) == b"t" + assert t.value_to_text(False, OID_BOOL) == b"f" def test_value_to_text_int_and_decimal() -> None: @@ -102,6 +141,27 @@ def test_value_to_text_datetime_in_timestamp_column_is_preserved() -> None: assert t.value_to_text(ts, OID_TIMESTAMP) == b"2024-06-01 12:30:45" +def test_value_to_text_date_in_timestamp_column_is_widened() -> None: + """DEV-1566: symmetric widening to the existing OID_DATE-driven narrowing. + CAST( AS TIMESTAMP) needs the wire text payload to look like a + Postgres TIMESTAMP literal (``YYYY-MM-DD HH:MM:SS``) — without the widen + step pgjdbc/psycopg2 mis-parse a bare ``YYYY-MM-DD`` value for a column + declared with OID 1114.""" + d = dt.date(2024, 6, 1) + assert t.value_to_text(d, OID_TIMESTAMP) == b"2024-06-01 00:00:00" + # Default (no oid / OID_TEXT) preserves the bare date shape. + assert t.value_to_text(d) == b"2024-06-01" + + +def test_value_to_text_bool_in_text_column_uses_true_false() -> None: + """DEV-1566: CAST( AS TEXT) must surface Postgres-shaped boolean + text (``true``/``false``), not the BOOL wire shape (``t``/``f``). + Default oid == OID_TEXT so the explicit and default forms agree; + BOOL-wire ``t``/``f`` is reachable only via explicit OID_BOOL.""" + assert t.value_to_text(True, OID_TEXT) == b"true" + assert t.value_to_text(False, OID_TEXT) == b"false" + + def test_value_to_text_str_and_bytes() -> None: assert t.value_to_text("hello") == b"hello" assert t.value_to_text(b"raw") == b"raw"