Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
14 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 91 additions & 12 deletions docs/database-support.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,23 +5,35 @@ SQL generation. Databases are supported at two tiers.

## Tier 1 — fully tested

Integration tests and/or Docker examples; must not regress.

| Engine | Coverage |
|---|---|
| **SQLite** | Integration tests in `tests/integration/test_integration.py`; embedded example. |
| **Postgres** | Integration tests in `tests/integration/test_integration_postgres.py`; Docker example. |
| **DuckDB** | Integration tests in `tests/integration/test_integration_duckdb.py` (in-process, no Docker). |
| **MySQL** | Docker example with `verify.py`. |
| **ClickHouse** | Docker example with `verify.py`. |
| **SQL Server** | Docker example with `verify.py` in `examples/sqlserver/`. |
| **Snowflake** | Integration tests in `tests/integration/test_integration_snowflake.py` (skip without `~/.snowflake/connections.toml`); `examples/snowflake/` ships `README.md` + `verify.py`. No Docker (no free local image). |
Live-instance integration tests must not regress. Where Docker images exist,
the suites spin up the engine via `testcontainers`; the cloud-only engines
(BigQuery, Snowflake) skip cleanly when credentials aren't available and run
against the live service in CI when they are.

| Engine | Live test | Docker example |
|---|---|---|
| **SQLite** | `tests/integration/test_integration.py` (in-process) | `examples/embedded/` |
| **Postgres** | `tests/integration/test_integration_postgres.py` (pytest-postgresql, spawned temp instance) | `examples/postgres/` |
| **DuckDB** | `tests/integration/test_integration_duckdb.py` (in-process) | `examples/embedded/` (DuckDB mode) |
| **MySQL** | `tests/integration/test_integration_mysql.py` (`testcontainers[mysql]`) | `examples/mysql/` |
| **ClickHouse** | `tests/integration/test_integration_clickhouse.py` (`testcontainers[clickhouse]`) | `examples/clickhouse/` |
| **SQL Server** | `tests/integration/test_integration_sqlserver.py` (`testcontainers`, `msodbcsql18` + `unixodbc-dev` on the runner) | `examples/sqlserver/` |
| **Snowflake** | `tests/integration/test_integration_snowflake.py` (skips without `~/.snowflake/connections.toml`; profile name overridable via `$SLAYER_SNOWFLAKE_CONNECTION`) | `examples/snowflake/` (no Docker) |
| **BigQuery** | `examples/bigquery/verify.py` driven by CI against `bigquery-public-data.thelook_ecommerce` (gated on `GCP_PROJECT_ID` / `GCP_SA_KEY_B64` repo secrets) | `examples/bigquery/` (no Docker — managed service) |

BigQuery does not yet have a pytest-style integration suite; its CI coverage
runs the example's `verify.py` directly via `.github/workflows/ci.yml`. That
exercises auto-ingestion, basic projection, joins, time-grain dimensions, and
the cardinality / sum-of-grouped-equals-total invariants — enough to catch
emitted-SQL regressions, but the verify-script tier is shallower than the
testcontainers suites.

## Tier 2 — code-covered

Unit tests for SQL generation; no live-instance verification.

BigQuery, Redshift, Trino/Presto, Databricks/Spark, Oracle.
Redshift, Trino/Presto (Athena uses the Presto dialect), Databricks/Spark,
Oracle.

## Aggregation support

Expand All @@ -40,6 +52,8 @@ because no standard syntax works everywhere:
| ClickHouse | yes | yes | yes | yes | Native `median(x)`, parametric `quantile(p)(x)`, native `stddev_*`/`var_*`/`corr`/`covar*` (camelCase variants emitted by sqlglot for `var_samp`). |
| Snowflake | yes | yes | yes | yes | Native `MEDIAN`, `PERCENTILE_CONT(p) WITHIN GROUP`, `STDDEV_*`/`VAR_*`/`CORR`/`COVAR_*`. `LOG10` native; no native `LOG2` (falls through to `LOG(2, x)`). |
| MySQL | **no** | **no** | yes | **no** | No native `MEDIAN`/`PERCENTILE_CONT`/`CORR`/`COVAR_*` and no Python-UDF mechanism — SLayer raises `NotImplementedError` for those. `STDDEV_SAMP`/`STDDEV_POP`/`VAR_SAMP`/`VAR_POP` are native on MySQL. Use MariaDB or compute the unsupported aggregations client-side. |
| SQL Server (T-SQL) | **no** | **no** | yes | yes (decomposed) | `MEDIAN` doesn't exist and T-SQL's `PERCENTILE_CONT` is window-only (no `WITHIN GROUP` aggregate form) — SLayer raises `NotImplementedError`. Native `STDEV`/`STDEVP`/`VAR`/`VARP` (slayer renames the canonical `STDDEV_*`/`VAR_*` names at emit time). `CORR`/`COVAR_*` use the same variance-decomposition formula as MySQL (`cov(x,y) = (var(x+y) − var(x) − var(y)) / 2`, `corr = cov / (stddev(x) · stddev(y))`). |
| BigQuery | **no** | **no** | yes | yes | BigQuery has no `MEDIAN` aggregate, and its `PERCENTILE_CONT` is analytic-only (no `WITHIN GROUP` syntax) — the base class emit `PERCENTILE_CONT(p) WITHIN GROUP (ORDER BY x)` fails at runtime. If you need percentile on BigQuery, define a custom `Aggregation` using `APPROX_QUANTILES(x, 100)[OFFSET(N)]`. Native `STDDEV_SAMP`/`STDDEV_POP`/`VAR_SAMP`/`VAR_POP`/`CORR`/`COVAR_SAMP`/`COVAR_POP` (sqlglot may emit `VARIANCE` for `var_samp`). |

### SQLite caveats

Expand Down Expand Up @@ -105,6 +119,37 @@ If you need percentiles on MySQL, the recommended options are:
- Define a custom `Aggregation` on the model with whatever `GROUP_CONCAT`-
based or windowed expression suits your data shape and group sizes.

### SQL Server (T-SQL) caveats

T-SQL has `STDEV`/`STDEVP`/`VAR`/`VARP` (not `STDDEV_SAMP`/`STDDEV_POP`/
`VAR_SAMP`/`VAR_POP`); sqlglot's tsql transpiler emits incorrect names like
`VAR_SAMP` and `VARIANCE_POP`, so the T-SQL dialect overrides the canonical
spellings via `Anonymous` rewrites in `slayer/sql/dialects/tsql.py`.

`CORR`/`COVAR_SAMP`/`COVAR_POP` are derived from variance:
`cov(x, y) = (var(x + y) − var(x) − var(y)) / 2`,
`corr = cov / (stddev(x) · stddev(y))`. The decomposition is shared with
MySQL via `_build_covar_decomposition` in `slayer/sql/dialects/base.py`.

`MEDIAN` doesn't exist, and `PERCENTILE_CONT` in T-SQL is a window function
only — there is no `WITHIN GROUP` aggregate form. SLayer raises
`NotImplementedError` for both at SQL generation time. Use the windowed form
as a custom `Aggregation` if you need it, or compute client-side.

Other T-SQL specifics surfaced by the dialect:

- `DATETRUNC(unit, col)` for time-grain dimensions (SQL Server 2022+ —
earlier versions don't have `DATETRUNC` and aren't supported).
- `DATETRUNC(iso_week, col)` for Monday-aligned week truncation —
`@@DATEFIRST`-independent so the bucketing is deterministic.
- `DATEADD(unit, n, col)` for time-shift arithmetic — T-SQL has no
`INTERVAL` literal.
- Bracketed `[ident]` quoting — `<model>.<column>` SLayer aliases get
mangled to `<model>___<column>` at emit and decoded back on result-row
keys (mirror of the BigQuery `___` mangling; see DEV-1571).
- Native `LOG10`, no native `LOG2` (`log2(x)` falls through to the
canonical 2-arg `LOG(2, x)` form).

### Snowflake caveats

Snowflake is a fully managed cloud warehouse — no Docker, no local instance.
Expand All @@ -130,6 +175,40 @@ Snowflake](configuration/datasources.md#snowflake) for connection setup.
canonical 2-arg `LOG(2, x)` form. `LOG10` and the rest of the math /
statistical functions are native.

### BigQuery caveats

BigQuery is a fully managed cloud warehouse — no Docker, no local instance.
CI runs the example's `verify.py` against `bigquery-public-data.thelook_ecommerce`,
gated on `GCP_PROJECT_ID` and `GCP_SA_KEY_B64` repo secrets (forks without
them skip cleanly). Auth via Google Application Default Credentials
(`$GOOGLE_APPLICATION_CREDENTIALS` pointing at a service-account JSON key,
plus `$GCP_PROJECT_ID` for billing). The `bigquery://` driver requires the
`sqlalchemy-bigquery` extra.

- **No FK introspection.** BigQuery exposes no foreign-key metadata via
`INFORMATION_SCHEMA`, so auto-ingestion cannot discover joins. Hand-declare
`ModelJoin`s on the model.
- **Dotted alias mangling.** BigQuery rejects column names containing `.`
(output schema names must match `[A-Za-z_][A-Za-z0-9_]*`), so SLayer
rewrites `<model>.<column>` aliases (`orders._count`,
`orders.products.category`) to `<model>___<column>` at emit time and
reverses the mapping on result rows. The triple-underscore separator is
distinct from `__` (used by `_query_as_model` for cross-model leaf
flattening), so the two encodings never collide. In `Column.sql`,
fully-qualified table paths must be backticked per-segment
(`` `project`.`dataset`.`table` ``) — a single backticked dotted path of
word-only segments (`` `my_dataset.my_table` ``) would false-positive
mangle.
- **No `MEDIAN` aggregate; `PERCENTILE_CONT` is analytic-only.** Both
raise at SQL generation time (sqlglot doesn't transpile the base class's
`PERCENTILE_CONT(p) WITHIN GROUP (ORDER BY x)` to BigQuery's analytic
form). Use a custom `Aggregation` with `APPROX_QUANTILES(x, 100)[OFFSET(N)]`
when you need it.
- **No native EXPLAIN.** BigQuery has no SQL-level `EXPLAIN`. The
`BigqueryDialect.explain_prefix` is `None`, so `engine.execute(...,
explain=True)` returns the dry-run SQL unchanged rather than an execution
plan.

## Adding a new dialect

1. Add the mapping to `slayer/engine/query_engine.py:_dialect_for_type()`.
Expand Down
116 changes: 116 additions & 0 deletions docs/interfaces/pg-facade.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,122 @@ Postgres-specific predicates that aren't valid SLayer DSL (`ILIKE`, `::cast`, re
`ANY`/`ALL`) parse but are rejected at execution — use the standard comparison / `IN` /
`BETWEEN` forms.

### `CAST(<column> AS <type>)` in projection

A projection of the shape `CAST(<column> AS <type>)` (and the equivalent `col::type`
sugar) is accepted when the inner expression is a bare or qualified column reference
**and** the (source, target) pair is in the allowlist below. The engine still executes
the bare column — the cast is a pure wire-layer type rewrite. The projected column's
Postgres OID is overridden to match the casted type.

Common BI shapes covered: `SELECT CAST(ordered_at AS TIMESTAMP) FROM orders` (DATE
column promoted for a TIMESTAMP-aware client), `SELECT CAST(amount AS TEXT) AS s
FROM orders` (stringification), `SELECT CAST(customers.region AS TEXT) FROM orders`
(joined column).

Out of scope: `CAST` around aggregates (`CAST(SUM(amount) AS DOUBLE)`), `TRY_CAST`,
and `CAST` around expressions that aren't a bare column (`CAST(SUBSTRING(...) AS T)`).
`CAST` wrapping a `DATE_TRUNC(...)` continues to route through the time-grain unwrap.

`CAST(...)` in `ORDER BY` and `GROUP BY` has two layers of admission:

1. **Unaliased canonical-form** (e.g. `ORDER BY CAST(c AS T)` repeating the
projection's CAST verbatim): **never admitted.** The translator raises
`ORDER BY column '...' is not in the projection list` / the GROUP BY
strict-on-extras error. Workaround: alias and reference the alias.
2. **Aliased reference** (`SELECT CAST(c AS T) AS x ... ORDER BY x` /
`... GROUP BY x`): admitted **only** when the `(source, target)` pair
preserves sort/group semantics under the bare-column engine projection.

Pairs that **fail** the aliased-reference admission and raise
`ORDER BY on CAST projection '...' with lossy pair X→T is unsupported`
(symmetric message for GROUP BY):

| Path | Lossy pairs |
|----------|--------------------------------------------------------------------------|
| ORDER BY | `X → TEXT` for every `X` (lex sort ≠ engine's natural sort) |
| GROUP BY | `TIMESTAMP → DATE` (many-to-one rollup); `INT → DOUBLE` (IEEE 754 collapse beyond ±2^53) |

Every other admitted pair — identity (`X → X`), `DATE → TIMESTAMP`,
`TIMESTAMP → DATE` for ORDER BY, `INT → DOUBLE` — preserves the casted
semantics under the bare-column engine projection, so the alias path stays
open.

```sql
-- Always rejected (canonical form):
SELECT CAST(delivered_at AS TIMESTAMP) FROM orders
ORDER BY CAST(delivered_at AS TIMESTAMP);

-- Aliased reference, safe pair → works:
SELECT CAST(delivered_at AS TIMESTAMP) AS dt FROM orders
ORDER BY dt;

-- Aliased reference, lossy pair → rejected:
SELECT CAST(id AS TEXT) AS s FROM orders ORDER BY s;
SELECT CAST(ordered_at AS DATE) AS d, COUNT(*) FROM orders GROUP BY d;
```

The wire-type override still applies in the safe-pair case — `dt` is
wire-typed `TIMESTAMP` even though the engine sorts the underlying `DATE`.
A future ticket can lift the remaining restrictions by pushing the CAST
into the engine SQL.

Admitted (source, target) coercions:

| Source type | Admitted target types |
|---------------|------------------------------|
| `DATE` | `DATE`, `TIMESTAMP`, `TEXT` |
| `TIMESTAMP` | `TIMESTAMP`, `DATE`, `TEXT` |
| `INT` | `INT`, `DOUBLE`, `TEXT` |
| `DOUBLE` | `DOUBLE`, `TEXT` |
| `BOOLEAN` | `BOOLEAN`, `TEXT` |
| `TEXT` | `TEXT` |
| *(unknown)* | `TEXT` |

Pairs outside the allowlist (e.g. `CAST(name AS INT)`, `CAST(amount AS BOOLEAN)`)
raise `Unsupported CAST: cannot project <SOURCE> column as <TARGET> (...). Admitted
coercions: see docs/interfaces/pg-facade.md.` Unsupported target types (`UUID`,
`JSON`, `ARRAY`, `STRUCT`, …) raise the standard `Unsupported projection
expression` error.

#### CAST coarse-OID mapping

CAST is a **coarse wire-OID hint**, not a precision-preserving conversion.
The SLayer engine projects the bare column unchanged; the pg-facade encoder
is OID-driven, so the wire bytes always match the OID we advertise. Some
PostgreSQL types the user can write in a CAST don't have a one-to-one
SLayer equivalent — those collapse onto the nearest broader SLayer type:

| User wrote in `CAST(... AS X)` | SLayer maps to | Wire OID advertised |
|---|---|---|
| `INTEGER` / `INT` (pre-existing) | `DataType.INT` | 20 (`int8`) — not 23 (`int4`) |
| `SMALLINT` | `DataType.INT` | 20 (`int8`) — not 21 (`int2`) |
| `TINYINT` / `MEDIUMINT` (non-Postgres widths) | `DataType.INT` | 20 (`int8`) |
| `BIGINT` | `DataType.INT` | 20 (`int8`) ✓ exact match |
| `DECIMAL` / `NUMERIC` | `DataType.DOUBLE` | 701 (`float8`) — not 1700 (`numeric`) |
| `FLOAT` / `REAL` / `DOUBLE` | `DataType.DOUBLE` | 701 (`float8`) ✓ |
| `TIMESTAMPTZ` / `TIMESTAMP WITH TIME ZONE` | `DataType.TIMESTAMP` | 1114 (`timestamp`, no TZ) — not 1184 (`timestamptz`) |
| `TIMESTAMP` / `DATETIME` | `DataType.TIMESTAMP` | 1114 (`timestamp`) ✓ |

What this means in practice:

- The wire bytes the client receives are always consistent with the OID we
advertise (the encoder picks the binary/text form from the OID). There is
no value corruption.
- The OID is potentially broader than what the user typed. A client that
asked for `NUMERIC` and got `float8` sees a float on the wire and decodes
it correctly as a float — but loses the "exact precision" expectation.
A client that asked for `TIMESTAMPTZ` sees naive `timestamp` bytes — and
loses TZ-aware decoding semantics.
- Callers needing exact `NUMERIC` precision, narrow integer wire widths, or
TZ-aware timestamps must compute upstream (or wait for SLayer to model
those types natively).

`DOUBLE → INT` is intentionally excluded: Python's `int(<float>)` truncates toward zero
while Postgres rounds half-to-even, so silently admitting the pair would diverge from
`psql` semantics. Pre-aggregate or pre-round on your side when an integer-typed result
is required.

## Introspection

* `INFORMATION_SCHEMA.METRICS` / `DIMENSIONS` / `SCHEMATA` / `TABLES` / `COLUMNS`.
Expand Down
26 changes: 26 additions & 0 deletions slayer/facade/catalog_sql.py
Original file line number Diff line number Diff line change
Expand Up @@ -1222,6 +1222,7 @@ def _substitute_context_functions(self, node: exp.Expression) -> exp.Expression:
# Try each substitution branch in order; first hit wins.
substituted = (
self._substitute_qualified_context_call(node)
or self._substitute_qualified_context_column(node)
or self._substitute_dedicated_func(node)
or self._substitute_bareword_column(node)
or self._substitute_anonymous_function(node)
Expand Down Expand Up @@ -1253,6 +1254,31 @@ def _substitute_qualified_context_call(
or self._substitute_anonymous_function(rhs)
)

def _substitute_qualified_context_column(
self, node: exp.Expression,
) -> Optional[exp.Expression]:
"""Replace ``pg_catalog.<bareword-ctx-fn>`` where sqlglot parses the
whole thing as ``Column(this=<ctx-fn>, table='pg_catalog')`` — the
no-parens shape (``pg_catalog.current_user``,
``pg_catalog.current_catalog``). The Dot-shaped variant
(``pg_catalog.current_database()``) is handled by
``_substitute_qualified_context_call``.
"""
if not isinstance(node, exp.Column):
return None
table = node.args.get("table")
if table is None:
return None
table_name = (
str(table.this) if hasattr(table, "this") else str(table)
).lower()
if table_name != "pg_catalog":
return None
ident = node.this
if not isinstance(ident, exp.Identifier):
return None
return self._literal_for_context_name(str(ident.this).lower())

def _substitute_dedicated_func(self, node: exp.Expression) -> Optional[exp.Expression]:
"""Dedicated sqlglot Func subclasses (typed nodes for niladic ctx fns)."""
if isinstance(node, (exp.CurrentDatabase, getattr(exp, "CurrentCatalog", exp.CurrentDatabase))):
Expand Down
Loading
Loading