diff --git a/README.md b/README.md index ac8244e..107a7e9 100644 --- a/README.md +++ b/README.md @@ -128,8 +128,13 @@ Agents Schema is the shared, queryable metadata surface for consumers that start from the warehouse and need context about data that already exists there. It is closest in spirit to `information_schema`, but extensible across many -providers. Compared with MCP servers, Agents Schema is narrower: it publishes -context inside the warehouse, while MCP servers can expose tools, actions, and +providers. In fact `AGENTS.TABLES` and `AGENTS.COLUMNS` are drop-in enriched +versions of `INFORMATION_SCHEMA.TABLES`/`COLUMNS`: the native columns plus +provider-prefixed context (`dbt_description`, `lookml_ai_context`, …). Because +`INFORMATION_SCHEMA` is per-database, these views cover the database that holds +the `AGENTS` schema — point the workflows at the database your data lives in. +Compared with MCP servers, Agents Schema is narrower: it publishes context +inside the warehouse, while MCP servers can expose tools, actions, and source-specific workflows. ### How it works diff --git a/SPEC.md b/SPEC.md index 7e5f62c..22b683b 100644 --- a/SPEC.md +++ b/SPEC.md @@ -82,6 +82,63 @@ The current package delivers one table family per metadata source: Each ingestion replaces its own table family with `CREATE OR REPLACE TABLE` and then inserts the rows parsed from the source metadata. +Each ingestion also refreshes provider-normalized views and generic context views over whichever provider tables currently exist. These views are intended to be familiar drop-in starting points for agents that would otherwise reach for `INFORMATION_SCHEMA.TABLES` or `INFORMATION_SCHEMA.COLUMNS`, while preserving source-provider references for deeper inspection. + +The generic views are documented in `AGENTS.ROOT` under the `core` provider. + +| View | Purpose | +|---|---| +| `AGENTS.TABLES` | `INFORMATION_SCHEMA.TABLES` enriched with matching provider table context. | +| `AGENTS.COLUMNS` | `INFORMATION_SCHEMA.COLUMNS` enriched with matching provider column context. | + +--- + +## Generic Context Views + +### Scope + +v1 extends the surfaces `INFORMATION_SCHEMA` already has — `TABLES` and `COLUMNS` — rather than inventing new object types. Relationships, metrics, and entities are intentionally out of scope: the information-schema-faithful home for relationships is the `REFERENTIAL_CONSTRAINTS` / `KEY_COLUMN_USAGE` family (a future extension), and metrics/entities are object types that semantic providers such as OSI already model in their own `AGENTS.OSI_*` tables. The generic views enrich; they do not become a competing semantic model. + +### Merge model + +Each provider publishes a normalized `AGENTS._TABLES` / `AGENTS._COLUMNS` view with a shared shape. The generic views then take the native `INFORMATION_SCHEMA` view as the row spine via `SELECT t.*` (so they inherit whatever native columns the account exposes — nothing is hardcoded) and **left join every provider view that exists** by object identity: + +- `AGENTS.TABLES`: `INFORMATION_SCHEMA.TABLES` joined to each `*_TABLES` view on `table_catalog` / `table_schema` / `table_name`. +- `AGENTS.COLUMNS`: `INFORMATION_SCHEMA.COLUMNS` joined to each `*_COLUMNS` view on `table_catalog` / `table_schema` / `table_name` / `column_name`. + +The merge is generic and provider-agnostic. Each provider's enrichment columns are appended under a `_` prefix (`dbt_description`, `lookml_ai_context`, `osi_description`, …), so providers never collide and no native column is overwritten. Within a single provider, rows are aggregated to one row per object identity before the join, so duplicate provider rows cannot multiply native rows. A provider that ships a new `*_TABLES`/`*_COLUMNS` view later — for example a memory provider contributing `memory_*` counts — is picked up automatically with no change to the core views. + +`SELECT t.*` resolves against the `INFORMATION_SCHEMA` of the database that holds the `AGENTS` schema, so `AGENTS.TABLES`/`COLUMNS` cover objects in that database. Provider-specific detail not promoted into the shared shape stays in the source tables (for example `AGENTS.LOOKML_DIMENSION`) and is reachable through the `_source_object_id` columns. + +### `AGENTS.TABLES` + +`SELECT t.*` from `INFORMATION_SCHEMA.TABLES` plus, for each participating provider, the following prefixed columns: + +| Column | Description | +|---|---| +| `_table_type` | Provider object kind, such as `DBT_MODEL` or `OSI_DATASET`. | +| `_display_name` | Provider label for the matched table. | +| `_description` | Provider description for the matched table. | +| `_ai_context` | Provider AI context for the matched table. | +| `_source_object_id` | Provider-specific object identifier(s). | +| `_source_path` | Source file path when available. | +| `_materialization` | Provider materialization when available. | +| `_tags` | Provider tags when available. | + +### `AGENTS.COLUMNS` + +`SELECT t.*` from `INFORMATION_SCHEMA.COLUMNS` plus, for each participating provider, the following prefixed columns: + +| Column | Description | +|---|---| +| `_display_name` | Provider label for the matched column. | +| `_description` | Provider description. | +| `_ai_context` | Provider AI context when available. | +| `_semantic_type` | Provider semantic field kind when available. | +| `_is_time_dimension` | Whether the field is marked time-like. | +| `_expression` | Provider expression or SQL when available. | +| `_source_object_id` | Provider-specific object identifier. | + --- ## Source: dbt @@ -446,6 +503,12 @@ The current source provider names are: | `lookml` | `AGENTS.LOOKML_*` | | `osi` | `AGENTS.OSI_*` | +The current core provider name is: + +| Provider | Objects | +|---|---| +| `core` | `AGENTS.ROOT`, `AGENTS.TABLES`, `AGENTS.COLUMNS` | + --- ## Summary of Current Tables @@ -453,6 +516,8 @@ The current source provider names are: | Table | Source | Purpose | |---|---|---| | `AGENTS.ROOT` | core | Provider registry upserted by dbt, LookML, and OSI workflows | +| `AGENTS.TABLES` | core | `INFORMATION_SCHEMA.TABLES` enriched from provider `*_TABLES` views | +| `AGENTS.COLUMNS` | core | `INFORMATION_SCHEMA.COLUMNS` enriched from provider `*_COLUMNS` views | | `AGENTS.DBT_MODEL` | dbt | dbt models with schema, materialization, documentation, path, and tags | | `AGENTS.DBT_COLUMN` | dbt | Documented dbt model columns | | `AGENTS.DBT_DEPENDENCY` | dbt | Direct dbt dependency edges | diff --git a/proposals/agent-schema-views.md b/proposals/agent-schema-views.md new file mode 100644 index 0000000..b22e05d --- /dev/null +++ b/proposals/agent-schema-views.md @@ -0,0 +1,383 @@ +# Agents Schema Context Views Proposal + +**Status:** Proposal +**Branch:** `agent_schema_views` + +## Summary + +Add information-schema-like views to Agents Schema so agents can query richer metadata through familiar object names: + +```sql +AGENTS.TABLES +AGENTS.COLUMNS +AGENTS.RELATIONSHIPS +AGENTS.METRICS +AGENTS.ENTITIES +``` + +The goal is to make Agents Schema instantly swappable for common `INFORMATION_SCHEMA` exploration patterns while adding richer context. Anywhere an agent would normally ask `INFORMATION_SCHEMA.TABLES` or `INFORMATION_SCHEMA.COLUMNS`, it should be able to ask `AGENTS.TABLES` or `AGENTS.COLUMNS` instead and get the familiar shape plus dbt descriptions, LookML/OSI semantic metadata, source provider references, and eventually profiling or usage context. + +## v1 Scope (Implemented) + +The shipped v1 is deliberately narrower than the full proposal below, which is retained as the longer-term design sketch. + +- **Only the surfaces `INFORMATION_SCHEMA` already has — `AGENTS.TABLES` and `AGENTS.COLUMNS`.** `RELATIONSHIPS`, `METRICS`, and `ENTITIES` are deferred. They are *new* object types that semantic providers like OSI already model in their own tables; adding generic versions now would make this a competing semantic model rather than an information-schema extension. The information-schema-faithful home for relationships is the `REFERENTIAL_CONSTRAINTS` / `KEY_COLUMN_USAGE` family, which is the intended future shape rather than a custom `AGENTS.RELATIONSHIPS` view. +- **Native spine via `SELECT t.*`.** `AGENTS.TABLES`/`COLUMNS` select `t.*` from `INFORMATION_SCHEMA.TABLES`/`COLUMNS` and inherit whatever native columns the account exposes. No native column list is hardcoded. +- **Generic identity merge.** Each provider's `*_TABLES`/`*_COLUMNS` view is left joined by object identity (catalog/schema/table, plus column for columns), with its enrichment columns appended under a `_` prefix. The set of providers is discovered, not hardcoded. Within a provider, rows are aggregated to one per identity to prevent fanout. +- **No hardcoded memory counts.** Memory participation is purely additive: when a memory provider later publishes its own `*_TABLES`/`*_COLUMNS` view exposing counts, those columns appear automatically. The core views contain no memory-specific logic. +- **Fail-soft.** View creation runs at the end of each provider ingestion but never fails the ingestion; a view error warns and is skipped. + +## Motivation + +Most SQL agents already know to inspect: + +```sql +INFORMATION_SCHEMA.TABLES +INFORMATION_SCHEMA.COLUMNS +``` + +But native information schema is too thin for analytic work. It can tell an agent that a column exists, but not: + +- which dbt model documented it +- whether it has LookML or OSI semantic context +- which metric uses it +- whether joining it causes fanout +- whether amounts need scaling +- whether a table is a semantic dataset, staging table, or source mirror + +Agents Schema already has source-specific tables. Context views would provide a generic layer over them. + +## Design Principles + +- **Views, not new source of truth.** Source provider tables remain canonical. +- **Information-schema swappable.** Preserve familiar view names and core columns so agents can reuse existing `INFORMATION_SCHEMA` habits with a richer source. +- **Provider-owned normalization.** Providers publish their own `AGENTS._TABLES`, `AGENTS._COLUMNS`, and related normalized views when they want to participate in the generic layer. +- **Provider-aware.** Preserve `source_provider` and `source_object_id` so agents can drill down. +- **Composable with memory.** If the memory provider exists, views can expose memory/warning counts and optional compact memory text. +- **Sparse first.** Start with dbt/LookML/OSI fields already available today; add warehouse-native metadata later. + +## Proposed Views + +The first columns in each view should intentionally resemble the equivalent `INFORMATION_SCHEMA` view where one exists. Agents and existing metadata snippets should be able to select familiar columns first, then opt into the extended columns. + +Core generic views should not directly know every source table. Instead, each provider maps its native metadata into provider-normalized views with the shared shape: + +```text +AGENTS.DBT_TABLES +AGENTS.DBT_COLUMNS +AGENTS.DBT_RELATIONSHIPS +AGENTS.LOOKML_TABLES +AGENTS.LOOKML_COLUMNS +AGENTS.LOOKML_METRICS +AGENTS.OSI_TABLES +AGENTS.OSI_COLUMNS +AGENTS.OSI_RELATIONSHIPS +AGENTS.OSI_METRICS +``` + +Then `AGENTS.TABLES` is a merged information-schema view over every provider `*_TABLES` view: + +```text +AGENTS.TABLES = + INFORMATION_SCHEMA.TABLES + LEFT JOIN provider *_TABLES views by table_catalog/table_schema/table_name +``` + +Provider-specific fields are appended with provider prefixes, such as `dbt_description`, `lookml_ai_context`, or `osi_source_object_id`. This keeps native columns like `table_name` unambiguous while letting providers enrich matching warehouse tables. + +Other generic views can start as unions over provider-normalized views until they get their own native information-schema spine. Provider-specific detail remains in the raw provider tables and is reachable through provider-prefixed source object columns. + +### `AGENTS.TABLES` + +One row per native warehouse table or view from `INFORMATION_SCHEMA.TABLES`, enriched by any provider-normalized `*_TABLES` view that matches the same catalog/schema/table identity. + +Suggested columns: + +```text +table_catalog +table_schema +table_name +table_type +table_owner +is_transient +clustering_key +row_count +bytes +retention_time +created +last_altered +comment +dbt_description +dbt_source_object_id +dbt_source_path +dbt_materialization +dbt_tags +lookml_description +lookml_ai_context +lookml_source_object_id +osi_description +osi_ai_context +osi_source_object_id +memories_count +warnings_count +``` + +Provider mappings: + +| Source | Mapping | +|---|---| +| `DBT_MODEL` | `schema_name`, `name`, `description`, `materialization`, `file_path`, `tags` | +| `LOOKML_VIEW` | `sql_table_name` when parseable, `name`, `label`, `description`, `ai_context`, `file_path` | +| `OSI_DATASET` | `source_table`, `name`, `description`, `ai_context` | + +Memory contribution: + +- table-anchored memories increment `memories_count` +- warning-bearing table memories increment `warnings_count` +- an optional future `agent_memories` field can aggregate compact memory summaries + +### `AGENTS.COLUMNS` + +One row per field/column-like object. + +Suggested columns: + +```text +table_catalog +table_schema +table_name +column_name +ordinal_position +data_type +is_nullable +display_name +description +ai_context +semantic_type +is_time_dimension +expression +source_provider +source_object_id +memories_count +warnings_count +``` + +Provider mappings: + +| Source | Mapping | +|---|---| +| `DBT_COLUMN` + `DBT_MODEL` | model schema/name, `column_name`, `data_type`, `description` | +| `LOOKML_DIMENSION` | `view_name`, `field_name`, `field_kind`, `type`, `sql`, `description`, `ai_context`, `primary_key` | +| `LOOKML_MEASURE` | `view_name`, `measure_name`, `type`, `sql`, `description`, `ai_context`, `filters` | +| `OSI_FIELD` + `OSI_DATASET` | dataset source table/name, `field_name`, `label`, `description`, `ai_context`, `is_time_dimension`, `expression` | + +Memory contribution: + +- column-anchored memories attach directly +- unit rules, enum meanings, timezone warnings, and null semantics can show up in memory counts + +### `AGENTS.RELATIONSHIPS` + +One row per relationship or dependency edge. + +Suggested columns: + +```text +relationship_name +from_catalog +from_schema +from_table +from_column +to_catalog +to_schema +to_table +to_column +relationship_type +multiplicity +source_provider +source_object_id +memories_count +warnings_count +``` + +Provider mappings: + +| Source | Mapping | +|---|---| +| `DBT_DEPENDENCY` | lineage edge from upstream node to downstream model | +| `OSI_RELATIONSHIP` | explicit semantic relationship with from/to datasets and columns | +| LookML explores | future: join graph from explore definitions once modeled in a table | + +Memory contribution: + +- relationship-anchored memories attach directly +- fanout warnings and safe-join rules surface during join planning + +### `AGENTS.METRICS` + +One row per metric or measure-like semantic object. + +Suggested columns: + +```text +metric_name +display_name +description +ai_context +expression +source_provider +source_object_id +dataset_name +view_name +memories_count +warnings_count +``` + +Provider mappings: + +| Source | Mapping | +|---|---| +| `OSI_METRIC` | metric name, description, ai_context, expression | +| `LOOKML_MEASURE` | measure name, view name, type/sql/filter expression, description, ai_context | +| dbt semantic layer | future provider | + +Memory contribution: + +- metric-anchored memories attach directly +- calculation caveats, exclusions, date policies, and unit rules show up near metrics + +### `AGENTS.ENTITIES` + +One row per canonical business entity when a provider contributes entity metadata. + +Suggested columns: + +```text +entity_id +display_name +description +source_provider +source_object_id +primary_table_schema +primary_table_name +primary_key_columns +memories_count +warnings_count +``` + +Initial provider mappings may be sparse. OSI entity-like structures, dbt semantic models, or custom providers can populate this later. + +Memory contribution: + +- entity-anchored memories define identity rules and cross-source mappings +- examples: account is canonical customer, email is not stable identity, subscription is billing relationship not customer + +## Example Queries + +Column lookup with richer context: + +```sql +SELECT + table_schema, + table_name, + column_name, + data_type, + description, + ai_context, + memories_count, + warnings_count +FROM AGENTS.COLUMNS +WHERE LOWER(column_name) LIKE '%amount%'; +``` + +Find semantic tables with warning-bearing memories: + +```sql +SELECT + table_schema, + table_name, + description, + source_provider, + warnings_count +FROM AGENTS.TABLES +WHERE warnings_count > 0; +``` + +Find metrics with context: + +```sql +SELECT + metric_name, + description, + ai_context, + expression, + source_provider +FROM AGENTS.METRICS +WHERE LOWER(metric_name) IN ('arr', 'mrr', 'revenue'); +``` + +## Memory Provider Interaction + +The views should not own memories. They should consume a memory provider if present. + +If `AGENTS.MEMORY` and `AGENTS.MEMORY_ANCHOR` exist: + +- `AGENTS.TABLES` can count table-anchored memories. +- `AGENTS.COLUMNS` can count column-anchored memories. +- `AGENTS.RELATIONSHIPS` can count relationship-anchored memories. +- `AGENTS.METRICS` can count metric-anchored memories. +- `AGENTS.ENTITIES` can count entity-anchored memories. + +This keeps memory normalized while making the generic views useful for agents that do not know how to join memory tables yet. + +## Should Views Be In Core? + +Yes eventually, but they can start as a proposal or optional package because they introduce cross-provider semantics. + +The current source tables are provider-owned and easy to reason about. Views add a second layer: + +```text +source provider tables -> provider-normalized views -> generic context views -> agent queries +``` + +That layer should have tests that pin: + +- row identity rules +- duplicate handling when multiple providers describe the same object +- how descriptions and `ai_context` are selected or combined +- behavior when the memory provider is absent + +## Duplicate And Merge Policy + +The hard part is not defining view columns; it is merging provider records. + +**v1 approach (implemented):** merge by object identity onto the native +`INFORMATION_SCHEMA` spine, with each provider's columns appended under a +`_` prefix. Providers therefore never collide — there is no +cross-provider "which source wins" decision, because each keeps its own +namespaced columns. Within a single provider, rows are aggregated to one row +per identity before the join so duplicate provider rows cannot multiply native +rows. + +The earlier sketch below considered the alternative of emitting one row per +provider object (a union) and letting agents pick a source. v1 chose prefixed +merge instead, since it preserves the one-row-per-object grain that makes the +views information-schema-swappable. A coalesced single `description`/`ai_context` +with a trust order remains a possible future option. + +- preserve `source_provider` and `_source_object_id` for drill-down +- later versions can add canonicalization if Agents Schema gains stable warehouse object identifiers + +## Resolved Decisions (v1) + +- **Warehouse views, refreshed per ingestion** (fail-soft), not CLI-materialized tables. +- **Native objects are the spine.** `TABLES`/`COLUMNS` start from `INFORMATION_SCHEMA` and enrich; they are not provider-only unions. +- **Memory counts are omitted until the memory provider ships its own view.** No reserved-but-zero columns. +- **Measures live in the deferred `METRICS` surface, not `COLUMNS`.** v1 columns are physical/field-like only. +- **dbt, LookML, and OSI all participate in v1** (LookML/OSI `sql_table_name`/`source_table` are parsed into identity). + +## Open Questions + +- When relationships land, confirm the `REFERENTIAL_CONSTRAINTS` / `KEY_COLUMN_USAGE` shape over a custom view, including how unenforced/OSI relationships are represented when the native constraint views are empty. +- Cross-database coverage: `INFORMATION_SCHEMA` is per-database, so `AGENTS.TABLES`/`COLUMNS` only cover the database holding `AGENTS`. Should multi-database deployments use `SNOWFLAKE.ACCOUNT_USAGE` (account-wide, latent) as an alternate spine? +- Should provider enrichment be prefixed columns (current) or also offer a coalesced single `description`/`ai_context` with a trust order? diff --git a/src/agents_schema/dbt.py b/src/agents_schema/dbt.py index 5200a01..97103c2 100644 --- a/src/agents_schema/dbt.py +++ b/src/agents_schema/dbt.py @@ -6,6 +6,7 @@ from .destinations import Column, Destination, TableSchema, open_destination from .root import upsert_provider_root +from .views import create_context_views __all__ = ["run"] @@ -51,6 +52,7 @@ def run(cfg: dict) -> None: upsert_provider_root(dest, "dbt") _create_tables(dest) _ingest(dest, manifest) + create_context_views(dest) def _load_manifest(path: Path) -> dict: diff --git a/src/agents_schema/destinations.py b/src/agents_schema/destinations.py index 94493ba..2709173 100644 --- a/src/agents_schema/destinations.py +++ b/src/agents_schema/destinations.py @@ -37,6 +37,8 @@ def array_indexes(self) -> set[int]: class Destination(Protocol): def replace_table(self, table: TableSchema) -> None: ... + def replace_view(self, name: str, sql: str) -> None: ... + def existing_table_names(self) -> set[str]: ... def upsert_rows(self, table: TableSchema, rows: Iterable[tuple[Any, ...]]) -> None: ... def insert_rows(self, table: TableSchema, rows: Iterable[tuple[Any, ...]]) -> None: ... def close(self) -> None: ... @@ -65,6 +67,22 @@ def replace_table(self, table: TableSchema) -> None: cur.execute(f"CREATE SCHEMA IF NOT EXISTS {self._agents_schema}") cur.execute(_create_table_sql(table, self._agents_schema)) + def replace_view(self, name: str, sql: str) -> None: + with self._con.cursor() as cur: + cur.execute(f"CREATE SCHEMA IF NOT EXISTS {self._agents_schema}") + cur.execute(_create_view_sql(name, sql, self._agents_schema)) + + def existing_table_names(self) -> set[str]: + with self._con.cursor() as cur: + cur.execute( + "SELECT LOWER(table_name) " + "FROM information_schema.tables " + "WHERE table_schema = UPPER(%s) " + "AND table_type = 'BASE TABLE'", + (self._agents_schema,), + ) + return {str(row[0]) for row in cur.fetchall()} + def upsert_rows(self, table: TableSchema, rows: Iterable[tuple[Any, ...]]) -> None: bind_rows = _bind_rows(table, rows) if not bind_rows: @@ -250,6 +268,10 @@ def _create_table_if_not_exists_sql(table: TableSchema, schema: str) -> str: return _create_table_statement_sql("CREATE TABLE IF NOT EXISTS", table, schema) +def _create_view_sql(name: str, sql: str, schema: str) -> str: + return f"CREATE OR REPLACE VIEW {schema}.{_identifier(name)} AS\n{sql}" + + def _create_table_statement_sql(prefix: str, table: TableSchema, schema: str) -> str: definitions = [] for column in table.columns: diff --git a/src/agents_schema/lookml.py b/src/agents_schema/lookml.py index d130b11..36620aa 100644 --- a/src/agents_schema/lookml.py +++ b/src/agents_schema/lookml.py @@ -8,6 +8,7 @@ from .destinations import Column, Destination, TableSchema, open_destination from .root import upsert_provider_root +from .views import create_context_views __all__ = ["run"] @@ -86,6 +87,7 @@ def run(cfg: dict) -> None: upsert_provider_root(dest, "lookml") _create_tables(dest) _ingest(dest, files, lookml_dir) + create_context_views(dest) def _load_lookml_files(lookml_dir: Path) -> list[Path]: diff --git a/src/agents_schema/osi.py b/src/agents_schema/osi.py index 368b79b..217840e 100644 --- a/src/agents_schema/osi.py +++ b/src/agents_schema/osi.py @@ -10,6 +10,7 @@ from .destinations import Column, Destination, TableSchema, open_destination from .root import upsert_provider_root +from .views import create_context_views __all__ = ["run"] @@ -67,6 +68,7 @@ def run(cfg: dict) -> None: upsert_provider_root(dest, "osi") _create_tables(dest) _ingest(dest, models) + create_context_views(dest) def _load_osi_files(osi_dir: Path) -> list[dict]: diff --git a/src/agents_schema/root.py b/src/agents_schema/root.py index c10b125..9677452 100644 --- a/src/agents_schema/root.py +++ b/src/agents_schema/root.py @@ -16,11 +16,19 @@ ) ROOT_ENTRIES = { + "core": ( + ("overview", "# Core\nShared Agents Schema registry and generic context views."), + ("root", "Provider registry. See AGENTS.ROOT."), + ("tables", "Information-schema-like table context view: information_schema.tables enriched from provider *_TABLES views. See AGENTS.TABLES."), + ("columns", "Information-schema-like column context view: information_schema.columns enriched from provider *_COLUMNS views. See AGENTS.COLUMNS."), + ), "dbt": ( ("overview", "# dbt\nTransformation metadata from dbt manifest.json."), ("model", "One row per dbt model. See AGENTS.DBT_MODEL."), ("column", "One row per documented dbt model column. See AGENTS.DBT_COLUMN."), ("dependency", "Direct dbt DAG edges. See AGENTS.DBT_DEPENDENCY."), + ("tables", "Provider-normalized table context view. See AGENTS.DBT_TABLES."), + ("columns", "Provider-normalized column context view. See AGENTS.DBT_COLUMNS."), ), "lookml": ( ("overview", "# LookML\nSemantic metadata parsed from LookML files."), @@ -28,6 +36,8 @@ ("dimension", "One row per LookML dimension or dimension group. See AGENTS.LOOKML_DIMENSION."), ("measure", "One row per LookML measure. See AGENTS.LOOKML_MEASURE."), ("explore", "One row per LookML explore. See AGENTS.LOOKML_EXPLORE."), + ("tables", "Provider-normalized table context view. See AGENTS.LOOKML_TABLES."), + ("columns", "Provider-normalized column context view. See AGENTS.LOOKML_COLUMNS."), ), "osi": ( ("overview", "# OSI\nOpen Semantic Interchange metadata parsed from *.osi.yaml files."), @@ -35,6 +45,8 @@ ("field", "One row per OSI dataset field. See AGENTS.OSI_FIELD."), ("metric", "One row per OSI metric. See AGENTS.OSI_METRIC."), ("relationship", "One row per OSI relationship. See AGENTS.OSI_RELATIONSHIP."), + ("tables", "Provider-normalized table context view. See AGENTS.OSI_TABLES."), + ("columns", "Provider-normalized column context view. See AGENTS.OSI_COLUMNS."), ), } diff --git a/src/agents_schema/views.py b/src/agents_schema/views.py new file mode 100644 index 0000000..23e376f --- /dev/null +++ b/src/agents_schema/views.py @@ -0,0 +1,337 @@ +"""Information-schema-like context views over provider-normalized views. + +v1 scope: extend the surfaces ``INFORMATION_SCHEMA`` already has — `TABLES` and +`COLUMNS` — rather than inventing new object types. Each metadata provider +publishes a normalized ``AGENTS._TABLES`` / ``AGENTS._COLUMNS`` +view with a shared shape. ``AGENTS.TABLES`` and ``AGENTS.COLUMNS`` then take the +native ``INFORMATION_SCHEMA`` view as the row spine (``SELECT t.*``) and merge +**every** provider view that exists by object identity, appending each +provider's columns under a ``_`` prefix. + +The merge is generic: no native column list is hardcoded (``SELECT t.*`` inherits +whatever the account exposes), and no provider is special-cased. A provider that +ships a new ``*_TABLES`` view later — e.g. a memory provider contributing +``memories_count`` — is picked up automatically with no change here. + +Relationships and metrics are intentionally out of scope for v1. The +information-schema-faithful home for relationships is the +``REFERENTIAL_CONSTRAINTS`` / ``KEY_COLUMN_USAGE`` family; see the proposal. +""" +from __future__ import annotations + +import sys + +from .destinations import Destination +from .root import upsert_provider_root + +__all__ = [ + "CORE_VIEW_NAMES", + "PROVIDER_VIEW_NAMES", + "create_context_views", + "build_context_view_sql", +] + +CORE_VIEW_NAMES = frozenset({"tables", "columns"}) +PROVIDER_VIEW_NAMES = frozenset( + { + "dbt_tables", + "dbt_columns", + "lookml_tables", + "lookml_columns", + "osi_tables", + "osi_columns", + } +) +_RELATION_RE = r"^[A-Za-z_][A-Za-z0-9_$]*([.][A-Za-z_][A-Za-z0-9_$]*){0,2}$" + + +def create_context_views(dest: Destination) -> None: + """Create provider-normalized views and the generic context views. + + Fail-soft: a view that cannot be created warns but never breaks the + surrounding ingestion, which has already written the provider tables. + """ + upsert_provider_root(dest, "core") + for name, sql in build_context_view_sql(dest.existing_table_names()).items(): + try: + dest.replace_view(name, sql) + except Exception as e: # noqa: BLE001 - the view layer must not fail ingestion + print(f" warning: could not create view agents.{name}: {e}", file=sys.stderr) + + +def build_context_view_sql(existing_tables: set[str]) -> dict[str, str]: + existing = {name.lower() for name in existing_tables} + provider_views = _provider_view_sql(existing) + return provider_views | { + "tables": _merge_view(provider_views, "tables", "information_schema.tables", _TABLE_IDENTITY, _TABLE_MERGE), + "columns": _merge_view(provider_views, "columns", "information_schema.columns", _COLUMN_IDENTITY, _COLUMN_MERGE), + } + + +def _provider_view_sql(existing: set[str]) -> dict[str, str]: + return { + "dbt_tables": _dbt_tables_sql(existing), + "dbt_columns": _dbt_columns_sql(existing), + "lookml_tables": _lookml_tables_sql(existing), + "lookml_columns": _lookml_columns_sql(existing), + "osi_tables": _osi_tables_sql(existing), + "osi_columns": _osi_columns_sql(existing), + } + + +# --- generic merge over the native information_schema spine ------------------ + + +def _merge_view( + provider_views: dict[str, str], + suffix: str, + spine: str, + identity: tuple[str, ...], + merge_columns: tuple[str, ...], +) -> str: + views = [name for name in provider_views if name.endswith(f"_{suffix}")] + selects = [ + ",\n ".join(f"{alias}.{column} AS {alias}_{column}" for column in merge_columns) + for alias in (_provider_alias(name, suffix) for name in views) + ] + joins = "\n".join(_merge_join(name, _provider_alias(name, suffix), identity, merge_columns) for name in views) + # Every enrichment column is `_` prefixed, so `t.*` (the native + # spine columns) can never collide with appended columns. Keep that prefix + # if more enrichment is added later. + enrichment = (",\n " + ",\n ".join(selects)) if selects else "" + return f"SELECT\n t.*{enrichment}\nFROM {spine} t\n{joins}" + + +def _provider_alias(view_name: str, suffix: str) -> str: + return view_name.removesuffix(f"_{suffix}") + + +def _merge_join(view_name: str, alias: str, identity: tuple[str, ...], merge_columns: tuple[str, ...]) -> str: + id_select = ",\n ".join(identity) + agg_select = ",\n ".join(f"{_agg(column)} AS {column}" for column in merge_columns) + group_by = ", ".join(identity) + required = [column for column in identity if column not in ("table_catalog", "table_schema")] + where = " AND ".join(f"{column} IS NOT NULL" for column in required) + on = "\n AND ".join(_merge_on(alias, column) for column in identity) + return ( + f"LEFT JOIN (\n" + f" SELECT\n {id_select},\n {agg_select}\n" + f" FROM agents.{view_name}\n" + f" WHERE {where}\n" + f" GROUP BY {group_by}\n" + f") {alias}\n ON {on}" + ) + + +def _merge_on(alias: str, column: str) -> str: + # Enrichment attaches by case-folded object name, not a guaranteed-unique + # key. A provider row with NULL table_catalog matches the spine in any + # catalog; since the spine is single-database, that is effectively + # schema+name (plus column) identity. + if column == "table_catalog": + return f"({alias}.{column} IS NULL OR LOWER(t.{column}) = LOWER({alias}.{column}))" + return f"LOWER(t.{column}) = LOWER({alias}.{column})" + + +def _agg(column: str) -> str: + if column == "tags": + return f"ANY_VALUE({column})" + if column in ("source_object_id", "source_path"): + return f"LISTAGG({column}, ', ') WITHIN GROUP (ORDER BY {column})" + return f"MIN({column})" + + +def _empty_view(columns: list[tuple[str, str]]) -> str: + projection = ",\n ".join(f"CAST(NULL AS {kind}) AS {name}" for name, kind in columns) + return f"SELECT\n {projection}\nWHERE 1 = 0" + + +def _relation_identity_sql(relation: str, fallback_name: str) -> tuple[str, str, str]: + """Split a 1-, 2-, or 3-part relation reference into catalog/schema/table.""" + is_simple = f"REGEXP_LIKE({relation}, '{_RELATION_RE}')" + part_count = f"REGEXP_COUNT({relation}, '[.]')" + return ( + f"""CASE + WHEN {is_simple} AND {part_count} = 2 + THEN SPLIT_PART({relation}, '.', 1) + ELSE CAST(NULL AS VARCHAR) + END AS table_catalog""", + f"""CASE + WHEN {is_simple} AND {part_count} = 2 + THEN SPLIT_PART({relation}, '.', 2) + WHEN {is_simple} AND {part_count} = 1 + THEN SPLIT_PART({relation}, '.', 1) + ELSE CAST(NULL AS VARCHAR) + END AS table_schema""", + f"""CASE + WHEN {is_simple} AND {part_count} = 2 + THEN SPLIT_PART({relation}, '.', 3) + WHEN {is_simple} AND {part_count} = 1 + THEN SPLIT_PART({relation}, '.', 2) + WHEN {is_simple} AND {part_count} = 0 + THEN {relation} + ELSE {fallback_name} + END AS table_name""", + ) + + +# --- provider-normalized view shapes ----------------------------------------- + +_TABLE_COLUMNS = [ + ("table_catalog", "VARCHAR"), + ("table_schema", "VARCHAR"), + ("table_name", "VARCHAR"), + ("table_type", "VARCHAR"), + ("display_name", "VARCHAR"), + ("description", "TEXT"), + ("ai_context", "TEXT"), + ("source_provider", "VARCHAR"), + ("source_object_id", "VARCHAR"), + ("source_path", "VARCHAR"), + ("materialization", "VARCHAR"), + ("tags", "VARIANT"), +] +_TABLE_IDENTITY = ("table_catalog", "table_schema", "table_name") +_TABLE_MERGE = tuple( + name for name, _ in _TABLE_COLUMNS if name not in _TABLE_IDENTITY and name != "source_provider" +) + +_COLUMN_COLUMNS = [ + ("table_catalog", "VARCHAR"), + ("table_schema", "VARCHAR"), + ("table_name", "VARCHAR"), + ("column_name", "VARCHAR"), + ("display_name", "VARCHAR"), + ("description", "TEXT"), + ("ai_context", "TEXT"), + ("semantic_type", "VARCHAR"), + ("is_time_dimension", "BOOLEAN"), + ("expression", "TEXT"), + ("source_provider", "VARCHAR"), + ("source_object_id", "VARCHAR"), +] +_COLUMN_IDENTITY = ("table_catalog", "table_schema", "table_name", "column_name") +_COLUMN_MERGE = tuple( + name for name, _ in _COLUMN_COLUMNS if name not in _COLUMN_IDENTITY and name != "source_provider" +) + + +def _dbt_tables_sql(existing: set[str]) -> str: + if "dbt_model" not in existing: + return _empty_view(_TABLE_COLUMNS) + return """SELECT + CAST(NULL AS VARCHAR) AS table_catalog, + schema_name AS table_schema, + name AS table_name, + 'DBT_MODEL' AS table_type, + name AS display_name, + description, + CAST(NULL AS TEXT) AS ai_context, + 'dbt' AS source_provider, + unique_id AS source_object_id, + file_path AS source_path, + materialization, + tags +FROM agents.dbt_model""" + + +def _dbt_columns_sql(existing: set[str]) -> str: + if not {"dbt_model", "dbt_column"}.issubset(existing): + return _empty_view(_COLUMN_COLUMNS) + return """SELECT + CAST(NULL AS VARCHAR) AS table_catalog, + m.schema_name AS table_schema, + m.name AS table_name, + c.column_name, + c.column_name AS display_name, + c.description, + CAST(NULL AS TEXT) AS ai_context, + CAST(NULL AS VARCHAR) AS semantic_type, + CAST(NULL AS BOOLEAN) AS is_time_dimension, + CAST(NULL AS TEXT) AS expression, + 'dbt' AS source_provider, + c.model_id || '.' || c.column_name AS source_object_id +FROM agents.dbt_column c +JOIN agents.dbt_model m ON m.unique_id = c.model_id""" + + +def _lookml_tables_sql(existing: set[str]) -> str: + if "lookml_view" not in existing: + return _empty_view(_TABLE_COLUMNS) + catalog_sql, schema_sql, table_sql = _relation_identity_sql("sql_table_name", "name") + return f"""SELECT + {catalog_sql}, + {schema_sql}, + {table_sql}, + 'LOOKML_VIEW' AS table_type, + COALESCE(label, name) AS display_name, + description, + ai_context, + 'lookml' AS source_provider, + name AS source_object_id, + file_path AS source_path, + CAST(NULL AS VARCHAR) AS materialization, + PARSE_JSON('[]') AS tags +FROM agents.lookml_view""" + + +def _lookml_columns_sql(existing: set[str]) -> str: + if not {"lookml_dimension", "lookml_view"}.issubset(existing): + return _empty_view(_COLUMN_COLUMNS) + catalog_sql, schema_sql, table_sql = _relation_identity_sql("v.sql_table_name", "v.name") + return f"""SELECT + {catalog_sql}, + {schema_sql}, + {table_sql}, + d.field_name AS column_name, + d.field_name AS display_name, + d.description, + d.ai_context, + d.field_kind AS semantic_type, + d.field_kind = 'dimension_group' AND COALESCE(d.type, 'time') = 'time' AS is_time_dimension, + d.sql AS expression, + 'lookml' AS source_provider, + d.view_name || '.' || d.field_name AS source_object_id +FROM agents.lookml_dimension d +JOIN agents.lookml_view v ON v.name = d.view_name""" + + +def _osi_tables_sql(existing: set[str]) -> str: + if "osi_dataset" not in existing: + return _empty_view(_TABLE_COLUMNS) + catalog_sql, schema_sql, table_sql = _relation_identity_sql("source_table", "name") + return f"""SELECT + {catalog_sql}, + {schema_sql}, + {table_sql}, + 'OSI_DATASET' AS table_type, + name AS display_name, + description, + ai_context, + 'osi' AS source_provider, + name AS source_object_id, + CAST(NULL AS VARCHAR) AS source_path, + CAST(NULL AS VARCHAR) AS materialization, + PARSE_JSON('[]') AS tags +FROM agents.osi_dataset""" + + +def _osi_columns_sql(existing: set[str]) -> str: + if not {"osi_dataset", "osi_field"}.issubset(existing): + return _empty_view(_COLUMN_COLUMNS) + catalog_sql, schema_sql, table_sql = _relation_identity_sql("d.source_table", "d.name") + return f"""SELECT + {catalog_sql}, + {schema_sql}, + {table_sql}, + f.field_name AS column_name, + COALESCE(f.label, f.field_name) AS display_name, + f.description, + f.ai_context, + CAST(NULL AS VARCHAR) AS semantic_type, + f.is_time_dimension, + f.expression, + 'osi' AS source_provider, + f.dataset_name || '.' || f.field_name AS source_object_id +FROM agents.osi_field f +JOIN agents.osi_dataset d ON d.name = f.dataset_name""" diff --git a/tests/test_connector_root.py b/tests/test_connector_root.py index 043917c..0452e33 100644 --- a/tests/test_connector_root.py +++ b/tests/test_connector_root.py @@ -8,6 +8,16 @@ class FakeDestination: def __init__(self): self.calls = [] + def existing_table_names(self): + return { + call[1].removeprefix("agents.") + for call in self.calls + if call[0] == "replace" + } + + def replace_view(self, name, sql): + self.calls.append(("view", name, sql)) + def upsert_rows(self, table, rows): self.calls.append(("upsert", table.name, list(rows))) @@ -44,6 +54,9 @@ def test_dbt_run_upserts_root_before_source_tables(self): self.assertEqual(dest.calls[0][0], "upsert") self.assertEqual({row[0] for row in dest.calls[0][2]}, {"dbt"}) self.assertEqual([call[0] for call in dest.calls[1:4]], ["replace", "replace", "replace"]) + self.assertEqual(dest.calls[4][0], "upsert") + self.assertEqual({row[0] for row in dest.calls[4][2]}, {"core"}) + self.assertEqual([call[0] for call in dest.calls[5:10]], ["view", "view", "view", "view", "view"]) def test_lookml_run_upserts_root_before_source_tables(self): dest = FakeDestination() @@ -59,6 +72,9 @@ def test_lookml_run_upserts_root_before_source_tables(self): self.assertEqual(dest.calls[0][0], "upsert") self.assertEqual({row[0] for row in dest.calls[0][2]}, {"lookml"}) self.assertEqual([call[0] for call in dest.calls[1:5]], ["replace", "replace", "replace", "replace"]) + self.assertEqual(dest.calls[5][0], "upsert") + self.assertEqual({row[0] for row in dest.calls[5][2]}, {"core"}) + self.assertEqual([call[0] for call in dest.calls[6:11]], ["view", "view", "view", "view", "view"]) def test_osi_run_upserts_root_before_source_tables(self): dest = FakeDestination() @@ -74,6 +90,9 @@ def test_osi_run_upserts_root_before_source_tables(self): self.assertEqual(dest.calls[0][0], "upsert") self.assertEqual({row[0] for row in dest.calls[0][2]}, {"osi"}) self.assertEqual([call[0] for call in dest.calls[1:5]], ["replace", "replace", "replace", "replace"]) + self.assertEqual(dest.calls[5][0], "upsert") + self.assertEqual({row[0] for row in dest.calls[5][2]}, {"core"}) + self.assertEqual([call[0] for call in dest.calls[6:11]], ["view", "view", "view", "view", "view"]) if __name__ == "__main__": diff --git a/tests/test_destinations.py b/tests/test_destinations.py index 5aa0a72..963caa5 100644 --- a/tests/test_destinations.py +++ b/tests/test_destinations.py @@ -1,6 +1,6 @@ import unittest -from agents_schema.destinations import _create_table_if_not_exists_sql, _merge_sql +from agents_schema.destinations import _create_table_if_not_exists_sql, _create_view_sql, _merge_sql from agents_schema.root import ROOT @@ -35,6 +35,11 @@ def test_root_merge_upserts_on_provider_and_key(self): sql, ) + def test_create_view_sql_validates_view_name_and_wraps_query(self): + sql = _create_view_sql("tables", "SELECT 1 AS value", "agents") + + self.assertEqual(sql, "CREATE OR REPLACE VIEW agents.tables AS\nSELECT 1 AS value") + if __name__ == "__main__": unittest.main() diff --git a/tests/test_root.py b/tests/test_root.py index b5078e1..7b1cb6a 100644 --- a/tests/test_root.py +++ b/tests/test_root.py @@ -22,7 +22,10 @@ def test_upsert_provider_root_writes_only_requested_provider(self): self.assertIs(table, ROOT) self.assertTrue(rows) self.assertEqual({row[0] for row in rows}, {"dbt"}) - self.assertEqual({row[1] for row in rows}, {"overview", "model", "column", "dependency"}) + self.assertEqual( + {row[1] for row in rows}, + {"overview", "model", "column", "dependency", "tables", "columns"}, + ) def test_upsert_provider_root_has_osi_entries(self): dest = FakeDestination() @@ -30,7 +33,21 @@ def test_upsert_provider_root_has_osi_entries(self): upsert_provider_root(dest, "osi") _, rows = dest.upserts[0] - self.assertEqual({row[1] for row in rows}, {"overview", "dataset", "field", "metric", "relationship"}) + self.assertEqual( + {row[1] for row in rows}, + {"overview", "dataset", "field", "metric", "relationship", "tables", "columns"}, + ) + + def test_upsert_provider_root_has_core_view_entries(self): + dest = FakeDestination() + + upsert_provider_root(dest, "core") + + _, rows = dest.upserts[0] + self.assertEqual( + {row[1] for row in rows}, + {"overview", "root", "tables", "columns"}, + ) if __name__ == "__main__": diff --git a/tests/test_views.py b/tests/test_views.py new file mode 100644 index 0000000..3f34757 --- /dev/null +++ b/tests/test_views.py @@ -0,0 +1,98 @@ +import unittest + +from agents_schema.views import CORE_VIEW_NAMES, PROVIDER_VIEW_NAMES, build_context_view_sql + + +class ContextViewSqlTests(unittest.TestCase): + def test_builds_only_tables_and_columns_surfaces(self): + views = build_context_view_sql({"dbt_model", "dbt_column"}) + + self.assertEqual(PROVIDER_VIEW_NAMES | CORE_VIEW_NAMES, set(views)) + self.assertEqual(CORE_VIEW_NAMES, {"tables", "columns"}) + # relationships / metrics / entities are out of scope for v1 + self.assertNotIn("relationships", views) + self.assertNotIn("metrics", views) + self.assertNotIn("entities", views) + + def test_builds_provider_views_from_raw_provider_tables(self): + views = build_context_view_sql({"dbt_model", "dbt_column"}) + + self.assertIn("FROM agents.dbt_model", views["dbt_tables"]) + self.assertIn("FROM agents.dbt_column c", views["dbt_columns"]) + + def test_core_tables_uses_information_schema_star_spine(self): + views = build_context_view_sql({"dbt_model", "lookml_view", "osi_dataset"}) + tables = views["tables"] + + # native spine is SELECT t.* — no hardcoded information_schema column list + self.assertIn("SELECT\n t.*", tables) + self.assertIn("FROM information_schema.tables t", tables) + self.assertNotIn("t.is_hybrid", tables) + self.assertNotIn("t.last_ddl", tables) + + def test_core_tables_merges_all_provider_tables_by_identity(self): + views = build_context_view_sql({"dbt_model", "lookml_view", "osi_dataset"}) + tables = views["tables"] + + # one prefixed enrichment column per provider, joined by identity + self.assertIn("dbt.description AS dbt_description", tables) + self.assertIn("lookml.ai_context AS lookml_ai_context", tables) + self.assertIn("osi.description AS osi_description", tables) + self.assertIn("FROM agents.dbt_tables", tables) + self.assertIn("FROM agents.osi_tables", tables) + self.assertIn("LOWER(t.table_name) = LOWER(dbt.table_name)", tables) + self.assertIn("GROUP BY table_catalog, table_schema, table_name", tables) + # generic merge: no hardcoded memory counts anywhere + self.assertNotIn("memories_count", tables) + self.assertNotIn("warnings_count", tables) + + def test_core_columns_merges_by_column_identity(self): + views = build_context_view_sql({"dbt_model", "dbt_column"}) + columns = views["columns"] + + self.assertIn("SELECT\n t.*", columns) + self.assertIn("FROM information_schema.columns t", columns) + self.assertIn("FROM agents.dbt_columns", columns) + self.assertIn("LOWER(t.column_name) = LOWER(dbt.column_name)", columns) + self.assertIn("dbt.description AS dbt_description", columns) + + def test_dbt_relationships_view_is_gone(self): + views = build_context_view_sql({"dbt_model", "dbt_dependency"}) + + self.assertNotIn("dbt_relationships", views) + + def test_osi_columns_parse_source_table_like_osi_tables(self): + views = build_context_view_sql({"osi_dataset", "osi_field"}) + + # columns identity must align with tables identity (parse source_table) + self.assertIn("THEN SPLIT_PART(d.source_table, '.', 2)", views["osi_columns"]) + self.assertIn("END AS table_schema", views["osi_columns"]) + self.assertIn("END AS table_name", views["osi_columns"]) + + def test_lookml_tables_parse_simple_sql_table_name(self): + views = build_context_view_sql({"lookml_view"}) + + self.assertIn("REGEXP_COUNT(sql_table_name, '[.]') = 2", views["lookml_tables"]) + self.assertIn("THEN SPLIT_PART(sql_table_name, '.', 1)", views["lookml_tables"]) + self.assertIn("END AS table_catalog", views["lookml_tables"]) + + def test_lookml_columns_use_same_sql_table_name_identity_as_tables(self): + views = build_context_view_sql({"lookml_view", "lookml_dimension"}) + + self.assertIn("FROM agents.lookml_dimension d", views["lookml_columns"]) + self.assertIn("JOIN agents.lookml_view v ON v.name = d.view_name", views["lookml_columns"]) + self.assertIn("REGEXP_COUNT(v.sql_table_name, '[.]') = 2", views["lookml_columns"]) + self.assertIn("ELSE v.name", views["lookml_columns"]) + + def test_missing_provider_tables_become_typed_empty_views(self): + views = build_context_view_sql(set()) + + # no provider tables exist: provider views are empty typed projections, + # but AGENTS.TABLES still works as an information_schema passthrough + self.assertIn("WHERE 1 = 0", views["dbt_tables"]) + self.assertIn("CAST(NULL AS VARCHAR) AS table_catalog", views["dbt_tables"]) + self.assertIn("FROM information_schema.tables t", views["tables"]) + + +if __name__ == "__main__": + unittest.main()