Skip to content

Commit 7f4f696

Browse files
authored
Edit concepts/models docs (#610)
1 parent 7d39ba2 commit 7f4f696

5 files changed

Lines changed: 181 additions & 122 deletions

File tree

docs/concepts/models/model_kinds.md

Lines changed: 49 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,19 @@
11
# Model kinds
22

3-
This page describes the supported kinds of [models](overview.md), which ultimately determines how the data for a model is loaded.
3+
This page describes the kinds of [models](./overview.md) SQLMesh supports, which determine how the data for a model is loaded.
44

55
## INCREMENTAL_BY_TIME_RANGE
66

7-
Specifies that the model should be computed incrementally based on a time range. This is an optimal choice for datasets in which records are of temporal nature and represent immutable facts such as events, logs, or transactions. Using this kind for datasets that fit the described traits typically results in significant cost and time savings.
7+
Models of the `INCREMENTAL_BY_TIME_RANGE` kind are computed incrementally based on a time range. This is an optimal choice for datasets in which records are captured over time and represent immutable facts such as events, logs, or transactions. Using this kind for appropriate datasets typically results in significant cost and time savings.
88

9-
As the name suggests, a model of this kind is computed incrementally, meaning only missing data intervals are processed during each evaluation. This is in contrast to the [FULL](#full) model kind, where the entire dataset is recomputed every time the model is evaluated.
9+
Only missing time intervals are processed during each execution for `INCREMENTAL_BY_TIME_RANGE` models. This is in contrast to the [FULL](#full) model kind, where the entire dataset is recomputed every time the model is executed.
1010

11-
In order to take advantage of the incremental evaluation, the model query must contain an expression in its `WHERE` clause that filters the upstream records by time range. SQLMesh provides special macros that represent the start and end of the time range being processed: `@start_date` / `@end_date` and `@start_ds` / `@end_ds`.
11+
An `INCREMENTAL_BY_TIME_RANGE` model query must contain an expression in its SQL `WHERE` clause that filters the upstream records by time range. SQLMesh provides special macros that represent the start and end of the time range being processed: `@start_date` / `@end_date` and `@start_ds` / `@end_ds`.
1212

13-
Refer to [Macros](../macros.md#predefined-variables) for more information on these.
13+
Refer to [Macros](../macros.md#predefined-variables) for more information.
1414

15-
Below is an example of a definition that takes advantage of the model's incremental nature:
16-
```sql linenums="1"
15+
This example implements an `INCREMENTAL_BY_TIME_RANGE` model by specifying the `kind` in the `MODEL` ddl and including a SQL `WHERE` clause to filter records by time range:
16+
```sql linenums="1" hl_lines="3-5 12-13"
1717
MODEL (
1818
name db.events,
1919
kind INCREMENTAL_BY_TIME_RANGE (
@@ -30,7 +30,10 @@ WHERE
3030
```
3131

3232
### Time column
33-
SQLMesh needs to know which column in the model's output represents a timestamp or date associated with each record. This column is used to determine which records will be overridden during data [restatement](../plans.md#restatement-plans), as well as a partition key for engines that support partitioning (such as Apache Spark).
33+
SQLMesh needs to know which column in the model's output represents the timestamp or date associated with each record.
34+
35+
The `time_column` is used to determine which records will be overridden during data [restatement](../plans.md#restatement-plans) and provides a partition key for engines that support partitioning (such as Apache Spark):
36+
3437
```sql linenums="1" hl_lines="4"
3538
MODEL (
3639
name db.events,
@@ -40,7 +43,7 @@ MODEL (
4043
);
4144
```
4245

43-
Additionally, the format in which the timestamp/date is stored is required. By default, SQLMesh uses the `%Y-%m-%d` format, but it can be overridden as follows:
46+
By default, SQLMesh assumes the time column is in the `%Y-%m-%d` format. For other formats, the default can be overridden as follows:
4447
```sql linenums="1" hl_lines="4"
4548
MODEL (
4649
name db.events,
@@ -49,9 +52,9 @@ MODEL (
4952
)
5053
);
5154
```
52-
**Note:** The time format should be defined using the same dialect as the one used to define the model's query.
55+
**Note:** The time format should be defined using the same SQL dialect as the one used to define the model's query.
5356

54-
SQLMesh also uses the time column to automatically append a time range filter to the model's query at runtime, which prevents records that are not part of the target interval from being stored. This is a safety mechanism that prevents the unintended overriding of unrelated records when handling late-arriving data.
57+
SQLMesh also uses the time column to automatically append a time range filter to the model's query at runtime, which prevents records that are not part of the target interval from being stored. This is a safety mechanism that prevents unintentionally overriding unrelated records when handling late-arriving data.
5558

5659
Consider the following model definition:
5760
```sql linenums="1"
@@ -70,7 +73,7 @@ WHERE
7073
receipt_date BETWEEN @start_ds AND @end_ds;
7174
```
7275

73-
At runtime, SQLMesh will automatically modify the model's query to look as follows:
76+
At runtime, SQLMesh will automatically modify the model's query to look like this:
7477
```sql linenums="1" hl_lines="7"
7578
SELECT
7679
event_date::TEXT as event_date,
@@ -82,9 +85,9 @@ WHERE
8285
```
8386

8487
### Idempotency
85-
It is recommended to ensure that queries of models of this kind are [idempotent](../../glossary/#idempotency) to prevent unexpected results during data [restatement](../plans.md#restatement-plans).
88+
It is recommended that queries of models of this kind are [idempotent](../glossary.md#idempotency) to prevent unexpected results during data [restatement](../plans.md#restatement-plans).
8689

87-
Note, however, that upstream models and tables can impact the extent to which the idempotency property can be guaranteed. For example, referencing an upstream model of kind [FULL](#full) in the model query automatically renders such a model as non-idempotent.
90+
Note, however, that upstream models and tables can impact a model's idempotency. For example, referencing an upstream model of kind [FULL](#full) in the model query automatically causes the model to be non-idempotent.
8891

8992
### Materialization strategy
9093
Depending on the target engine, models of the `INCREMENTAL_BY_TIME_RANGE` kind are materialized using the following strategies:
@@ -101,16 +104,18 @@ Depending on the target engine, models of the `INCREMENTAL_BY_TIME_RANGE` kind a
101104

102105
## INCREMENTAL_BY_UNIQUE_KEY
103106

104-
This kind signifies that a model should be computed incrementally based on a unique key. If a key is missing in the model's table, the new row is inserted; otherwise the existing row associated with this key is updated with the new one. This kind is a good fit for datasets that have the following traits:
107+
Models of the `INCREMENTAL_BY_UNIQUE_KEY` kind are computed incrementally based on a unique key.
105108

106-
* Each record has a key associated with it.
107-
* There should be at most one record associated with each unique key.
108-
* It is appropriate to upsert records, meaning existing records can be overridden by new arrivals when their keys match.
109+
If a key is missing in the model's table, the new data row is inserted; otherwise, the existing row associated with this key is updated with the new one. This kind is a good fit for datasets that have the following traits:
109110

110-
A [Slowly Changing Dimension](../glossary.md#slowly-changing-dimension-scd) (SCD) is one example that fits this description well.
111+
* Each record has a unique key associated with it.
112+
* There is at most one record associated with each unique key.
113+
* It is appropriate to upsert records, so existing records can be overridden by new arrivals when their keys match.
111114

112-
The name of the unique key column must be provided as part of the model definition, as in the following example:
113-
```sql linenums="1"
115+
A [Slowly Changing Dimension](../glossary.md#slowly-changing-dimension-scd) (SCD) is one approach that fits this description well.
116+
117+
The name of the unique key column must be provided as part of the `MODEL` DDL, as in this example:
118+
```sql linenums="1" hl_lines="3-5"
114119
MODEL (
115120
name db.employees,
116121
kind INCREMENTAL_BY_UNIQUE_KEY (
@@ -126,7 +131,7 @@ FROM raw_employees;
126131
```
127132

128133
Composite keys are also supported:
129-
```sql linenums="1"
134+
```sql linenums="1" hl_lines-"4"
130135
MODEL (
131136
name db.employees,
132137
kind INCREMENTAL_BY_UNIQUE_KEY (
@@ -135,8 +140,8 @@ MODEL (
135140
);
136141
```
137142

138-
Similar to the [INCREMENTAL_BY_TIME_RANGE](#incremental_by_time_range) kind, the upstream records can be filtered by time range using the `@start_date`, `@end_date`, and so forth. Use [macros](../macros.md#predefined-variables) in order to process the input data incrementally:
139-
```sql linenums="1"
143+
`INCREMENTAL_BY_UNIQUE_KEY` model kinds can also filter upstream records by time range using a SQL `WHERE` clause and the `@start_date`, `@end_date` or other macros (similar to the [INCREMENTAL_BY_TIME_RANGE](#incremental_by_time_range) kind):
144+
```sql linenums="1" hl_lines="6-7"
140145
SELECT
141146
name::TEXT as name,
142147
title::TEXT as title,
@@ -146,7 +151,7 @@ WHERE
146151
event_date BETWEEN @start_date AND @end_date;
147152
```
148153

149-
**Note:** Models of this kind are inherently [non-idempotent](../../glossary/#idempotency), which should be taken into consideration during data [restatement](../plans.md#restatement-plans).
154+
**Note:** Models of the `INCREMENTAL_BY_UNIQUE_KEY` kind are inherently [non-idempotent](../glossary.md#idempotency), which should be taken into consideration during data [restatement](../plans.md#restatement-plans).
150155

151156
### Materialization strategy
152157
Depending on the target engine, models of the `INCREMENTAL_BY_UNIQUE_KEY` kind are materialized using the following strategies:
@@ -162,12 +167,14 @@ Depending on the target engine, models of the `INCREMENTAL_BY_UNIQUE_KEY` kind a
162167
| DuckDB | not supported |
163168

164169
## FULL
165-
As the name suggests, this kind causes the dataset associated with a model to be fully refreshed (rewritten) upon each model evaluation. It's somewhat easier to use than incremental kinds, due to the lack of any special settings or additional query considerations. This makes it suitable for smaller datasets, where recomputing data from scratch is relatively cheap and doesn't require preservation of processing history. However, using this kind with datasets that have a high volume of records will result in significant runtime and compute costs.
170+
Models of the `FULL` kind cause the dataset associated with a model to be fully refreshed (rewritten) upon each model evaluation.
166171

167-
This kind can be a good fit for aggregate tables that lack temporal dimension. For aggregate tables with temporal dimension, consider the [INCREMENTAL_BY_TIME_RANGE](#incremental_by_time_range) kind instead.
172+
The `FULL` model kind is somewhat easier to use than incremental kinds due to the lack of special settings or additional query considerations. This makes it suitable for smaller datasets, where recomputing data from scratch is relatively cheap and doesn't require preservation of processing history. However, using this kind with datasets containing a large volume of records will result in significant runtime and compute costs.
168173

169-
Example:
170-
```sql linenums="1"
174+
This kind can be a good fit for aggregate tables that lack a temporal dimension. For aggregate tables with a temporal dimension, consider the [INCREMENTAL_BY_TIME_RANGE](#incremental_by_time_range) kind instead.
175+
176+
This example specifies a `FULL` model kind:
177+
```sql linenums="1" hl_lines="3"
171178
MODEL (
172179
name db.salary_by_title_agg,
173180
kind FULL
@@ -194,14 +201,16 @@ Depending on the target engine, models of the `FULL` kind are materialized using
194201
| DuckDB | CREATE OR REPLACE TABLE |
195202

196203
## VIEW
197-
Other model kinds cause the output of a model query to be materialized and stored in a physical table. The `VIEW` kind is different, because no data actually gets written during model evaluation. Instead, a non-materialized view (or "virtual table") is created or replaced based on the model's query.
204+
The model kinds described so far cause the output of a model query to be materialized and stored in a physical table.
198205

199-
**Note:** With this kind, the model's query is evaluated every time the model is referenced in downstream queries. This may incur undesirable compute cost in cases where the model's query is compute-intensive, or when the model is referenced in many downstream queries.
206+
The `VIEW` kind is different, because no data is actually written during model execution. Instead, a non-materialized view (or "virtual table") is created or replaced based on the model's query.
200207

201-
View is the default model kind if the kind is not specified.
208+
**Note:** View is the default model kind if kind is not specified.
202209

203-
Example:
204-
```sql linenums="1"
210+
**Note:** With this kind, the model's query is evaluated every time the model is referenced in a downstream query. This may incur undesirable compute cost and time in cases where the model's query is compute-intensive, or when the model is referenced in many downstream queries.
211+
212+
This example specifies a `VIEW` model kind:
213+
```sql linenums="1" hl_lines="3"
205214
MODEL (
206215
name db.highest_salary,
207216
kind VIEW
@@ -213,10 +222,12 @@ FROM db.employees;
213222
```
214223

215224
## EMBEDDED
216-
Embedded models are a way to share common logic between different models of other kinds. This kind is similar to [VIEW](#view), except models of this kind are never evaluated, and therefore there are no data assets (tables or views) associated with them in the data warehouse. Instead, the embedded model's query gets injected directly into a query of each downstream model that references this model in its own query.
225+
Embedded models are a way to share common logic between different models of other kinds.
217226

218-
Example:
219-
```sql linenums="1"
227+
There are no data assets (tables or views) associated with `EMBEDDED` models in the data warehouse. Instead, an `EMBEDDED` model's query is injected directly into the query of each downstream model that references it.
228+
229+
This example specifies a `EMBEDDED` model kind:
230+
```sql linenums="1" hl_lines="3"
220231
MODEL (
221232
name db.unique_employees,
222233
kind EMBEDDED
@@ -228,4 +239,4 @@ FROM db.employees;
228239
```
229240

230241
## SEED
231-
This is a special kind reserved for [seed models](seed_models.md).
242+
The `SEED` model kind is used to specify [seed models](./seed_models.md) for using static CSV datasets in your SQLMesh project.

0 commit comments

Comments
 (0)