You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/concepts/models/model_kinds.md
+49-38Lines changed: 49 additions & 38 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,19 +1,19 @@
1
1
# Model kinds
2
2
3
-
This page describes the supported kinds of [models](overview.md), which ultimately determines how the data for a model is loaded.
3
+
This page describes the kinds of [models](./overview.md) SQLMesh supports, which determine how the data for a model is loaded.
4
4
5
5
## INCREMENTAL_BY_TIME_RANGE
6
6
7
-
Specifies that the model should be computed incrementally based on a time range. This is an optimal choice for datasets in which records are of temporal nature and represent immutable facts such as events, logs, or transactions. Using this kind for datasets that fit the described traits typically results in significant cost and time savings.
7
+
Models of the `INCREMENTAL_BY_TIME_RANGE` kind are computed incrementally based on a time range. This is an optimal choice for datasets in which records are captured over time and represent immutable facts such as events, logs, or transactions. Using this kind for appropriate datasets typically results in significant cost and time savings.
8
8
9
-
As the name suggests, a model of this kind is computed incrementally, meaning only missing data intervals are processed during each evaluation. This is in contrast to the [FULL](#full) model kind, where the entire dataset is recomputed every time the model is evaluated.
9
+
Only missing time intervals are processed during each execution for `INCREMENTAL_BY_TIME_RANGE` models. This is in contrast to the [FULL](#full) model kind, where the entire dataset is recomputed every time the model is executed.
10
10
11
-
In order to take advantage of the incremental evaluation, the model query must contain an expression in its `WHERE` clause that filters the upstream records by time range. SQLMesh provides special macros that represent the start and end of the time range being processed: `@start_date` / `@end_date` and `@start_ds` / `@end_ds`.
11
+
An `INCREMENTAL_BY_TIME_RANGE`model query must contain an expression in its SQL`WHERE` clause that filters the upstream records by time range. SQLMesh provides special macros that represent the start and end of the time range being processed: `@start_date` / `@end_date` and `@start_ds` / `@end_ds`.
12
12
13
-
Refer to [Macros](../macros.md#predefined-variables) for more information on these.
13
+
Refer to [Macros](../macros.md#predefined-variables) for more information.
14
14
15
-
Below is an example of a definition that takes advantage of the model's incremental nature:
16
-
```sql linenums="1"
15
+
This example implements an `INCREMENTAL_BY_TIME_RANGE` model by specifying the `kind` in the `MODEL` ddl and including a SQL `WHERE` clause to filter records by time range:
16
+
```sql linenums="1" hl_lines="3-5 12-13"
17
17
MODEL (
18
18
name db.events,
19
19
kind INCREMENTAL_BY_TIME_RANGE (
@@ -30,7 +30,10 @@ WHERE
30
30
```
31
31
32
32
### Time column
33
-
SQLMesh needs to know which column in the model's output represents a timestamp or date associated with each record. This column is used to determine which records will be overridden during data [restatement](../plans.md#restatement-plans), as well as a partition key for engines that support partitioning (such as Apache Spark).
33
+
SQLMesh needs to know which column in the model's output represents the timestamp or date associated with each record.
34
+
35
+
The `time_column` is used to determine which records will be overridden during data [restatement](../plans.md#restatement-plans) and provides a partition key for engines that support partitioning (such as Apache Spark):
36
+
34
37
```sql linenums="1" hl_lines="4"
35
38
MODEL (
36
39
name db.events,
@@ -40,7 +43,7 @@ MODEL (
40
43
);
41
44
```
42
45
43
-
Additionally, the format in which the timestamp/date is stored is required. By default, SQLMesh uses the `%Y-%m-%d` format, but it can be overridden as follows:
46
+
By default, SQLMesh assumes the time column is in the `%Y-%m-%d` format. For other formats, the default can be overridden as follows:
44
47
```sql linenums="1" hl_lines="4"
45
48
MODEL (
46
49
name db.events,
@@ -49,9 +52,9 @@ MODEL (
49
52
)
50
53
);
51
54
```
52
-
**Note:** The time format should be defined using the same dialect as the one used to define the model's query.
55
+
**Note:** The time format should be defined using the same SQL dialect as the one used to define the model's query.
53
56
54
-
SQLMesh also uses the time column to automatically append a time range filter to the model's query at runtime, which prevents records that are not part of the target interval from being stored. This is a safety mechanism that prevents the unintended overriding of unrelated records when handling late-arriving data.
57
+
SQLMesh also uses the time column to automatically append a time range filter to the model's query at runtime, which prevents records that are not part of the target interval from being stored. This is a safety mechanism that prevents unintentionally overriding unrelated records when handling late-arriving data.
55
58
56
59
Consider the following model definition:
57
60
```sql linenums="1"
@@ -70,7 +73,7 @@ WHERE
70
73
receipt_date BETWEEN @start_ds AND @end_ds;
71
74
```
72
75
73
-
At runtime, SQLMesh will automatically modify the model's query to look as follows:
76
+
At runtime, SQLMesh will automatically modify the model's query to look like this:
74
77
```sql linenums="1" hl_lines="7"
75
78
SELECT
76
79
event_date::TEXTas event_date,
@@ -82,9 +85,9 @@ WHERE
82
85
```
83
86
84
87
### Idempotency
85
-
It is recommended to ensure that queries of models of this kind are [idempotent](../../glossary/#idempotency) to prevent unexpected results during data [restatement](../plans.md#restatement-plans).
88
+
It is recommended that queries of models of this kind are [idempotent](../glossary.md#idempotency) to prevent unexpected results during data [restatement](../plans.md#restatement-plans).
86
89
87
-
Note, however, that upstream models and tables can impact the extent to which the idempotency property can be guaranteed. For example, referencing an upstream model of kind [FULL](#full) in the model query automatically renders such a model as non-idempotent.
90
+
Note, however, that upstream models and tables can impact a model's idempotency. For example, referencing an upstream model of kind [FULL](#full) in the model query automatically causes the model to be non-idempotent.
88
91
89
92
### Materialization strategy
90
93
Depending on the target engine, models of the `INCREMENTAL_BY_TIME_RANGE` kind are materialized using the following strategies:
@@ -101,16 +104,18 @@ Depending on the target engine, models of the `INCREMENTAL_BY_TIME_RANGE` kind a
101
104
102
105
## INCREMENTAL_BY_UNIQUE_KEY
103
106
104
-
This kind signifies that a model should be computed incrementally based on a unique key. If a key is missing in the model's table, the new row is inserted; otherwise the existing row associated with this key is updated with the new one. This kind is a good fit for datasets that have the following traits:
107
+
Models of the `INCREMENTAL_BY_UNIQUE_KEY` kind are computed incrementally based on a unique key.
105
108
106
-
* Each record has a key associated with it.
107
-
* There should be at most one record associated with each unique key.
108
-
* It is appropriate to upsert records, meaning existing records can be overridden by new arrivals when their keys match.
109
+
If a key is missing in the model's table, the new data row is inserted; otherwise, the existing row associated with this key is updated with the new one. This kind is a good fit for datasets that have the following traits:
109
110
110
-
A [Slowly Changing Dimension](../glossary.md#slowly-changing-dimension-scd) (SCD) is one example that fits this description well.
111
+
* Each record has a unique key associated with it.
112
+
* There is at most one record associated with each unique key.
113
+
* It is appropriate to upsert records, so existing records can be overridden by new arrivals when their keys match.
111
114
112
-
The name of the unique key column must be provided as part of the model definition, as in the following example:
113
-
```sql linenums="1"
115
+
A [Slowly Changing Dimension](../glossary.md#slowly-changing-dimension-scd) (SCD) is one approach that fits this description well.
116
+
117
+
The name of the unique key column must be provided as part of the `MODEL` DDL, as in this example:
118
+
```sql linenums="1" hl_lines="3-5"
114
119
MODEL (
115
120
name db.employees,
116
121
kind INCREMENTAL_BY_UNIQUE_KEY (
@@ -126,7 +131,7 @@ FROM raw_employees;
126
131
```
127
132
128
133
Composite keys are also supported:
129
-
```sql linenums="1"
134
+
```sql linenums="1" hl_lines-"4"
130
135
MODEL (
131
136
name db.employees,
132
137
kind INCREMENTAL_BY_UNIQUE_KEY (
@@ -135,8 +140,8 @@ MODEL (
135
140
);
136
141
```
137
142
138
-
Similar to the [INCREMENTAL_BY_TIME_RANGE](#incremental_by_time_range) kind, the upstream records can be filtered by time range using the `@start_date`, `@end_date`, and so forth. Use [macros](../macros.md#predefined-variables) in order to process the input data incrementally:
139
-
```sql linenums="1"
143
+
`INCREMENTAL_BY_UNIQUE_KEY` model kinds can also filter upstream records by time range using a SQL `WHERE` clause and the `@start_date`, `@end_date` or other macros (similar to the [INCREMENTAL_BY_TIME_RANGE](#incremental_by_time_range) kind):
144
+
```sql linenums="1" hl_lines="6-7"
140
145
SELECT
141
146
name::TEXTas name,
142
147
title::TEXTas title,
@@ -146,7 +151,7 @@ WHERE
146
151
event_date BETWEEN @start_date AND @end_date;
147
152
```
148
153
149
-
**Note:** Models of this kind are inherently [non-idempotent](../../glossary/#idempotency), which should be taken into consideration during data [restatement](../plans.md#restatement-plans).
154
+
**Note:** Models of the `INCREMENTAL_BY_UNIQUE_KEY`kind are inherently [non-idempotent](../glossary.md#idempotency), which should be taken into consideration during data [restatement](../plans.md#restatement-plans).
150
155
151
156
### Materialization strategy
152
157
Depending on the target engine, models of the `INCREMENTAL_BY_UNIQUE_KEY` kind are materialized using the following strategies:
@@ -162,12 +167,14 @@ Depending on the target engine, models of the `INCREMENTAL_BY_UNIQUE_KEY` kind a
162
167
| DuckDB | not supported |
163
168
164
169
## FULL
165
-
As the name suggests, this kind causes the dataset associated with a model to be fully refreshed (rewritten) upon each model evaluation. It's somewhat easier to use than incremental kinds, due to the lack of any special settings or additional query considerations. This makes it suitable for smaller datasets, where recomputing data from scratch is relatively cheap and doesn't require preservation of processing history. However, using this kind with datasets that have a high volume of records will result in significant runtime and compute costs.
170
+
Models of the `FULL`kind cause the dataset associated with a model to be fully refreshed (rewritten) upon each model evaluation.
166
171
167
-
This kind can be a good fit for aggregate tables that lack temporal dimension. For aggregate tables with temporal dimension, consider the [INCREMENTAL_BY_TIME_RANGE](#incremental_by_time_range)kind instead.
172
+
The `FULL` model kind is somewhat easier to use than incremental kinds due to the lack of special settings or additional query considerations. This makes it suitable for smaller datasets, where recomputing data from scratch is relatively cheap and doesn't require preservation of processing history. However, using this kind with datasets containing a large volume of records will result in significant runtime and compute costs.
168
173
169
-
Example:
170
-
```sql linenums="1"
174
+
This kind can be a good fit for aggregate tables that lack a temporal dimension. For aggregate tables with a temporal dimension, consider the [INCREMENTAL_BY_TIME_RANGE](#incremental_by_time_range) kind instead.
175
+
176
+
This example specifies a `FULL` model kind:
177
+
```sql linenums="1" hl_lines="3"
171
178
MODEL (
172
179
name db.salary_by_title_agg,
173
180
kind FULL
@@ -194,14 +201,16 @@ Depending on the target engine, models of the `FULL` kind are materialized using
194
201
| DuckDB | CREATE OR REPLACE TABLE |
195
202
196
203
## VIEW
197
-
Other model kinds cause the output of a model query to be materialized and stored in a physical table. The `VIEW` kind is different, because no data actually gets written during model evaluation. Instead, a non-materialized view (or "virtual table") is created or replaced based on the model's query.
204
+
The model kinds described so far cause the output of a model query to be materialized and stored in a physical table.
198
205
199
-
**Note:** With this kind, the model's query is evaluated every time the model is referenced in downstream queries. This may incur undesirable compute cost in cases where the model's query is compute-intensive, or when the model is referenced in many downstream queries.
206
+
The `VIEW`kind is different, because no data is actually written during model execution. Instead, a non-materialized view (or "virtual table") is created or replaced based on the model's query.
200
207
201
-
View is the default model kind if the kind is not specified.
208
+
**Note:**View is the default model kind if kind is not specified.
202
209
203
-
Example:
204
-
```sql linenums="1"
210
+
**Note:** With this kind, the model's query is evaluated every time the model is referenced in a downstream query. This may incur undesirable compute cost and time in cases where the model's query is compute-intensive, or when the model is referenced in many downstream queries.
211
+
212
+
This example specifies a `VIEW` model kind:
213
+
```sql linenums="1" hl_lines="3"
205
214
MODEL (
206
215
name db.highest_salary,
207
216
kind VIEW
@@ -213,10 +222,12 @@ FROM db.employees;
213
222
```
214
223
215
224
## EMBEDDED
216
-
Embedded models are a way to share common logic between different models of other kinds. This kind is similar to [VIEW](#view), except models of this kind are never evaluated, and therefore there are no data assets (tables or views) associated with them in the data warehouse. Instead, the embedded model's query gets injected directly into a query of each downstream model that references this model in its own query.
225
+
Embedded models are a way to share common logic between different models of other kinds.
217
226
218
-
Example:
219
-
```sql linenums="1"
227
+
There are no data assets (tables or views) associated with `EMBEDDED` models in the data warehouse. Instead, an `EMBEDDED` model's query is injected directly into the query of each downstream model that references it.
228
+
229
+
This example specifies a `EMBEDDED` model kind:
230
+
```sql linenums="1" hl_lines="3"
220
231
MODEL (
221
232
name db.unique_employees,
222
233
kind EMBEDDED
@@ -228,4 +239,4 @@ FROM db.employees;
228
239
```
229
240
230
241
## SEED
231
-
This is a special kind reserved for [seed models](seed_models.md).
242
+
The `SEED` model kind is used to specify [seed models](./seed_models.md) for using static CSV datasets in your SQLMesh project.
0 commit comments