You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/comparisons.md
+10-10Lines changed: 10 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,9 +7,9 @@ There are many tools and frameworks in the data ecosystem. This page tries to ma
7
7
## dbt
8
8
[dbt](https://www.getdbt.com/) is a tool for data transformations. It is a pioneer in this space and has shown how valuable transformation frameworks can be. Although dbt is a fanstastic tool, it has trouble scaling with data and organizational size.
9
9
10
-
SQLMesh aims to be dbt formatcompatible. Importing existing dbt projects with minor changes is in development.
10
+
SQLMesh aims to be dbt format-compatible. Importing existing dbt projects with minor changes is in development.
11
11
12
-
### Feature Comparisons
12
+
### Feature comparisons
13
13
| Feature | dbt | SQLMesh
14
14
| ------- | --- | -------
15
15
| `SQL models` | ✅ | ✅
@@ -42,8 +42,8 @@ SQLMesh aims to be dbt format compatible. Importing existing dbt projects with m
42
42
| `Comprehensive Python API` | ❌ | ✅
43
43
44
44
45
-
### Incremental Models
46
-
Implementing an incremental model is difficult and error-prone in dbt because it does not keep track of state. Since there is no state in dbt, the user must write subqueries to find missing date boundaries.
45
+
### Incremental models
46
+
Implementing an incremental model is difficult and error-prone in dbt, because dbt does not keep track of state. Since there is no state in dbt, the user must write subqueries to find missing date boundaries.
47
47
48
48
#### Complexity
49
49
```sql
@@ -62,9 +62,9 @@ WHERE e.ds >= (SELECT MAX(ds) FROM {{ this }})
62
62
{% endif %}
63
63
```
64
64
65
-
Having to manually specify macros to find date boundaries is repetitive and error-prone. As incremental models become more complex, the cognitive burden of having two run times, "first time full refresh" vs "subsequent incremental", increases.
65
+
Having to manually specify macros to find date boundaries is repetitive and error-prone. As incremental models become more complex, the cognitive burden of having two run times, "first time full refresh" vs. "subsequent incremental", increases.
66
66
67
-
SQLMesh keeps track of which date ranges exist so the query can be simplified as follows.
67
+
SQLMesh keeps track of which date ranges exist so that the query can be simplified as follows:
68
68
69
69
```sql
70
70
-- sqlmesh incremental
@@ -77,9 +77,9 @@ WHERE d.ds BETWEEN @start_ds AND @end_ds
77
77
```
78
78
79
79
#### Data leakage
80
-
dbt does not enforce that the data inserted into the incremental table should be there. This can lead to problems or consistency issues such as latearriving data overriding past partitions. SQLMesh wraps all queries under the hood in a subquery with a time filter to enforce that the data inserted for a particular batch is as expected.
80
+
dbt does not enforce that the data inserted into the incremental table should be there. This can lead to problems or consistency issues, such as late-arriving data overriding past partitions. SQLMesh wraps all queries under the hood in a subquery with a time filter to enforce that the data inserted for a particular batch is as expected.
81
81
82
-
dbt also only supports the 'insert/overwrite' incremental load pattern for systems that natively support it. SQLMesh enables 'insert/overwrite' on any system because it is the most robust way to do incremental pipelines. 'Append' pipelines risk data accuracy in the variety of scenarios where your pipelines may run more than once for a given date.
82
+
dbt also only supports the 'insert/overwrite' incremental load pattern for systems that natively support it. SQLMesh enables 'insert/overwrite' on any system, because it is the most robust way to do incremental pipelines. 'Append' pipelines risk data accuracy in the variety of scenarios where your pipelines may run more than once for a given date.
83
83
84
84
85
85
```sql
@@ -103,15 +103,15 @@ WHERE ds BETWEEN @start_ds AND @end_ds
103
103
```
104
104
105
105
#### Data gaps
106
-
The main pattern used in incremental models checks for MAX(ds). This pattern does not catch missing data from the past or data gaps.
106
+
The main pattern used in incremental models checks for MAX(ds). This pattern does not catch missing data from the past, or data gaps.
Copy file name to clipboardExpand all lines: docs/concepts/architecture/serialization.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@ SQLMesh executes Python code through [macros](../macros.md) and [Python models](
4
4
5
5
## Serialization format
6
6
7
-
Rather than using Python's `pickle` format, SQLMesh has it's own serialization format. This is because `pickle` is not compatible across Python versions, and would for example prevent you from developing on Python 3.7 and then running Python 3.8 in production.
7
+
Rather than using Python's `pickle` format, SQLMesh has it's own serialization format. This is because `pickle` is not compatible across Python versions, and would, for example, prevent you from developing on Python 3.7 and then running Python 3.8 in production.
8
8
9
9
Instead, SQLMesh stores the string representation of your Python implementation and then re-evaluates it. Given a custom Python function or macro, SQLMesh reads the Abstract Syntax Tree (AST) of the function and converts that into a string representation, along with all dependencies and global variables. For more information, refer to [snapshot fingerprinting](../architecture/snapshots.md#fingerprinting).
Copy file name to clipboardExpand all lines: docs/concepts/architecture/snapshots.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
# Snapshots
2
-
A snapshot is a record of a model at a given time. Along with a copy of the model, a snapshot contains everything needed to evaluate the model and render its query. This allows SQLMesh to have a consistent view of your project's history and its data as the project and its models evolve and change. Since model queries can have macros, each snapshot stores a copy of all macro definitions and global variables at the time the snapshot is taken. Additionally, snapshots store the intervals of time they have data for.
2
+
A snapshot is a record of a model at a given time. Along with a copy of the model, a snapshot contains everything needed to evaluate the model and render its query. This allows SQLMesh to have a consistent view of your project's history and its data as the project and its models evolve and change. Since model queries can have macros, each snapshot stores a copy of all macro definitions and global variables at the time the snapshot is taken. Additionally, snapshots store the intervals of time for which they have data.
3
3
4
4
## Fingerprinting
5
5
Snapshots have unique fingerprints that are derived from their models. SQLMesh uses these fingerprints
Copy file name to clipboardExpand all lines: docs/concepts/audits.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -42,7 +42,7 @@ AUDIT (
42
42
SELECT*FROM @this_model
43
43
WHERE @column >= @threshold;
44
44
```
45
-
In the example above we utilized [Macros](macros.md) to parameterize the audit implementation. `@this_model` is a special macro which refers to a model that is being audited. For incremental models, this macro also ensures that only relevant data intervals are affected. `@column` and `@threshold` are generic parameters, values for which are set in the model definition.
45
+
In the example above we utilized [macros](macros.md) to parameterize the audit implementation. `@this_model` is a special macro which refers to a model that is being audited. For incremental models, this macro also ensures that only relevant data intervals are affected. `@column` and `@threshold` are generic parameters, values for which are set in the model definition.
46
46
47
47
The generic audit can now be applied to a model by being referenced in its definition:
48
48
```sql linenums="1"
@@ -56,7 +56,7 @@ MODEL (
56
56
```
57
57
Notice how `column` and `threshold` parameters have been set at this point. These values will later be propagated into the audit query and returned by the `@column` and `@threshold` macros accordingly.
58
58
59
-
Please also note that the same audit can be applied more than once to the same model with different sets of parameters.
59
+
Note that the same audit can be applied more than once to the same model with different sets of parameters.
60
60
61
61
## Built-in audits
62
62
SQLMesh comes with a suite of built-in generic audits which covers a broad set of common use cases.
@@ -75,7 +75,7 @@ MODEL (
75
75
```
76
76
77
77
### unique_values
78
-
Makes sure that provided columns only contain unique values.
78
+
Ensures that provided columns only contain unique values.
79
79
80
80
Example:
81
81
```sql linenums="1"
@@ -101,7 +101,7 @@ MODEL (
101
101
```
102
102
103
103
### number_of_rows
104
-
Ensures that the number of rows in the model's table exceeds the configured threshold. For incremental models this check only applies to a data interval that is being evaluated, and not to the entire table.
104
+
Ensures that the number of rows in the model's table exceeds the configured threshold. For incremental models, this check only applies to a data interval that is being evaluated, not to the entire table.
Environments are isolated namespaces that allow you to test and preview your changes.
3
3
4
-
SQLMesh differentiates between production and development environments. Currently only the environment with the name `prod` is treated by SQLMesh as the production one. Environments with other names are considered to be development ones.
4
+
SQLMesh differentiates between production and development environments. Currently, only the environment with the name `prod` is treated by SQLMesh as the production one. Environments with other names are considered to be development ones.
5
5
6
6
[Models](models/overview.md) in development environments get a special suffix appended to the schema portion of their names. For example, to access data for a model with name `db.model_a` in the target environment `my_dev`, the `db__my_dev.model_a` table name should be used in a query. Models in the production environment are referred to by their original names.
7
7
8
8
## Why use environments
9
-
Data pipelines and their dependencies tend to grow in complexity over time and so assessing the impact of local changes can become quite challenging. Pipeline owners may not be aware of all downstream consumers of their pipelines, or may drastically underestimate the impact a change would have. That's why it is so important to be able to iterate and test model changes using production dependencies and data, while simultaneously avoiding any impact to existing datasets and/or pipelines that are currently used in production. Recreating the entire data warehouse with given changes would be an ideal solution to fully understand their impact, but this process is usually excessively expensive and time consuming.
9
+
Data pipelines and their dependencies tend to grow in complexity over time, and so assessing the impact of local changes can become quite challenging. Pipeline owners may not be aware of all downstream consumers of their pipelines, or may drastically underestimate the impact a change would have. That's why it is so important to be able to iterate and test model changes using production dependencies and data, while simultaneously avoiding any impact to existing datasets or pipelines that are currently used in production. Recreating the entire data warehouse with given changes would be an ideal solution to fully understand their impact, but this process is usually excessively expensive and time consuming.
10
10
11
11
SQLMesh environments allow you to easily spin up shallow 'clones' of the data warehouse quickly and efficiently. SQLMesh understands which models have changed compared to the target environment, and only computes data gaps that have been directly caused by the changes. Any changes or backfills within the target environment **do not impact** other environments. At the same time, any computation that was done in this environment **can be safely reused** in other environments.
12
12
@@ -16,17 +16,17 @@ When running the [plan](plans.md) command, the environment name can be supplied
16
16
By default, the [`sqlmesh plan`](plans.md) command targets the production (`prod`) environment.
17
17
18
18
### Example
19
-
A custom name can be provided as an argument to create/update a development environment. For example, to target an environment with name `my_dev`, run:
19
+
A custom name can be provided as an argument to create or update a development environment. For example, to target an environment with name `my_dev`, run:
20
20
21
21
```bash
22
22
sqlmesh plan my_dev
23
23
```
24
24
A new environment is created automatically the first time a plan is applied to it.
25
25
26
-
## How do environments work
26
+
## How environments work
27
27
Whenever a model definition changes, a new model snapshot is created with a unique [fingerprint](architecture/snapshots.md#fingerprints). This fingerprint allows SQLMesh to detect if a given model variant exists in other environments or if it's a brand new variant. Because models may depend on other models, the fingerprint of a target model variant also includes fingerprints of its upstream dependencies. If a fingerprint already exists in SQLMesh, it is safe to reuse the existing physical table associated with that model variant, since we're confident that the logic that populates that table is exactly the same. This makes an environment a collection of references to model [snapshots](architecture/snapshots.md).
28
28
29
-
Please refer to the [Plans](plans.md#plan-application) page for additional details.
29
+
Refer to [plans](plans.md#plan-application) for additional details.
30
30
31
31
## Date range
32
32
A development environment includes a start date and end date. When creating a development environment, the intent is usually to test changes on a subset of data. The size of such a subset is determined by a time range defined through the start and end date of the environment. Both start and end date are provided during the [plan](plans.md) creation.
Copy file name to clipboardExpand all lines: docs/concepts/glossary.md
+8-2Lines changed: 8 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -42,14 +42,20 @@ Combining data from various sources (such as from a data warehouse) into one uni
42
42
## Lineage
43
43
The lineage of your data is a visualization of the life cycle of your data as it flows from data sources downstream to consumption.
44
44
45
+
## Slowly Changing Dimension (SCD)
46
+
A dimension (in a data warehouse, typically a dataset) containing relatively static data that can change slowly but unpredictably, rather than on a regular schedule. Some examples of typical slowly changing dimensions are places and products.
47
+
45
48
## Table
46
49
A table is the visual representation of data stored in rows and columns.
47
50
51
+
## User-Defined Function (UDF)
52
+
Functions that a user of a database server provides to extend its functionality, in contrast to built-in functions that are already provided. UDFs are typically written to satisfy the particular requirements of the user.
53
+
48
54
## View
49
55
A view is the result of a SQL query on a database.
50
56
51
57
## Virtual Data Marts
52
-
Term used to describe's SQLMesh's ability to share tables across environments to ensure tables are only built once while maintaining data integrity and environment isolation. See [Plan Application](plans.md#plan-application) for more information.
58
+
Term used to describes SQLMesh's ability to share tables across environments to ensure tables are only built once while maintaining data integrity and environment isolation. See [plan application](plans.md#plan-application) for more information.
53
59
54
60
## Virtual Update
55
-
Term used to describe a plan that can be applied without having to load any additional data or build any additional tables. See [Plan's Virtual Update](plans.md#virtual-update) for more information.
61
+
Term used to describe a plan that can be applied without having to load any additional data or build any additional tables. See [Virtual Update](plans.md#virtual-update) for more information.
Copy file name to clipboardExpand all lines: docs/concepts/macros.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Macros
2
2
3
-
Although SQL is not dynamic, data pipelines need some form of dynamicism to be useful. For example, you may want a SQL query that runs the same logic except for a filter on dates that should change with every invocation.
3
+
Although SQL is not dynamic, data pipelines need some form of dynamicism in order to be useful. For example, you may want a SQL query that runs the same logic except for a filter on dates that should change with every invocation.
4
4
5
5
```sql linenums="1"
6
6
SELECT*
@@ -71,7 +71,7 @@ Variables:
71
71
* @latest_millis
72
72
73
73
## Jinja
74
-
[Jinja](https://jinja.palletsprojects.com/en/3.1.x/) is a popular templating tool for creating dynamic SQL and **is supported** by SQLMesh, but there are some drawbacks which lead for us to create our own Macro system.
74
+
[Jinja](https://jinja.palletsprojects.com/en/3.1.x/) is a popular templating tool for creating dynamic SQL and is supported by SQLMesh, but there are some drawbacks which lead for us to create our own macro system.
75
75
76
76
* Jinja is not valid SQL and not parseable.
77
77
```sql linenums="1"
@@ -80,9 +80,9 @@ SE{{ 'lect' }} x {{ 'AS ' + var }}
80
80
FROM {{ 'table CROSS JOIN z' }}
81
81
```
82
82
83
-
* Jinja is verbose and difficult to debug
83
+
* Jinja is verbose and difficult to debug.
84
84
```sql linenums="1"
85
-
TODO example with multiple for loops with trailing or leading comma
85
+
TBD example with multiple for loops with trailing or leading comma
Copy file name to clipboardExpand all lines: docs/concepts/models/model_kinds.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -84,7 +84,7 @@ WHERE
84
84
### Idempotency
85
85
It is recommended to ensure that queries of models of this kind are [idempotent](../../glossary/#idempotency) to prevent unexpected results during data [restatement](../plans.md#restatement-plans).
86
86
87
-
Please note, however, that upstream models and tables can impact the extent to which the idempotency property can be guaranteed. For example, referencing an upstream model of kind [FULL](#full) in the model query automatically renders such a model as non-idempotent.
87
+
Note, however, that upstream models and tables can impact the extent to which the idempotency property can be guaranteed. For example, referencing an upstream model of kind [FULL](#full) in the model query automatically renders such a model as non-idempotent.
88
88
89
89
### Materialization strategy
90
90
Depending on the target engine, models of the `INCREMENTAL_BY_TIME_RANGE` kind are materialized using the following strategies:
@@ -107,7 +107,7 @@ This kind signifies that a model should be computed incrementally based on a uni
107
107
* There should be at most one record associated with each unique key.
108
108
* It is appropriate to upsert records, meaning existing records can be overridden by new arrivals when their keys match.
109
109
110
-
[Slowly Changing Dimensions](https://en.wikipedia.org/wiki/Slowly_changing_dimension) (SCD) is one example that fits this description well.
110
+
A [Slowly Changing Dimension](../glossary.md#slowly-changing-dimension-scd) (SCD) is one example that fits this description well.
111
111
112
112
The name of the unique key column must be provided as part of the model definition, as in the following example:
113
113
```sql linenums="1"
@@ -198,7 +198,7 @@ Other model kinds cause the output of a model query to be materialized and store
198
198
199
199
**Note:** With this kind, the model's query is evaluated every time the model is referenced in downstream queries. This may incur undesirable compute cost in cases where the model's query is compute-intensive, or when the model is referenced in many downstream queries.
200
200
201
-
View is the default model kind if kind is not specified.
201
+
View is the default model kind if the kind is not specified.
0 commit comments