Skip to content

Commit 5108b1a

Browse files
authored
Edits to body (#617)
1 parent 71bb0ff commit 5108b1a

1 file changed

Lines changed: 27 additions & 18 deletions

File tree

docs/comparisons.md

Lines changed: 27 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -48,16 +48,20 @@ SQLMesh aims to be dbt format-compatible. Importing existing dbt projects with m
4848
| `Comprehensive Python API` | ❌ | ✅
4949

5050
### Environments
51-
Development and staging environments in dbt are expensive to make and not fully representative of what will go into production.
51+
Development and staging environments in dbt are costly to make and not fully representative of what will go into production.
5252

53-
The usual flow for creating a new environment in dbt is to rerun your entire warehouse in a new environment. This may work at small scales, but even if it does, it's a waste of time and money. SQLMesh is able to provide efficient isolated environments with [Virtual Data Marts](concepts/plans.md#plan-application). Creating a development environment in SQLMesh is free -- you can quickly get a full replica of any other environment with a simple command. Environments in dbt cost compute and storage.
53+
The standard approach to creating a new environment in dbt is to rerun your entire warehouse in a new environment. This may work at small scales, but even then it wastes time and money.
5454

55-
Additionally, SQLMesh ensures that promotion of staging environments to production is predictable and consistent. Promotions are simple pointer swaps meaning there is again no wasted compute. There is no concept of promotion in dbt, and queries are all rerun when it's time to deploy something.
55+
SQLMesh provides efficient isolated environments with [Virtual Data Marts](./concepts/plans.md#plan-application). Environments in dbt cost compute and storage, but creating a development environment in SQLMesh is free -- you can quickly access a full replica of any other environment with a single command.
56+
57+
Additionally, SQLMesh ensures that promotion of staging environments to production is predictable and consistent. There is no concept of promotion in dbt, so queries are all rerun when it's time to deploy something. In SQLMesh, promotions are simple pointer swaps so there is no wasted compute.
5658

5759
### Incremental models
58-
Implementing an incremental model is difficult and error-prone in dbt, because dbt does not keep track of state. Since there is no state in dbt, the user must write subqueries to find missing date boundaries.
60+
Implementing incremental models is difficult and error-prone in dbt because it does not keep track of state.
5961

6062
#### Complexity
63+
Since there is no state in dbt, users must write and maintain subqueries to find missing date boundaries themselves:
64+
6165
```sql
6266
-- dbt incremental
6367
SELECT *
@@ -70,11 +74,13 @@ JOIN raw.event_dims d
7074
AND d.ds >= (SELECT MAX(ds) FROM {{ this }})
7175
{% endif %}
7276
{% if is_incremental() %}
73-
WHERE e.ds >= (SELECT MAX(ds) FROM {{ this }})
77+
WHERE e.ds >= (SELECT MAX(ds) FROM {{ this }})
7478
{% endif %}
7579
```
7680

77-
Having to manually specify macros to find date boundaries is repetitive and error-prone. As incremental models become more complex, the cognitive burden of having two run times, "first time full refresh" vs. "subsequent incremental", increases.
81+
Manually specifying macros to find date boundaries is repetitive and error-prone.
82+
83+
The example above shows how incremental models behave differently in dbt depending on whether they have been run before. As models become more complex, the cognitive burden of having two run times, "first time full refresh" vs. "subsequent incremental", increases.
7884

7985
SQLMesh keeps track of which date ranges exist so that the query can be simplified as follows:
8086

@@ -89,11 +95,13 @@ WHERE d.ds BETWEEN @start_ds AND @end_ds
8995
```
9096

9197
#### Data leakage
92-
dbt does not enforce that the data inserted into the incremental table should be there. This can lead to problems or consistency issues, such as late-arriving data overriding past partitions. SQLMesh wraps all queries under the hood in a subquery with a time filter to enforce that the data inserted for a particular batch is as expected.
98+
dbt does not check whether the data inserted into an incremental table should be there or not. This can lead to problems or consistency issues, such as late-arriving data overriding past partitions. These problems are called "data leakage."
9399

94-
dbt also only supports the 'insert/overwrite' incremental load pattern for systems that natively support it. SQLMesh enables 'insert/overwrite' on any system, because it is the most robust way to do incremental pipelines. 'Append' pipelines risk data accuracy in the variety of scenarios where your pipelines may run more than once for a given date.
100+
SQLMesh wraps all queries in a subquery with a time filter under the hood to enforce that the data inserted for a particular batch is as expected.
95101

102+
In addition, dbt only supports the 'insert/overwrite' incremental load pattern for systems that natively support it. SQLMesh enables 'insert/overwrite' on any system, because it is the most robust approach to incremental loading. 'Append' pipelines risk data inaccuracy in the variety of scenarios where your pipelines may run more than once for a given date.
96103

104+
This example shows the time filtering subquery SQLMesh applies to all queries as a guard against data leakage:
97105
```sql
98106
-- original query
99107
SELECT *
@@ -115,29 +123,30 @@ WHERE ds BETWEEN @start_ds AND @end_ds
115123
```
116124

117125
#### Data gaps
118-
The main pattern used in incremental models checks for MAX(ds). This pattern does not catch missing data from the past, or data gaps.
126+
The main pattern used to implement incremental models in dbt is checking for the most recent data with MAX(date). This pattern does not catch missing data from the past, or "data gaps."
119127

128+
SQLMesh stores each date interval a model has been run with, so it knows exactly what dates are missing:
120129
```
121130
Expected dates: 2022-01-01, 2022-01-02, 2022-01-03
122131
Missing past data: ?, 2022-01-02, 2022-01-03
123132
Data gap: 2022-01-01, ?, 2022-01-03
124133
```
125134

126-
SQLMesh stores each date interval a model has been run with, so it knows exactly what dates are missing.
127-
128135
#### Performance
129-
The subqueries that look for MAX(date) could have a performance impact on the query. SQLMesh is able to avoid these extra subqueries.
136+
Subqueries that look for MAX(date) could have a performance impact on the primary query. SQLMesh is able to avoid these extra subqueries.
137+
138+
Additionally, dbt expects an incremental model to be able to fully refresh the first time it runs. For some large data sets, this is cost-prohibitive or infeasible.
130139

131-
Additionally, dbt expects an incremental model to be able to fully refresh the first time it runs. For some large scale data sets, this is cost prohibitive or infeasible. SQLMesh is able to [batch](../concepts/models/overview#batch_size) up backfills into more manageable chunks.
140+
SQLMesh is able to [batch](../concepts/models/overview#batch_size) up backfills into more manageable chunks.
132141

133142
### SQL understanding
134-
dbt heavily relies on [Jinja](https://jinja.palletsprojects.com/en/3.1.x/). It has no understanding of SQL and treats all queries as raw strings with no context. This means that simple syntax errors (like a trailing comma) are difficult to debug and require a full run to detect.
143+
dbt heavily relies on [Jinja](https://jinja.palletsprojects.com/en/3.1.x/). It has no understanding of SQL and treats all queries as raw strings without context. This means that simple syntax errors like trailing commas are difficult to debug and require a full run to detect.
135144

136-
Although SQLMesh supports Jinja, it does not rely on it and parses/understands SQL through [SQLGlot](https://github.com/tobymao/sqlglot). Simple errors can be detected at compile time. You no longer have to wait minutes to see that you've referenced a column incorrectly or missed a comma.
145+
SQLMesh supports Jinja, but it does not rely on it - instead, it parses/understands SQL through [SQLGlot](https://github.com/tobymao/sqlglot). Simple errors can be detected at compile time, so you no longer have to wait minutes to see that you've referenced a column incorrectly or missed a comma.
137146

138-
Additionally, having a first-class understanding of SQL allows SQLMesh to do some interesting things, like transpilation, column-level lineage, and automatic change categorization.
147+
Additionally, having a first-class understanding of SQL allows SQLMesh to do some interesting and useful things like transpilation, column-level lineage, and automatic change categorization.
139148

140149
### Testing
141-
dbt calls data quality checks testing. Although data quality checks are extremely valuable, they are not sufficient for creating robust data pipelines. Data quality checks are great for detecting upstream data issues and large scale problems like nulls and duplicates. But they are not meant for testing edge cases or business logic.
150+
Data quality checks such as detecting NULL values and duplicated rows are extremely valuable for detecting upstream data issues and large scale problems. However, they are not meant for testing edge cases or business logic, and they are not sufficient for creating robust data pipelines.
142151

143-
[Unit and integration tests](concepts/tests.md) are the tools to use to validate business logic. SQLMesh encourages users to add unit tests to all of their models to ensure changes don't unexpectedly break assumptions. Unit tests are designed to be fast and self contained so that they can run in CI.
152+
[Unit and integration tests](./concepts/tests.md) are the tools to use to validate business logic. SQLMesh encourages users to add unit tests to all of their models to ensure changes don't unexpectedly break assumptions. Unit tests are designed to be fast and self contained so that they can run in continuous integration (CI) frameworks.

0 commit comments

Comments
 (0)