Skip to content

Commit 72721b6

Browse files
authored
Fill in the documentation page for seed models (#392)
1 parent d1e66d3 commit 72721b6

2 files changed

Lines changed: 113 additions & 3 deletions

File tree

Lines changed: 112 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,114 @@
11
# Seed models
22

3-
TODO
3+
Seed is a special kind of model where data is sourced from a static dataset defined as a CSV file rather than from a SQL / Python implementation defined by a user. The CSV files themselves are also a part of your SQLMesh project.
4+
5+
Since seeds are also models in SQLMesh, they enjoy all the same benefits that SQL / Python models provide:
6+
7+
* A physical table gets created in the data warehouse which reflects the contents of the seed's CSV file.
8+
* Seed models can be referenced in downstream models the same way as other models.
9+
* Changes to CSV files are captured during [planning](../plans.md#plan-application) and versioned using the same [fingerprinting](../architecture/snapshots.md#fingerprinting) mechanism.
10+
* [Environment](../environments.md) isolation also applies to seed models.
11+
12+
Seed models are a good fit for static datasets that don't change often or at all. Examples of such datasets include:
13+
14+
* Names of national holidays and their dates.
15+
* A static list of identifiers that should be excluded.
16+
17+
## Creating a seed model
18+
19+
Similarly to [SQL models](sql_models.md), seed models are defined in files with the `.sql` extension as part of the `models/` folder of the SQLMesh project. To indicate that the model is a seed model, the special kind `SEED` should be used in the model definition:
20+
```sql linenums="1"
21+
MODEL (
22+
name test_db.national_holidays,
23+
kind SEED (
24+
path 'national_holidays.csv'
25+
)
26+
);
27+
```
28+
The `path` attribute provided as part of the definition represents a path to the seed's CSV file **relative** to the path of the definition's `.sql` file.
29+
30+
The physical table with the seed's content gets created using column types inferred by Pandas. The dataset schema can be overridden as part of the model definition:
31+
```sql linenums="1" hl_lines="6 7 8 9"
32+
MODEL (
33+
name test_db.national_holidays,
34+
kind SEED (
35+
path 'national_holidays.csv'
36+
),
37+
columns (
38+
name VARCHAR,
39+
date DATE
40+
)
41+
);
42+
```
43+
**Note:** the dataset schema provided in the definition takes precedence over column names defined in the header of a CSV file. This means that the order in which columns are provided in the model definition must match the order of columns in the CSV file.
44+
45+
## Example
46+
47+
For this example we use the model definition from the previous section and assume it's been saved to the `models/national_holidays.sql` file of the SQLMesh project.
48+
49+
Add the seed's CSV file with name `national_holidays.csv` to the `models/` folder with the following contents:
50+
```csv linenums="1"
51+
name,date
52+
New Year,2023-01-01
53+
Christmas,2023-12-25
54+
```
55+
56+
When running the `sqlmesh plan` command the new model gets automatically detected:
57+
```bash
58+
$ sqlmesh plan
59+
======================================================================
60+
Successfully Ran 0 tests against duckdb
61+
----------------------------------------------------------------------
62+
Summary of differences against `prod`:
63+
└── Added Models:
64+
└── test_db.national_holidays
65+
Models needing backfill (missing dates):
66+
└── test_db.national_holidays: (2023-02-16, 2023-02-16)
67+
Enter the backfill start date (eg. '1 year', '2020-01-01') or blank for the beginning of history:
68+
Apply - Backfill Tables [y/n]: y
69+
70+
All model batches have been executed successfully
71+
72+
test_db.national_holidays ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% • 1/1 • 0:00:00
73+
```
74+
75+
After successful plan application you can now query the table which resulted from model evaluation:
76+
```bash
77+
$ sqlmesh fetchdf "SELECT * FROM test_db.national_holidays"
78+
79+
name date
80+
0 New Year 2023-01-01
81+
1 Christmas 2023-12-25
82+
```
83+
84+
Changes to the CSV files get picked up during the subsequent `sqlmesh plan` command:
85+
```bash
86+
$ sqlmesh plan
87+
======================================================================
88+
Successfully Ran 0 tests against duckdb
89+
----------------------------------------------------------------------
90+
Summary of differences against `prod`:
91+
└── Directly Modified:
92+
└── test_db.national_holidays
93+
---
94+
95+
+++
96+
97+
@@ -1,3 +1,4 @@
98+
99+
name,date
100+
New Year,2023-01-01
101+
Christmas,2023-12-25
102+
+Independence Day,2023-07-04
103+
Directly Modified: test_db.national_holidays
104+
[1] [Breaking] Backfill test_db.national_holidays and indirectly modified children
105+
[2] [Non-breaking] Backfill test_db.national_holidays but not indirectly modified children: 1
106+
Models needing backfill (missing dates):
107+
└── test_db.national_holidays: (2023-02-16, 2023-02-16)
108+
Enter the backfill start date (eg. '1 year', '2020-01-01') or blank for the beginning of history:
109+
Apply - Backfill Tables [y/n]: y
110+
111+
All model batches have been executed successfully
112+
113+
test_db.national_holidays ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% • 1/1 • 0:00:00
114+
```

examples/sushi/models/waiter_names.sql

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,5 @@ MODEL (
44
path '../seeds/waiter_names.csv',
55
batch_size 5
66
),
7-
dialect duckdb,
87
owner jen
9-
)
8+
)

0 commit comments

Comments
 (0)