|
1 | 1 | # Seed models |
2 | 2 |
|
3 | | -TODO |
| 3 | +Seed is a special kind of model where data is sourced from a static dataset defined as a CSV file rather than from a SQL / Python implementation defined by a user. The CSV files themselves are also a part of your SQLMesh project. |
| 4 | + |
| 5 | +Since seeds are also models in SQLMesh, they enjoy all the same benefits that SQL / Python models provide: |
| 6 | + |
| 7 | +* A physical table gets created in the data warehouse which reflects the contents of the seed's CSV file. |
| 8 | +* Seed models can be referenced in downstream models the same way as other models. |
| 9 | +* Changes to CSV files are captured during [planning](../plans.md#plan-application) and versioned using the same [fingerprinting](../architecture/snapshots.md#fingerprinting) mechanism. |
| 10 | +* [Environment](../environments.md) isolation also applies to seed models. |
| 11 | + |
| 12 | +Seed models are a good fit for static datasets that don't change often or at all. Examples of such datasets include: |
| 13 | + |
| 14 | +* Names of national holidays and their dates. |
| 15 | +* A static list of identifiers that should be excluded. |
| 16 | + |
| 17 | +## Creating a seed model |
| 18 | + |
| 19 | +Similarly to [SQL models](sql_models.md), seed models are defined in files with the `.sql` extension as part of the `models/` folder of the SQLMesh project. To indicate that the model is a seed model, the special kind `SEED` should be used in the model definition: |
| 20 | +```sql linenums="1" |
| 21 | +MODEL ( |
| 22 | + name test_db.national_holidays, |
| 23 | + kind SEED ( |
| 24 | + path 'national_holidays.csv' |
| 25 | + ) |
| 26 | +); |
| 27 | +``` |
| 28 | +The `path` attribute provided as part of the definition represents a path to the seed's CSV file **relative** to the path of the definition's `.sql` file. |
| 29 | + |
| 30 | +The physical table with the seed's content gets created using column types inferred by Pandas. The dataset schema can be overridden as part of the model definition: |
| 31 | +```sql linenums="1" hl_lines="6 7 8 9" |
| 32 | +MODEL ( |
| 33 | + name test_db.national_holidays, |
| 34 | + kind SEED ( |
| 35 | + path 'national_holidays.csv' |
| 36 | + ), |
| 37 | + columns ( |
| 38 | + name VARCHAR, |
| 39 | + date DATE |
| 40 | + ) |
| 41 | +); |
| 42 | +``` |
| 43 | +**Note:** the dataset schema provided in the definition takes precedence over column names defined in the header of a CSV file. This means that the order in which columns are provided in the model definition must match the order of columns in the CSV file. |
| 44 | + |
| 45 | +## Example |
| 46 | + |
| 47 | +For this example we use the model definition from the previous section and assume it's been saved to the `models/national_holidays.sql` file of the SQLMesh project. |
| 48 | + |
| 49 | +Add the seed's CSV file with name `national_holidays.csv` to the `models/` folder with the following contents: |
| 50 | +```csv linenums="1" |
| 51 | +name,date |
| 52 | +New Year,2023-01-01 |
| 53 | +Christmas,2023-12-25 |
| 54 | +``` |
| 55 | + |
| 56 | +When running the `sqlmesh plan` command the new model gets automatically detected: |
| 57 | +```bash |
| 58 | +$ sqlmesh plan |
| 59 | +====================================================================== |
| 60 | +Successfully Ran 0 tests against duckdb |
| 61 | +---------------------------------------------------------------------- |
| 62 | +Summary of differences against `prod`: |
| 63 | +└── Added Models: |
| 64 | + └── test_db.national_holidays |
| 65 | +Models needing backfill (missing dates): |
| 66 | +└── test_db.national_holidays: (2023-02-16, 2023-02-16) |
| 67 | +Enter the backfill start date (eg. '1 year', '2020-01-01') or blank for the beginning of history: |
| 68 | +Apply - Backfill Tables [y/n]: y |
| 69 | + |
| 70 | +All model batches have been executed successfully |
| 71 | + |
| 72 | +test_db.national_holidays ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% • 1/1 • 0:00:00 |
| 73 | +``` |
| 74 | + |
| 75 | +After successful plan application you can now query the table which resulted from model evaluation: |
| 76 | +```bash |
| 77 | +$ sqlmesh fetchdf "SELECT * FROM test_db.national_holidays" |
| 78 | + |
| 79 | + name date |
| 80 | +0 New Year 2023-01-01 |
| 81 | +1 Christmas 2023-12-25 |
| 82 | +``` |
| 83 | + |
| 84 | +Changes to the CSV files get picked up during the subsequent `sqlmesh plan` command: |
| 85 | +```bash |
| 86 | +$ sqlmesh plan |
| 87 | +====================================================================== |
| 88 | +Successfully Ran 0 tests against duckdb |
| 89 | +---------------------------------------------------------------------- |
| 90 | +Summary of differences against `prod`: |
| 91 | +└── Directly Modified: |
| 92 | + └── test_db.national_holidays |
| 93 | +--- |
| 94 | + |
| 95 | ++++ |
| 96 | + |
| 97 | +@@ -1,3 +1,4 @@ |
| 98 | + |
| 99 | + name,date |
| 100 | + New Year,2023-01-01 |
| 101 | + Christmas,2023-12-25 |
| 102 | ++Independence Day,2023-07-04 |
| 103 | +Directly Modified: test_db.national_holidays |
| 104 | +[1] [Breaking] Backfill test_db.national_holidays and indirectly modified children |
| 105 | +[2] [Non-breaking] Backfill test_db.national_holidays but not indirectly modified children: 1 |
| 106 | +Models needing backfill (missing dates): |
| 107 | +└── test_db.national_holidays: (2023-02-16, 2023-02-16) |
| 108 | +Enter the backfill start date (eg. '1 year', '2020-01-01') or blank for the beginning of history: |
| 109 | +Apply - Backfill Tables [y/n]: y |
| 110 | + |
| 111 | +All model batches have been executed successfully |
| 112 | + |
| 113 | +test_db.national_holidays ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% • 1/1 • 0:00:00 |
| 114 | +``` |
0 commit comments