You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* remove pandera dependencies in src
* update documentation
* update changelog
* add integration test to check that custom schemas work
* don't test with python 3.12 as pandas/pyarrow have issues
Copy file name to clipboardExpand all lines: CHANGELOG.md
+2-1Lines changed: 2 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,12 +10,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
10
10
Anything MAY change at any time. The public API SHOULD NOT be considered stable.").
11
11
While in this phase, we will denote breaking changes with a minor increase.
12
12
13
-
## Unreleased
13
+
## Unreleased patch
14
14
15
15
### Changed
16
16
17
17
*`dac` does not rely on [`pydantic`](https://pypi.org/project/pydantic/) anymore, and uses [`dataclass`](https://docs.python.org/3/library/dataclasses.html#) instead.
18
18
Changes affect `PackConfig` and `PyProjectConfig`.
19
+
*`Schema` does not have to be a `pandera.DataFrameModel` anymore, but any class that implements a `validate` method (see the `_input.interface.Validator` protocol).
Copy file name to clipboardExpand all lines: docs/index.md
+18-9Lines changed: 18 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -23,17 +23,17 @@ from demo_data import load
23
23
data = load()
24
24
```
25
25
26
-
Depending on how the data was prepared, load may return a [pandas](https://pandas.pydata.org/), [dask](https://www.dask.org/), [pyspark](https://spark.apache.org/docs/latest/api/python/index.html), or [modin](https://github.com/modin-project/modin) dataframe. The limited choice is due to the fact that it must be supported by [pandera](https://pandera.readthedocs.io/en/stable/).
26
+
Data can be in any format. There is no constraint of any kind.
27
27
28
-
Not only accessing data will be this easy, but you will also have the [pandera DataFrame Model](https://pandera.readthedocs.io/en/stable/dataframe_models.html) associated with the data. How?
28
+
Not only accessing data will be this easy but, depending on how data were prepared, you may also have access to useful metadata. How?
29
29
```python
30
30
from demo_data import Schema
31
31
```
32
32
33
-
With the schema you can, for example
33
+
With the schema you could, for example
34
34
35
35
* access the column names (e.g. `Schema.my_column`)
36
-
* unit test your functions by [synthetizining data](https://pandera.readthedocs.io/en/stable/data_synthesis_strategies.html)
36
+
* unit test your functions by getting a data example with `Schema.example()`
37
37
38
38
## How can a Data Engineer provide a DaC python package?
39
39
@@ -45,20 +45,29 @@ and use the command `dac pack` (run `dac pack --help` for detailed instructions)
45
45
46
46
On a high level, the most important elements you must provide are:
47
47
48
-
* python code to load the data. It should as a DataFrame in one of the supported libraries: [pandas](https://pandas.pydata.org/), [dask](https://www.dask.org/), [pyspark](https://spark.apache.org/docs/latest/api/python/index.html), or [modin](https://github.com/modin-project/modin)
49
-
* a [pandera DataFrame Model](https://pandera.readthedocs.io/en/stable/dataframe_models.html) fitting the data that can be loaded
48
+
* python code to load the data
49
+
* a `Schema` class that at very least contains a `validate` method, but possibly also
50
+
51
+
- data field names (column names, if data is tabular)
52
+
- an `example` method
53
+
50
54
* python dependencies
51
55
56
+
!!! hint "Use `pandera` to define the Schema"
57
+
58
+
If the data type you are using is supported by [`pandera`](https://pandera.readthedocs.io/en/stable/index.html) consider using a [`DataFrameModel`](https://pandera.readthedocs.io/en/stable/dataframe_models.html) to define the Schema.
59
+
60
+
52
61
## What are the advantages of distributing data in this way?
53
62
54
63
* The code needed to load the data, the data source, and locations are abstracted away from the user.
55
64
This mean that the data engineer can start from local files, transition to SQL database, cloud file storage, or kafka topic, without having the user to notice it or need to adapt its code.
56
65
57
-
*Column names are passed to the user, and can be abstracted from the data source leveraging on the pandera [`Field.alias`](https://pandera.readthedocs.io/en/stable/reference/generated/pandera.api.pandas.model_components.Field.html). In this way, the user code will not contain hard-coded column names, and changes in data source column names won't impact the user.
66
+
**If you provide data field names in `Schema`* (e.g. `Schema.column_1`), the user code will not contain hard-coded column names, and changes in data source field names won't impact the user.
58
67
59
-
*Users can build robust code by [writing unit testing for their functions](https://pandera.readthedocs.io/en/stable/data_synthesis_strategies.html) effortlessly.
68
+
**If you provide the `Schema.example` method*, users will be able to build robust code by writing unit testing for their functions effortlessly.
60
69
61
-
* Semantic versioning can be used to communicate significat changes:
70
+
* Semantic versioning can be used to communicate significant changes:
62
71
63
72
* a patch update corresponds to a fix in the data: its intended content is unchanged
64
73
* a minor update corresponds to a change in the data that does not break the schema
0 commit comments