Skip to content

Commit 7316b11

Browse files
authored
make dac pandera-independent (#21)
* remove pandera dependencies in src * update documentation * update changelog * add integration test to check that custom schemas work * don't test with python 3.12 as pandas/pyarrow have issues
1 parent 3db5067 commit 7316b11

9 files changed

Lines changed: 76 additions & 28 deletions

File tree

.github/workflows/actions.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ jobs:
1919
- name: Setup python 🐍
2020
uses: actions/setup-python@v4
2121
with:
22-
python-version: '3.11'
22+
python-version: '3.12'
2323
- name: Setup cache 💾
2424
uses: actions/cache@v3
2525
with:
@@ -84,7 +84,7 @@ jobs:
8484
- name: Setup python 🐍
8585
uses: actions/setup-python@v4
8686
with:
87-
python-version: "3.11"
87+
python-version: "3.12"
8888
- name: Prepare release 🙆‍♂️📦test
8989
run: |
9090
python -m venv venv || . venv/bin/activate

.gitlab-ci.yml

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@ stages:
44
- verification
55
- deployment
66

7-
87
check style:
98
stage: verification
109
before_script:
@@ -13,7 +12,6 @@ check style:
1312
script:
1413
- pre-commit run
1514

16-
1715
test:
1816
image: $image
1917
stage: verification
@@ -25,10 +23,10 @@ test:
2523
parallel:
2624
matrix:
2725
- image:
28-
- "python:3.9"
29-
- "python:3.10"
30-
- "python:3.11"
31-
26+
- "python:3.9"
27+
- "python:3.10"
28+
- "python:3.11"
29+
- "python:3.12"
3230

3331
deploy package:
3432
stage: deployment

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,12 +10,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1010
Anything MAY change at any time. The public API SHOULD NOT be considered stable.").
1111
While in this phase, we will denote breaking changes with a minor increase.
1212

13-
## Unreleased
13+
## Unreleased patch
1414

1515
### Changed
1616

1717
* `dac` does not rely on [`pydantic`](https://pypi.org/project/pydantic/) anymore, and uses [`dataclass`](https://docs.python.org/3/library/dataclasses.html#) instead.
1818
Changes affect `PackConfig` and `PyProjectConfig`.
19+
* `Schema` does not have to be a `pandera.DataFrameModel` anymore, but any class that implements a `validate` method (see the `_input.interface.Validator` protocol).
1920

2021
## 0.3.3
2122

docs/index.md

Lines changed: 18 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -23,17 +23,17 @@ from demo_data import load
2323
data = load()
2424
```
2525

26-
Depending on how the data was prepared, load may return a [pandas](https://pandas.pydata.org/), [dask](https://www.dask.org/), [pyspark](https://spark.apache.org/docs/latest/api/python/index.html), or [modin](https://github.com/modin-project/modin) dataframe. The limited choice is due to the fact that it must be supported by [pandera](https://pandera.readthedocs.io/en/stable/).
26+
Data can be in any format. There is no constraint of any kind.
2727

28-
Not only accessing data will be this easy, but you will also have the [pandera DataFrame Model](https://pandera.readthedocs.io/en/stable/dataframe_models.html) associated with the data. How?
28+
Not only accessing data will be this easy but, depending on how data were prepared, you may also have access to useful metadata. How?
2929
```python
3030
from demo_data import Schema
3131
```
3232

33-
With the schema you can, for example
33+
With the schema you could, for example
3434

3535
* access the column names (e.g. `Schema.my_column`)
36-
* unit test your functions by [synthetizining data](https://pandera.readthedocs.io/en/stable/data_synthesis_strategies.html)
36+
* unit test your functions by getting a data example with `Schema.example()`
3737

3838
## How can a Data Engineer provide a DaC python package?
3939

@@ -45,20 +45,29 @@ and use the command `dac pack` (run `dac pack --help` for detailed instructions)
4545

4646
On a high level, the most important elements you must provide are:
4747

48-
* python code to load the data. It should as a DataFrame in one of the supported libraries: [pandas](https://pandas.pydata.org/), [dask](https://www.dask.org/), [pyspark](https://spark.apache.org/docs/latest/api/python/index.html), or [modin](https://github.com/modin-project/modin)
49-
* a [pandera DataFrame Model](https://pandera.readthedocs.io/en/stable/dataframe_models.html) fitting the data that can be loaded
48+
* python code to load the data
49+
* a `Schema` class that at very least contains a `validate` method, but possibly also
50+
51+
- data field names (column names, if data is tabular)
52+
- an `example` method
53+
5054
* python dependencies
5155

56+
!!! hint "Use `pandera` to define the Schema"
57+
58+
If the data type you are using is supported by [`pandera`](https://pandera.readthedocs.io/en/stable/index.html) consider using a [`DataFrameModel`](https://pandera.readthedocs.io/en/stable/dataframe_models.html) to define the Schema.
59+
60+
5261
## What are the advantages of distributing data in this way?
5362

5463
* The code needed to load the data, the data source, and locations are abstracted away from the user.
5564
This mean that the data engineer can start from local files, transition to SQL database, cloud file storage, or kafka topic, without having the user to notice it or need to adapt its code.
5665

57-
* Column names are passed to the user, and can be abstracted from the data source leveraging on the pandera [`Field.alias`](https://pandera.readthedocs.io/en/stable/reference/generated/pandera.api.pandas.model_components.Field.html). In this way, the user code will not contain hard-coded column names, and changes in data source column names won't impact the user.
66+
* *If you provide data field names in `Schema`* (e.g. `Schema.column_1`), the user code will not contain hard-coded column names, and changes in data source field names won't impact the user.
5867

59-
* Users can build robust code by [writing unit testing for their functions](https://pandera.readthedocs.io/en/stable/data_synthesis_strategies.html) effortlessly.
68+
* *If you provide the `Schema.example` method*, users will be able to build robust code by writing unit testing for their functions effortlessly.
6069

61-
* Semantic versioning can be used to communicate significat changes:
70+
* Semantic versioning can be used to communicate significant changes:
6271

6372
* a patch update corresponds to a fix in the data: its intended content is unchanged
6473
* a minor update corresponds to a change in the data that does not break the schema

src/dac/_input/config.py

Lines changed: 3 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,8 @@
55
from pathlib import Path
66
from typing import Optional
77

8-
import pandera as pa
9-
108
from dac._file_helper import temporarily_copied_file
9+
from dac._input.interface import Validator
1110
from dac._input.pyproject import PyProjectConfig
1211

1312

@@ -65,7 +64,7 @@ def _check_schema_contains_expected_class(self) -> None:
6564
) from e
6665

6766
try:
68-
issubclass(pkg.Schema, pa.SchemaModel)
67+
issubclass(pkg.Schema, Validator)
6968
except Exception as e:
7069
raise ValueError(
7170
(f"{self.schema_path.as_posix()} does not contain the required `class Schema(pa.SchemaModel)`")
@@ -99,9 +98,5 @@ def _check_schema_match_data(self) -> None:
9998

10099
try:
101100
schema_module.Schema.validate(data, lazy=True)
102-
except pa.errors.SchemaErrors as e:
103-
raise ValueError("Validation of the schema against the data has failed:" "\n" f"{e.failure_cases}") from e
104101
except Exception as e:
105-
raise ValueError(
106-
"Validation of the schema against the data has failed for unexpected reasons:" "\n" f"{e}"
107-
) from e
102+
raise ValueError("Validation of the schema against the data has failed:" "\n" f"{e}") from e

src/dac/_input/interface.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
from typing import Any, Protocol, runtime_checkable
2+
3+
4+
@runtime_checkable
5+
class Validator(Protocol):
6+
@classmethod
7+
def validate(cls, check_obj: Any, **kwargs) -> Any:
8+
pass

test/data/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,10 @@ def get_path_to_sample_schema() -> Path:
4141
return Path(__file__).parent / "schema/sample.py"
4242

4343

44+
def get_path_to_sample_custom_schema() -> Path:
45+
return Path(__file__).parent / "schema/sample_custom.py"
46+
47+
4448
def get_path_to_invalid_schema() -> Path:
4549
return Path(__file__).parent / "schema/invalid.py"
4650

test/data/schema/sample_custom.py

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
import pandas as pd
2+
3+
4+
class Schema:
5+
int1 = "int1"
6+
float1 = "float1"
7+
string1 = "string1"
8+
date1 = "date1"
9+
datetime1 = "datetime1"
10+
11+
@classmethod
12+
def validate(cls, check_obj: pd.DataFrame, **kwargs) -> pd.DataFrame:
13+
# columns are present
14+
for col in (cls.int1, cls.float1, cls.string1, cls.date1, cls.datetime1):
15+
assert col in check_obj.columns
16+
17+
# correct types
18+
assert check_obj[cls.int1].dtype == int
19+
assert check_obj[cls.float1].dtype == float
20+
assert check_obj[cls.string1].dtype == object
21+
assert check_obj[cls.date1].dtype == object
22+
assert check_obj[cls.datetime1].dtype == "datetime64[ns]"
23+
24+
# no nulls in int1
25+
assert check_obj[cls.int1].isna().sum() == 0
26+
27+
return check_obj

test/integration_test/pack_test.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
from test.data import (
99
generate_random_project_name,
1010
get_path_to_parquet_as_pandas_requirements,
11+
get_path_to_sample_custom_schema,
1112
get_path_to_sample_load_parquet_as_pandas,
1213
get_path_to_sample_parquet,
1314
get_path_to_sample_schema,
@@ -22,16 +23,21 @@
2223
import pandas as pd
2324
import pandera as pa
2425
import pytest
25-
2626
from dac import PackConfig, PyProjectConfig
2727

2828

2929
@pytest.mark.slow
30-
def test_if_valid_input_then_create_python_wheel():
30+
def test_if_valid_input_with_pandera_schema_then_create_python_wheel():
3131
with _packed_data() as config:
3232
_verify_wheel(wheel_dir=Path(config.wheel_dir), pyproject=config.pyproject)
3333

3434

35+
@pytest.mark.slow
36+
def test_if_valid_input_with_custom_schema_then_create_python_wheel():
37+
with _packed_data(schema_path=get_path_to_sample_custom_schema()) as config:
38+
_verify_wheel(wheel_dir=Path(config.wheel_dir), pyproject=config.pyproject)
39+
40+
3541
@pytest.mark.slow
3642
def test_if_installed_then_can_load_data():
3743
with _packed_data() as config:

0 commit comments

Comments
 (0)