make dac pandera-independent (#21)

francesco086 · web-flow · commit 7316b11a30df · 2023-10-31T13:25:51.000+01:00
* remove pandera dependencies in src

* update documentation

* update changelog

* add integration test to check that custom schemas work

* don't test with python 3.12 as pandas/pyarrow have issues
diff --git a/.github/workflows/actions.yml b/.github/workflows/actions.yml
@@ -19,7 +19,7 @@ jobs:
     - name: Setup python 🐍
       uses: actions/setup-python@v4
       with:
-        python-version: '3.11'
+        python-version: '3.12'
     - name: Setup cache 💾
       uses: actions/cache@v3
       with:
@@ -84,7 +84,7 @@ jobs:
     - name: Setup python 🐍
       uses: actions/setup-python@v4
       with:
-        python-version: "3.11"
+        python-version: "3.12"
     - name: Prepare release 🙆‍♂️📦test
       run: |
         python -m venv venv || . venv/bin/activate
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
@@ -4,7 +4,6 @@ stages:
   - verification
   - deployment
 
-
 check style:
   stage: verification
   before_script:
@@ -13,7 +12,6 @@ check style:
   script:
     - pre-commit run
 
-
 test:
   image: $image
   stage: verification
@@ -25,10 +23,10 @@ test:
   parallel:
     matrix:
       - image:
-        - "python:3.9"
-        - "python:3.10"
-        - "python:3.11"
-
+          - "python:3.9"
+          - "python:3.10"
+          - "python:3.11"
+          - "python:3.12"
 
 deploy package:
   stage: deployment
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -10,12 +10,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 Anything MAY change at any time. The public API SHOULD NOT be considered stable.").
 While in this phase, we will denote breaking changes with a minor increase.
 
-## Unreleased
+## Unreleased patch
 
 ### Changed
 
 * `dac` does not rely on [`pydantic`](https://pypi.org/project/pydantic/) anymore, and uses [`dataclass`](https://docs.python.org/3/library/dataclasses.html#) instead.
   Changes affect `PackConfig` and `PyProjectConfig`.
+* `Schema` does not have to be a `pandera.DataFrameModel` anymore, but any class that implements a `validate` method (see the `_input.interface.Validator` protocol).
 
 ## 0.3.3
 
diff --git a/docs/index.md b/docs/index.md
@@ -23,17 +23,17 @@ from demo_data import load
 data = load()
 ```
 
-Depending on how the data was prepared, load may return a [pandas](https://pandas.pydata.org/), [dask](https://www.dask.org/), [pyspark](https://spark.apache.org/docs/latest/api/python/index.html), or [modin](https://github.com/modin-project/modin) dataframe. The limited choice is due to the fact that it must be supported by [pandera](https://pandera.readthedocs.io/en/stable/).
+Data can be in any format. There is no constraint of any kind.
 
-Not only accessing data will be this easy, but you will also have the [pandera DataFrame Model](https://pandera.readthedocs.io/en/stable/dataframe_models.html) associated with the data. How?
+Not only accessing data will be this easy but, depending on how data were prepared, you may also have access to useful metadata. How?
 ```python
 from demo_data import Schema
 ```
 
-With the schema you can, for example
+With the schema you could, for example
 
 * access the column names (e.g. `Schema.my_column`)
-* unit test your functions by [synthetizining data](https://pandera.readthedocs.io/en/stable/data_synthesis_strategies.html)
+* unit test your functions by getting a data example with `Schema.example()`
 
 ## How can a Data Engineer provide a DaC python package?
 
@@ -45,20 +45,29 @@ and use the command `dac pack` (run `dac pack --help` for detailed instructions)
 
 On a high level, the most important elements you must provide are:
 
-* python code to load the data. It should as a DataFrame in one of the supported libraries: [pandas](https://pandas.pydata.org/), [dask](https://www.dask.org/), [pyspark](https://spark.apache.org/docs/latest/api/python/index.html), or [modin](https://github.com/modin-project/modin)
-* a [pandera DataFrame Model](https://pandera.readthedocs.io/en/stable/dataframe_models.html) fitting the data that can be loaded
+* python code to load the data
+* a `Schema` class that at very least contains a `validate` method, but possibly also
+
+    - data field names (column names, if data is tabular)
+    - an `example` method
+
 * python dependencies
 
+!!! hint "Use `pandera` to define the Schema"
+
+    If the data type you are using is supported by [`pandera`](https://pandera.readthedocs.io/en/stable/index.html) consider  using a [`DataFrameModel`](https://pandera.readthedocs.io/en/stable/dataframe_models.html) to define the Schema.
+
+
 ## What are the advantages of distributing data in this way?
 
 * The code needed to load the data, the data source, and locations are abstracted away from the user.
   This mean that the data engineer can start from local files, transition to SQL database, cloud file storage, or kafka topic, without having the user to notice it or need to adapt its code.
 
-* Column names are passed to the user, and can be abstracted from the data source leveraging on the pandera  [`Field.alias`](https://pandera.readthedocs.io/en/stable/reference/generated/pandera.api.pandas.model_components.Field.html). In this way, the user code will not contain hard-coded column names, and changes in data source column names won't impact the user.
+* *If you provide data field names in `Schema`* (e.g. `Schema.column_1`), the user code will not contain hard-coded column names, and changes in data source field names won't impact the user.
 
-* Users can build robust code by [writing unit testing for their functions](https://pandera.readthedocs.io/en/stable/data_synthesis_strategies.html) effortlessly.
+* *If you provide the `Schema.example` method*, users will be able to build robust code by writing unit testing for their functions effortlessly.
 
-* Semantic versioning can be used to communicate significat changes:
+* Semantic versioning can be used to communicate significant changes:
 
   * a patch update corresponds to a fix in the data: its intended content is unchanged
   * a minor update corresponds to a change in the data that does not break the schema
diff --git a/src/dac/_input/config.py b/src/dac/_input/config.py
@@ -5,9 +5,8 @@
 from pathlib import Path
 from typing import Optional
 
-import pandera as pa
-
 from dac._file_helper import temporarily_copied_file
+from dac._input.interface import Validator
 from dac._input.pyproject import PyProjectConfig
 
 
@@ -65,7 +64,7 @@ def _check_schema_contains_expected_class(self) -> None:
             ) from e
 
         try:
-            issubclass(pkg.Schema, pa.SchemaModel)
+            issubclass(pkg.Schema, Validator)
         except Exception as e:
             raise ValueError(
                 (f"{self.schema_path.as_posix()} does not contain the required `class Schema(pa.SchemaModel)`")
@@ -99,9 +98,5 @@ def _check_schema_match_data(self) -> None:
 
         try:
             schema_module.Schema.validate(data, lazy=True)
-        except pa.errors.SchemaErrors as e:
-            raise ValueError("Validation of the schema against the data has failed:" "\n" f"{e.failure_cases}") from e
         except Exception as e:
-            raise ValueError(
-                "Validation of the schema against the data has failed for unexpected reasons:" "\n" f"{e}"
-            ) from e
+            raise ValueError("Validation of the schema against the data has failed:" "\n" f"{e}") from e
diff --git a/src/dac/_input/interface.py b/src/dac/_input/interface.py
@@ -0,0 +1,8 @@
+from typing import Any, Protocol, runtime_checkable
+
+
+@runtime_checkable
+class Validator(Protocol):
+    @classmethod
+    def validate(cls, check_obj: Any, **kwargs) -> Any:
+        pass
diff --git a/test/data/__init__.py b/test/data/__init__.py
@@ -41,6 +41,10 @@ def get_path_to_sample_schema() -> Path:
     return Path(__file__).parent / "schema/sample.py"
 
 
+def get_path_to_sample_custom_schema() -> Path:
+    return Path(__file__).parent / "schema/sample_custom.py"
+
+
 def get_path_to_invalid_schema() -> Path:
     return Path(__file__).parent / "schema/invalid.py"
 
diff --git a/test/data/schema/sample_custom.py b/test/data/schema/sample_custom.py
@@ -0,0 +1,27 @@
+import pandas as pd
+
+
+class Schema:
+    int1 = "int1"
+    float1 = "float1"
+    string1 = "string1"
+    date1 = "date1"
+    datetime1 = "datetime1"
+
+    @classmethod
+    def validate(cls, check_obj: pd.DataFrame, **kwargs) -> pd.DataFrame:
+        # columns are present
+        for col in (cls.int1, cls.float1, cls.string1, cls.date1, cls.datetime1):
+            assert col in check_obj.columns
+
+        # correct types
+        assert check_obj[cls.int1].dtype == int
+        assert check_obj[cls.float1].dtype == float
+        assert check_obj[cls.string1].dtype == object
+        assert check_obj[cls.date1].dtype == object
+        assert check_obj[cls.datetime1].dtype == "datetime64[ns]"
+
+        # no nulls in int1
+        assert check_obj[cls.int1].isna().sum() == 0
+
+        return check_obj
diff --git a/test/integration_test/pack_test.py b/test/integration_test/pack_test.py
@@ -8,6 +8,7 @@
 from test.data import (
     generate_random_project_name,
     get_path_to_parquet_as_pandas_requirements,
+    get_path_to_sample_custom_schema,
     get_path_to_sample_load_parquet_as_pandas,
     get_path_to_sample_parquet,
     get_path_to_sample_schema,
@@ -22,16 +23,21 @@
 import pandas as pd
 import pandera as pa
 import pytest
-
 from dac import PackConfig, PyProjectConfig
 
 
 @pytest.mark.slow
-def test_if_valid_input_then_create_python_wheel():
+def test_if_valid_input_with_pandera_schema_then_create_python_wheel():
     with _packed_data() as config:
         _verify_wheel(wheel_dir=Path(config.wheel_dir), pyproject=config.pyproject)
 
 
+@pytest.mark.slow
+def test_if_valid_input_with_custom_schema_then_create_python_wheel():
+    with _packed_data(schema_path=get_path_to_sample_custom_schema()) as config:
+        _verify_wheel(wheel_dir=Path(config.wheel_dir), pyproject=config.pyproject)
+
+
 @pytest.mark.slow
 def test_if_installed_then_can_load_data():
     with _packed_data() as config: