|
| 1 | +# Methodology |
| 2 | +Pasteur is a pipeline based system based on a methodology with three stages: |
| 3 | +ingestion, synthesis, and evaluation. |
| 4 | +The whole process is designed to be deterministic, beginning from raw data |
| 5 | +to evaluated product. |
| 6 | + |
| 7 | + |
| 8 | + |
| 9 | +## Ingestion |
| 10 | +The process begins with collecting the raw data sources. |
| 11 | +In a Pasteur project, raw data sources are stored in the `./raw` directory by default, |
| 12 | +but you can change this to ex. an S3 share or a different drive mount, in case it |
| 13 | +is large. |
| 14 | +The `./raw` data directory is considered immutable by Pasteur, and can contain |
| 15 | +archives (`.zip`, `.tar`), SQL dumps, CSV files, JSON files or any type of file |
| 16 | +in its raw form. |
| 17 | + |
| 18 | +The Raw Data sources are then combined and preprocessed to form a `Dataset`. |
| 19 | +This is done by a `Dataset` module, which contains all relevant information |
| 20 | +required to wrangle the raw sources into a tidy form (by typing, removing unnecessary |
| 21 | +data, and renaming columns to adhere to project standards). |
| 22 | +Currently, only multi-table tabular data is supported, but in the future |
| 23 | +Pasteur should also support multi-modal data such as images, time-series, and |
| 24 | +audio. |
| 25 | + |
| 26 | +Datasets represent the raw data sources in full resolution. |
| 27 | +With few exceptions, this will mean that it is intractable to synthesize them |
| 28 | +as is. |
| 29 | +We therefore need to downsample Datasets into a more manageable form. |
| 30 | +We name this form a `View`. |
| 31 | +`View` modules have an API similar to datasets and work in a similar process. |
| 32 | +However, since the bulk of processing and typing is performed on the `Dataset` |
| 33 | +level, `View` modules are used to select subsets of tables, rows, and columns from |
| 34 | +the original `Dataset`, or simplify columns to a smaller domain. |
| 35 | + |
| 36 | +The ingestion stage ends with the preprocessed `View` data. |
| 37 | +We associate a set of hyperparameters with this `View` data, that can be used |
| 38 | +to transform and encode for synthesis, as well as for evaluation. |
| 39 | + |
| 40 | +## Synthesis |
| 41 | +The next stage to data synthesis is synthesis itself. |
| 42 | +Pasteur does not dictate a particular procedure for synthesis, other than |
| 43 | +splitting the procedure into three stages (`bake`, `fit`, and `sample`), leaving the rest up |
| 44 | +to the algorithm creator. |
| 45 | +Pasteur does define a formalized procedure for encoding data and integrating domain |
| 46 | +knowledge, which is a significant and complex task on its own. |
| 47 | +Pasteur separates the encoding process into two reversible steps: Transformation, and Encoding. |
| 48 | + |
| 49 | +In Transformation, complex types (such as dates) are broken down into simpler |
| 50 | +ones (numerical, categorical values) with a formalized specification (Base Transform Layer). |
| 51 | +This process is performed by modules named Transformers, each of which is designed |
| 52 | +to handle a single column type. |
| 53 | +Transformers integrate expert domain knowledge into their column transformations, |
| 54 | +and the end result is simpler types with orthogonal information. |
| 55 | +E.g., consider two date columns we know to be relative: `21-3-23`, `25-3-23`, they both |
| 56 | +contain denormalized information about the year. |
| 57 | +With a naive transformation it would be left up to the synthesis algorithm to |
| 58 | +enforce the denormalized dates (years, months) match. |
| 59 | +However, we can transform those two dates into: `2023`, `March`, `23`, `4 days`, |
| 60 | +in which case the information held by each column has been normalized (no duplication). |
| 61 | + |
| 62 | +In Encoding, the simpler column types are converted to a form suitable for the |
| 63 | +synthesis algorithm and metric at hand. |
| 64 | +A synthesis execution can require multiple encodings. |
| 65 | +For Neural Network algorithms, the Base Transform Layer would be converted |
| 66 | +into vector batches (e.g., float16/32 vectors, one-hot). |
| 67 | +For Marginal Algorithms, the numerical columns would be discretized into bins. |
| 68 | + |
| 69 | +Encoding and Transformation example: |
| 70 | + |
| 71 | + |
| 72 | + |
| 73 | +## Evaluation |
| 74 | +Finally, in evaluation, arbitrary code is executed upon the resulting data to produce a set of metrics. |
| 75 | +Pasteur does not enforce a particular structure to what a metric is, other than enforcing certain |
| 76 | +access patterns. |
| 77 | +Metrics are split based on how they access data into `Column`, `Table`, and `View` metrics. |
| 78 | +This influences how metrics are instantiated in a synthesis execution: |
| 79 | +- `Column`: once per column with matching type |
| 80 | +- `Table`: once per table |
| 81 | +- `View`: once |
| 82 | + |
| 83 | +And what input they receive: |
| 84 | +- `Column`: only their column + metadata |
| 85 | +- `Table`: the table and its parents, and metadata for those |
| 86 | +- `View`: View metadata and all tables |
| 87 | + |
| 88 | +In addition, `View` and `Table` metrics may select to receive raw data, |
| 89 | +Base Transform Layer data, or encodings (e.g., a classifier module requires |
| 90 | +as input continuous data and as output discrete classes). |
| 91 | + |
| 92 | +## Appendix: Full Methodology |
| 93 | +The full methodology and its steps can be seen below. |
| 94 | + |
0 commit comments