|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "attachments": {}, |
| 5 | + "cell_type": "markdown", |
| 6 | + "metadata": {}, |
| 7 | + "source": [ |
| 8 | + "# Methodology\n", |
| 9 | + "Pasteur is a pipeline based system based on a methodology with three stages:\n", |
| 10 | + "ingestion, synthesis, and evaluation.\n", |
| 11 | + "The whole process is designed to be deterministic, beginning from raw data\n", |
| 12 | + "to evaluated product.\n", |
| 13 | + "\n", |
| 14 | + "" |
| 15 | + ] |
| 16 | + }, |
| 17 | + { |
| 18 | + "attachments": {}, |
| 19 | + "cell_type": "markdown", |
| 20 | + "metadata": {}, |
| 21 | + "source": [ |
| 22 | + "## Ingestion\n", |
| 23 | + "The process begins with collecting the raw data sources.\n", |
| 24 | + "In a Pasteur project, raw data sources are stored in the `./raw` directory by default,\n", |
| 25 | + "but you can change this to ex. an S3 share or a different drive mount, in case it\n", |
| 26 | + "is large.\n", |
| 27 | + "The `./raw` data directory is considered immutable by Pasteur, and can contain\n", |
| 28 | + "archives (`.zip`, `.tar`), SQL dumps, CSV files, JSON files or any type of file\n", |
| 29 | + "in its raw form.\n", |
| 30 | + "\n", |
| 31 | + "The Raw Data sources are then combined and preprocessed to form a `Dataset`.\n", |
| 32 | + "This is done by a `Dataset` module, which contains all relevant information\n", |
| 33 | + "required to wrangle the raw sources into a tidy form (by typing, removing unnecessary \n", |
| 34 | + "data, and renaming columns to adhere to project standards).\n", |
| 35 | + "Currently, only multi-table tabular data is supported, but in the future\n", |
| 36 | + "Pasteur should also support multi-modal data such as images, time-series, and\n", |
| 37 | + "audio.\n", |
| 38 | + "\n", |
| 39 | + "Datasets represent the raw data sources in full resolution.\n", |
| 40 | + "With few exceptions, this will mean that it is intractable to synthesize them\n", |
| 41 | + "as is.\n", |
| 42 | + "We therefore need to downsample Datasets into a more manageable form.\n", |
| 43 | + "We name this form a `View`.\n", |
| 44 | + "`View` modules have an API similar to datasets and work in a similar process.\n", |
| 45 | + "However, since the bulk of processing and typing is performed on the `Dataset`\n", |
| 46 | + "level, `View` modules are used to select subsets of tables, rows, and columns from\n", |
| 47 | + "the original `Dataset`, or simplify columns to a smaller domain.\n", |
| 48 | + "\n", |
| 49 | + "The ingestion stage ends with the preprocessed `View` data.\n", |
| 50 | + "We associate a set of hyperparameters with this `View` data, that can be used\n", |
| 51 | + "to transform and encode for synthesis, as well as for evaluation." |
| 52 | + ] |
| 53 | + }, |
| 54 | + { |
| 55 | + "attachments": {}, |
| 56 | + "cell_type": "markdown", |
| 57 | + "metadata": {}, |
| 58 | + "source": [ |
| 59 | + "## Synthesis\n", |
| 60 | + "The next stage to data synthesis is synthesis itself.\n", |
| 61 | + "Pasteur does not dictate a particular procedure for synthesis, other than\n", |
| 62 | + "splitting the procedure into three stages (`bake`, `fit`, and `sample`), leaving the rest up\n", |
| 63 | + "to the algorithm creator.\n", |
| 64 | + "Pasteur does define a formalized procedure for encoding data and integrating domain\n", |
| 65 | + "knowledge, which is a significant and complex task on its own.\n", |
| 66 | + "Pasteur separates the encoding process into two reversible steps: Transformation, and Encoding.\n", |
| 67 | + "\n", |
| 68 | + "In Transformation, complex types (such as dates) are broken down into simpler\n", |
| 69 | + "ones (numerical, categorical values) with a formalized specification (Base Transform Layer).\n", |
| 70 | + "This process is performed by modules named Transformers, each of which is designed\n", |
| 71 | + "to handle a single column type.\n", |
| 72 | + "Transformers integrate expert domain knowledge into their column transformations,\n", |
| 73 | + "and the end result is simpler types with orthogonal information.\n", |
| 74 | + "E.g., consider two date columns we know to be relative: `21-3-23`, `25-3-23`, they both\n", |
| 75 | + "contain denormalized information about the year.\n", |
| 76 | + "With a naive transformation it would be left up to the synthesis algorithm to\n", |
| 77 | + "enforce the denormalized dates (years, months) match.\n", |
| 78 | + "However, we can transform those two dates into: `2023`, `March`, `23`, `4 days`,\n", |
| 79 | + "in which case the information held by each column has been normalized (no duplication).\n", |
| 80 | + "\n", |
| 81 | + "In Encoding, the simpler column types are converted to a form suitable for the\n", |
| 82 | + "synthesis algorithm and metric at hand.\n", |
| 83 | + "A synthesis execution can require multiple encodings.\n", |
| 84 | + "For Neural Network algorithms, the Base Transform Layer would be converted\n", |
| 85 | + "into vector batches (e.g., float16/32 vectors, one-hot).\n", |
| 86 | + "For Marginal Algorithms, the numerical columns would be discretized into bins.\n", |
| 87 | + "\n", |
| 88 | + "Encoding and Transformation example:\n", |
| 89 | + "\n", |
| 90 | + "" |
| 91 | + ] |
| 92 | + }, |
| 93 | + { |
| 94 | + "attachments": {}, |
| 95 | + "cell_type": "markdown", |
| 96 | + "metadata": {}, |
| 97 | + "source": [ |
| 98 | + "## Evaluation\n", |
| 99 | + "Finally, in evaluation, arbitrary code is executed upon the resulting data to produce a set of metrics.\n", |
| 100 | + "Pasteur does not enforce a particular structure to what a metric is, other than enforcing certain\n", |
| 101 | + "access patterns.\n", |
| 102 | + "Metrics are split based on how they access data into `Column`, `Table`, and `View` metrics.\n", |
| 103 | + "This influences how metrics are instantiated in a synthesis execution: \n", |
| 104 | + "- `Column`: once per column with matching type\n", |
| 105 | + "- `Table`: once per table\n", |
| 106 | + "- `View`: once\n", |
| 107 | + " \n", |
| 108 | + "And what input they receive:\n", |
| 109 | + "- `Column`: only their column + metadata\n", |
| 110 | + "- `Table`: the table and its parents, and metadata for those\n", |
| 111 | + "- `View`: View metadata and all tables\n", |
| 112 | + "\n", |
| 113 | + "In addition, `View` and `Table` metrics may select to receive raw data, \n", |
| 114 | + "Base Transform Layer data, or encodings (e.g., a classifier module requires\n", |
| 115 | + "as input continuous data and as output discrete classes)." |
| 116 | + ] |
| 117 | + }, |
| 118 | + { |
| 119 | + "attachments": {}, |
| 120 | + "cell_type": "markdown", |
| 121 | + "metadata": {}, |
| 122 | + "source": [ |
| 123 | + "## Appendix: Full Methodology\n", |
| 124 | + "The full methodology and its steps can be seen below.\n", |
| 125 | + "" |
| 126 | + ] |
| 127 | + } |
| 128 | + ], |
| 129 | + "metadata": { |
| 130 | + "language_info": { |
| 131 | + "name": "python" |
| 132 | + } |
| 133 | + }, |
| 134 | + "nbformat": 4, |
| 135 | + "nbformat_minor": 2 |
| 136 | +} |
0 commit comments