Skip to content

Commit f29b8bc

Browse files
committed
add intro + methodology
1 parent 1ecda7f commit f29b8bc

3 files changed

Lines changed: 145 additions & 6 deletions

File tree

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,13 @@
1414
"module system.\n",
1515
"\n",
1616
"The tutorial begins with detailing Pasteur's architecture and module system.\n",
17-
"Afterwards, we create a project and begin integrating and analyzing a new dataset.\n"
17+
"Afterwards, we create a project and begin integrating and analyzing a new dataset.\n",
18+
"Finally, we extend this project by creating modules for all of Pasteur's supported\n",
19+
"modules.\n",
20+
"\n",
21+
"In this tutorial, we will briefly mention but not touch the out-of-core features\n",
22+
"of pasteur, which will be analyzed in a future \"Advanced Topics\" section."
1823
]
19-
},
20-
{
21-
"cell_type": "markdown",
22-
"metadata": {},
23-
"source": []
2424
}
2525
],
2626
"metadata": {
Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
{
2+
"cells": [
3+
{
4+
"attachments": {},
5+
"cell_type": "markdown",
6+
"metadata": {},
7+
"source": [
8+
"# Methodology\n",
9+
"Pasteur is a pipeline based system based on a methodology with three stages:\n",
10+
"ingestion, synthesis, and evaluation.\n",
11+
"The whole process is designed to be deterministic, beginning from raw data\n",
12+
"to evaluated product.\n",
13+
"\n",
14+
"![Methodology Overview](./res/graphs-pipeline_simple.svg)"
15+
]
16+
},
17+
{
18+
"attachments": {},
19+
"cell_type": "markdown",
20+
"metadata": {},
21+
"source": [
22+
"## Ingestion\n",
23+
"The process begins with collecting the raw data sources.\n",
24+
"In a Pasteur project, raw data sources are stored in the `./raw` directory by default,\n",
25+
"but you can change this to ex. an S3 share or a different drive mount, in case it\n",
26+
"is large.\n",
27+
"The `./raw` data directory is considered immutable by Pasteur, and can contain\n",
28+
"archives (`.zip`, `.tar`), SQL dumps, CSV files, JSON files or any type of file\n",
29+
"in its raw form.\n",
30+
"\n",
31+
"The Raw Data sources are then combined and preprocessed to form a `Dataset`.\n",
32+
"This is done by a `Dataset` module, which contains all relevant information\n",
33+
"required to wrangle the raw sources into a tidy form (by typing, removing unnecessary \n",
34+
"data, and renaming columns to adhere to project standards).\n",
35+
"Currently, only multi-table tabular data is supported, but in the future\n",
36+
"Pasteur should also support multi-modal data such as images, time-series, and\n",
37+
"audio.\n",
38+
"\n",
39+
"Datasets represent the raw data sources in full resolution.\n",
40+
"With few exceptions, this will mean that it is intractable to synthesize them\n",
41+
"as is.\n",
42+
"We therefore need to downsample Datasets into a more manageable form.\n",
43+
"We name this form a `View`.\n",
44+
"`View` modules have an API similar to datasets and work in a similar process.\n",
45+
"However, since the bulk of processing and typing is performed on the `Dataset`\n",
46+
"level, `View` modules are used to select subsets of tables, rows, and columns from\n",
47+
"the original `Dataset`, or simplify columns to a smaller domain.\n",
48+
"\n",
49+
"The ingestion stage ends with the preprocessed `View` data.\n",
50+
"We associate a set of hyperparameters with this `View` data, that can be used\n",
51+
"to transform and encode for synthesis, as well as for evaluation."
52+
]
53+
},
54+
{
55+
"attachments": {},
56+
"cell_type": "markdown",
57+
"metadata": {},
58+
"source": [
59+
"## Synthesis\n",
60+
"The next stage to data synthesis is synthesis itself.\n",
61+
"Pasteur does not dictate a particular procedure for synthesis, other than\n",
62+
"splitting the procedure into three stages (`bake`, `fit`, and `sample`), leaving the rest up\n",
63+
"to the algorithm creator.\n",
64+
"Pasteur does define a formalized procedure for encoding data and integrating domain\n",
65+
"knowledge, which is a significant and complex task on its own.\n",
66+
"Pasteur separates the encoding process into two reversible steps: Transformation, and Encoding.\n",
67+
"\n",
68+
"In Transformation, complex types (such as dates) are broken down into simpler\n",
69+
"ones (numerical, categorical values) with a formalized specification (Base Transform Layer).\n",
70+
"This process is performed by modules named Transformers, each of which is designed\n",
71+
"to handle a single column type.\n",
72+
"Transformers integrate expert domain knowledge into their column transformations,\n",
73+
"and the end result is simpler types with orthogonal information.\n",
74+
"E.g., consider two date columns we know to be relative: `21-3-23`, `25-3-23`, they both\n",
75+
"contain denormalized information about the year.\n",
76+
"With a naive transformation it would be left up to the synthesis algorithm to\n",
77+
"enforce the denormalized dates (years, months) match.\n",
78+
"However, we can transform those two dates into: `2023`, `March`, `23`, `4 days`,\n",
79+
"in which case the information held by each column has been normalized (no duplication).\n",
80+
"\n",
81+
"In Encoding, the simpler column types are converted to a form suitable for the\n",
82+
"synthesis algorithm and metric at hand.\n",
83+
"A synthesis execution can require multiple encodings.\n",
84+
"For Neural Network algorithms, the Base Transform Layer would be converted\n",
85+
"into vector batches (e.g., float16/32 vectors, one-hot).\n",
86+
"For Marginal Algorithms, the numerical columns would be discretized into bins.\n",
87+
"\n",
88+
"Encoding and Transformation example:\n",
89+
"\n",
90+
"![Encoding and Transforming example](./res/graphs-encoding.svg)"
91+
]
92+
},
93+
{
94+
"attachments": {},
95+
"cell_type": "markdown",
96+
"metadata": {},
97+
"source": [
98+
"## Evaluation\n",
99+
"Finally, in evaluation, arbitrary code is executed upon the resulting data to produce a set of metrics.\n",
100+
"Pasteur does not enforce a particular structure to what a metric is, other than enforcing certain\n",
101+
"access patterns.\n",
102+
"Metrics are split based on how they access data into `Column`, `Table`, and `View` metrics.\n",
103+
"This influences how metrics are instantiated in a synthesis execution: \n",
104+
"- `Column`: once per column with matching type\n",
105+
"- `Table`: once per table\n",
106+
"- `View`: once\n",
107+
" \n",
108+
"And what input they receive:\n",
109+
"- `Column`: only their column + metadata\n",
110+
"- `Table`: the table and its parents, and metadata for those\n",
111+
"- `View`: View metadata and all tables\n",
112+
"\n",
113+
"In addition, `View` and `Table` metrics may select to receive raw data, \n",
114+
"Base Transform Layer data, or encodings (e.g., a classifier module requires\n",
115+
"as input continuous data and as output discrete classes)."
116+
]
117+
},
118+
{
119+
"attachments": {},
120+
"cell_type": "markdown",
121+
"metadata": {},
122+
"source": [
123+
"## Appendix: Full Methodology\n",
124+
"The full methodology and its steps can be seen below.\n",
125+
"![Methodology Complete](./res/graphs-complete_pipeline.svg)"
126+
]
127+
}
128+
],
129+
"metadata": {
130+
"language_info": {
131+
"name": "python"
132+
}
133+
},
134+
"nbformat": 4,
135+
"nbformat_minor": 2
136+
}

0 commit comments

Comments
 (0)