Skip to content

Commit a7f6c9d

Browse files
committed
restructure docs into prelim and tutorial
1 parent f29b8bc commit a7f6c9d

12 files changed

Lines changed: 128 additions & 141 deletions

docs/source/index.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,14 @@ Welcome to Pasteur's documentation!
2828

2929
This documentation is a work-in-progress!
3030

31+
32+
.. toctree::
33+
:maxdepth: 2
34+
:caption: Preliminaries
35+
:glob:
36+
37+
preliminaries/*
38+
3139
.. toctree::
3240
:maxdepth: 2
3341
:caption: Tutorial
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Introduction
2+
Welcome and thank you for your interest in synthetic data generation!
3+
4+
In this section, we're going to cover the preliminaries about Pasteur:
5+
a complete methodology for data synthesis and its system architecture.
6+
Data synthesis consists of too many a steps for it to be
7+
performed ad-hoc (preprocessing, encoding, synthesis, decoding,
8+
evaluations) reproducibly and reliably.
9+
For this reason, parallelizing and caching between those steps
10+
(especially for large datasets) becomes critical.
11+
12+
In the next section, we will go through an example project
13+
and explain how to create and use custom dataset, synthesis, and
14+
evaluation modules.
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
# Methodology
2+
Pasteur is a pipeline based system based on a methodology with three stages:
3+
ingestion, synthesis, and evaluation.
4+
The whole process is designed to be deterministic, beginning from raw data
5+
to evaluated product.
6+
7+
![Methodology Overview](../res/graphs-pipeline_simple.svg)
8+
9+
## Ingestion
10+
The process begins with collecting the raw data sources.
11+
In a Pasteur project, raw data sources are stored in the `./raw` directory by default,
12+
but you can change this to ex. an S3 share or a different drive mount, in case it
13+
is large.
14+
The `./raw` data directory is considered immutable by Pasteur, and can contain
15+
archives (`.zip`, `.tar`), SQL dumps, CSV files, JSON files or any type of file
16+
in its raw form.
17+
18+
The Raw Data sources are then combined and preprocessed to form a `Dataset`.
19+
This is done by a `Dataset` module, which contains all relevant information
20+
required to wrangle the raw sources into a tidy form (by typing, removing unnecessary
21+
data, and renaming columns to adhere to project standards).
22+
Currently, only multi-table tabular data is supported, but in the future
23+
Pasteur should also support multi-modal data such as images, time-series, and
24+
audio.
25+
26+
Datasets represent the raw data sources in full resolution.
27+
With few exceptions, this will mean that it is intractable to synthesize them
28+
as is.
29+
We therefore need to downsample Datasets into a more manageable form.
30+
We name this form a `View`.
31+
`View` modules have an API similar to datasets and work in a similar process.
32+
However, since the bulk of processing and typing is performed on the `Dataset`
33+
level, `View` modules are used to select subsets of tables, rows, and columns from
34+
the original `Dataset`, or simplify columns to a smaller domain.
35+
36+
The ingestion stage ends with the preprocessed `View` data.
37+
We associate a set of hyperparameters with this `View` data, that can be used
38+
to transform and encode for synthesis, as well as for evaluation.
39+
40+
## Synthesis
41+
The next stage to data synthesis is synthesis itself.
42+
Pasteur does not dictate a particular procedure for synthesis, other than
43+
splitting the procedure into three stages (`bake`, `fit`, and `sample`), leaving the rest up
44+
to the algorithm creator.
45+
Pasteur does define a formalized procedure for encoding data and integrating domain
46+
knowledge, which is a significant and complex task on its own.
47+
Pasteur separates the encoding process into two reversible steps: Transformation, and Encoding.
48+
49+
In Transformation, complex types (such as dates) are broken down into simpler
50+
ones (numerical, categorical values) with a formalized specification (Base Transform Layer).
51+
This process is performed by modules named Transformers, each of which is designed
52+
to handle a single column type.
53+
Transformers integrate expert domain knowledge into their column transformations,
54+
and the end result is simpler types with orthogonal information.
55+
E.g., consider two date columns we know to be relative: `21-3-23`, `25-3-23`, they both
56+
contain denormalized information about the year.
57+
With a naive transformation it would be left up to the synthesis algorithm to
58+
enforce the denormalized dates (years, months) match.
59+
However, we can transform those two dates into: `2023`, `March`, `23`, `4 days`,
60+
in which case the information held by each column has been normalized (no duplication).
61+
62+
In Encoding, the simpler column types are converted to a form suitable for the
63+
synthesis algorithm and metric at hand.
64+
A synthesis execution can require multiple encodings.
65+
For Neural Network algorithms, the Base Transform Layer would be converted
66+
into vector batches (e.g., float16/32 vectors, one-hot).
67+
For Marginal Algorithms, the numerical columns would be discretized into bins.
68+
69+
Encoding and Transformation example:
70+
71+
![Encoding and Transforming example](../res/graphs-encoding.svg)
72+
73+
## Evaluation
74+
Finally, in evaluation, arbitrary code is executed upon the resulting data to produce a set of metrics.
75+
Pasteur does not enforce a particular structure to what a metric is, other than enforcing certain
76+
access patterns.
77+
Metrics are split based on how they access data into `Column`, `Table`, and `View` metrics.
78+
This influences how metrics are instantiated in a synthesis execution:
79+
- `Column`: once per column with matching type
80+
- `Table`: once per table
81+
- `View`: once
82+
83+
And what input they receive:
84+
- `Column`: only their column + metadata
85+
- `Table`: the table and its parents, and metadata for those
86+
- `View`: View metadata and all tables
87+
88+
In addition, `View` and `Table` metrics may select to receive raw data,
89+
Base Transform Layer data, or encodings (e.g., a classifier module requires
90+
as input continuous data and as output discrete classes).
91+
92+
## Appendix: Full Methodology
93+
The full methodology and its steps can be seen below.
94+
![Methodology Complete](../res/graphs-complete_pipeline.svg)
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Architecture
2+
3+
![Parallelization Architecture](../res/graphs-parallel.svg)
4+
5+
![Partitioning](../res/graphs-partitioning.svg)
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)