add intro + methodology

antheas · antheas · commit f29b8bcc9336 · 2023-03-20T12:54:01.000Z
diff --git a/docs/source/tutorial/0_introduction.ipynb b/docs/source/tutorial/0_introduction.ipynb
@@ -14,13 +14,13 @@
     "module system.\n",
     "\n",
     "The tutorial begins with detailing Pasteur's architecture and module system.\n",
-    "Afterwards, we create a project and begin integrating and analyzing a new dataset.\n"
+    "Afterwards, we create a project and begin integrating and analyzing a new dataset.\n",
+    "Finally, we extend this project by creating modules for all of Pasteur's supported\n",
+    "modules.\n",
+    "\n",
+    "In this tutorial, we will briefly mention but not touch the out-of-core features\n",
+    "of pasteur, which will be analyzed in a future \"Advanced Topics\" section."
    ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": []
   }
  ],
  "metadata": {
diff --git a/docs/source/tutorial/1_methodology.ipynb b/docs/source/tutorial/1_methodology.ipynb
@@ -0,0 +1,136 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Methodology\n",
+    "Pasteur is a pipeline based system based on a methodology with three stages:\n",
+    "ingestion, synthesis, and evaluation.\n",
+    "The whole process is designed to be deterministic, beginning from raw data\n",
+    "to evaluated product.\n",
+    "\n",
+    "![Methodology Overview](./res/graphs-pipeline_simple.svg)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Ingestion\n",
+    "The process begins with collecting the raw data sources.\n",
+    "In a Pasteur project, raw data sources are stored in the `./raw` directory by default,\n",
+    "but you can change this to ex. an S3 share or a different drive mount, in case it\n",
+    "is large.\n",
+    "The `./raw` data directory is considered immutable by Pasteur, and can contain\n",
+    "archives (`.zip`, `.tar`), SQL dumps, CSV files, JSON files or any type of file\n",
+    "in its raw form.\n",
+    "\n",
+    "The Raw Data sources are then combined and preprocessed to form a `Dataset`.\n",
+    "This is done by a `Dataset` module, which contains all relevant information\n",
+    "required to wrangle the raw sources into a tidy form (by typing, removing unnecessary \n",
+    "data, and renaming columns to adhere to project standards).\n",
+    "Currently, only multi-table tabular data is supported, but in the future\n",
+    "Pasteur should also support multi-modal data such as images, time-series, and\n",
+    "audio.\n",
+    "\n",
+    "Datasets represent the raw data sources in full resolution.\n",
+    "With few exceptions, this will mean that it is intractable to synthesize them\n",
+    "as is.\n",
+    "We therefore need to downsample Datasets into a more manageable form.\n",
+    "We name this form a `View`.\n",
+    "`View` modules have an API similar to datasets and work in a similar process.\n",
+    "However, since the bulk of processing and typing is performed on the `Dataset`\n",
+    "level, `View` modules are used to select subsets of tables, rows, and columns from\n",
+    "the original `Dataset`, or simplify columns to a smaller domain.\n",
+    "\n",
+    "The ingestion stage ends with the preprocessed `View` data.\n",
+    "We associate a set of hyperparameters with this `View` data, that can be used\n",
+    "to transform and encode for synthesis, as well as for evaluation."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Synthesis\n",
+    "The next stage to data synthesis is synthesis itself.\n",
+    "Pasteur does not dictate a particular procedure for synthesis, other than\n",
+    "splitting the procedure into three stages (`bake`, `fit`, and `sample`), leaving the rest up\n",
+    "to the algorithm creator.\n",
+    "Pasteur does define a formalized procedure for encoding data and integrating domain\n",
+    "knowledge, which is a significant and complex task on its own.\n",
+    "Pasteur separates the encoding process into two reversible steps: Transformation, and Encoding.\n",
+    "\n",
+    "In Transformation, complex types (such as dates) are broken down into simpler\n",
+    "ones (numerical, categorical values) with a formalized specification (Base Transform Layer).\n",
+    "This process is performed by modules named Transformers, each of which is designed\n",
+    "to handle a single column type.\n",
+    "Transformers integrate expert domain knowledge into their column transformations,\n",
+    "and the end result is simpler types with orthogonal information.\n",
+    "E.g., consider two date columns we know to be relative: `21-3-23`, `25-3-23`, they both\n",
+    "contain denormalized information about the year.\n",
+    "With a naive transformation it would be left up to the synthesis algorithm to\n",
+    "enforce the denormalized dates (years, months) match.\n",
+    "However, we can transform those two dates into: `2023`, `March`, `23`, `4 days`,\n",
+    "in which case the information held by each column has been normalized (no duplication).\n",
+    "\n",
+    "In Encoding, the simpler column types are converted to a form suitable for the\n",
+    "synthesis algorithm and metric at hand.\n",
+    "A synthesis execution can require multiple encodings.\n",
+    "For Neural Network algorithms, the Base Transform Layer would be converted\n",
+    "into vector batches (e.g., float16/32 vectors, one-hot).\n",
+    "For Marginal Algorithms, the numerical columns would be discretized into bins.\n",
+    "\n",
+    "Encoding and Transformation example:\n",
+    "\n",
+    "![Encoding and Transforming example](./res/graphs-encoding.svg)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Evaluation\n",
+    "Finally, in evaluation, arbitrary code is executed upon the resulting data to produce a set of metrics.\n",
+    "Pasteur does not enforce a particular structure to what a metric is, other than enforcing certain\n",
+    "access patterns.\n",
+    "Metrics are split based on how they access data into `Column`, `Table`, and `View` metrics.\n",
+    "This influences how metrics are instantiated in a synthesis execution: \n",
+    "- `Column`: once per column with matching type\n",
+    "- `Table`: once per table\n",
+    "- `View`: once\n",
+    "  \n",
+    "And what input they receive:\n",
+    "- `Column`: only their column + metadata\n",
+    "- `Table`: the table and its parents, and metadata for those\n",
+    "- `View`: View metadata and all tables\n",
+    "\n",
+    "In addition, `View` and `Table` metrics may select to receive raw data, \n",
+    "Base Transform Layer data, or encodings (e.g., a classifier module requires\n",
+    "as input continuous data and as output discrete classes)."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Appendix: Full Methodology\n",
+    "The full methodology and its steps can be seen below.\n",
+    "![Methodology Complete](./res/graphs-complete_pipeline.svg)"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/docs/source/tutorial/res/graphs-tables.drawio.svg b/docs/source/tutorial/res/graphs-tables.drawio.svg