|
1 | 1 | # Import Automation System |
2 | 2 |
|
3 | | -The import automation system has three components: |
4 | | -1. [Cloud Build configuration file](cloudbuild/README.md) |
5 | | -2. [Executor](executor/README.md) |
| 3 | +The import automation system has two components: |
| 4 | +- Import Job (Cloud Batch) |
| 5 | +- Ingestion Pipeline (Dataflow + Cloud Workflow) |
6 | 6 |
|
7 | | -## User Manual |
| 7 | +## Import Job |
| 8 | +Import jobs own the task of fetching data from external data sources and making it available for ingestion into the knowledge graph. Each import typically includes an import script along with a manifest file containing import metadata (e.g., refresh schedule). Adding a new data import to the stack involves adding the import script and manifest and then triggering the scheduler script which configures a Cloud Scheduler job for periodically running the import job. Detailed usage instructions on how to configure a new import job are available in the [user guide](executor/README.md). |
8 | 9 |
|
9 | | -### Specifying Import Targets |
10 | | -Import targets are specified in the commit message using the IMPORTS tag. |
11 | | -The system accepts one or two of the following formats depending on the files |
12 | | -affected by the commit. |
| 10 | +The scheduler job triggers a GCP Workflow which then creates a GCP Batch job for each data import. An import job performs multiple tasks such as downloading data, processing it, and generating resolved mcf and copying it to GCS. It relies on the DataCommons [Import tool](https://github.com/datacommonsorg/import/blob/master/docs/usage.md) to perform mcf generation. Additionally, several validations are performed as part of the import job to ensure data quality. More details about the validation framework and supported validations can be found in the [README](https://github.com/datacommonsorg/data/tree/master/tools/import_validation). |
13 | 11 |
|
14 | | -1. Absolute import name: |
15 | | - {path to the directory containing the manifest file}:{import name in the manifest} |
16 | | -2. Relative import name: {import name in the manifest} |
| 12 | +Status of various import jobs can be monitored in the ImportStatus spanner table via the [Looker Studio dashboard](https://lookerstudio.google.com/c/reporting/e88fda74-50c9-46c6-88aa-c84342ceba48/). |
17 | 13 |
|
18 | | -In the commit message, use IMPORTS={comma separated list of import names |
19 | | -without spaces between the elements} to specify import targets. |
20 | | -The commit message may contain more than just the tag and list. |
| 14 | +## Ingestion Pipeline |
| 15 | +DataCommons runs various import jobs on cloud batch that generate the output MCF data on GCS. The output from these jobs is consumed by the graph ingestion pipeline (Dataflow) to push data into the knowledge graph (Spanner). More details about the ingestion pipeline are available [here](https://github.com/datacommonsorg/import/tree/master/pipeline/ingestion). |
21 | 16 |
|
22 | | -A commit can modify files in multiple directories that contain manifest files. |
23 | | -The system will detect the affected directories based on paths of changed files. |
24 | | -If only one directory is affected, both absolute and relative import names |
25 | | -are accepted. If multiple directories are affected, absolute import names must |
26 | | -be used. You can also use IMPORTS=all to ask the system to run all affected |
27 | | -imports and use IMPORTS={path to the directory containing the manifest file}:all |
28 | | -to run all imports in that directory, but they are discouraged as they are not |
29 | | -explicit. |
| 17 | +A GCP [cloud workflow](workflow/spanner-ingestion-workflow.yaml) is used to coordinate control between auto-refresh import jobs and the ingestion dataflow pipeline. To maintain data consistency, a global lock is used to ensure that only a single execution of the workflow is active at any time. The workflow relies on various [Spanner tables](workflow/spanner_schema.sql) for metadata management and [helper cloud functions](workflow/ingestion-helper/README.md) to control the execution. |
30 | 18 |
|
| 19 | +Infrastructure deployment for the various components in the import automation stack is automated using a [Terraform script](terraform/main.tf). |
31 | 20 |
|
32 | | -### Example Commit Messages |
33 | | - |
34 | | -Assuming the following directory structure: |
35 | | -``` |
36 | | -. |
37 | | -└── data |
38 | | - └── scripts |
39 | | - └── us_bls |
40 | | - ├── cpi |
41 | | - │ ├── README.md |
42 | | - │ ├── c_cpi_u_1999_2020.csv |
43 | | - │ ├── c_cpi_u_1999_2020.mcf |
44 | | - │ ├── c_cpi_u_1999_2020.tmcf |
45 | | - │ ├── c_cpi_u_1999_2020_import_config.txt |
46 | | - │ ├── cpi_u_1913_2020.csv |
47 | | - │ ├── cpi_u_1913_2020.tmcf |
48 | | - │ ├── cpi_u_1913_2020_import_config.txt |
49 | | - │ ├── cpi_w_1913_2020.csv |
50 | | - │ ├── cpi_w_1913_2020.mcf |
51 | | - │ ├── cpi_w_1913_2020.tmcf |
52 | | - │ ├── cpi_w_1913_2020_import_config.txt |
53 | | - │ ├── generate_csv.py |
54 | | - │ └── manifest.json |
55 | | - └── jolts |
56 | | - ├── BLSJolts.csv |
57 | | - ├── BLSJolts.tmcf |
58 | | - ├── BLSJolts_StatisticalVariables.mcf |
59 | | - ├── README.md |
60 | | - ├── __init__.py |
61 | | - ├── bls_jolts.py |
62 | | - ├── import.config |
63 | | - └── manifest.json |
64 | | -``` |
65 | | - |
66 | | -Assuming scripts/us_bls/cpi/manifest.json has |
67 | | -```json |
68 | | -{ |
69 | | - "import_specifications": [ |
70 | | - { |
71 | | - "import_name": "USBLS_CPIAllItemsAverage", |
72 | | - ... |
73 | | - } |
74 | | - ] |
75 | | -} |
76 | | -``` |
77 | | - |
78 | | -Assuming scripts/us_bls/jolts/manifest.json has |
79 | | -```json |
80 | | -{ |
81 | | - "import_specifications": [ |
82 | | - { |
83 | | - "import_name": "BLS_JOLTS", |
84 | | - ... |
85 | | - } |
86 | | - ] |
87 | | -} |
88 | | -``` |
89 | | - |
90 | | -If the commit only changes files in scripts/us_bls/cpi: |
91 | | -- To import USBLS_CPIAllItemsAverage: |
92 | | - - "fix syntax error IMPORTS=scripts/us_bls/cpi:USBLS_CPIAllItemsAverage" or |
93 | | - - "IMPORTS=USBLS_CPIAllItemsAverage fix memory leak" |
94 | | -- To import BLS_JOLTS |
95 | | - - "nice day IMPORTS=scripts/us_bls/jolts:BLS_JOLTS" |
96 | | -- To import both USBLS_CPIAllItemsAverage and BLS_JOLTS |
97 | | - - "update README IMPORTS=scripts/us_bls/cpi:USBLS_CPIAllItemsAverage,scripts/us_bls/jolts:BLS_JOLTS" or |
98 | | - - "IMPORTS=USBLS_CPIAllItemsAverage,scripts/us_bls/jolts:BLS_JOLTS hope they succeed" |
99 | | - |
100 | | -If the commit changes files in both scripts/us_bls/cpi and scripts/us_bls/jolts |
101 | | -directories: |
102 | | -- To import USBLS_CPIAllItemsAverage: |
103 | | - - "fix syntax error IMPORTS=scripts/us_bls/cpi:USBLS_CPIAllItemsAverage" |
104 | | -- To import BLS_JOLTS |
105 | | - - "nice day IMPORTS=scripts/us_bls/jolts:BLS_JOLTS" or |
106 | | - - "good try IMPORTS=BLS_JOLTS" |
107 | | -- To import both USBLS_CPIAllItemsAverage and BLS_JOLTS |
108 | | - - "update README IMPORTS=scripts/us_bls/cpi:USBLS_CPIAllItemsAverage,scripts/us_bls/jolts:BLS_JOLTS" or |
109 | | - - "IMPORTS=scripts/us_bls/cpi:USBLS_CPIAllItemsAverage,BLS_JOLTS hope they succeed" |
110 | | - |
111 | | -### Importing to Dev Graph |
112 | | - |
113 | | -1. Fork datacommonsorg/data |
114 | | -2. Create a new branch in the fork |
115 | | -3. Create a pull request from the new branch to master of datacommonsorg/data |
116 | | -4. Push commits to the branch. If you want a commit to execute some imports, |
117 | | - specify the import targets in the commit message |
118 | | - (see [Specifying Import Targets](#specifying-import-targets)). If no tag |
119 | | - is found, no imports will be executed. |
120 | | - |
121 | | -### Scheduling Updates |
122 | | - |
123 | | -1. Push to master of datacommonsorg/data and specify the targets in the commit |
124 | | - message as described by [Specifying Import Target](#specifying-import-targets) |
125 | | - but using the SCHEDULES tag instead of IMPORTS. |
126 | | -2. Configure the production pipeline to pick up the data files at some schedule. |
127 | | - |
128 | | - |
129 | | -## Deployment |
130 | | - |
131 | | -1. Check in the `import-automation` directory to the repository. |
132 | | -2. [Configure](executor/README.md#configuring-the-executor) and [deploy](executor/README.md#deploying-on-app-engine) the executor |
133 | | -5. [Create a Cloud Tasks queue](#creating-cloud-task-queue) |
134 | | -6. [Connect the repository to Cloud Build and set up Cloud Build triggers](#setting-up-cloud-build) |
135 | | - |
136 | | - |
137 | | -### Creating Cloud Task Queue |
138 | | -- See https://cloud.google.com/tasks/docs/creating-queues#creating_a_queue |
139 | | -- It is recommended that `maxAttempts` be set to one. This can be done by `gcloud tasks queues update <QUEUE_ID> --max-attempts=1` |
140 | | - |
141 | | - |
142 | | -### Setting Up Cloud Build |
143 | | -1. [Connect](https://cloud.google.com/cloud-build/docs/automating-builds/create-manage-triggers#connect_repo) the repository to Cloud Build |
144 | | -2. Create the trigger that runs on pull requests for importing to dev |
145 | | - - See [this](https://cloud.google.com/cloud-build/docs/automating-builds/create-manage-triggers#build_trigger) for how to create triggers |
146 | | - - Choose these settings |
147 | | - - **Event**: `Pull request (GitHub App only)` |
148 | | - - **Source** |
149 | | - - **Repository**: Choose the connected repository from the list |
150 | | - - **Branch**: `^master$` |
151 | | - - **Comment control**: Choose from one of the three options. |
152 | | - - `Required except for owners and collaborators`: Pull requests from owners and collaborators would be able to trigger Cloud Build directly. Non-collaborators would need a `/gcbrun` comment from an owner or collaborator in the pull request |
153 | | - - `Required`: Every pull request would need a `/gcbrun` comment from an owner or collaborator to run Cloud Build |
154 | | - - `Not required`: Every pull request can trigger Cloud Build directly. |
155 | | - - **Build Configuration** |
156 | | - - **File type**: `Cloud Build configuration file (yaml or json)` |
157 | | - - **Cloud Build configuration file location**: `/import-automation/cloudbuild/cloudbuild.yaml` |
158 | | - - **Substitution variables** |
159 | | - - **_EMAIL_ACCOUNT**: `<email account used for sending notifications>`; |
160 | | - - **_EMAIL_TOKEN**: `<password, app password, or access token of the email account>` |
161 | | - - **_GITHUB_AUTH_USERNAME**: `<GitHub username to authenticate with GitHub API>` |
162 | | - - **_GITHUB_AUTH_ACCESS_TOKEN**: `<access token of the GitHub account>` |
163 | | - - **_GITHUB_REPO_NAME**: `<name of the connected repository, e.g., data>` |
164 | | - - **_GITHUB_REPO_OWNER_USERNAME**: `<username of the owner of the repository, e.g., datacommonsorg>` |
165 | | - - **_HANDLER_SERVICE**: `<service the executor is deployed to, e.g., default>` |
166 | | - - **_HANDLER_URI**: `<URI of the executor's endpoint that imports to dev, e.g., />` |
167 | | - - **_IMPORTER_OAUTH_CLIENT_ID**: `<OAuth client ID used to authenticate with the proxy for the importer>` |
168 | | - - **_TASK_LOCATION_ID**: `<location ID of the Cloud Tasks queue, e.g., us-central1>` (This can be found by going to the Cloud Tasks control panel and look at the "Location" column.) |
169 | | - - **_TASK_PROJECT_ID**: `<ID of the Google Cloud project that hosts the task queue, e.g., google.com:datcom-data>` |
170 | | - - **_TASK_QUEUE_NAME**: `<Name of the task queue>` |
171 | | -3. Create the trigger that runs on pushes to *master* for scheduling updates |
172 | | - - Copy the dev trigger but add/override these settings |
173 | | - - **Event**: `Push to a branch` |
174 | | - - **Build Configuration** |
175 | | - - **Substitution variables** (keeping the others) |
176 | | - - **_BASE_BRANCH**: `master` |
177 | | - - **_HEAD_BRANCH**: `master` |
178 | | - - **_PR_NUMBER**: `0` |
179 | | - - **_HANDLER_URI**: `/schedule` |
0 commit comments