Skip to content

Commit 21e697e

Browse files
authored
Add README for the import automation infra (#1930)
1 parent ae626cc commit 21e697e

3 files changed

Lines changed: 182 additions & 179 deletions

File tree

import-automation/README.md

Lines changed: 11 additions & 170 deletions
Original file line numberDiff line numberDiff line change
@@ -1,179 +1,20 @@
11
# Import Automation System
22

3-
The import automation system has three components:
4-
1. [Cloud Build configuration file](cloudbuild/README.md)
5-
2. [Executor](executor/README.md)
3+
The import automation system has two components:
4+
- Import Job (Cloud Batch)
5+
- Ingestion Pipeline (Dataflow + Cloud Workflow)
66

7-
## User Manual
7+
## Import Job
8+
Import jobs own the task of fetching data from external data sources and making it available for ingestion into the knowledge graph. Each import typically includes an import script along with a manifest file containing import metadata (e.g., refresh schedule). Adding a new data import to the stack involves adding the import script and manifest and then triggering the scheduler script which configures a Cloud Scheduler job for periodically running the import job. Detailed usage instructions on how to configure a new import job are available in the [user guide](executor/README.md).
89

9-
### Specifying Import Targets
10-
Import targets are specified in the commit message using the IMPORTS tag.
11-
The system accepts one or two of the following formats depending on the files
12-
affected by the commit.
10+
The scheduler job triggers a GCP Workflow which then creates a GCP Batch job for each data import. An import job performs multiple tasks such as downloading data, processing it, and generating resolved mcf and copying it to GCS. It relies on the DataCommons [Import tool](https://github.com/datacommonsorg/import/blob/master/docs/usage.md) to perform mcf generation. Additionally, several validations are performed as part of the import job to ensure data quality. More details about the validation framework and supported validations can be found in the [README](https://github.com/datacommonsorg/data/tree/master/tools/import_validation).
1311

14-
1. Absolute import name:
15-
{path to the directory containing the manifest file}:{import name in the manifest}
16-
2. Relative import name: {import name in the manifest}
12+
Status of various import jobs can be monitored in the ImportStatus spanner table via the [Looker Studio dashboard](https://lookerstudio.google.com/c/reporting/e88fda74-50c9-46c6-88aa-c84342ceba48/).
1713

18-
In the commit message, use IMPORTS={comma separated list of import names
19-
without spaces between the elements} to specify import targets.
20-
The commit message may contain more than just the tag and list.
14+
## Ingestion Pipeline
15+
DataCommons runs various import jobs on cloud batch that generate the output MCF data on GCS. The output from these jobs is consumed by the graph ingestion pipeline (Dataflow) to push data into the knowledge graph (Spanner). More details about the ingestion pipeline are available [here](https://github.com/datacommonsorg/import/tree/master/pipeline/ingestion).
2116

22-
A commit can modify files in multiple directories that contain manifest files.
23-
The system will detect the affected directories based on paths of changed files.
24-
If only one directory is affected, both absolute and relative import names
25-
are accepted. If multiple directories are affected, absolute import names must
26-
be used. You can also use IMPORTS=all to ask the system to run all affected
27-
imports and use IMPORTS={path to the directory containing the manifest file}:all
28-
to run all imports in that directory, but they are discouraged as they are not
29-
explicit.
17+
A GCP [cloud workflow](workflow/spanner-ingestion-workflow.yaml) is used to coordinate control between auto-refresh import jobs and the ingestion dataflow pipeline. To maintain data consistency, a global lock is used to ensure that only a single execution of the workflow is active at any time. The workflow relies on various [Spanner tables](workflow/spanner_schema.sql) for metadata management and [helper cloud functions](workflow/ingestion-helper/README.md) to control the execution.
3018

19+
Infrastructure deployment for the various components in the import automation stack is automated using a [Terraform script](terraform/main.tf).
3120

32-
### Example Commit Messages
33-
34-
Assuming the following directory structure:
35-
```
36-
.
37-
└── data
38-
└── scripts
39-
└── us_bls
40-
├── cpi
41-
│ ├── README.md
42-
│ ├── c_cpi_u_1999_2020.csv
43-
│ ├── c_cpi_u_1999_2020.mcf
44-
│ ├── c_cpi_u_1999_2020.tmcf
45-
│ ├── c_cpi_u_1999_2020_import_config.txt
46-
│ ├── cpi_u_1913_2020.csv
47-
│ ├── cpi_u_1913_2020.tmcf
48-
│ ├── cpi_u_1913_2020_import_config.txt
49-
│ ├── cpi_w_1913_2020.csv
50-
│ ├── cpi_w_1913_2020.mcf
51-
│ ├── cpi_w_1913_2020.tmcf
52-
│ ├── cpi_w_1913_2020_import_config.txt
53-
│ ├── generate_csv.py
54-
│ └── manifest.json
55-
└── jolts
56-
├── BLSJolts.csv
57-
├── BLSJolts.tmcf
58-
├── BLSJolts_StatisticalVariables.mcf
59-
├── README.md
60-
├── __init__.py
61-
├── bls_jolts.py
62-
├── import.config
63-
└── manifest.json
64-
```
65-
66-
Assuming scripts/us_bls/cpi/manifest.json has
67-
```json
68-
{
69-
"import_specifications": [
70-
{
71-
"import_name": "USBLS_CPIAllItemsAverage",
72-
...
73-
}
74-
]
75-
}
76-
```
77-
78-
Assuming scripts/us_bls/jolts/manifest.json has
79-
```json
80-
{
81-
"import_specifications": [
82-
{
83-
"import_name": "BLS_JOLTS",
84-
...
85-
}
86-
]
87-
}
88-
```
89-
90-
If the commit only changes files in scripts/us_bls/cpi:
91-
- To import USBLS_CPIAllItemsAverage:
92-
- "fix syntax error IMPORTS=scripts/us_bls/cpi:USBLS_CPIAllItemsAverage" or
93-
- "IMPORTS=USBLS_CPIAllItemsAverage fix memory leak"
94-
- To import BLS_JOLTS
95-
- "nice day IMPORTS=scripts/us_bls/jolts:BLS_JOLTS"
96-
- To import both USBLS_CPIAllItemsAverage and BLS_JOLTS
97-
- "update README IMPORTS=scripts/us_bls/cpi:USBLS_CPIAllItemsAverage,scripts/us_bls/jolts:BLS_JOLTS" or
98-
- "IMPORTS=USBLS_CPIAllItemsAverage,scripts/us_bls/jolts:BLS_JOLTS hope they succeed"
99-
100-
If the commit changes files in both scripts/us_bls/cpi and scripts/us_bls/jolts
101-
directories:
102-
- To import USBLS_CPIAllItemsAverage:
103-
- "fix syntax error IMPORTS=scripts/us_bls/cpi:USBLS_CPIAllItemsAverage"
104-
- To import BLS_JOLTS
105-
- "nice day IMPORTS=scripts/us_bls/jolts:BLS_JOLTS" or
106-
- "good try IMPORTS=BLS_JOLTS"
107-
- To import both USBLS_CPIAllItemsAverage and BLS_JOLTS
108-
- "update README IMPORTS=scripts/us_bls/cpi:USBLS_CPIAllItemsAverage,scripts/us_bls/jolts:BLS_JOLTS" or
109-
- "IMPORTS=scripts/us_bls/cpi:USBLS_CPIAllItemsAverage,BLS_JOLTS hope they succeed"
110-
111-
### Importing to Dev Graph
112-
113-
1. Fork datacommonsorg/data
114-
2. Create a new branch in the fork
115-
3. Create a pull request from the new branch to master of datacommonsorg/data
116-
4. Push commits to the branch. If you want a commit to execute some imports,
117-
specify the import targets in the commit message
118-
(see [Specifying Import Targets](#specifying-import-targets)). If no tag
119-
is found, no imports will be executed.
120-
121-
### Scheduling Updates
122-
123-
1. Push to master of datacommonsorg/data and specify the targets in the commit
124-
message as described by [Specifying Import Target](#specifying-import-targets)
125-
but using the SCHEDULES tag instead of IMPORTS.
126-
2. Configure the production pipeline to pick up the data files at some schedule.
127-
128-
129-
## Deployment
130-
131-
1. Check in the `import-automation` directory to the repository.
132-
2. [Configure](executor/README.md#configuring-the-executor) and [deploy](executor/README.md#deploying-on-app-engine) the executor
133-
5. [Create a Cloud Tasks queue](#creating-cloud-task-queue)
134-
6. [Connect the repository to Cloud Build and set up Cloud Build triggers](#setting-up-cloud-build)
135-
136-
137-
### Creating Cloud Task Queue
138-
- See https://cloud.google.com/tasks/docs/creating-queues#creating_a_queue
139-
- It is recommended that `maxAttempts` be set to one. This can be done by `gcloud tasks queues update <QUEUE_ID> --max-attempts=1`
140-
141-
142-
### Setting Up Cloud Build
143-
1. [Connect](https://cloud.google.com/cloud-build/docs/automating-builds/create-manage-triggers#connect_repo) the repository to Cloud Build
144-
2. Create the trigger that runs on pull requests for importing to dev
145-
- See [this](https://cloud.google.com/cloud-build/docs/automating-builds/create-manage-triggers#build_trigger) for how to create triggers
146-
- Choose these settings
147-
- **Event**: `Pull request (GitHub App only)`
148-
- **Source**
149-
- **Repository**: Choose the connected repository from the list
150-
- **Branch**: `^master$`
151-
- **Comment control**: Choose from one of the three options.
152-
- `Required except for owners and collaborators`: Pull requests from owners and collaborators would be able to trigger Cloud Build directly. Non-collaborators would need a `/gcbrun` comment from an owner or collaborator in the pull request
153-
- `Required`: Every pull request would need a `/gcbrun` comment from an owner or collaborator to run Cloud Build
154-
- `Not required`: Every pull request can trigger Cloud Build directly.
155-
- **Build Configuration**
156-
- **File type**: `Cloud Build configuration file (yaml or json)`
157-
- **Cloud Build configuration file location**: `/import-automation/cloudbuild/cloudbuild.yaml`
158-
- **Substitution variables**
159-
- **_EMAIL_ACCOUNT**: `<email account used for sending notifications>`;
160-
- **_EMAIL_TOKEN**: `<password, app password, or access token of the email account>`
161-
- **_GITHUB_AUTH_USERNAME**: `<GitHub username to authenticate with GitHub API>`
162-
- **_GITHUB_AUTH_ACCESS_TOKEN**: `<access token of the GitHub account>`
163-
- **_GITHUB_REPO_NAME**: `<name of the connected repository, e.g., data>`
164-
- **_GITHUB_REPO_OWNER_USERNAME**: `<username of the owner of the repository, e.g., datacommonsorg>`
165-
- **_HANDLER_SERVICE**: `<service the executor is deployed to, e.g., default>`
166-
- **_HANDLER_URI**: `<URI of the executor's endpoint that imports to dev, e.g., />`
167-
- **_IMPORTER_OAUTH_CLIENT_ID**: `<OAuth client ID used to authenticate with the proxy for the importer>`
168-
- **_TASK_LOCATION_ID**: `<location ID of the Cloud Tasks queue, e.g., us-central1>` (This can be found by going to the Cloud Tasks control panel and look at the "Location" column.)
169-
- **_TASK_PROJECT_ID**: `<ID of the Google Cloud project that hosts the task queue, e.g., google.com:datcom-data>`
170-
- **_TASK_QUEUE_NAME**: `<Name of the task queue>`
171-
3. Create the trigger that runs on pushes to *master* for scheduling updates
172-
- Copy the dev trigger but add/override these settings
173-
- **Event**: `Push to a branch`
174-
- **Build Configuration**
175-
- **Substitution variables** (keeping the others)
176-
- **_BASE_BRANCH**: `master`
177-
- **_HEAD_BRANCH**: `master`
178-
- **_PR_NUMBER**: `0`
179-
- **_HANDLER_URI**: `/schedule`

import-automation/cloudbuild/README.md

Lines changed: 0 additions & 8 deletions
This file was deleted.

0 commit comments

Comments
 (0)