datacommonsorg · balit-raibot · Jan 4, 2026 · Jan 6, 2026 · Jan 6, 2026 · Jan 14, 2026
diff --git a/import-automation/executor/Dockerfile b/import-automation/executor/Dockerfile
@@ -36,7 +36,8 @@ fonts-liberation \
 xdg-utils \
 chromium \
 chromium-driver \
-p7zip-full
+p7zip-full \
+libeccodes
 # Install the Google Cloud CLI
 RUN apt-get update && \
     curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | gpg --dearmor -o /usr/share/keyrings/cloud.google.gpg && \

diff --git a/import-automation/executor/requirements.txt b/import-automation/executor/requirements.txt
@@ -41,6 +41,7 @@ omegaconf
 prettytable
 protobuf
 psutil
+pygrib
 pylint
 pyspellchecker
 pytest

diff --git a/scripts/noaa_gfs/README.md b/scripts/noaa_gfs/README.md
@@ -0,0 +1,61 @@
+# NOAA: Global Forecast System Dataset
+## Overview
+The NOAA-GFS 0.25 Atmos dataset provides high-resolution global atmospheric and land-surface data on a 0.25-degree (~28km) grid. It includes a wide range of meteorological variables, such as temperature, wind, humidity, precipitation, and soil moisture, generated four times daily with forecasts extending up to 16 days (384 hours).
+The dataset provides a standardized global output on a 0.25-degree (~28km) equidistant cylindrical grid, covering the entire Earth's surface and up to 127 vertical atmospheric layers. It is distributed in GRIB2 (Gridded Binary Edition 2) format via the NOAA Operational Model Archive and Distribution System (NOMADS) and is categorized as a public domain product of the United States Government.
+This pipeline automates the ingestion, format conversion, and standardized mapping of GFS GRIB2 files into Data Commons-compatible StatVar observations.
+
+## Data Source & Provenance
+* **Source URL:** [NOMADS NCEP GFS Production](https://nomads.ncep.noaa.gov/pub/data/nccf/com/gfs/prod/)
+* **Provider:** National Centers for Environmental Prediction (NCEP / NOAA).
+* **Update Frequency:** 4 times daily (00z, 06z, 12z, 18z).
+* **Variable Inventory:** [NCO Product Description](https://www.nco.ncep.noaa.gov/pmb/products/gfs/gfs.t00z.pgrb2.0p25.anl.shtml)
+
+
+## Automated Pipeline Logic
+The pipeline is a Python-driven architecture managed via a `manifest.json` import specification.
+
+### 1. Data Ingestion (`download_noaa_gfs_grib.py`)
+* **Stateful Tracking:** The script retrieves its last successful run checkpoint from `gs://{bucket}/state.json`.
+* **Chronological Integrity:** It identifies missing 6-hour slots (00z, 06z, 12z, 18z) and performs memory-efficient streamed downloads of GRIB2 files into local `input_files/` directories.
+
+### 2. Transformation & Mapping (`grib_statvar_processor.py`)
+This stage converts binary meteorological data into structured CSVs using the `pygrib` library.
+* **Parallel Processing:** Utilizes `multiprocessing.Pool` to process GRIB messages across available CPU cores.
+* **Coordinate Normalization:** Longitudes are transformed from the 0–360 range to the -180 to 180 range.
+* **StatVar Mapping:**
+    * **DCID Construction:** Maps GRIB short codes (e.g., `TMP`, `UGRD`) and vertical levels to formal Data Commons identifiers like `dcid:Temperature_Place_850Millibar`.
+    * **Unit Scaling:** Automatically scales variables such as Land and Ice cover.
+* **GCS Streaming:** Processed CSVs are merged and uploaded directly to the GCS output prefix.
+
+### 3. BigQuery Ingestion (`dc_bq_ingest.py`)
+* **Staging Pattern:** Bulk loads raw CSVs from GCS into a staging table (`Observation_Staging`).
+* **SQL Transformation:** Executes an `INSERT INTO` query to map staging data to the final production schema, handling type casting and attaching the provenance ID (`dc/base/NOAA_GlobalForecastSystem`).
+
+---
+
+## Pipeline Configuration (`manifest.json`)
+The pipeline is governed by specific resource requirements for high-concurrency GRIB decompression:
+* **Cron Schedule:** `30 04,10,16,22 * * *` (Runs 30 minutes after GFS cycle releases).
+* **Resource Limits:** 64 CPUs | 256GB RAM | 4GB Disk.
+* **Timeout:** 1 hour (`3600s`).
+
+---
+
+## Usage Instructions
+
+### Prerequisites
+* **Python Libraries:** `pygrib`, `numpy`, `google-cloud-storage`, `google-cloud-bigquery`, `absl-py`.
+* **System Requirements:** Requires `libgrib-api` or `eccodes` installed on the host system.
+
+### Manual Execution
+While designed for automated execution, stages can be run manually for debugging:
+
+```bash
+# 1. Download missing data
+python3 download_noaa_gfs_grib.py --project_id=YOUR_PROJECT_ID
+
+# 2. Process GRIB to CSV and upload to GCS
+python3 grib_statvar_processor.py --input=./input_files
+
+# 3. Ingest from GCS to BigQuery
+python3 dc_bq_ingest.py --project_id=YOUR_PROJECT_ID --dataset_id=YOUR_DATASET
diff --git a/scripts/noaa_gfs/dc_bq_ingest.py b/scripts/noaa_gfs/dc_bq_ingest.py
@@ -0,0 +1,129 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the 'License');
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#         https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an 'AS IS' BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Automates ingestion of processed NOAA GFS meteorological data into BigQuery.
+"""
+
+import os
+from absl import app, flags, logging
+from google.cloud import bigquery
+from google.cloud import storage
+
+# --- FLAG DEFINITIONS ---
+FLAGS = flags.FLAGS
+flags.DEFINE_string('project_id', 'datcom-external', 'GCP Project ID.')
+flags.DEFINE_string('bucket_name', 'datcom-prod-imports', 'GCS Bucket containing the CSVs.')
+flags.DEFINE_string('gcs_prefix', 'scripts/noaa_gfs/NOAA_GlobalForecastSystem/output/', 'GCS prefix (folder path).')
+flags.DEFINE_string('dataset_id', 'data_commons_noaa_gfs', 'BigQuery Dataset ID.')
+flags.DEFINE_string('table_id', 'Observation', 'BigQuery Table ID.')
+flags.DEFINE_string('staging_table_id', 'Observation_Staging', 'Temporary Staging Table ID.')
+
+def run_mapping_query(bq_client):
+    """
+    Executes the SQL transformation to map data from Staging to Final table.
+    """
+    final_table = f"{FLAGS.project_id}.{FLAGS.dataset_id}.{FLAGS.table_id}"
+    staging_table = f"{FLAGS.project_id}.{FLAGS.dataset_id}.{FLAGS.staging_table_id}"
+
+    query = f"""
+    INSERT INTO `{final_table}` (
+        observation_about,
+        variable_measured,
+        value,
+        observation_date,
+        measurement_method,
+        unit,
+        prov_id
+    )
+    SELECT 
+        placeName,
+        variableMeasured,
+        CAST(value AS STRING),
+        CAST(observationDate AS STRING),
+        measurementMethod,
+        unit,
+        'dc/base/NOAA_GlobalForecastSystem'
+    FROM `{staging_table}`;
+    """
+
+    try:
+        logging.info("Starting transformation query...")
+        query_job = bq_client.query(query)
+        query_job.result() # Wait for completion
+
+        # Optional: Truncate staging table after successful migration
+        bq_client.query(f"TRUNCATE TABLE `{staging_table}`").result()
+        logging.info("Transformation complete and staging table cleared.")
+        return True
+    except Exception as e:
+        logging.error(f"Mapping query failed: {e}")
+        return False
+
+def upload_gcs_to_staging(bq_client, gcs_uri):
+    """
+    Loads raw CSV data into the Staging table.
+    """
+    table_ref = f"{FLAGS.project_id}.{FLAGS.dataset_id}.{FLAGS.staging_table_id}"
+
+    job_config = bigquery.LoadJobConfig(
+        source_format=bigquery.SourceFormat.CSV,
+        skip_leading_rows=1,
+        autodetect=True,
+        # WRITE_APPEND used here to collect all CSVs before the final SQL transformation
+        write_disposition=bigquery.WriteDisposition.WRITE_APPEND,
+    )
+
+    try:
+        logging.info(f"Loading to staging: {gcs_uri}")
+        load_job = bq_client.load_table_from_uri(gcs_uri, table_ref, job_config=job_config)
+        load_job.result() 
+        return True
+    except Exception as e:
+        logging.error(f"Failed to load {gcs_uri}: {e}")
+        return False
+
+def main(argv):
+    """Entry point for the GCS-to-BigQuery ingestion script."""
+    # Initialize Clients
+    bq_client = bigquery.Client(project=FLAGS.project_id)
+    storage_client = storage.Client(project=FLAGS.project_id)
+
+    # Get reference to the bucket and list blobs
+    bucket = storage_client.bucket(FLAGS.bucket_name)
+    blobs = bucket.list_blobs(prefix=FLAGS.gcs_prefix)
+
+    # Filter for CSV files
+    csv_uris = [f"gs://{FLAGS.bucket_name}/{blob.name}" for blob in blobs if blob.name.endswith('.csv')]
+
+    if not csv_uris:
+        logging.warning(f"No CSV files found at gs://{FLAGS.bucket_name}/{FLAGS.gcs_prefix}")
+        return
+
+    logging.info(f"Found {len(csv_uris)} files in GCS for ingestion.")
+
+    # Step 1: Bulk Load everything into Staging
+    success_count = 0
+    for uri in csv_uris:
+        if upload_gcs_to_staging(bq_client, uri):
+            success_count += 1
+
+    logging.info(f"Ingestion batch complete. {success_count}/{len(csv_uris)} URIs processed.")
+
+    # Step 2: Run Mapping SQL if at least some files loaded
+    if success_count > 0:
+        run_mapping_query(bq_client)
+
+if __name__ == "__main__":
+    app.run(main)
diff --git a/scripts/noaa_gfs/download_noaa_gfs_grib.py b/scripts/noaa_gfs/download_noaa_gfs_grib.py
@@ -0,0 +1,135 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the 'License');
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#         https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an 'AS IS' BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Automates GFS GRIB2 source file retrieval from NOAA NOMADS.
+This script manages dated directory structures and utilizes memory-efficient 
+HTTP streaming to download large-scale meteorological datasets for 
+downstream Data Commons ingestion.
+"""
+
+import os
+import json
+import requests
+from datetime import datetime, timedelta
+from pathlib import Path
+from absl import app, flags, logging
+from google.cloud import storage
+from google.api_core import exceptions
+
+# --- FLAG DEFINITIONS ---
+FLAGS = flags.FLAGS
+flags.DEFINE_string('project_id', 'datcom', 'The GCP Project ID.')
+flags.DEFINE_string('bucket_name', 'datcom-prod-imports', 'The GCS bucket name.')
+flags.DEFINE_string('state_path', 'scripts/noaa_gfs/NOAA_GlobalForecastSystem/state.json', 'The path within the bucket for state.json.')
+
+def get_gcs_client():
+    """Initializes the GCS client with a specific Project ID."""
+    return storage.Client(project=FLAGS.project_id)
+
+def load_state():
+    """Reads state from GCS. Returns default if file doesn't exist."""
+    client = get_gcs_client()
+    bucket = client.bucket(FLAGS.bucket_name)
+    blob = bucket.blob(FLAGS.state_path)
+
+    try:
+        state_data = blob.download_as_text()
+        logging.info(f"Successfully loaded state from gs://{FLAGS.bucket_name}/{FLAGS.state_path}")
+        return json.loads(state_data)
+    except exceptions.NotFound:
+        logging.warning("State file not found in GCS. Starting from default (24h ago).")
+        # Default: Start 24 hours ago
+        yesterday = (datetime.now() - timedelta(days=1))
+        return {"date": yesterday.strftime('%Y%m%d'), "cycle": "18"}
+
+def get_next_slot(current_date_str, current_cycle):
+    """Calculates the next 6-hour GFS slot."""
+    current_dt = datetime.strptime(f"{current_date_str}{current_cycle}", '%Y%m%d%H')
+    next_dt = current_dt + timedelta(hours=6)
+    return next_dt.strftime('%Y%m%d'), next_dt.strftime('%H')
+
+def download_gfs_file(date_stamp, cycle, fhour="000"):
+    """Downloads the GRIB2 file from NOAA."""
+    # 1. Setup Paths
+    # Target directory: ./input_files/YYYYMMDD/
+    target_dir = Path("./input_files") / date_stamp
+    target_dir.mkdir(parents=True, exist_ok=True)
+
+    file_name = f"gfs.t{cycle}z.pgrb2.0p25.f{fhour}"
+    output_path = target_dir / file_name
+
+    # 2. Construct URL
+    url = (f"https://nomads.ncep.noaa.gov/pub/data/nccf/com/gfs/prod/"
+           f"gfs.{date_stamp}/{cycle}/atmos/{file_name}")
+
+    logging.info(f"Downloading: {url}")
+    logging.info(f"Destination: {output_path}")
+
+    # 3. Perform Streamed Download
+    try:
+        with requests.get(url, stream=True, timeout=60) as r:
+            # Check if file exists on server (e.g., handles 404 if data isn't ready)
+            r.raise_for_status() 
+
+            with open(output_path, 'wb') as f:
+                for chunk in r.iter_content(chunk_size=1024 * 1024): # 1MB chunks
+                    if chunk:
+                        f.write(chunk)
+
+        logging.info(f"Successfully downloaded: {date_stamp} Cycle {cycle}")
+        return str(output_path)
+
+    except requests.exceptions.HTTPError as e:
+        if e.response.status_code == 404:
+            logging.error(f"File not found on NOMADS. The {date_stamp} data might not be posted yet.")
+        else:
+            logging.error(f"HTTP Error: {e}")
+    except Exception as e:
+        logging.error(f"Download failed: {e}")
+
+    return None
+
+def main(argv):
+    """Entry point for the download script."""
+    state = load_state()
+    current_date = state['date']
+    current_cycle = state['cycle']
+
+    # Get the latest possible slot (NOAA usually has a few hours delay)
+    now = datetime.now() - timedelta(hours=4)
+
+    logging.info(f"Iterating from: {current_date} {current_cycle}z")
+
+    while True:
+        # 1. Determine the next slot to try
+        next_date, next_cycle = get_next_slot(current_date, current_cycle)
+        next_dt = datetime.strptime(f"{next_date}{next_cycle}", '%Y%m%d%H')
+
+        # 2. Stop if we are trying to download files from the future
+        if next_dt > now:
+            logging.info("All available files up to current time have been checked.")
+            break
+
+        # 3. Attempt Download
+        if download_gfs_file(next_date, next_cycle):
+            current_date, current_cycle = next_date, next_cycle
+        else:
+            # If a file isn't found, it might not be posted yet.
+            # We stop here to maintain chronological integrity.
+            logging.info(f"Reached the end of available data on server at {next_date} {next_cycle}z.")
+            break
+
+if __name__ == "__main__":
+    app.run(main)
-Original file line number
+Diff line change
@@ Expand Up / @@ -41,6 +41,7 @@ omegaconf @@
     prettytable
     protobuf
     psutil
+    pygrib
     pylint
     pyspellchecker
     pytest
@@ Expand Down @@