This document defines the architecture stance for cloud-landerox-data and separates:
- What is implemented today (
As-Is) - What is planned next (
Target) - How decisions are made per source/domain (decision matrix)
- Runtime folders are present but not populated yet:
functions/ingestion/functions/trigger/dataflow/pipelines/
- Runtime implementations are intentionally excluded from this public baseline.
- Shared infrastructure utilities are implemented in
shared/common/. - CI validates quality and tests; deployment workflows are not yet active.
- This repository is intentionally maintained as a public architecture/template baseline.
- Production-specific pipelines and deployment details may live in private repos.
The target is a hybrid GCP data platform:
- Warehouse path for fast SQL analytics on BigQuery native tables.
- Lakehouse path for open-format interoperability on GCS/BigLake.
- Batch + Streaming coexistence as a practical operating model.
- Medallion layering (
Bronze -> Silver -> Gold) for data quality progression. - Governed operation with contracts, replay paths, quality gates, and SLOs.
flowchart LR
SRC["Sources"] --> CF["Cloud Functions"]
SRC --> PS["Pub/Sub"]
SRC -.-> CDC["CDC Connectors - optional"]
CF --> BZ_GCS["Bronze on GCS"]
CF --> PS
PS --> BZ_BQ["Bronze Raw in BigQuery"]
BZ_GCS --> DF["Dataflow"]
PS --> DF
CDC -.-> DF
DF --> SV_BQ["Silver in BigQuery"]
DF -.-> SV_GCS["Silver on GCS / BigLake"]
BZ_BQ --> SQL["SQL Transform Layer"]
SV_BQ --> SQL
SV_GCS -.-> SQL
SQL --> GOLD["Gold in BigQuery"]
flowchart LR
API[APIs / Webhooks] --> ING[Ingestion Function]
ING -->|validate contract| BRONZE[(Bronze Storage)]
ING -->|publish| PS[Pub/Sub]
PS --> DF[Dataflow]
BRONZE --> DF
DF -->|valid records| SILVER[(Silver Table)]
DF -->|invalid records| DLQ[(DLQ)]
DLQ --> REPLAY[Replay / Backfill Job]
REPLAY --> DF
SILVER --> DQ[Data Quality Gates]
DQ --> GOLD[(Gold Table)]
ING --> OBS[Observability + SLO]
DF --> OBS
DQ --> OBS
Keep diagrams minimal and operationally useful:
- Context overview
- End-to-end flow with controls
- Dataflow processing view (
1or2diagrams) - Cloud Functions view (
1or2diagrams) - Hybrid storage model (BigQuery + Lakehouse/Iceberg)
When to split diagrams:
- Dataflow (
1vs2): use one when stream and batch/replay share most transforms; use two when windowing, dedup, contracts, or sinks differ materially. - Cloud Functions (
1vs2): use one when ingestion and trigger handlers are simple and coupled; use two when they are independent deploy/ownership units.
Canonical templates:
- Diagram Catalog
- Dataflow Shared
- Dataflow Streaming
- Dataflow Batch/Replay
- Functions Shared
- Functions Ingestion HTTP/Webhook
- Functions Trigger/Orchestration
- Storage Hybrid BigQuery + Lakehouse
- Architecture patterns: Event-driven, batch/stream hybrid, Medallion, selective Kappa/Lambda.
- Cross-cutting patterns: Data contracts, schema evolution, idempotency, deduplication, quality gates, replay.
- Organizational model: Data Mesh (team/domain ownership), optional for this personal repo.
- Services: Cloud Functions, Pub/Sub, Dataflow, BigQuery, GCS, BigLake.
- Formats: JSON/NDJSON, Avro, Parquet.
- Table formats: BigQuery native tables, Apache Iceberg (primary external table format), Delta/Hudi only when interoperability requires them.
- Data Contracts + Schema Evolution
Contract versioning (
schema_version) and compatibility policy (backward/forward) per source. - DLQ + Replay/Backfill Invalid records route to DLQ with reason codes and deterministic replay path.
- Idempotency + Deduplication
Stable event keys (for example
event_id) and explicit dedup windows or merge keys. - Data Quality Gates Checks at Bronze -> Silver and Silver -> Gold boundaries.
- Observability + SLO Track latency, error rate, freshness, and processed volume.
- Orchestration Cloud Scheduler and/or Workflows for batch runs, backfills, and re-runs.
- CDC pattern (conditional) Add only when transactional database sources are in scope.
- Governance PII classification, retention policy, and dataset/table access controls.
| Pattern | Typical GCP implementation |
|---|---|
| Data contracts + schema evolution | JSON schema in repo, Pub/Sub schema validation (when applicable), BigQuery schema versioning strategy |
| DLQ + replay/backfill | Pub/Sub dead-letter topics/subscriptions, Dataflow error side outputs, replay jobs via Dataflow batch |
| Idempotency + deduplication | Cloud Function idempotency keys, Dataflow key/window dedup, BigQuery merge keys |
| Data quality gates | Validation transforms in Dataflow, SQL checks in BigQuery, quality checks at Bronze->Silver and Silver->Gold |
| Observability + SLO | Cloud Logging structured logs, Cloud Monitoring metrics/alerts, error budget/SLO dashboards |
| Orchestration | Cloud Scheduler + Workflows for schedules, retries, backfills, and re-runs |
| CDC pattern (conditional) | Datastream (or source-native CDC) + Dataflow/BigQuery ingestion path |
| Governance | BigQuery IAM policies, dataset/table ACLs, retention configuration, data classification and catalog metadata |
- Data Mesh Organizational scaling pattern; useful if multi-team ownership emerges.
- Databricks/Delta interoperability Keep optional for cross-platform use cases; not core for current GCP-first runtime.
- Public repo (
cloud-landerox-data): reference architecture, patterns, templates, and shared engineering practices. - Infrastructure repo (Terraform, separate): GCP provisioning and environment setup.
- Private runtime repos (optional): production pipeline logic, environment-specific deployments, and sensitive operational details.
For runtimes with many modules (for example, 50+ pipelines), use this structure in the private runtime repo:
runtime-data-platform/
├── functions/
│ ├── ingestion/
│ │ └── <domain>/<source>_<mode>/
│ │ └── main.py
│ └── trigger/
│ └── <domain>/<event>_<purpose>/
│ └── main.py
├── dataflow/
│ └── pipelines/
│ └── <domain>/
│ ├── bronze/<pipeline_name>/
│ ├── silver/<pipeline_name>/
│ └── gold/<pipeline_name>/
└── tests/
├── functions/<domain>/...
└── dataflow/<domain>/...Why this layout is aligned with GCP operating reality:
- Keeps deployment blast radius small (one function/pipeline per module folder).
- Maps naturally to Medallion responsibilities in BigQuery/GCS.
- Supports independent scaling and rollback per pipeline.
- Works with path-based CI/CD triggers for selective deployments.
- Phase 1 (now): Contracts, DLQ, idempotency, base observability.
- Phase 2: Quality gates, replay automation, orchestration hardening.
- Phase 3: CDC (if needed), governance expansion, optional interoperability patterns.
| Decision axis | Preferred option | Alternate option | Use when |
|---|---|---|---|
| Ingestion entry | API/Event -> Pub/Sub | API/Event -> GCS direct | Pub/Sub for resilience/backpressure; direct GCS for low-frequency archival |
| Processing mode | Streaming | Batch | Streaming for low-latency SLAs; batch for backfill/scheduled loads |
| Silver storage | BigQuery native | GCS + BigLake (Iceberg/Parquet/Avro) | BigQuery native for SQL-first speed; BigLake for open-format interoperability |
| Transformation style | SQL ELT | Dataflow ETL | ELT for business logic in SQL; ETL for complex parsing/enrichment/dedup |
| Topology style | Selective Kappa | Lambda-lite split | Kappa when one logic path is feasible; Lambda-lite when stream and batch constraints differ |
| Contract strategy | Backward-compatible evolution | Breaking change with migration | Backward-compatible by default; migration only when required |
| Replay strategy | DLQ replay | Full backfill reprocessing | DLQ replay for scoped errors; full backfill for systemic issues |
| Governance level | Dataset/table IAM + retention | Domain-level policy stack | Basic controls by default; expand with complexity |
- Not a pure Kappa-only platform.
- Not a pure lakehouse-only platform.
- Not a Data Mesh platform by default.
It is a pragmatic hybrid platform that chooses patterns per source, SLA, and cost profile.