- dbt User 360 Dimension in BigQuery
This dbt project builds a clean, testable user 360 dimension (dim_users) in BigQuery. It aggregates user identity with resolved hierarchical location data and multi-path attribution (sponsor/site/classroom) from anonymized platform data patterns (professional networking/resource platform).
Key goals:
- Resolve hierarchical locations with deduplication and prioritization.
- Unify attribution paths while preserving unlinked users.
- Deliver a BI-ready mart with enforced grain, schema tests, and documentation.
Originally a monolithic SQL query in a BI tool that combined location normalization, multi-path attribution, and evolving user attributes. It worked for small daily updates but became fragile and hard to maintain.
dbt refactor benefits:
- Modularity — Static logic (location cleaning, attribution unification) in intermediate models, protected from frequent
dim_userschanges. - Testability — Layered tests catch issues early.
- Consistency — Centralized transformations ensure uniform logic across consumers.
- Performance — Intermediate tables speed up downstream reads; full daily refresh is fast (<50k rows).
Result: Fragile query → production-grade, modular pipeline.
📊 View editable Mermaid source (desktop recommended)
graph TD
A(Raw Sources:<br>users, locations, attributions) --> B(int_locations_clean:<br>Normalize & deduplicate hierarchy<br>One row per from_location_id)
A --> C(int_user_attributions:<br>Unify classroom/invite/sponsor paths<br>Multiple rows per user possible)
B --> D(dim_users:<br>Final user 360 dimension<br>Grain: user_id + optional sponsor/site)
C --> D
D --> E(BI / Reporting:<br>Consistent, testable queries)
style A fill:#757575,stroke:#424242
style B fill:#1e88e5,stroke:#0d47a1,color:#fff
style C fill:#1e88e5,stroke:#0d47a1,color:#fff
style D fill:#0d47a1,stroke:#003087,color:#fff
style E fill:#757575,stroke:#424242
A single from_location_id can map to multiple types (city, county, state, country) with potential duplicates or hierarchies. The pipeline collects candidates, ranks cities by distance (e.g., ST_DISTANCE in miles) + heuristics (e.g., regex checks for address-like strings), picks the best per type, and emits one consistent row per source location — preventing ambiguous geography in downstream queries.
This mapping prioritizes the nearest city if within ~10 miles (or if the original locale resembles a suburb/address), grouping users into standardized cities. For example, scattered suburbs are aggregated under their parent city. The approach enables accurate mapping, regional grouping, and reliable BI visualization without data fragmentation.
Multiple paths (classroom membership, educator assignments, email invitations, direct sponsor invite codes) are normalized into a stacked intermediate model. Independent Learners (type 'IL') are excluded from invitation-based paths. The model produces multiple rows per user_id where applicable; dim_users selects the canonical attribution for each user.
- Intermediates → tables (persist complex transforms, faster downstream reads and easier debugging)
- Mart (
dim_users) → table (optimized for BI/reporting queries)
Materialization can be changed in dbt_project.yml (e.g. to view) if needed.
marts/marts_schema.yml includes:
not_nullonuser_iduniqueonuser_iddbt_utils.unique_combination_of_columnson [user_id, sponsor_id, site_id]dbt_utils.expression_is_trueto ensure location completeness for users withlocation_id
This repository is optimized for static code review and portfolio demonstration. No live BigQuery connection or real data is included, so full dbt run, dbt test, and dbt docs generate commands will fail without credentials — this is intentional.
Included for zero-setup docs:
- Pre-generated
target/manifest.jsonandtarget/catalog.jsonwith column types, descriptions, and metadata (from a real run).
- Make sure you are in the project directory and your Python virtual environment is active (if using one).
- Generate the manifest file (parses models, sources, tests, and yml documentation — no credentials needed):
dbt parse- The repo contains pre-populated
target/catalog.jsonwith full column details. If you ever need to regenerate or adjust it:- Keep the structure and unique_ids matching those in
manifest.json. - Update
metadata.generated_atto a recent timestamp if desired. - The current file includes detailed types and explanatory comments for all sources, intermediate models, and the final mart.
- Keep the structure and unique_ids matching those in
- Start the documentation server:
dbt docs serveOpen http://localhost:8080 in your browser. You should see:
- Full lineage graph
- Column names + descriptions (from
*.ymlfiles) - Data types + custom comments (from
catalog.json) - Model descriptions, tests, and dependencies
If the server fails to start due to "Address already in use":
lsof -i :8080
kill -9 <PID>then retry:
dbt docs serveIf you want to run the full pipeline (compile, run, test, generate real docs):
- Create a GCP project and enable BigQuery (free tier is sufficient for small data).
- Install the Google Cloud SDK.
- Authenticate locally:
gcloud auth application-default login- Copy
profiles.example.ymlto~/.dbt/profiles.ymland update project and dataset to your own values. - Run standard dbt commands (limited to non-execution steps):
dbt debug # should say "All checks passed!"
dbt compile
dbt docs generate # pulls metadata from yml files only
dbt docs serve # open http://localhost:8080Note: dbt run and dbt test will fail without source tables in your dataset. To execute the full pipeline, load dummy data first (see next section).
See the official guide: dbt + BigQuery setup
This repo includes a single SQL script (dummy_data/setup_dummy_data.sql) that recreates the bronze_raw dataset and populates it with dummy data. Reviewers can run this in their own BigQuery project (free sandbox tier works perfectly).
- Create or select a Google Cloud project (free tier / sandbox is sufficient):
- Go to https://console.cloud.google.com
- Create a new project if needed (no billing required for small tests).
- Create a new dataset named
bronze_raw(or use an existing one). - Open a new query tab.
- Copy-paste the entire content of
dummy_data/setup_dummy_data.sql. - Replace all occurrences of
dbt-user-dimension-demowith your actual project ID (found in the top bar or IAM & Admin → Settings). - Run the script (click Run or Ctrl+Enter).
Update models/sources.yml:
- Change database:
dbt-user-dimension-demoto your actual project ID. - Keep schema:
bronze_raw(or update if you used a different dataset name).
Set up dbt credentials (one-time): Install gcloud SDK if not already installed.
Run:
gcloud auth application-default loginCopy profiles.example.yml to ~/.dbt/profiles.yml and update project and dataset to match your setup.
Run the dbt pipeline:
dbt debug
dbt run
dbt test
dbt docs generate && dbt docs serveOpen http://localhost:8080 to view the full documentation with real metadata.
.
├── dbt_project.yml
├── profiles.example.yml
├── packages.yml
├── package-lock.yml
├── .python-version
├── .gitignore
├── dummy_data/
│ └── setup_dummy_data.sql
├── models/
│ ├── sources.yml
│ ├── intermediate/
│ │ ├── int_locations_clean.sql
│ │ ├── int_user_attributions.sql
│ │ └── intermediate_schema.yml
│ └── marts/
│ ├── dim_users.sql
│ └── marts_schema.yml
├── macros/
│ └── bigquery_catalog_fix.sql
├── target/
│ ├── catalog.json
│ └── manifest.json
├── deprecated/
│ └── macros/
│ └── utils.sql
└── README.md
- dbt Core 1.11+
dbt-bigqueryadapter
MIT License
Copyright (c) 2025-2026 Corin Stedman (space-lumps)
See the LICENSE file for full details.
