Digital Twin Builder for Databricks
OntoBricks is a web application that transforms Databricks tables into a materialized knowledge graph. It lets you design ontologies (OWL), map them to Unity Catalog tables via R2RML, materialize triples into a Delta-backed triple store and a Lakebase Postgres graph engine, reason over the graph (OWL 2 RL, SWRL, SHACL), and query it through an auto-generated GraphQL API. The entire pipeline — from metadata import to a queryable knowledge graph — can run in four clicks using LLM-powered automation.
Please note that all projects in the /databrickslabs github account are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use of these projects.
Any issues discovered through the use of this project should be filed as GitHub Issues on the Repo. They will be reviewed as time permits, but there are no formal SLAs for support.
OntoBricks uses uv for dependency management. All dependencies are declared in pyproject.toml.
# Clone the repository
git clone <repository-url>
cd OntoBricks
# Install dependencies (uv resolves them from pyproject.toml)
uv sync
# Or use the setup script
scripts/setup.sh- Python 3.10 or higher
- Databricks workspace access (Databricks Apps must be enabled). Local development uses a Personal Access Token; production uses the App's service principal.
- A SQL Warehouse (you'll need its ID for local dev).
- Databricks Lakebase Autoscaling project + branch + Postgres
database — required since v0.4.0 for the domain registry
(domains, versions, permissions, schedules, global config) and the
Graph DB triple store. Provisioned Lakebase instances are not
supported. The Postgres driver (
psycopg[binary]+psycopg-pool) is declared as an optional dependency so volume-only forks can opt out — install withuv sync --extra lakebasefor any normal deployment. - Unity Catalog Volume in the catalog/schema that hosts the
triplestore VIEWs (
triplestore_<domain>_v<n>). The volume is reserved for binary artefacts (documents/uploads — domain-scoped attachments imported by the ontology designer). psql(libpq client) onPATHfor the Lakebase permission bootstrap scripts (brew install libpq && brew link --force libpqon macOS).
# Configure credentials
cp .env.example .env
# Edit .env with your Databricks host, token, and warehouse ID
# Start the application
scripts/start.sh
# Open http://localhost:8000# Install and authenticate the Databricks CLI (>= 0.250.0)
brew install databricks # or curl -fsSL https://databricks.com/install.sh | sh
databricks auth login --host https://<workspace>
# Edit scripts/deploy.config.sh (warehouse, registry catalog/schema,
# Lakebase project/branch/database — see the file header) and then:
make deploy
# Or directly: scripts/deploy.shscripts/deploy.sh generates app.yaml from app.yaml.template +
scripts/deploy.config.sh, validates and deploys the DAB bundle on
target dev-lakebase, runs scripts/bootstrap-app-permissions.sh
(app SP CAN_MANAGE on itself), then runs
scripts/bootstrap-lakebase-perms.sh on the registry / graph / sync
schemas. All steps are idempotent.
After the first deploy, bind the sql-warehouse, volume, and
postgres (Lakebase) resources in the Databricks Apps UI
(Compute > Apps > > Resources) if the DAB bind did
not take. Open the app and click Settings > Registry > Initialize
to create the Lakebase schema; re-run make bootstrap-lakebase once
afterwards so the freshly created schema picks up USAGE/DML.
Lakebase deploy targets. Pick a Databricks Lakebase Autoscaling project + branch and a Postgres database, then set the
LAKEBASE_PROJECT,LAKEBASE_BRANCH,LAKEBASE_DATABASE_RESOURCE_SEGMENT(thedb-…id fromdatabricks postgres list-databases "projects/<id>/branches/<branch>" -o json, not the Postgres database name shown in the SQL UI), andLAKEBASE_REGISTRY_SCHEMAdefaults inscripts/deploy.config.sh. The DAB composes the full Appspostgres.databasepath and binds apostgresApps resource so the runtime auto-injectsPGHOST/PGPORT/PGDATABASE/PGUSER; the app mints the Lakebase JWT automatically (no user secret required).
Upgrading from a pre-v0.4.0 deployment. Pre-v0.4.0 stored the entire registry as JSON on the Unity Catalog Volume. Run
scripts/migrate-registry-to-lakebase.shonce before upgrading to v0.4.0+ to copy every JSON-shaped artefact (domains, versions, permissions, schedules, global config) into Lakebase. Binary artefacts on the Volume are left untouched.
First deploy only:
make deployrunsscripts/bootstrap-app-permissions.shautomatically, which grants each app's service principalCAN_MANAGEon itself. Without that grant the middleware cannot read the app's own ACL and every first-time visitor — including the deployingCAN_MANAGEuser — lands on the access-denied page. If you deploy viadatabricks bundle deploydirectly, runmake bootstrap-permsonce afterwards (it is idempotent).
See Deployment Guide for the full checklist including resource configuration and permissions.
- Ensure all tests pass:
make test - Update the version in
pyproject.toml - Commit, tag, and push:
git add -A && git commit -m "Release vX.Y.Z"
git tag vX.Y.Z
git push origin main --tags- Deploy the new version:
make deploy
| Step | Action | What Happens |
|---|---|---|
| 1 | Import Metadata (Domain > Metadata) | Fetches table and column metadata from Unity Catalog |
| 2 | Generate Ontology (Ontology > Wizard) | LLM designs entities, relationships, and attributes from your metadata |
| 3 | Auto-Map (Mapping > Auto-Map) | LLM generates SQL mappings for every entity and relationship |
| 4 | Synchronize (Digital Twin > Status) | Executes mappings and populates the triple store |
- Ontology Designer — the main ontology graph view lives under Ontology → Designer (visual canvas + AI Assistant).
- Domain Cockpit (Validation) — Active Version shows which registry version is exposed via API / MCP; it can differ from the version you have loaded in the editor.
- Registry → Browse — only place to set the Active (API/MCP) version for a domain; Domain → Versions shows that status as a read-only badge.
- New domain — after New Domain, a full-page loading overlay runs until Domain Information finishes its first load.
- Domain Information — triple-store / snapshot / local graph paths update when you commit the domain name (blur or change) or change version (aligned with naming rules before save).
- Duplicate names — Save to Unity Catalog is blocked if the sanitized domain name already exists in the registry (inline check + confirmation before POST).
- Navbar — domain name and version in the top bar refresh after load, save, clear, import, and version switches (browser cache invalidated on those actions).
The graph triple-store backend is pluggable; the abstraction (GraphDBFactory / GraphDBBackend) is preserved so additional engines can be added in the future. Today only one engine ships:
- Lakebase (Postgres) — default; three Postgres objects per domain version (
*_syncbulk-data table,*__appcompanion for reasoning/cohort writes,g_<dom>_v<n>UNION view for reads) inside a configurable Postgres schema on the App-bound Lakebase database (same connection as the optional Lakebase registry backend). Requires thelakebaseextra (uv sync --extra lakebase) sopsycopgis installed.
Engine-specific options are stored as global JSON (graph_engine_config). For Lakebase the supported keys are database (optional override of PGDATABASE), schema (optional, default ontobricks_graph), sync_mode (app_managed default, or managed_synced to delegate bulk ingest to a Databricks Lakeflow snapshot pipeline), sync_table_mode (snapshot / triggered / continuous — snapshot is the recommended mode), sync_timeout_s (default 600), sync_uc_catalog (UC catalog the synced table is registered in; defaults to the snapshot Delta catalog when unset), and sync_uc_schema (UC schema segment for the synced-table FQN; defaults to the registry UC schema so the Lakeflow object lands in the same UC namespace as other registry artefacts). See docs/lakebase-graphdb.md for the full reference.
Lakebase permission grants (three schemas). The app service principal needs
USAGE + DMLon up to three Postgres schemas — each covered by one run ofscripts/bootstrap-lakebase-perms.sh:
Schema When to run Deploy config var Registry schema (e.g. ontobricks_registry)After Settings → Registry → InitializeLAKEBASE_BOOTSTRAP_SCHEMAGraph schema (e.g. ontobricks_graph)After first Digital Twin BuildLAKEBASE_GRAPH_SCHEMASync schema (e.g. ontobricks)After first Lakeflow snapshot ( managed_syncedonly)LAKEBASE_SYNC_SCHEMA
scripts/deploy.shcalls the bootstrap for all three automatically. If the Graph DB is on a separate Lakebase instance from the registry, setLAKEBASE_GRAPH_PROJECT,LAKEBASE_GRAPH_BRANCH, andLAKEBASE_GRAPH_DATABASEinscripts/deploy.config.shso the second and third grants target the correct instance.
Lakebase build performance. When the active engine is Lakebase, the Digital Twin build streams warehouse rows in
fetchmanybatches (SQLWarehouse.iter_rows) and ingests them viaCOPY FROM STDINinto a per-batch temp table followed byINSERT … ON CONFLICT DO NOTHING(and the symmetricalDELETE … USINGfor incremental removes). The FastAPI process never holds the full graph or the full diff: snapshot CTAS andEXCEPTexecution stay warehouse-side, the app pipes one batch at a time. There is no Volume archive thread — Postgres is the system of record for the graph.
Lakebase managed-synced mode. When
graph_engine_config.sync_mode = "managed_synced", the bulk R2RML data movement is moved entirely off the app: a Databricks Lakeflow snapshot pipeline keeps a Postgres synced table in lock-step with the R2RML view, and the FastAPI process only orchestrates (SyncedTableManager.ensure+trigger_and_wait). Reasoning + cohort writes stay on the direct PG path through a writable companion table; readers see both via a UNION view (back-compat name). PG layout per graph version:g_<dom>_v<n>_sync(Lakeflow),g_<dom>_v<n>__app(app),g_<dom>_v<n>(UNION view). Seedocs/graphdb-integration.md §9for the full architecture.
- Design an ontology visually using the OntoViz canvas, or import OWL/RDFS/industry standards (FIBO, CDISC, IOF, HL7 FHIR R4/R4B/R5)
- Map ontology entities to Databricks tables with column-level precision
- Build the Digital Twin — materializes triples into the triple store (incremental by default)
- Query through the GraphQL playground or explore the interactive knowledge graph
- Reason over the graph — run OWL 2 RL inference, SWRL rules, SHACL validation, and constraint checks
- Two-phase search — preview matching entities in a flat list, then select specific ones to expand into the full graph with relationships and neighbors
- Configurable search depth — control the maximum traversal depth and entity cap for graph expansion
- Right-click "Expand neighbours" — enrich the current graph in place with N-hop neighbours of any selected node (depth follows the right-pane Depth slider, default 2); newly added entities are highlighted and the camera zooms to frame them, with a non-blocking spinner in the canvas top-right while the request runs
- Bridge navigation — follow cross-domain bridges to automatically switch domains and focus on the target entity in the knowledge graph
- Data cluster detection — detect communities in the knowledge graph using Louvain, Label Propagation, or Greedy Modularity algorithms; available client-side (Graphology) for the visible subgraph and server-side (NetworkX) for the full graph; cluster results can be visualized with color-by-cluster mode and collapsed into super-nodes
- Cohort discovery — group entities that travel together using rule-based linkage (shared resources via predicates) and compatibility constraints (same-value, value-equals, value-in, value-range); deterministic, explainable cohorts with live counters, why/why-not explainers, and idempotent materialisation as graph triples (
:inCohort) or Unity Catalog Delta tables. Seedocs/cohort_discovery.md. - Data quality violation limits — cap the number of violations displayed per rule (configurable via dropdown, default 10) for faster quality checks
- Per-rule progress tracking — SWRL inference and data quality checks report progress for each individual rule
The Ontology Designer view (Ontology → Designer) includes a floating AI Assistant (bottom-right of the canvas) that lets you modify your ontology through natural language commands — add entities, remove orphans, list relationships, and more. Conversation history is maintained within the session.
- Deep-linked sidebar sections — shareable URLs, browser Back/Forward support
- Breadcrumb navigation — always see your position (Registry > Domain > Ontology > Section)
- Keyboard shortcuts —
Cmd/Ctrl+Ssave,Cmd/Ctrl+Ksearch,?help overlay - SQL connection pooling — reusable database connections, no per-query TLS handshake
- CSRF protection — double-submit cookie for all state-changing requests
- Structured JSON logging — set
LOG_FORMAT=jsonfor production-grade observability
OntoBricks exposes the knowledge graph to LLM agents via the Model Context Protocol. Deploy the companion mcp-ontobricks app and connect from Cursor, Claude Desktop, or the Databricks Playground.
Export one or more domains directly from Registry → Browse to a portable
.obx file with per-domain version-mode selection (Latest / Active / All /
Choose). Import with per-domain conflict resolution (Skip / Overwrite / Rename).
No command line required — ideal for ad-hoc transfers and cross-tenant sharing.
For automated promotion pipelines use the
scripts/registry_transfer.sh command-line tool — export a curated subset
of domains/versions from a source registry into a .zip, then preview and
commit it into the target registry. See
Registry Import / Export for the full reference,
examples, and a comparison of the OBX UI vs CLI approaches.
Detect 19 structural, logical, and semantic pitfalls (P1.1–P4.7) in your ontology from the Ontology → Pitfalls sidebar panel. Fast graph-only checks run immediately; ML-heavy checks (semantic similarity, NLP naming) require installing the optional extra:
uv sync --extra pitfallsFull documentation is available in docs/. For a comprehensive feature list and architecture details, see INFO.md.