Conventions for creating and editing marimo notebooks in the Developer Data Portal. Read this before touching any notebook.
The setup_pyoso cell is immutable. Do not rename it, rewrite it, or change its body. It is the only place pyoso and mo are imported. All other cells receive these via dependency injection.
@app.cell(hide_code=True)
def setup_pyoso():
# This code sets up pyoso to be used as a database provider for this notebook
# This code is autogenerated. Modification could lead to unexpected results :)
import pyoso
import marimo as mo
pyoso_db_conn = pyoso.Client().dbapi_connection()
return mo, pyoso_db_connDo not link to other notebooks or pages inside the app. Notebooks run inside iframes served from localhost:8000, but the app shell is at localhost:3000. Any relative or absolute path link (/quick-start, ./commits.py) resolves against the wrong origin and breaks. Plain text only — no hrefs, no [text](url) for internal navigation.
Every cell must have hide_code=True. No exceptions — not for content cells, not for infrastructure cells.
@app.cell(hide_code=True)
def _(mo):
...Never pin a marimo version. The top of every notebook file must have:
__generated_with = "unknown"
app = marimo.App(width="full", css_file="styles/root.css")The css_file path is relative to the notebook file. Use the correct variant:
- Root notebooks (
notebooks/*.py):css_file="styles/root.css" - Data notebooks (
notebooks/data/**/*.py):css_file="../../styles/data.css" - Insight notebooks (
notebooks/insights/*.py):css_file="../styles/insights.css"
CSS is built from notebooks/styles/base.css + variant partials by scripts/build_css.py. Run uv run scripts/build_css.py after editing any CSS source file.
When querying oso.stg_opendevdata__ecosystems and listing/ranking multiple ecosystems, always add:
WHERE e.is_crypto = 1 AND e.is_category = 0This filters out ODD's internal category ecosystems. Not needed when filtering by a specific ecosystem name (e.g., WHERE e.name = 'Ethereum').
All queries run against a Trino data warehouse via pyoso. Write Trino-compatible SQL:
DATE_TRUNC('month', dt)notDATE_TRUNC(dt, MONTH)CAST(x AS VARCHAR)notSAFE_CASTCOALESCEnotIFNULL- Single quotes for strings, double quotes for identifiers
- Capitalize keywords (
SELECT,FROM,WHERE)
- Prefix throwaway variables with
_to avoid marimo "multiple definitions" errors:_df = mo.sql("SELECT ...", engine=pyoso_db_conn)
- Named variables shared across cells use descriptive names:
df_top_ecosystems,df_monthly_commits
Cells declare dependencies in their signature. Infrastructure cells live at the bottom of the file — marimo resolves execution order automatically from the dependency graph.
@app.cell(hide_code=True)
def _(mo, pyoso_db_conn, px): # receives what it needs
...Use """ not r""" (raw strings break escape sequences in mermaid and SQL blocks).
Never import or call dotenv. Environment variables are loaded automatically by the server process. The OSO_API_KEY is always available via the environment.
Always use pyoso_db_conn from setup_pyoso. Never do conn = pyoso.Client() in other cells.
DDP data model notebooks follow this order top-to-bottom in the rendered view. Infrastructure cells (helpers, imports, setup) go at the bottom of the file.
- Title —
# Model Name+ one-line intro + preview SQL snippet - Overview — What the model is and why it matters
- Key concepts — Model-specific context (hierarchy, ID systems, data sources)
- Data lineage — Mermaid diagram showing transformation flow
- Key fields (if applicable) — Field documentation
- Model previews — Accordion previews using
render_table_preview - Best Practices — Table with Goal / Recommended Approach / Why columns
- Live data exploration — Stat widgets + charts with executed queries
- Sample queries — Markdown + executed pairs (see pattern below)
- Methodology / Edge cases (if applicable)
- Helper utilities —
render_table_previewand supporting functions - Imports —
plotly.express,pandasif needed setup_pyoso— Always last
Present each query as a markdown + executed pair — two cells, both with hide_code=True:
@app.cell(hide_code=True)
def _(mo):
mo.md("""
### 1. Query Title
Description of what the query does.
```sql
SELECT column1, column2
FROM oso.model_name
WHERE condition
LIMIT 10
```
""")
return
@app.cell(hide_code=True)
def _(mo, pyoso_db_conn):
_df = mo.sql(
"""
SELECT column1, column2
FROM oso.model_name
WHERE condition
LIMIT 10
""",
engine=pyoso_db_conn
)
return- Use narrow date ranges (7–30 days) to keep execution fast
- Use
CURRENT_DATE - INTERVAL 'N' DAY - Convert dates with
pd.to_datetime()for plotly
@app.cell(hide_code=True)
def imports():
import pandas as pd
import plotly.express as px
return (pd, px)_fig.update_layout(
template='plotly_white',
margin=dict(t=20, l=0, r=0, b=0),
height=400,
hovermode='x unified'
)
_fig.update_xaxes(title='', showgrid=False, linecolor="#000", linewidth=1)
_fig.update_yaxes(title='Y-Axis Label', showgrid=True, gridcolor="#E5E5E5", linecolor="#000", linewidth=1)Additions by chart type:
- Horizontal bar:
yaxis=dict(categoryorder='total ascending') - Line/area with dates:
xaxis=dict(tickformat="%b %Y") - Multi-series legend:
legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1, title_text='')
- Primary:
#4C78A8(blue) - Secondary:
#F58518(orange) - Tertiary:
#72B7B2(teal) - Area fill:
rgba(76, 120, 168, 0.2)
mo.hstack([
mo.stat(label="Label", value=f"{value:,}", bordered=True, caption="Context"),
], widths="equal", gap=1)- Always
bordered=Truewith acaption - Place stats above charts:
mo.vstack([stats_row, chart]) - Typically 3–4 stats per row
Use mo.mermaid(), not ASCII art:
mo.mermaid("""
graph TD
A[source_model<br/>Description] --> B[final_model<br/>Description]
""")Use mo.md() for prose. Do not use mo.callout().
Include this cell in all data model notebooks (at the bottom of the file):
@app.cell(hide_code=True)
def _(mo, pyoso_db_conn):
def get_model_preview(model_name, limit=5):
return mo.sql(f"SELECT * FROM {model_name} LIMIT {limit}",
engine=pyoso_db_conn, output=False)
def get_row_count(model_name):
result = mo.sql(f"SHOW STATS FOR {model_name}",
engine=pyoso_db_conn, output=False)
return result['row_count'].sum()
def generate_sql_snippet(model_name, df_results, limit=5):
column_names = df_results.columns.tolist()
columns_formatted = ',\n '.join(column_names)
return mo.md(f"```sql\nSELECT\n {columns_formatted}\nFROM {model_name}\nLIMIT {limit}\n```\n")
def render_table_preview(model_name):
df = get_model_preview(model_name)
if df.empty:
return mo.md(f"**{model_name}**\n\nUnable to retrieve preview.")
sql_snippet = generate_sql_snippet(model_name, df, limit=5)
fmt = {c: '{:.0f}' for c in df.columns if df[c].dtype == 'int64' and ('_id' in c or c == 'id')}
table = mo.ui.table(df, format_mapping=fmt, show_column_summaries=False, show_data_types=False)
row_count = get_row_count(model_name)
col_count = len(df.columns)
title = f"{model_name} | {row_count:,.0f} rows, {col_count} cols"
return mo.accordion({title: mo.vstack([sql_snippet, table])})
return (render_table_preview,)oso.stg_opendevdata__ecosystems— Ecosystem definitions (name,is_crypto,is_chain,is_category)oso.stg_opendevdata__ecosystems_child_ecosystems— Parent-child hierarchy linksoso.stg_opendevdata__ecosystems_repos— Direct repo → ecosystem mappingoso.stg_opendevdata__ecosystems_repos_recursive— Recursive mapping with distance + path
oso.int_opendevdata__repositories_with_repo_id— Maps ODD repos to GitHub integer IDs (canonical bridge for cross-source joins)
oso.int_ddp__developers— Unified developer list (ODD + GHA), keyed byuser_idoso.stg_opendevdata__developers— ODD developers with GraphQL IDs
oso.int_ddp__commits_unified— Combined ODD + GHA commits (not deduped)oso.int_ddp__commits_deduped— Deduplicated unified commitsoso.stg_opendevdata__commits— Raw ODD commits with identity resolution
oso.stg_github__events— Raw events with nested fieldsoso.int_gharchive__github_events— Standardized events (canonical entrypoint)oso.int_ddp_github_events— Curated subset of event typesoso.int_ddp_github_events_daily— Daily aggregation with normalized typesoso.int_gharchive__developer_activities— Daily rollup for MAD metrics
oso.stg_opendevdata__repo_developer_28d_activities— 28-day rolling activity (day × repo × developer)oso.stg_opendevdata__eco_mads— Pre-calculated ecosystem MAD counts (day × ecosystem)
oso.int_github__node_id_map— Decode GitHub GraphQL Node IDs to REST integer IDs
PushEvent, PullRequestEvent, PullRequestReviewEvent, PullRequestReviewCommentEvent, IssuesEvent, WatchEvent, ForkEvent
When writing or documenting model notebooks, be explicit about:
- Freshness: GitHub Archive is ~3 days behind real-time
- Completeness: Public GitHub timeline only — no private repos, no deleted events
- Identity:
actor_idis a GitHub REST ID;canonical_developer_idis from Open Dev Data — don't conflate them - Join keys: Use
repo_id(integer) for cross-source joins, not repo names