DDP Notebooks — Agent Guide

Conventions for creating and editing marimo notebooks in the Developer Data Portal. Read this before touching any notebook.

Critical Rules (Do Not Violate)

Never modify `setup_pyoso`

The setup_pyoso cell is immutable. Do not rename it, rewrite it, or change its body. It is the only place pyoso and mo are imported. All other cells receive these via dependency injection.

@app.cell(hide_code=True)
def setup_pyoso():
    # This code sets up pyoso to be used as a database provider for this notebook
    # This code is autogenerated. Modification could lead to unexpected results :)
    import pyoso
    import marimo as mo
    pyoso_db_conn = pyoso.Client().dbapi_connection()
    return mo, pyoso_db_conn

Never add internal links

Do not link to other notebooks or pages inside the app. Notebooks run inside iframes served from localhost:8000, but the app shell is at localhost:3000. Any relative or absolute path link (/quick-start, ./commits.py) resolves against the wrong origin and breaks. Plain text only — no hrefs, no [text](url) for internal navigation.

Always use `hide_code=True`

Every cell must have hide_code=True. No exceptions — not for content cells, not for infrastructure cells.

@app.cell(hide_code=True)
def _(mo):
    ...

Use `__generated_with = "unknown"`

Never pin a marimo version. The top of every notebook file must have:

__generated_with = "unknown"
app = marimo.App(width="full", css_file="styles/root.css")

The css_file path is relative to the notebook file. Use the correct variant:

Root notebooks (notebooks/*.py): css_file="styles/root.css"
Data notebooks (notebooks/data/**/*.py): css_file="../../styles/data.css"
Insight notebooks (notebooks/insights/*.py): css_file="../styles/insights.css"

CSS is built from notebooks/styles/base.css + variant partials by scripts/build_css.py. Run uv run scripts/build_css.py after editing any CSS source file.

Filter ecosystem queries

When querying oso.stg_opendevdata__ecosystems and listing/ranking multiple ecosystems, always add:

WHERE e.is_crypto = 1 AND e.is_category = 0

This filters out ODD's internal category ecosystems. Not needed when filtering by a specific ecosystem name (e.g., WHERE e.name = 'Ethereum').

Use Trino SQL

All queries run against a Trino data warehouse via pyoso. Write Trino-compatible SQL:

DATE_TRUNC('month', dt) not DATE_TRUNC(dt, MONTH)
CAST(x AS VARCHAR) not SAFE_CAST
COALESCE not IFNULL
Single quotes for strings, double quotes for identifiers
Capitalize keywords (SELECT, FROM, WHERE)

Cell Conventions

Variable naming

Prefix throwaway variables with _ to avoid marimo "multiple definitions" errors:
```
_df = mo.sql("SELECT ...", engine=pyoso_db_conn)
```
Named variables shared across cells use descriptive names: df_top_ecosystems, df_monthly_commits

Dependency injection

Cells declare dependencies in their signature. Infrastructure cells live at the bottom of the file — marimo resolves execution order automatically from the dependency graph.

@app.cell(hide_code=True)
def _(mo, pyoso_db_conn, px):   # receives what it needs
    ...

Markdown strings

Use """ not r""" (raw strings break escape sequences in mermaid and SQL blocks).

No dotenv

Never import or call dotenv. Environment variables are loaded automatically by the server process. The OSO_API_KEY is always available via the environment.

Never call `pyoso.Client()` directly

Always use pyoso_db_conn from setup_pyoso. Never do conn = pyoso.Client() in other cells.

Notebook Structure

DDP data model notebooks follow this order top-to-bottom in the rendered view. Infrastructure cells (helpers, imports, setup) go at the bottom of the file.

Content cells (rendered order)

Title — # Model Name + one-line intro + preview SQL snippet
Overview — What the model is and why it matters
Key concepts — Model-specific context (hierarchy, ID systems, data sources)
Data lineage — Mermaid diagram showing transformation flow
Key fields (if applicable) — Field documentation
Model previews — Accordion previews using render_table_preview
Best Practices — Table with Goal / Recommended Approach / Why columns
Live data exploration — Stat widgets + charts with executed queries
Sample queries — Markdown + executed pairs (see pattern below)
Methodology / Edge cases (if applicable)

Infrastructure cells (bottom of file)

Helper utilities — render_table_preview and supporting functions
Imports — plotly.express, pandas if needed
setup_pyoso — Always last

Queries

Sample query pattern

Present each query as a markdown + executed pair — two cells, both with hide_code=True:

@app.cell(hide_code=True)
def _(mo):
    mo.md("""
    ### 1. Query Title

    Description of what the query does.

    ```sql
    SELECT column1, column2
    FROM oso.model_name
    WHERE condition
    LIMIT 10
    ```
    """)
    return


@app.cell(hide_code=True)
def _(mo, pyoso_db_conn):
    _df = mo.sql(
        """
        SELECT column1, column2
        FROM oso.model_name
        WHERE condition
        LIMIT 10
        """,
        engine=pyoso_db_conn
    )
    return

Live data queries

Use narrow date ranges (7–30 days) to keep execution fast
Use CURRENT_DATE - INTERVAL 'N' DAY
Convert dates with pd.to_datetime() for plotly

Visualizations

Imports cell

@app.cell(hide_code=True)
def imports():
    import pandas as pd
    import plotly.express as px
    return (pd, px)

Chart styling

_fig.update_layout(
    template='plotly_white',
    margin=dict(t=20, l=0, r=0, b=0),
    height=400,
    hovermode='x unified'
)
_fig.update_xaxes(title='', showgrid=False, linecolor="#000", linewidth=1)
_fig.update_yaxes(title='Y-Axis Label', showgrid=True, gridcolor="#E5E5E5", linecolor="#000", linewidth=1)

Additions by chart type:

Horizontal bar: yaxis=dict(categoryorder='total ascending')
Line/area with dates: xaxis=dict(tickformat="%b %Y")
Multi-series legend: legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1, title_text='')

Color palette

Primary: #4C78A8 (blue)
Secondary: #F58518 (orange)
Tertiary: #72B7B2 (teal)
Area fill: rgba(76, 120, 168, 0.2)

Widgets

Stats

mo.hstack([
    mo.stat(label="Label", value=f"{value:,}", bordered=True, caption="Context"),
], widths="equal", gap=1)

Always bordered=True with a caption
Place stats above charts: mo.vstack([stats_row, chart])
Typically 3–4 stats per row

Data lineage diagrams

Use mo.mermaid(), not ASCII art:

mo.mermaid("""
graph TD
    A[source_model<br/>Description] --> B[final_model<br/>Description]
""")

Markdown

Use mo.md() for prose. Do not use mo.callout().

Helper Utilities Cell

Include this cell in all data model notebooks (at the bottom of the file):

@app.cell(hide_code=True)
def _(mo, pyoso_db_conn):
    def get_model_preview(model_name, limit=5):
        return mo.sql(f"SELECT * FROM {model_name} LIMIT {limit}",
                      engine=pyoso_db_conn, output=False)

    def get_row_count(model_name):
        result = mo.sql(f"SHOW STATS FOR {model_name}",
                        engine=pyoso_db_conn, output=False)
        return result['row_count'].sum()

    def generate_sql_snippet(model_name, df_results, limit=5):
        column_names = df_results.columns.tolist()
        columns_formatted = ',\n  '.join(column_names)
        return mo.md(f"```sql\nSELECT\n  {columns_formatted}\nFROM {model_name}\nLIMIT {limit}\n```\n")

    def render_table_preview(model_name):
        df = get_model_preview(model_name)
        if df.empty:
            return mo.md(f"**{model_name}**\n\nUnable to retrieve preview.")
        sql_snippet = generate_sql_snippet(model_name, df, limit=5)
        fmt = {c: '{:.0f}' for c in df.columns if df[c].dtype == 'int64' and ('_id' in c or c == 'id')}
        table = mo.ui.table(df, format_mapping=fmt, show_column_summaries=False, show_data_types=False)
        row_count = get_row_count(model_name)
        col_count = len(df.columns)
        title = f"{model_name} | {row_count:,.0f} rows, {col_count} cols"
        return mo.accordion({title: mo.vstack([sql_snippet, table])})

    return (render_table_preview,)

Data Sources Reference

Open Dev Data (Ecosystem models)

oso.stg_opendevdata__ecosystems — Ecosystem definitions (name, is_crypto, is_chain, is_category)
oso.stg_opendevdata__ecosystems_child_ecosystems — Parent-child hierarchy links
oso.stg_opendevdata__ecosystems_repos — Direct repo → ecosystem mapping
oso.stg_opendevdata__ecosystems_repos_recursive — Recursive mapping with distance + path

Repository bridge

oso.int_opendevdata__repositories_with_repo_id — Maps ODD repos to GitHub integer IDs (canonical bridge for cross-source joins)

Developer models

oso.int_ddp__developers — Unified developer list (ODD + GHA), keyed by user_id
oso.stg_opendevdata__developers — ODD developers with GraphQL IDs

Commit models

oso.int_ddp__commits_unified — Combined ODD + GHA commits (not deduped)
oso.int_ddp__commits_deduped — Deduplicated unified commits
oso.stg_opendevdata__commits — Raw ODD commits with identity resolution

GitHub Archive (event models)

oso.stg_github__events — Raw events with nested fields
oso.int_gharchive__github_events — Standardized events (canonical entrypoint)
oso.int_ddp_github_events — Curated subset of event types
oso.int_ddp_github_events_daily — Daily aggregation with normalized types
oso.int_gharchive__developer_activities — Daily rollup for MAD metrics

Timeseries models

oso.stg_opendevdata__repo_developer_28d_activities — 28-day rolling activity (day × repo × developer)
oso.stg_opendevdata__eco_mads — Pre-calculated ecosystem MAD counts (day × ecosystem)

ID mapping

oso.int_github__node_id_map — Decode GitHub GraphQL Node IDs to REST integer IDs

DDP curated event types

PushEvent, PullRequestEvent, PullRequestReviewEvent, PullRequestReviewCommentEvent, IssuesEvent, WatchEvent, ForkEvent

Data Quality Notes

When writing or documenting model notebooks, be explicit about:

Freshness: GitHub Archive is ~3 days behind real-time
Completeness: Public GitHub timeline only — no private repos, no deleted events
Identity: actor_id is a GitHub REST ID; canonical_developer_id is from Open Dev Data — don't conflate them
Join keys: Use repo_id (integer) for cross-source joins, not repo names

FilesExpand file tree

claude.md

Latest commit

History