Skip to content

Latest commit

 

History

History
310 lines (240 loc) · 10.8 KB

File metadata and controls

310 lines (240 loc) · 10.8 KB

DDP Notebooks — Agent Guide

Conventions for creating and editing marimo notebooks in the Developer Data Portal. Read this before touching any notebook.


Critical Rules (Do Not Violate)

Never modify setup_pyoso

The setup_pyoso cell is immutable. Do not rename it, rewrite it, or change its body. It is the only place pyoso and mo are imported. All other cells receive these via dependency injection.

@app.cell(hide_code=True)
def setup_pyoso():
    # This code sets up pyoso to be used as a database provider for this notebook
    # This code is autogenerated. Modification could lead to unexpected results :)
    import pyoso
    import marimo as mo
    pyoso_db_conn = pyoso.Client().dbapi_connection()
    return mo, pyoso_db_conn

Never add internal links

Do not link to other notebooks or pages inside the app. Notebooks run inside iframes served from localhost:8000, but the app shell is at localhost:3000. Any relative or absolute path link (/quick-start, ./commits.py) resolves against the wrong origin and breaks. Plain text only — no hrefs, no [text](url) for internal navigation.

Always use hide_code=True

Every cell must have hide_code=True. No exceptions — not for content cells, not for infrastructure cells.

@app.cell(hide_code=True)
def _(mo):
    ...

Use __generated_with = "unknown"

Never pin a marimo version. The top of every notebook file must have:

__generated_with = "unknown"
app = marimo.App(width="full", css_file="styles/root.css")

The css_file path is relative to the notebook file. Use the correct variant:

  • Root notebooks (notebooks/*.py): css_file="styles/root.css"
  • Data notebooks (notebooks/data/**/*.py): css_file="../../styles/data.css"
  • Insight notebooks (notebooks/insights/*.py): css_file="../styles/insights.css"

CSS is built from notebooks/styles/base.css + variant partials by scripts/build_css.py. Run uv run scripts/build_css.py after editing any CSS source file.

Filter ecosystem queries

When querying oso.stg_opendevdata__ecosystems and listing/ranking multiple ecosystems, always add:

WHERE e.is_crypto = 1 AND e.is_category = 0

This filters out ODD's internal category ecosystems. Not needed when filtering by a specific ecosystem name (e.g., WHERE e.name = 'Ethereum').

Use Trino SQL

All queries run against a Trino data warehouse via pyoso. Write Trino-compatible SQL:

  • DATE_TRUNC('month', dt) not DATE_TRUNC(dt, MONTH)
  • CAST(x AS VARCHAR) not SAFE_CAST
  • COALESCE not IFNULL
  • Single quotes for strings, double quotes for identifiers
  • Capitalize keywords (SELECT, FROM, WHERE)

Cell Conventions

Variable naming

  • Prefix throwaway variables with _ to avoid marimo "multiple definitions" errors:
    _df = mo.sql("SELECT ...", engine=pyoso_db_conn)
  • Named variables shared across cells use descriptive names: df_top_ecosystems, df_monthly_commits

Dependency injection

Cells declare dependencies in their signature. Infrastructure cells live at the bottom of the file — marimo resolves execution order automatically from the dependency graph.

@app.cell(hide_code=True)
def _(mo, pyoso_db_conn, px):   # receives what it needs
    ...

Markdown strings

Use """ not r""" (raw strings break escape sequences in mermaid and SQL blocks).

No dotenv

Never import or call dotenv. Environment variables are loaded automatically by the server process. The OSO_API_KEY is always available via the environment.

Never call pyoso.Client() directly

Always use pyoso_db_conn from setup_pyoso. Never do conn = pyoso.Client() in other cells.


Notebook Structure

DDP data model notebooks follow this order top-to-bottom in the rendered view. Infrastructure cells (helpers, imports, setup) go at the bottom of the file.

Content cells (rendered order)

  1. Title# Model Name + one-line intro + preview SQL snippet
  2. Overview — What the model is and why it matters
  3. Key concepts — Model-specific context (hierarchy, ID systems, data sources)
  4. Data lineage — Mermaid diagram showing transformation flow
  5. Key fields (if applicable) — Field documentation
  6. Model previews — Accordion previews using render_table_preview
  7. Best Practices — Table with Goal / Recommended Approach / Why columns
  8. Live data exploration — Stat widgets + charts with executed queries
  9. Sample queries — Markdown + executed pairs (see pattern below)
  10. Methodology / Edge cases (if applicable)

Infrastructure cells (bottom of file)

  1. Helper utilitiesrender_table_preview and supporting functions
  2. Importsplotly.express, pandas if needed
  3. setup_pyoso — Always last

Queries

Sample query pattern

Present each query as a markdown + executed pair — two cells, both with hide_code=True:

@app.cell(hide_code=True)
def _(mo):
    mo.md("""
    ### 1. Query Title

    Description of what the query does.

    ```sql
    SELECT column1, column2
    FROM oso.model_name
    WHERE condition
    LIMIT 10
    ```
    """)
    return


@app.cell(hide_code=True)
def _(mo, pyoso_db_conn):
    _df = mo.sql(
        """
        SELECT column1, column2
        FROM oso.model_name
        WHERE condition
        LIMIT 10
        """,
        engine=pyoso_db_conn
    )
    return

Live data queries

  • Use narrow date ranges (7–30 days) to keep execution fast
  • Use CURRENT_DATE - INTERVAL 'N' DAY
  • Convert dates with pd.to_datetime() for plotly

Visualizations

Imports cell

@app.cell(hide_code=True)
def imports():
    import pandas as pd
    import plotly.express as px
    return (pd, px)

Chart styling

_fig.update_layout(
    template='plotly_white',
    margin=dict(t=20, l=0, r=0, b=0),
    height=400,
    hovermode='x unified'
)
_fig.update_xaxes(title='', showgrid=False, linecolor="#000", linewidth=1)
_fig.update_yaxes(title='Y-Axis Label', showgrid=True, gridcolor="#E5E5E5", linecolor="#000", linewidth=1)

Additions by chart type:

  • Horizontal bar: yaxis=dict(categoryorder='total ascending')
  • Line/area with dates: xaxis=dict(tickformat="%b %Y")
  • Multi-series legend: legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1, title_text='')

Color palette

  • Primary: #4C78A8 (blue)
  • Secondary: #F58518 (orange)
  • Tertiary: #72B7B2 (teal)
  • Area fill: rgba(76, 120, 168, 0.2)

Widgets

Stats

mo.hstack([
    mo.stat(label="Label", value=f"{value:,}", bordered=True, caption="Context"),
], widths="equal", gap=1)
  • Always bordered=True with a caption
  • Place stats above charts: mo.vstack([stats_row, chart])
  • Typically 3–4 stats per row

Data lineage diagrams

Use mo.mermaid(), not ASCII art:

mo.mermaid("""
graph TD
    A[source_model<br/>Description] --> B[final_model<br/>Description]
""")

Markdown

Use mo.md() for prose. Do not use mo.callout().


Helper Utilities Cell

Include this cell in all data model notebooks (at the bottom of the file):

@app.cell(hide_code=True)
def _(mo, pyoso_db_conn):
    def get_model_preview(model_name, limit=5):
        return mo.sql(f"SELECT * FROM {model_name} LIMIT {limit}",
                      engine=pyoso_db_conn, output=False)

    def get_row_count(model_name):
        result = mo.sql(f"SHOW STATS FOR {model_name}",
                        engine=pyoso_db_conn, output=False)
        return result['row_count'].sum()

    def generate_sql_snippet(model_name, df_results, limit=5):
        column_names = df_results.columns.tolist()
        columns_formatted = ',\n  '.join(column_names)
        return mo.md(f"```sql\nSELECT\n  {columns_formatted}\nFROM {model_name}\nLIMIT {limit}\n```\n")

    def render_table_preview(model_name):
        df = get_model_preview(model_name)
        if df.empty:
            return mo.md(f"**{model_name}**\n\nUnable to retrieve preview.")
        sql_snippet = generate_sql_snippet(model_name, df, limit=5)
        fmt = {c: '{:.0f}' for c in df.columns if df[c].dtype == 'int64' and ('_id' in c or c == 'id')}
        table = mo.ui.table(df, format_mapping=fmt, show_column_summaries=False, show_data_types=False)
        row_count = get_row_count(model_name)
        col_count = len(df.columns)
        title = f"{model_name} | {row_count:,.0f} rows, {col_count} cols"
        return mo.accordion({title: mo.vstack([sql_snippet, table])})

    return (render_table_preview,)

Data Sources Reference

Open Dev Data (Ecosystem models)

  • oso.stg_opendevdata__ecosystems — Ecosystem definitions (name, is_crypto, is_chain, is_category)
  • oso.stg_opendevdata__ecosystems_child_ecosystems — Parent-child hierarchy links
  • oso.stg_opendevdata__ecosystems_repos — Direct repo → ecosystem mapping
  • oso.stg_opendevdata__ecosystems_repos_recursive — Recursive mapping with distance + path

Repository bridge

  • oso.int_opendevdata__repositories_with_repo_id — Maps ODD repos to GitHub integer IDs (canonical bridge for cross-source joins)

Developer models

  • oso.int_ddp__developers — Unified developer list (ODD + GHA), keyed by user_id
  • oso.stg_opendevdata__developers — ODD developers with GraphQL IDs

Commit models

  • oso.int_ddp__commits_unified — Combined ODD + GHA commits (not deduped)
  • oso.int_ddp__commits_deduped — Deduplicated unified commits
  • oso.stg_opendevdata__commits — Raw ODD commits with identity resolution

GitHub Archive (event models)

  • oso.stg_github__events — Raw events with nested fields
  • oso.int_gharchive__github_events — Standardized events (canonical entrypoint)
  • oso.int_ddp_github_events — Curated subset of event types
  • oso.int_ddp_github_events_daily — Daily aggregation with normalized types
  • oso.int_gharchive__developer_activities — Daily rollup for MAD metrics

Timeseries models

  • oso.stg_opendevdata__repo_developer_28d_activities — 28-day rolling activity (day × repo × developer)
  • oso.stg_opendevdata__eco_mads — Pre-calculated ecosystem MAD counts (day × ecosystem)

ID mapping

  • oso.int_github__node_id_map — Decode GitHub GraphQL Node IDs to REST integer IDs

DDP curated event types

PushEvent, PullRequestEvent, PullRequestReviewEvent, PullRequestReviewCommentEvent, IssuesEvent, WatchEvent, ForkEvent


Data Quality Notes

When writing or documenting model notebooks, be explicit about:

  • Freshness: GitHub Archive is ~3 days behind real-time
  • Completeness: Public GitHub timeline only — no private repos, no deleted events
  • Identity: actor_id is a GitHub REST ID; canonical_developer_id is from Open Dev Data — don't conflate them
  • Join keys: Use repo_id (integer) for cross-source joins, not repo names