docs: add text-to-sql dev note by dhruvnathawani · Pull Request #349 · NVIDIA-NeMo/DataDesigner

dhruvnathawani · 2026-02-23T18:24:49Z

Summary

Add a dev note documenting the enterprise-grade text-to-SQL SDG pipeline used to generate training data for Nemotron's SQL capabilities across PostgreSQL, MySQL, and SQLite.

What's in the post

Motivation: The "real-world gap" between academic benchmarks (Spider >85%) and production SQL (Spider 2.0 Lite <50%) — dialect specificity, dirty data, distractor tables, industry-specific schemas, complexity gradients
Pipeline walkthrough: Conditional samplers (60 industries, 700 topics, 90 SQL concepts) → three-stage LLM generation (natural language prompt → database context with dirty data + distractors → SQL with chain-of-thought reasoning) → quality waterfall (SQLFluff syntax validation + 4-dimension LLM judge scoring)
ASCII pipeline diagram showing the 4-stage flow (Conditional Samplers → Three-Stage LLM Generation → Quality Waterfall → Output)
Deep dive into SubcategorySamplerParams for two-level conditional sampling (industry→topic, complexity→sql_concept)
Quality waterfall breakdown: 300k generated → ~180k after SQLFluff → 96.5k after judge filtering (68% total rejection)
Rich metadata table (industry, topic, complexity, concept, dialect, instruction style, 4 judge scores) enabling precision filtering for downstream training
Discussion of chain-of-thought reasoning traces teaching models to think like Data Engineers
Results: 96.5k filtered records, 3 SQL dialects, 60 industries, 700 topics, 89 SQL concepts, 100% syntax-verified, all judge dimensions ≥ 3/4
7 key takeaways covering conditional sampling, three-stage generation, dirty data, distractor tables, hard validators, multi-dimension scoring and CoT reasoning

Files changed

docs/devnotes/posts/text-to-sql.md (new)

greptile-apps · 2026-03-09T22:04:41Z

Greptile Summary

This PR adds a dev note documenting the enterprise text-to-SQL SDG pipeline used to generate training data for Nemotron Super v3, along with the accompanying runnable recipe (enterprise_text_to_sql.py), benchmark result images, and author/nav entries. Most issues raised in earlier review rounds have been addressed — the score-zero guard now uses is not none, the dialect validator uses SQL_SQLITE, the sql_dialect seeder column is shown in Step 1, the Key Takeaway taxonomy examples are corrected, and the MySQL function restrictions are clarified.

Confidence Score: 5/5

Safe to merge — only one minor P2 documentation nit remains; no logic or correctness issues.
The PR is documentation-only. Previous review rounds addressed all P0/P1 findings (score-zero handling, dialect validator, unused import, model alias, taxonomy contradictions). The single remaining finding is a P2 stale reference to non-existent companion files in the disclaimer note.
No files require special attention.

Important Files Changed

Filename	Overview
docs/devnotes/posts/text-to-sql.md	New dev note documenting the text-to-SQL SDG pipeline; stale companion-file reference (prompts.py, rubrics.py) in the disclaimer note; earlier rounds of review addressed most other issues (dialect validator, unused import, score-zero handling, Window Functions taxonomy, model alias registration).
docs/assets/recipes/code_generation/enterprise_text_to_sql.py	Complete, self-contained recipe file; score extraction loop, prompt templates, rubrics, and `is not none` guard on score columns all look correct. No new issues found.
docs/devnotes/.authors.yml	Three new authors (dnathawani, ymeyer, mvansegbroeck) added correctly; entries follow existing schema with name, description, and avatar URL.
mkdocs.yml	New nav entry for the text-to-SQL dev note added at the top of the Dev Notes section (most-recent-first order), consistent with the existing pattern.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    S1["Stage 1: Seeding & Diversification\n(industry×topic, complexity×concept,\ninstruction style, dialect)"]
    S2["Stage 2: Prompt Generation\n(LLM — natural-language request,\nno SQL jargon)"]
    S3["Stage 3: Schema + Data Generation\n(LLM — DDL + INSERTs,\ndistractor tables/columns, dirty data)"]
    S4["Stage 4: SQL Generation\n(LLM — dialect-specific SQL,\ncleans dirty data, ignores distractors)"]
    S5A["Syntax Validator\n(SQLFluff per dialect)"]
    S5B["5 LLM Judges\n(Prompt · SQL · Context ·\nData Quality · Knowledge)"]
    OUT["Output: 15 flat score columns\nfor downstream filtering"]

    S1 --> S2 --> S3 --> S4 --> S5A & S5B --> OUT

Prompt To Fix All With AI

This is a comment left during a code review.
Path: docs/devnotes/posts/text-to-sql.md
Line: 128

Comment:
**Stale companion-file references**

The disclaimer names `prompts.py` and `rubrics.py` as companion files, but neither exists in the repository. All prompts (`PROMPT_GEN_TEXT`, `SCHEMA_GEN_PROMPTS`, `SQL_GEN_PROMPTS`, etc.) and scoring rubrics (`SQL_SCORES`, `PROMPT_SCORES`, `DATA_QUALITY_SCORES`, etc.) are defined inline in `enterprise_text_to_sql.py`. A reader who follows this note to look for those files will come up empty-handed.

```suggestion
    The code blocks below show the key configuration patterns for each pipeline stage. Model aliases (`prompt_gen`, `context_gen`, etc.) and prompt/rubric definitions are referenced but not fully defined inline; for the complete, runnable pipeline — including all prompt templates and scoring rubrics — see the [Enterprise Text-to-SQL Recipe](../../recipes/code_generation/enterprise_text_to_sql/).
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (16): Last reviewed commit: "Merge branch 'dhruv/devnotes/text-to-sql..." | Re-trigger Greptile}

Signed-off-by: Yev Meyer <ymeyer@nvidia.com>

PR feedback fixes: - Fix Window Functions contradiction: Key Takeaway #1 now uses "Geospatial SQL" (Advanced) instead of "Window Functions" (Intermediate) - Fix score-0 truthiness bug: use `is not none` instead of truthy check in Jinja2 expression columns (inline example + production pipeline) - Soften Code Sandbox language: "A natural next step would be..." instead of "We are actively implementing..." - Cut Gretel reference per mvansegbroeck: replaced with NVIDIA/Nemotron team description - Replace Qwen model references with Nemotron per mvansegbroeck: MODEL_NAME, ASCII diagram labels, Pipeline Overview prose - Rename sdg_qwen_235b.py -> sdg_ndd_text2sql.py per mvansegbroeck - Fix Try It Yourself: use MODEL_ALIAS = "nvidia-text" with default provider pattern (matches structured-outputs dev note), remove unused explicit ModelConfig - Remove placeholder dataset link (#), add "Dataset: Internal" note New content: - Add BIRD Benchmark Results section with bar chart (JPG), data table, BIRD caveat paragraph, and Jocelyn Huang acknowledgement (Nemotron Super EX: 26.77% -> 41.80%, +15 pts, beats GPT-OSS-120B) - Replace "Looking Ahead: Code Sandbox" with broader "Next Steps": Code Sandbox, RL on BIRD via NeMo Gym, schema representation, Spider 2.0 - Add Project Summary table at end of post

- Fix "EHR Systems" -> "Electronic Health Records" in Key Takeaway #1 to match the exact taxonomy string in the code example (greptile) - Add admonition clarifying code snippets are illustrative, not runnable, with link to Enterprise Text-to-SQL Recipe (nabinchha) - Add context before score extraction snippet referencing the five LLMJudgeColumnConfig columns and linking to full recipe (nabinchha) - Add companion file note and recipe link to production pipeline details block for prompts.py, rubrics.py, text2sql_seed.json (nabinchha)

… recipe - Fix "EHR Systems" -> "Electronic Health Records" in Key Takeaway #1 to match the exact taxonomy string in the code example (greptile) - Add admonition clarifying inline code snippets are illustrative, with link to runnable Enterprise Text-to-SQL Recipe (nabinchha) - Add context before score extraction snippet referencing the five LLMJudgeColumnConfig columns and linking to full recipe (nabinchha) - Replace production pipeline <details> block (230 lines with phantom imports from prompts.py, rubrics.py, text2sql_seed.json) with snippet include of enterprise_text_to_sql.py recipe — self-contained and runnable, consistent with other merged dev notes (nabinchha)

- Wrap minimal inline example in collapsible <details> dropdown - Rename "A Team Effort" section to "Summary" - Remove redundant Scale/Dialects/Dataset line

The Step 3/4 prompt templates reference {{ sql_dialect }} but the Step 1 seeding code never defined it, leaving an unresolved Jinja2 variable for readers following along. Add the sql_dialect sampler with a comment explaining the pipeline runs once per dialect. Made-with: Cursor

- Remove specific "60-70%" BIRD claim from intro to avoid contradiction with the 41.80%/38.25% direct-generation results shown later (those higher figures come from specialized systems with schema linking) - Reword MySQL "forbids" to "prompts exclude" -- REGEXP_REPLACE and CONVERT_TZ are valid MySQL functions; the pipeline excluded them for portability, not because the dialect forbids them

github-actions · 2026-04-13T19:33:44Z

Docs preview: https://8eefda37.dd-docs-preview.pages.dev

Notebook tutorials are placeholder-only in previews.

andreatgretel · 2026-04-14T14:59:02Z

mkdocs.yml:75-85

the new dev note isn't listed in the nav section of mkdocs.yml - all the other dev notes have entries there (lines 78-85). it'll still show up on the blog index via auto-discovery, but it won't appear in the sidebar. should go after "Async All the Way Down" since it's dated 2026-03-11.

andreatgretel · 2026-04-14T14:59:04Z

+        category="sql_complexity",
+        values={
+            "Beginner":     ["Basic SELECT Statements", "WHERE Clauses", "Simple Aggregations", ...],
+            "Intermediate": ["Window Functions", "Recursive CTEs", "Correlated Subqueries", ...],


"Recursive CTEs" is listed under Intermediate here, but the recipe included lower on the page via --8<-- (line 240 of enterprise_text_to_sql.py) puts it under Advanced. your own intro also frames recursive CTEs as a senior-engineer concept. maybe swap it with "CASE Expressions" to match the recipe?

Good catch, changed to match recipe

andreatgretel · 2026-04-14T14:59:29Z

docs/assets/recipes/code_generation/enterprise_text_to_sql.py:546
expr=f"{{{{ {judge_name}.{rubric}.score if {judge_name}.{rubric}.score else ' ' }}}}"

this still uses a truthy check for score extraction, but the inline snippet in the dev note (line 378) was fixed to use is not none. since both appear on the same rendered page via the --8<-- include, readers see contradictory patterns. the truthy version also silently drops legitimate score-0 values. Claude Code caught this one.

andreatgretel · 2026-04-14T14:59:33Z

+
+4. **Distractor tables teach schema linking.** Injecting semantically similar but irrelevant tables forces the model to *read* the schema instead of guessing from table names. This is the skill gap between academic benchmarks and production.
+
+5. **Per-dialect generation avoids lowest-common-denominator SQL.** Rather than generating ANSI SQL and hoping it works everywhere, the pipeline produces dialect-specific schemas and queries with appropriate syntax (`strftime` vs `DATE_SUB` vs `interval`, `REPLACE()` vs `regexp_replace`). Each dialect gets its own tailored prompts, validators, and judge prompts.


nit: REPLACE() is available in all three dialects (SQLite, MySQL, PostgreSQL) - it's not really a dialect-specific function. the strftime vs DATE_SUB vs interval example right before it already makes the point well. maybe just drop the REPLACE() vs regexp_replace part? this was flagged in a prior review too.

Good point, changed

andreatgretel · 2026-04-14T14:59:40Z

ran slop-guard on the post - scored 23/100, which is in the same range as some existing dev notes (structured-outputs is 29, design-principles is 21), so not blocking. a couple easy wins if you want to tighten the prose: "The key insight:" on line 33 could just be cut (state the insight directly), "performance drops significantly" on line 23 could use the actual number instead, and the "X, not Y" contrast pattern shows up 5 times ("feature, not a bug", "reasoning, not just syntax", "hours, not months", etc.) - effective in moderation but gets repetitive across the post.

andreatgretel · 2026-04-14T14:59:43Z

+| SQL complexity | 3 tiers | 89 concepts | Difficulty level (Beginner → Advanced) |
+| SQL task type | 12 categories | 94 concepts | What the query does (analytics, transformation, ...) |
+| Data quality | 5 challenges | 12 concepts | Dirty data to inject and clean |
+| Knowledge dependency | 3 categories | 8 concepts | Implicit reasoning required |


the taxonomy table says "3 categories | 8 concepts" for knowledge dependency, but the recipe included on this page defines 3 categories x 3 subcategories = 9 concepts. small thing, but the demo recipe shouldn't exceed the stated total.

I specifically remember putting 9 LOL, thanks!

- Add devnote to mkdocs nav after Async All the Way Down - Swap Recursive CTEs to Advanced, CASE Expressions to Intermediate (matches recipe) - Fix score extraction truthy check to use 'is not none' (preserves score-0 values) - Drop REPLACE() vs regexp_replace from dialect takeaway (REPLACE is cross-dialect) - Tighten prose: remove 'The key insight:', use actual BIRD number, trim X-not-Y - Fix knowledge dependency count: 8 -> 9 concepts (3x3 in recipe)

…A-NeMo/DataDesigner into dhruv/devnotes/text-to-sql

* docs: add text-to-sql devnote * add diagram, update content * correct inconsistencies * docs: address PR NVIDIA-NeMo#349 feedback and add BIRD benchmark results PR feedback fixes: - Fix Window Functions contradiction: Key Takeaway NVIDIA-NeMo#1 now uses "Geospatial SQL" (Advanced) instead of "Window Functions" (Intermediate) - Fix score-0 truthiness bug: use `is not none` instead of truthy check in Jinja2 expression columns (inline example + production pipeline) - Soften Code Sandbox language: "A natural next step would be..." instead of "We are actively implementing..." - Cut Gretel reference per mvansegbroeck: replaced with NVIDIA/Nemotron team description - Replace Qwen model references with Nemotron per mvansegbroeck: MODEL_NAME, ASCII diagram labels, Pipeline Overview prose - Rename sdg_qwen_235b.py -> sdg_ndd_text2sql.py per mvansegbroeck - Fix Try It Yourself: use MODEL_ALIAS = "nvidia-text" with default provider pattern (matches structured-outputs dev note), remove unused explicit ModelConfig - Remove placeholder dataset link (#), add "Dataset: Internal" note New content: - Add BIRD Benchmark Results section with bar chart (JPG), data table, BIRD caveat paragraph, and Jocelyn Huang acknowledgement (Nemotron Super EX: 26.77% -> 41.80%, +15 pts, beats GPT-OSS-120B) - Replace "Looking Ahead: Code Sandbox" with broader "Next Steps": Code Sandbox, RL on BIRD via NeMo Gym, schema representation, Spider 2.0 - Add Project Summary table at end of post * docs: address second round of PR NVIDIA-NeMo#349 feedback - Fix "EHR Systems" -> "Electronic Health Records" in Key Takeaway NVIDIA-NeMo#1 to match the exact taxonomy string in the code example (greptile) - Add admonition clarifying code snippets are illustrative, not runnable, with link to Enterprise Text-to-SQL Recipe (nabinchha) - Add context before score extraction snippet referencing the five LLMJudgeColumnConfig columns and linking to full recipe (nabinchha) - Add companion file note and recipe link to production pipeline details block for prompts.py, rubrics.py, text2sql_seed.json (nabinchha) * docs: address round 2 PR NVIDIA-NeMo#349 feedback, replace production block with recipe - Fix "EHR Systems" -> "Electronic Health Records" in Key Takeaway NVIDIA-NeMo#1 to match the exact taxonomy string in the code example (greptile) - Add admonition clarifying inline code snippets are illustrative, with link to runnable Enterprise Text-to-SQL Recipe (nabinchha) - Add context before score extraction snippet referencing the five LLMJudgeColumnConfig columns and linking to full recipe (nabinchha) - Replace production pipeline <details> block (230 lines with phantom imports from prompts.py, rubrics.py, text2sql_seed.json) with snippet include of enterprise_text_to_sql.py recipe — self-contained and runnable, consistent with other merged dev notes (nabinchha) * docs: polish Try It Yourself and Summary sections - Wrap minimal inline example in collapsible <details> dropdown - Rename "A Team Effort" section to "Summary" - Remove redundant Scale/Dialects/Dataset line * docs: add missing sql_dialect sampler to Step 1 code snippet The Step 3/4 prompt templates reference {{ sql_dialect }} but the Step 1 seeding code never defined it, leaving an unresolved Jinja2 variable for readers following along. Add the sql_dialect sampler with a comment explaining the pipeline runs once per dialect. * fix ascii diagram * docs: fix BIRD score framing and MySQL dialect wording - Remove specific "60-70%" BIRD claim from intro to avoid contradiction with the 41.80%/38.25% direct-generation results shown later (those higher figures come from specialized systems with schema linking) - Reword MySQL "forbids" to "prompts exclude" -- REGEXP_REPLACE and CONVERT_TZ are valid MySQL functions; the pipeline excluded them for portability, not because the dialect forbids them * docs: move text-to-sql images to assets/ convention and update refs * docs: address text-to-sql devnote review comments - Add devnote to mkdocs nav after Async All the Way Down - Swap Recursive CTEs to Advanced, CASE Expressions to Intermediate (matches recipe) - Fix score extraction truthy check to use 'is not none' (preserves score-0 values) - Drop REPLACE() vs regexp_replace from dialect takeaway (REPLACE is cross-dialect) - Tighten prose: remove 'The key insight:', use actual BIRD number, trim X-not-Y - Fix knowledge dependency count: 8 -> 9 concepts (3x3 in recipe) --------- Signed-off-by: Yev Meyer <ymeyer@nvidia.com> Co-authored-by: Yev Meyer <ymeyer@nvidia.com>

docs: add text-to-sql devnote

e9e6f00

nabinchha requested a review from 3mei February 26, 2026 00:11

Merge branch 'main' into dhruv/devnotes/text-to-sql

e0e3804

dhruvnathawani marked this pull request as ready for review March 9, 2026 22:01

dhruvnathawani requested review from a team as code owners March 9, 2026 22:01

dhruvnathawani requested a review from mvansegbroeck March 9, 2026 22:02

greptile-apps bot reviewed Mar 9, 2026

View reviewed changes

Comment thread docs/devnotes/posts/text-to-sql.md Outdated

Comment thread docs/devnotes/posts/text-to-sql.md Outdated

Comment thread docs/devnotes/posts/text-to-sql.md Outdated

add diagram, update content

2009060

Signed-off-by: Yev Meyer <ymeyer@nvidia.com>

greptile-apps bot reviewed Mar 9, 2026

View reviewed changes

Comment thread docs/devnotes/posts/text-to-sql.md

Comment thread docs/devnotes/posts/text-to-sql.md Outdated

correct inconsistencies

428d81c

Signed-off-by: Yev Meyer <ymeyer@nvidia.com>

greptile-apps bot reviewed Mar 9, 2026

View reviewed changes