Skip to content

docs: add text-to-sql dev note#349

Merged
dhruvnathawani merged 20 commits intomainfrom
dhruv/devnotes/text-to-sql
Apr 14, 2026
Merged

docs: add text-to-sql dev note#349
dhruvnathawani merged 20 commits intomainfrom
dhruv/devnotes/text-to-sql

Conversation

@dhruvnathawani
Copy link
Copy Markdown
Contributor

@dhruvnathawani dhruvnathawani commented Feb 23, 2026

Summary

Add a dev note documenting the enterprise-grade text-to-SQL SDG pipeline used to generate training data for Nemotron's SQL capabilities across PostgreSQL, MySQL, and SQLite.

What's in the post

  • Motivation: The "real-world gap" between academic benchmarks (Spider >85%) and production SQL (Spider 2.0 Lite <50%) — dialect specificity, dirty data, distractor tables, industry-specific schemas, complexity gradients
  • Pipeline walkthrough: Conditional samplers (60 industries, 700 topics, 90 SQL concepts) → three-stage LLM generation (natural language prompt → database context with dirty data + distractors → SQL with chain-of-thought reasoning) → quality waterfall (SQLFluff syntax validation + 4-dimension LLM judge scoring)
  • ASCII pipeline diagram showing the 4-stage flow (Conditional Samplers → Three-Stage LLM Generation → Quality Waterfall → Output)
  • Deep dive into SubcategorySamplerParams for two-level conditional sampling (industry→topic, complexity→sql_concept)
  • Quality waterfall breakdown: 300k generated → ~180k after SQLFluff → 96.5k after judge filtering (68% total rejection)
  • Rich metadata table (industry, topic, complexity, concept, dialect, instruction style, 4 judge scores) enabling precision filtering for downstream training
  • Discussion of chain-of-thought reasoning traces teaching models to think like Data Engineers
  • Results: 96.5k filtered records, 3 SQL dialects, 60 industries, 700 topics, 89 SQL concepts, 100% syntax-verified, all judge dimensions ≥ 3/4
  • 7 key takeaways covering conditional sampling, three-stage generation, dirty data, distractor tables, hard validators, multi-dimension scoring and CoT reasoning

Files changed

  • docs/devnotes/posts/text-to-sql.md (new)

@nabinchha nabinchha requested a review from 3mei February 26, 2026 00:11
@dhruvnathawani dhruvnathawani marked this pull request as ready for review March 9, 2026 22:01
@dhruvnathawani dhruvnathawani requested review from a team as code owners March 9, 2026 22:01
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 9, 2026

Greptile Summary

This PR adds a dev note documenting the enterprise text-to-SQL SDG pipeline used to generate training data for Nemotron Super v3, along with the accompanying runnable recipe (enterprise_text_to_sql.py), benchmark result images, and author/nav entries. Most issues raised in earlier review rounds have been addressed — the score-zero guard now uses is not none, the dialect validator uses SQL_SQLITE, the sql_dialect seeder column is shown in Step 1, the Key Takeaway taxonomy examples are corrected, and the MySQL function restrictions are clarified.

Confidence Score: 5/5

  • Safe to merge — only one minor P2 documentation nit remains; no logic or correctness issues.
  • The PR is documentation-only. Previous review rounds addressed all P0/P1 findings (score-zero handling, dialect validator, unused import, model alias, taxonomy contradictions). The single remaining finding is a P2 stale reference to non-existent companion files in the disclaimer note.
  • No files require special attention.

Important Files Changed

Filename Overview
docs/devnotes/posts/text-to-sql.md New dev note documenting the text-to-SQL SDG pipeline; stale companion-file reference (prompts.py, rubrics.py) in the disclaimer note; earlier rounds of review addressed most other issues (dialect validator, unused import, score-zero handling, Window Functions taxonomy, model alias registration).
docs/assets/recipes/code_generation/enterprise_text_to_sql.py Complete, self-contained recipe file; score extraction loop, prompt templates, rubrics, and is not none guard on score columns all look correct. No new issues found.
docs/devnotes/.authors.yml Three new authors (dnathawani, ymeyer, mvansegbroeck) added correctly; entries follow existing schema with name, description, and avatar URL.
mkdocs.yml New nav entry for the text-to-SQL dev note added at the top of the Dev Notes section (most-recent-first order), consistent with the existing pattern.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    S1["Stage 1: Seeding & Diversification\n(industry×topic, complexity×concept,\ninstruction style, dialect)"]
    S2["Stage 2: Prompt Generation\n(LLM — natural-language request,\nno SQL jargon)"]
    S3["Stage 3: Schema + Data Generation\n(LLM — DDL + INSERTs,\ndistractor tables/columns, dirty data)"]
    S4["Stage 4: SQL Generation\n(LLM — dialect-specific SQL,\ncleans dirty data, ignores distractors)"]
    S5A["Syntax Validator\n(SQLFluff per dialect)"]
    S5B["5 LLM Judges\n(Prompt · SQL · Context ·\nData Quality · Knowledge)"]
    OUT["Output: 15 flat score columns\nfor downstream filtering"]

    S1 --> S2 --> S3 --> S4 --> S5A & S5B --> OUT
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: docs/devnotes/posts/text-to-sql.md
Line: 128

Comment:
**Stale companion-file references**

The disclaimer names `prompts.py` and `rubrics.py` as companion files, but neither exists in the repository. All prompts (`PROMPT_GEN_TEXT`, `SCHEMA_GEN_PROMPTS`, `SQL_GEN_PROMPTS`, etc.) and scoring rubrics (`SQL_SCORES`, `PROMPT_SCORES`, `DATA_QUALITY_SCORES`, etc.) are defined inline in `enterprise_text_to_sql.py`. A reader who follows this note to look for those files will come up empty-handed.

```suggestion
    The code blocks below show the key configuration patterns for each pipeline stage. Model aliases (`prompt_gen`, `context_gen`, etc.) and prompt/rubric definitions are referenced but not fully defined inline; for the complete, runnable pipeline — including all prompt templates and scoring rubrics — see the [Enterprise Text-to-SQL Recipe](../../recipes/code_generation/enterprise_text_to_sql/).
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (16): Last reviewed commit: "Merge branch 'dhruv/devnotes/text-to-sql..." | Re-trigger Greptile

Comment thread docs/devnotes/posts/text-to-sql.md Outdated
Comment thread docs/devnotes/posts/text-to-sql.md Outdated
Comment thread docs/devnotes/posts/text-to-sql.md Outdated
Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
Comment thread docs/devnotes/posts/text-to-sql.md
Comment thread docs/devnotes/posts/text-to-sql.md Outdated
Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
Comment thread docs/devnotes/posts/text-to-sql.md Outdated
Comment thread docs/devnotes/posts/text-to-sql.md Outdated
Comment thread docs/devnotes/posts/text-to-sql.md Outdated
Comment thread docs/devnotes/posts/text-to-sql.md Outdated
Comment thread docs/devnotes/posts/text-to-sql.md Outdated
Comment thread docs/devnotes/posts/text-to-sql.md
PR feedback fixes:
- Fix Window Functions contradiction: Key Takeaway #1 now uses
  "Geospatial SQL" (Advanced) instead of "Window Functions" (Intermediate)
- Fix score-0 truthiness bug: use `is not none` instead of truthy check
  in Jinja2 expression columns (inline example + production pipeline)
- Soften Code Sandbox language: "A natural next step would be..." instead
  of "We are actively implementing..."
- Cut Gretel reference per mvansegbroeck: replaced with NVIDIA/Nemotron
  team description
- Replace Qwen model references with Nemotron per mvansegbroeck: MODEL_NAME,
  ASCII diagram labels, Pipeline Overview prose
- Rename sdg_qwen_235b.py -> sdg_ndd_text2sql.py per mvansegbroeck
- Fix Try It Yourself: use MODEL_ALIAS = "nvidia-text" with default
  provider pattern (matches structured-outputs dev note), remove unused
  explicit ModelConfig
- Remove placeholder dataset link (#), add "Dataset: Internal" note
New content:
- Add BIRD Benchmark Results section with bar chart (JPG), data table,
  BIRD caveat paragraph, and Jocelyn Huang acknowledgement
  (Nemotron Super EX: 26.77% -> 41.80%, +15 pts, beats GPT-OSS-120B)
- Replace "Looking Ahead: Code Sandbox" with broader "Next Steps":
  Code Sandbox, RL on BIRD via NeMo Gym, schema representation, Spider 2.0
- Add Project Summary table at end of post
Comment thread docs/devnotes/posts/text-to-sql.md Outdated
Comment thread docs/devnotes/posts/text-to-sql.md
Comment thread docs/devnotes/posts/text-to-sql.md
Comment thread docs/devnotes/posts/text-to-sql.md Outdated
dhruvnathawani and others added 3 commits March 11, 2026 11:54
- Fix "EHR Systems" -> "Electronic Health Records" in Key Takeaway #1
  to match the exact taxonomy string in the code example (greptile)
- Add admonition clarifying code snippets are illustrative, not
  runnable, with link to Enterprise Text-to-SQL Recipe (nabinchha)
- Add context before score extraction snippet referencing the five
  LLMJudgeColumnConfig columns and linking to full recipe (nabinchha)
- Add companion file note and recipe link to production pipeline
  details block for prompts.py, rubrics.py, text2sql_seed.json (nabinchha)
… recipe

- Fix "EHR Systems" -> "Electronic Health Records" in Key Takeaway #1
  to match the exact taxonomy string in the code example (greptile)
- Add admonition clarifying inline code snippets are illustrative,
  with link to runnable Enterprise Text-to-SQL Recipe (nabinchha)
- Add context before score extraction snippet referencing the five
  LLMJudgeColumnConfig columns and linking to full recipe (nabinchha)
- Replace production pipeline <details> block (230 lines with phantom
  imports from prompts.py, rubrics.py, text2sql_seed.json) with
  snippet include of enterprise_text_to_sql.py recipe — self-contained
  and runnable, consistent with other merged dev notes (nabinchha)
- Wrap minimal inline example in collapsible <details> dropdown
- Rename "A Team Effort" section to "Summary"
- Remove redundant Scale/Dialects/Dataset line
Comment thread docs/devnotes/posts/text-to-sql.md Outdated
Comment thread docs/devnotes/posts/text-to-sql.md Outdated
The Step 3/4 prompt templates reference {{ sql_dialect }} but the
Step 1 seeding code never defined it, leaving an unresolved Jinja2
variable for readers following along. Add the sql_dialect sampler
with a comment explaining the pipeline runs once per dialect.

Made-with: Cursor
- Remove specific "60-70%" BIRD claim from intro to avoid contradiction
  with the 41.80%/38.25% direct-generation results shown later (those
  higher figures come from specialized systems with schema linking)
- Reword MySQL "forbids" to "prompts exclude" -- REGEXP_REPLACE and
  CONVERT_TZ are valid MySQL functions; the pipeline excluded them for
  portability, not because the dialect forbids them
Comment thread docs/devnotes/posts/text-to-sql.md
nabinchha
nabinchha previously approved these changes Mar 12, 2026
Comment thread docs/devnotes/posts/text-to-sql.md
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 13, 2026

Docs preview: https://8eefda37.dd-docs-preview.pages.dev

Notebook tutorials are placeholder-only in previews.

mvansegbroeck
mvansegbroeck previously approved these changes Apr 13, 2026
@andreatgretel
Copy link
Copy Markdown
Contributor

mkdocs.yml:75-85

the new dev note isn't listed in the nav section of mkdocs.yml - all the other dev notes have entries there (lines 78-85). it'll still show up on the blog index via auto-discovery, but it won't appear in the sidebar. should go after "Async All the Way Down" since it's dated 2026-03-11.

Comment thread docs/devnotes/posts/text-to-sql.md Outdated
category="sql_complexity",
values={
"Beginner": ["Basic SELECT Statements", "WHERE Clauses", "Simple Aggregations", ...],
"Intermediate": ["Window Functions", "Recursive CTEs", "Correlated Subqueries", ...],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Recursive CTEs" is listed under Intermediate here, but the recipe included lower on the page via --8<-- (line 240 of enterprise_text_to_sql.py) puts it under Advanced. your own intro also frames recursive CTEs as a senior-engineer concept. maybe swap it with "CASE Expressions" to match the recipe?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, changed to match recipe

@andreatgretel
Copy link
Copy Markdown
Contributor

docs/assets/recipes/code_generation/enterprise_text_to_sql.py:546

expr=f"{{{{ {judge_name}.{rubric}.score if {judge_name}.{rubric}.score else ' ' }}}}"

this still uses a truthy check for score extraction, but the inline snippet in the dev note (line 378) was fixed to use is not none. since both appear on the same rendered page via the --8<-- include, readers see contradictory patterns. the truthy version also silently drops legitimate score-0 values. Claude Code caught this one.

Comment thread docs/devnotes/posts/text-to-sql.md Outdated

4. **Distractor tables teach schema linking.** Injecting semantically similar but irrelevant tables forces the model to *read* the schema instead of guessing from table names. This is the skill gap between academic benchmarks and production.

5. **Per-dialect generation avoids lowest-common-denominator SQL.** Rather than generating ANSI SQL and hoping it works everywhere, the pipeline produces dialect-specific schemas and queries with appropriate syntax (`strftime` vs `DATE_SUB` vs `interval`, `REPLACE()` vs `regexp_replace`). Each dialect gets its own tailored prompts, validators, and judge prompts.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: REPLACE() is available in all three dialects (SQLite, MySQL, PostgreSQL) - it's not really a dialect-specific function. the strftime vs DATE_SUB vs interval example right before it already makes the point well. maybe just drop the REPLACE() vs regexp_replace part? this was flagged in a prior review too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, changed

@andreatgretel
Copy link
Copy Markdown
Contributor

ran slop-guard on the post - scored 23/100, which is in the same range as some existing dev notes (structured-outputs is 29, design-principles is 21), so not blocking. a couple easy wins if you want to tighten the prose: "The key insight:" on line 33 could just be cut (state the insight directly), "performance drops significantly" on line 23 could use the actual number instead, and the "X, not Y" contrast pattern shows up 5 times ("feature, not a bug", "reasoning, not just syntax", "hours, not months", etc.) - effective in moderation but gets repetitive across the post.

Comment thread docs/devnotes/posts/text-to-sql.md Outdated
| SQL complexity | 3 tiers | 89 concepts | Difficulty level (Beginner → Advanced) |
| SQL task type | 12 categories | 94 concepts | What the query does (analytics, transformation, ...) |
| Data quality | 5 challenges | 12 concepts | Dirty data to inject and clean |
| Knowledge dependency | 3 categories | 8 concepts | Implicit reasoning required |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the taxonomy table says "3 categories | 8 concepts" for knowledge dependency, but the recipe included on this page defines 3 categories x 3 subcategories = 9 concepts. small thing, but the demo recipe shouldn't exceed the stated total.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I specifically remember putting 9 LOL, thanks!

dhruvnathawani and others added 3 commits April 14, 2026 09:24
  - Add devnote to mkdocs nav after Async All the Way Down
  - Swap Recursive CTEs to Advanced, CASE Expressions to Intermediate (matches recipe)
  - Fix score extraction truthy check to use 'is not none' (preserves score-0 values)
  - Drop REPLACE() vs regexp_replace from dialect takeaway (REPLACE is cross-dialect)
  - Tighten prose: remove 'The key insight:', use actual BIRD number, trim X-not-Y
  - Fix knowledge dependency count: 8 -> 9 concepts (3x3 in recipe)
@dhruvnathawani dhruvnathawani merged commit 1448f9c into main Apr 14, 2026
50 checks passed
przemekboruta pushed a commit to przemekboruta/DataDesigner that referenced this pull request Apr 14, 2026
* docs: add text-to-sql devnote

* add diagram, update content

* correct inconsistencies

* docs: address PR NVIDIA-NeMo#349 feedback and add BIRD benchmark results
PR feedback fixes:
- Fix Window Functions contradiction: Key Takeaway NVIDIA-NeMo#1 now uses
  "Geospatial SQL" (Advanced) instead of "Window Functions" (Intermediate)
- Fix score-0 truthiness bug: use `is not none` instead of truthy check
  in Jinja2 expression columns (inline example + production pipeline)
- Soften Code Sandbox language: "A natural next step would be..." instead
  of "We are actively implementing..."
- Cut Gretel reference per mvansegbroeck: replaced with NVIDIA/Nemotron
  team description
- Replace Qwen model references with Nemotron per mvansegbroeck: MODEL_NAME,
  ASCII diagram labels, Pipeline Overview prose
- Rename sdg_qwen_235b.py -> sdg_ndd_text2sql.py per mvansegbroeck
- Fix Try It Yourself: use MODEL_ALIAS = "nvidia-text" with default
  provider pattern (matches structured-outputs dev note), remove unused
  explicit ModelConfig
- Remove placeholder dataset link (#), add "Dataset: Internal" note
New content:
- Add BIRD Benchmark Results section with bar chart (JPG), data table,
  BIRD caveat paragraph, and Jocelyn Huang acknowledgement
  (Nemotron Super EX: 26.77% -> 41.80%, +15 pts, beats GPT-OSS-120B)
- Replace "Looking Ahead: Code Sandbox" with broader "Next Steps":
  Code Sandbox, RL on BIRD via NeMo Gym, schema representation, Spider 2.0
- Add Project Summary table at end of post

* docs: address second round of PR NVIDIA-NeMo#349 feedback

- Fix "EHR Systems" -> "Electronic Health Records" in Key Takeaway NVIDIA-NeMo#1
  to match the exact taxonomy string in the code example (greptile)
- Add admonition clarifying code snippets are illustrative, not
  runnable, with link to Enterprise Text-to-SQL Recipe (nabinchha)
- Add context before score extraction snippet referencing the five
  LLMJudgeColumnConfig columns and linking to full recipe (nabinchha)
- Add companion file note and recipe link to production pipeline
  details block for prompts.py, rubrics.py, text2sql_seed.json (nabinchha)

* docs: address round 2 PR NVIDIA-NeMo#349 feedback, replace production block with recipe
- Fix "EHR Systems" -> "Electronic Health Records" in Key Takeaway NVIDIA-NeMo#1
  to match the exact taxonomy string in the code example (greptile)
- Add admonition clarifying inline code snippets are illustrative,
  with link to runnable Enterprise Text-to-SQL Recipe (nabinchha)
- Add context before score extraction snippet referencing the five
  LLMJudgeColumnConfig columns and linking to full recipe (nabinchha)
- Replace production pipeline <details> block (230 lines with phantom
  imports from prompts.py, rubrics.py, text2sql_seed.json) with
  snippet include of enterprise_text_to_sql.py recipe — self-contained
  and runnable, consistent with other merged dev notes (nabinchha)

* docs: polish Try It Yourself and Summary sections
- Wrap minimal inline example in collapsible <details> dropdown
- Rename "A Team Effort" section to "Summary"
- Remove redundant Scale/Dialects/Dataset line

* docs: add missing sql_dialect sampler to Step 1 code snippet

The Step 3/4 prompt templates reference {{ sql_dialect }} but the
Step 1 seeding code never defined it, leaving an unresolved Jinja2
variable for readers following along. Add the sql_dialect sampler
with a comment explaining the pipeline runs once per dialect.

* fix ascii diagram

* docs: fix BIRD score framing and MySQL dialect wording
- Remove specific "60-70%" BIRD claim from intro to avoid contradiction
  with the 41.80%/38.25% direct-generation results shown later (those
  higher figures come from specialized systems with schema linking)
- Reword MySQL "forbids" to "prompts exclude" -- REGEXP_REPLACE and
  CONVERT_TZ are valid MySQL functions; the pipeline excluded them for
  portability, not because the dialect forbids them

* docs: move text-to-sql images to assets/ convention and update refs

* docs: address text-to-sql devnote review comments

  - Add devnote to mkdocs nav after Async All the Way Down
  - Swap Recursive CTEs to Advanced, CASE Expressions to Intermediate (matches recipe)
  - Fix score extraction truthy check to use 'is not none' (preserves score-0 values)
  - Drop REPLACE() vs regexp_replace from dialect takeaway (REPLACE is cross-dialect)
  - Tighten prose: remove 'The key insight:', use actual BIRD number, trim X-not-Y
  - Fix knowledge dependency count: 8 -> 9 concepts (3x3 in recipe)

---------

Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
Co-authored-by: Yev Meyer <ymeyer@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants