Conversation
Greptile SummaryThis PR adds a dev note documenting the enterprise text-to-SQL SDG pipeline used to generate training data for Nemotron Super v3, along with the accompanying runnable recipe (
|
| Filename | Overview |
|---|---|
| docs/devnotes/posts/text-to-sql.md | New dev note documenting the text-to-SQL SDG pipeline; stale companion-file reference (prompts.py, rubrics.py) in the disclaimer note; earlier rounds of review addressed most other issues (dialect validator, unused import, score-zero handling, Window Functions taxonomy, model alias registration). |
| docs/assets/recipes/code_generation/enterprise_text_to_sql.py | Complete, self-contained recipe file; score extraction loop, prompt templates, rubrics, and is not none guard on score columns all look correct. No new issues found. |
| docs/devnotes/.authors.yml | Three new authors (dnathawani, ymeyer, mvansegbroeck) added correctly; entries follow existing schema with name, description, and avatar URL. |
| mkdocs.yml | New nav entry for the text-to-SQL dev note added at the top of the Dev Notes section (most-recent-first order), consistent with the existing pattern. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
S1["Stage 1: Seeding & Diversification\n(industry×topic, complexity×concept,\ninstruction style, dialect)"]
S2["Stage 2: Prompt Generation\n(LLM — natural-language request,\nno SQL jargon)"]
S3["Stage 3: Schema + Data Generation\n(LLM — DDL + INSERTs,\ndistractor tables/columns, dirty data)"]
S4["Stage 4: SQL Generation\n(LLM — dialect-specific SQL,\ncleans dirty data, ignores distractors)"]
S5A["Syntax Validator\n(SQLFluff per dialect)"]
S5B["5 LLM Judges\n(Prompt · SQL · Context ·\nData Quality · Knowledge)"]
OUT["Output: 15 flat score columns\nfor downstream filtering"]
S1 --> S2 --> S3 --> S4 --> S5A & S5B --> OUT
Prompt To Fix All With AI
This is a comment left during a code review.
Path: docs/devnotes/posts/text-to-sql.md
Line: 128
Comment:
**Stale companion-file references**
The disclaimer names `prompts.py` and `rubrics.py` as companion files, but neither exists in the repository. All prompts (`PROMPT_GEN_TEXT`, `SCHEMA_GEN_PROMPTS`, `SQL_GEN_PROMPTS`, etc.) and scoring rubrics (`SQL_SCORES`, `PROMPT_SCORES`, `DATA_QUALITY_SCORES`, etc.) are defined inline in `enterprise_text_to_sql.py`. A reader who follows this note to look for those files will come up empty-handed.
```suggestion
The code blocks below show the key configuration patterns for each pipeline stage. Model aliases (`prompt_gen`, `context_gen`, etc.) and prompt/rubric definitions are referenced but not fully defined inline; for the complete, runnable pipeline — including all prompt templates and scoring rubrics — see the [Enterprise Text-to-SQL Recipe](../../recipes/code_generation/enterprise_text_to_sql/).
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (16): Last reviewed commit: "Merge branch 'dhruv/devnotes/text-to-sql..." | Re-trigger Greptile
Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
PR feedback fixes: - Fix Window Functions contradiction: Key Takeaway #1 now uses "Geospatial SQL" (Advanced) instead of "Window Functions" (Intermediate) - Fix score-0 truthiness bug: use `is not none` instead of truthy check in Jinja2 expression columns (inline example + production pipeline) - Soften Code Sandbox language: "A natural next step would be..." instead of "We are actively implementing..." - Cut Gretel reference per mvansegbroeck: replaced with NVIDIA/Nemotron team description - Replace Qwen model references with Nemotron per mvansegbroeck: MODEL_NAME, ASCII diagram labels, Pipeline Overview prose - Rename sdg_qwen_235b.py -> sdg_ndd_text2sql.py per mvansegbroeck - Fix Try It Yourself: use MODEL_ALIAS = "nvidia-text" with default provider pattern (matches structured-outputs dev note), remove unused explicit ModelConfig - Remove placeholder dataset link (#), add "Dataset: Internal" note New content: - Add BIRD Benchmark Results section with bar chart (JPG), data table, BIRD caveat paragraph, and Jocelyn Huang acknowledgement (Nemotron Super EX: 26.77% -> 41.80%, +15 pts, beats GPT-OSS-120B) - Replace "Looking Ahead: Code Sandbox" with broader "Next Steps": Code Sandbox, RL on BIRD via NeMo Gym, schema representation, Spider 2.0 - Add Project Summary table at end of post
- Fix "EHR Systems" -> "Electronic Health Records" in Key Takeaway #1 to match the exact taxonomy string in the code example (greptile) - Add admonition clarifying code snippets are illustrative, not runnable, with link to Enterprise Text-to-SQL Recipe (nabinchha) - Add context before score extraction snippet referencing the five LLMJudgeColumnConfig columns and linking to full recipe (nabinchha) - Add companion file note and recipe link to production pipeline details block for prompts.py, rubrics.py, text2sql_seed.json (nabinchha)
… recipe - Fix "EHR Systems" -> "Electronic Health Records" in Key Takeaway #1 to match the exact taxonomy string in the code example (greptile) - Add admonition clarifying inline code snippets are illustrative, with link to runnable Enterprise Text-to-SQL Recipe (nabinchha) - Add context before score extraction snippet referencing the five LLMJudgeColumnConfig columns and linking to full recipe (nabinchha) - Replace production pipeline <details> block (230 lines with phantom imports from prompts.py, rubrics.py, text2sql_seed.json) with snippet include of enterprise_text_to_sql.py recipe — self-contained and runnable, consistent with other merged dev notes (nabinchha)
- Wrap minimal inline example in collapsible <details> dropdown - Rename "A Team Effort" section to "Summary" - Remove redundant Scale/Dialects/Dataset line
The Step 3/4 prompt templates reference {{ sql_dialect }} but the
Step 1 seeding code never defined it, leaving an unresolved Jinja2
variable for readers following along. Add the sql_dialect sampler
with a comment explaining the pipeline runs once per dialect.
Made-with: Cursor
- Remove specific "60-70%" BIRD claim from intro to avoid contradiction with the 41.80%/38.25% direct-generation results shown later (those higher figures come from specialized systems with schema linking) - Reword MySQL "forbids" to "prompts exclude" -- REGEXP_REPLACE and CONVERT_TZ are valid MySQL functions; the pipeline excluded them for portability, not because the dialect forbids them
|
Docs preview: https://8eefda37.dd-docs-preview.pages.dev
|
the new dev note isn't listed in the |
| category="sql_complexity", | ||
| values={ | ||
| "Beginner": ["Basic SELECT Statements", "WHERE Clauses", "Simple Aggregations", ...], | ||
| "Intermediate": ["Window Functions", "Recursive CTEs", "Correlated Subqueries", ...], |
There was a problem hiding this comment.
"Recursive CTEs" is listed under Intermediate here, but the recipe included lower on the page via --8<-- (line 240 of enterprise_text_to_sql.py) puts it under Advanced. your own intro also frames recursive CTEs as a senior-engineer concept. maybe swap it with "CASE Expressions" to match the recipe?
There was a problem hiding this comment.
Good catch, changed to match recipe
this still uses a truthy check for score extraction, but the inline snippet in the dev note (line 378) was fixed to use |
|
|
||
| 4. **Distractor tables teach schema linking.** Injecting semantically similar but irrelevant tables forces the model to *read* the schema instead of guessing from table names. This is the skill gap between academic benchmarks and production. | ||
|
|
||
| 5. **Per-dialect generation avoids lowest-common-denominator SQL.** Rather than generating ANSI SQL and hoping it works everywhere, the pipeline produces dialect-specific schemas and queries with appropriate syntax (`strftime` vs `DATE_SUB` vs `interval`, `REPLACE()` vs `regexp_replace`). Each dialect gets its own tailored prompts, validators, and judge prompts. |
There was a problem hiding this comment.
nit: REPLACE() is available in all three dialects (SQLite, MySQL, PostgreSQL) - it's not really a dialect-specific function. the strftime vs DATE_SUB vs interval example right before it already makes the point well. maybe just drop the REPLACE() vs regexp_replace part? this was flagged in a prior review too.
There was a problem hiding this comment.
Good point, changed
|
ran slop-guard on the post - scored 23/100, which is in the same range as some existing dev notes (structured-outputs is 29, design-principles is 21), so not blocking. a couple easy wins if you want to tighten the prose: "The key insight:" on line 33 could just be cut (state the insight directly), "performance drops significantly" on line 23 could use the actual number instead, and the "X, not Y" contrast pattern shows up 5 times ("feature, not a bug", "reasoning, not just syntax", "hours, not months", etc.) - effective in moderation but gets repetitive across the post. |
| | SQL complexity | 3 tiers | 89 concepts | Difficulty level (Beginner → Advanced) | | ||
| | SQL task type | 12 categories | 94 concepts | What the query does (analytics, transformation, ...) | | ||
| | Data quality | 5 challenges | 12 concepts | Dirty data to inject and clean | | ||
| | Knowledge dependency | 3 categories | 8 concepts | Implicit reasoning required | |
There was a problem hiding this comment.
the taxonomy table says "3 categories | 8 concepts" for knowledge dependency, but the recipe included on this page defines 3 categories x 3 subcategories = 9 concepts. small thing, but the demo recipe shouldn't exceed the stated total.
There was a problem hiding this comment.
I specifically remember putting 9 LOL, thanks!
- Add devnote to mkdocs nav after Async All the Way Down - Swap Recursive CTEs to Advanced, CASE Expressions to Intermediate (matches recipe) - Fix score extraction truthy check to use 'is not none' (preserves score-0 values) - Drop REPLACE() vs regexp_replace from dialect takeaway (REPLACE is cross-dialect) - Tighten prose: remove 'The key insight:', use actual BIRD number, trim X-not-Y - Fix knowledge dependency count: 8 -> 9 concepts (3x3 in recipe)
…A-NeMo/DataDesigner into dhruv/devnotes/text-to-sql
* docs: add text-to-sql devnote * add diagram, update content * correct inconsistencies * docs: address PR NVIDIA-NeMo#349 feedback and add BIRD benchmark results PR feedback fixes: - Fix Window Functions contradiction: Key Takeaway NVIDIA-NeMo#1 now uses "Geospatial SQL" (Advanced) instead of "Window Functions" (Intermediate) - Fix score-0 truthiness bug: use `is not none` instead of truthy check in Jinja2 expression columns (inline example + production pipeline) - Soften Code Sandbox language: "A natural next step would be..." instead of "We are actively implementing..." - Cut Gretel reference per mvansegbroeck: replaced with NVIDIA/Nemotron team description - Replace Qwen model references with Nemotron per mvansegbroeck: MODEL_NAME, ASCII diagram labels, Pipeline Overview prose - Rename sdg_qwen_235b.py -> sdg_ndd_text2sql.py per mvansegbroeck - Fix Try It Yourself: use MODEL_ALIAS = "nvidia-text" with default provider pattern (matches structured-outputs dev note), remove unused explicit ModelConfig - Remove placeholder dataset link (#), add "Dataset: Internal" note New content: - Add BIRD Benchmark Results section with bar chart (JPG), data table, BIRD caveat paragraph, and Jocelyn Huang acknowledgement (Nemotron Super EX: 26.77% -> 41.80%, +15 pts, beats GPT-OSS-120B) - Replace "Looking Ahead: Code Sandbox" with broader "Next Steps": Code Sandbox, RL on BIRD via NeMo Gym, schema representation, Spider 2.0 - Add Project Summary table at end of post * docs: address second round of PR NVIDIA-NeMo#349 feedback - Fix "EHR Systems" -> "Electronic Health Records" in Key Takeaway NVIDIA-NeMo#1 to match the exact taxonomy string in the code example (greptile) - Add admonition clarifying code snippets are illustrative, not runnable, with link to Enterprise Text-to-SQL Recipe (nabinchha) - Add context before score extraction snippet referencing the five LLMJudgeColumnConfig columns and linking to full recipe (nabinchha) - Add companion file note and recipe link to production pipeline details block for prompts.py, rubrics.py, text2sql_seed.json (nabinchha) * docs: address round 2 PR NVIDIA-NeMo#349 feedback, replace production block with recipe - Fix "EHR Systems" -> "Electronic Health Records" in Key Takeaway NVIDIA-NeMo#1 to match the exact taxonomy string in the code example (greptile) - Add admonition clarifying inline code snippets are illustrative, with link to runnable Enterprise Text-to-SQL Recipe (nabinchha) - Add context before score extraction snippet referencing the five LLMJudgeColumnConfig columns and linking to full recipe (nabinchha) - Replace production pipeline <details> block (230 lines with phantom imports from prompts.py, rubrics.py, text2sql_seed.json) with snippet include of enterprise_text_to_sql.py recipe — self-contained and runnable, consistent with other merged dev notes (nabinchha) * docs: polish Try It Yourself and Summary sections - Wrap minimal inline example in collapsible <details> dropdown - Rename "A Team Effort" section to "Summary" - Remove redundant Scale/Dialects/Dataset line * docs: add missing sql_dialect sampler to Step 1 code snippet The Step 3/4 prompt templates reference {{ sql_dialect }} but the Step 1 seeding code never defined it, leaving an unresolved Jinja2 variable for readers following along. Add the sql_dialect sampler with a comment explaining the pipeline runs once per dialect. * fix ascii diagram * docs: fix BIRD score framing and MySQL dialect wording - Remove specific "60-70%" BIRD claim from intro to avoid contradiction with the 41.80%/38.25% direct-generation results shown later (those higher figures come from specialized systems with schema linking) - Reword MySQL "forbids" to "prompts exclude" -- REGEXP_REPLACE and CONVERT_TZ are valid MySQL functions; the pipeline excluded them for portability, not because the dialect forbids them * docs: move text-to-sql images to assets/ convention and update refs * docs: address text-to-sql devnote review comments - Add devnote to mkdocs nav after Async All the Way Down - Swap Recursive CTEs to Advanced, CASE Expressions to Intermediate (matches recipe) - Fix score extraction truthy check to use 'is not none' (preserves score-0 values) - Drop REPLACE() vs regexp_replace from dialect takeaway (REPLACE is cross-dialect) - Tighten prose: remove 'The key insight:', use actual BIRD number, trim X-not-Y - Fix knowledge dependency count: 8 -> 9 concepts (3x3 in recipe) --------- Signed-off-by: Yev Meyer <ymeyer@nvidia.com> Co-authored-by: Yev Meyer <ymeyer@nvidia.com>
Summary
Add a dev note documenting the enterprise-grade text-to-SQL SDG pipeline used to generate training data for Nemotron's SQL capabilities across PostgreSQL, MySQL, and SQLite.
What's in the post
Files changed