DSL-first natural language to SQL for PostgreSQL and Redshift
nlp2sql turns a natural language question into SQL through a reusable Python DSL:
await connect(...)await nlp.ask(...)- optional few-shot examples
- optional semantic context
- optional validation and repair
The library is designed for both simple schemas and large warehouses, but all public examples in this repository use the local e-commerce domain shipped with the project itself.
- DSL-first API:
connect()returns anNLP2SQLclient withask(),validate(),explain(), andsuggest() - Business-aware generation: optional
SemanticContextadds canonical tables, metrics, dimensions, rules, and mappings - Execution modes: generate only, generate plus validate, and generate plus validate plus repair
- Few-shot examples: pass example lists directly or use an example repository implementation
- Large schema support: FAISS plus TF-IDF hybrid retrieval, schema filters, and disk-backed caches
- Multiple providers: OpenAI, Anthropic, and Gemini
- Database support: PostgreSQL and Amazon Redshift
- Async by default: built for services, APIs, notebooks, and workers
| Document | Description |
|---|---|
| Architecture | Runtime flow, services, ports, and diagrams |
| API Reference | Python API, CLI, hooks, and metadata reference |
| Configuration | Environment variables, examples, semantic context, cache behavior |
| Enterprise Guide | Governed usage, scale, and deployment patterns |
| Redshift Support | Redshift-specific notes using public examples |
| Examples | Safe public examples based on the local e-commerce domain |
| Contributing | Contribution guidelines |
# With UV (recommended)
uv add nlp2sql
# With pip
pip install nlp2sql
# With specific providers
pip install nlp2sql[anthropic,gemini]
pip install nlp2sql[all-providers]
# With embeddings
pip install nlp2sql[embeddings-local]
pip install nlp2sql[embeddings-openai]export OPENAI_API_KEY="your-openai-key"
# or ANTHROPIC_API_KEY / GOOGLE_API_KEYimport asyncio
import nlp2sql
from nlp2sql import ProviderConfig
async def main():
nlp = await nlp2sql.connect(
"postgresql://testuser:testpass@localhost:5432/testdb",
provider=ProviderConfig(provider="openai", api_key="sk-..."),
)
result = await nlp.ask("Show active users by region")
print(result.sql)
print(result.confidence)
print(result.metadata["sql_intent_plan"])
asyncio.run(main())connect() loads the schema, initializes retrieval indexes, and returns a reusable NLP2SQL client. ask() returns a typed QueryResult.
Pass examples directly to connect(). The library handles indexing for you.
nlp = await nlp2sql.connect(
"postgresql://testuser:testpass@localhost:5432/testdb",
provider=ProviderConfig(provider="openai", api_key="sk-..."),
examples=[
{
"question": "Show revenue by source category for the flagship store",
"sql": (
"SELECT d.metric_date, mc.source_category, SUM(d.revenue) AS revenue "
"FROM daily_channel_metrics d "
"JOIN stores s ON d.store_id = s.id "
"JOIN marketing_channels mc ON d.channel_id = mc.id "
"WHERE s.code = 'na_flagship' "
"GROUP BY d.metric_date, mc.source_category"
),
"database_type": "postgres",
}
],
)Use semantic context when the same question could map to multiple plausible tables or dimensions.
from nlp2sql import (
DimensionDefinition,
DomainRule,
MetricDefinition,
SemanticContext,
SemanticEntityMapping,
)
semantic_context = SemanticContext(
domain="ecommerce_channel_performance",
canonical_tables=["daily_channel_metrics"],
required_filters=["s.code = 'na_flagship'", "s.region = 'North America'"],
entity_mappings=[
SemanticEntityMapping(
source_term="North America flagship store",
target="store_scope",
resolved_value="na_flagship / North America",
filter_expression="s.code = 'na_flagship' AND s.region = 'North America'",
)
],
metric_definitions=[
MetricDefinition(name="revenue", description="Revenue by day and source category."),
MetricDefinition(name="orders_count", description="Orders by day and source category."),
],
dimension_definitions=[
DimensionDefinition(name="metric_date", description="Daily grain."),
DimensionDefinition(name="source_category", description="Channel grouping."),
],
rules=[
DomainRule(
name="preserve_source_breakdown",
description="Keep source_category when the question asks for a source breakdown.",
required_dimensions=["source_category"],
preferred_tables=["daily_channel_metrics"],
)
],
)
result = await nlp.ask(
"Show daily revenue and order count by source category for the North America flagship store",
semantic_context=semantic_context,
)ask() supports execution-aware modes directly.
result = await nlp.ask(
"Show revenue by source category for the flagship store",
validate=True,
repair=True,
)generate_only: generate SQL onlygenerate_and_validate: execute readonly validation when execution is wiredgenerate_validate_repair: retry on semantic or execution failures when repair hooks are available
The CLI exposes the same concepts:
nlp2sql query \
--database-url postgresql://testuser:testpass@localhost:5432/testdb \
--question "Show daily revenue by source category for the North America flagship store" \
--examples-file examples.json \
--semantic-context-file semantic-context.json \
--validate \
--repair \
--show-semantic-context \
--show-sql-intent-plan \
--show-selected-examplesflowchart TD
userCode[UserCodeOrCLI] --> dsl[connectAndAskDSL]
dsl --> analysis[QueryAnalysisService]
analysis --> semantic[SemanticResolutionService]
semantic --> retrieval[SchemaRetrievalAndExampleSelection]
retrieval --> intent[SqlIntentPlanningService]
intent --> prompt[PromptAssemblyAndAdapters]
prompt --> llm[LLMGeneration]
llm --> semval[SemanticValidation]
semval --> exec[OptionalExecutionAndRepair]
exec --> result[QueryResultMetadata]
At runtime the library:
- analyzes the question
- optionally resolves and merges semantic context
- retrieves relevant schema and examples
- builds a structured SQL intent plan
- assembles the prompt
- generates SQL
- optionally validates, executes, and repairs
- returns a
QueryResultwith debug metadata
See Architecture for the full breakdown.
This repository ships a local e-commerce integration domain used in tests and docs. It includes:
storesmarketing_channelsusersproductsordersorder_itemsdaily_channel_metrics
The public examples intentionally stay inside that domain to avoid leaking any private warehouse schema.
To start it locally:
cd docker
docker compose up -d postgresThe default URL is:
postgresql://testuser:testpass@localhost:5432/testdb| Provider | Default Model | Context Size | Best For |
|---|---|---|---|
| OpenAI | gpt-4o-mini |
128K | Fast general purpose usage |
| Anthropic | claude-sonnet-4-20250514 |
200K | Larger schemas and long prompts |
| Gemini | gemini-2.0-flash |
1M | High-volume and very large contexts |
All models are configurable through ProviderConfig.
connect() is the recommended path. Lower-level entry points still exist for advanced wiring:
from nlp2sql import DatabaseType, ProviderConfig, create_and_initialize_service
service = await create_and_initialize_service(
database_url="postgresql://testuser:testpass@localhost:5432/testdb",
provider_config=ProviderConfig(provider="openai", api_key="sk-..."),
database_type=DatabaseType.POSTGRES,
)
result = await service.generate_sql(
"Count active users by region",
database_type=DatabaseType.POSTGRES,
)
print(result["sql"])git clone https://github.com/luiscarbonel1991/nlp2sql.git
cd nlp2sql
uv sync
# Start the local public e-commerce database
cd docker && docker compose up -d postgres
# Integration tests without llm
cd ..
uv run pytest -m "integration and not llm"
# Optional llm integration tests
uv run pytest -m "integration and llm"nlp2sql includes a Model Context Protocol server for assistant integration.
{
"mcpServers": {
"nlp2sql": {
"command": "python",
"args": ["/path/to/nlp2sql/mcp_server/server.py"],
"env": {
"OPENAI_API_KEY": "${OPENAI_API_KEY}",
"NLP2SQL_DEFAULT_DB_URL": "postgresql://testuser:testpass@localhost:5432/testdb"
}
}
}
}See mcp_server/README.md for details.
See CONTRIBUTING.md.
MIT License. See LICENSE.
