IMPLEMENTED (2024-12-18)
Pouzita knihovna buenavista (Python) misto puvodne planovaneho duckgres (Rust).
Duvody zmeny:
- Jednodussi integrace s existujicim Python codebase
- Snazsi customizace autentizace
- Mene zavislosti (neni potreba Rust toolchain)
- Dostatecny vykon pro MVP
Soubory:
src/pgwire_server.py- Custom PG Wire server based on buenavistasrc/routers/pgwire_auth.py- Authentication bridge for PG Wire
Reference: buenavista
2024-12-18
Workspaces v Keboola slouzi k interaktivni praci s daty - SQL transformace, analyzy, explorace. Uzivatele potrebuji:
- Pripojit se k workspace z externich nastroju (DBeaver, DataGrip, Python, R, BI tools)
- Spoustet SQL interaktivne
- Cist data z cele Storage projektu (read-only)
- Zapisovat do workspace (read-write)
DuckDB je embedded databaze - nema nativni server mode. Na rozdil od PostgreSQL, MySQL ci Snowflake neexistuje zpusob, jak se k DuckDB pripojit "pres sit".
Pro Snowflake backend uzivatele dostanou credentials:
- Host (Snowflake account)
- Username
- Password / Private Key
- Database, Schema, Warehouse
Pak se pripoji standardnim Snowflake driverem (JDBC, ODBC, Python connector).
Pouzijeme PostgreSQL Wire Protocol pro pristup k DuckDB workspacum.
Konkretne vyhodnotime a nasadime jednu z techto implementaci:
- duckgres - PostHog, produkcni pouziti
- duckdb-pgwire - DuckDB extension + server
┌──────────────────────────────────────────────────────────────────────┐
│ USER TOOLS │
│ DBeaver, DataGrip, psql, Python psycopg2, R, Tableau, ... │
└───────────────────────────────┬──────────────────────────────────────┘
│
│ PostgreSQL Wire Protocol (port 5432)
│
┌───────────────────────────────▼──────────────────────────────────────┐
│ PG WIRE SERVER │
│ (duckgres / pgwire) │
│ │
│ - Authentication (workspace credentials) │
│ - Session management │
│ - Query routing │
└───────────────────────────────┬──────────────────────────────────────┘
│
│ DuckDB Python/C API
│
┌───────────────────────────────▼──────────────────────────────────────┐
│ WORKSPACE SESSION │
│ │
│ workspace_123.duckdb (RW) ← User's working space │
│ │ │
│ ├── ATTACH 'project_1/in_c_sales/orders.duckdb' │
│ │ AS in_c_sales_orders (READ_ONLY) │
│ ├── ATTACH 'project_1/in_c_sales/customers.duckdb' │
│ │ AS in_c_sales_customers (READ_ONLY) │
│ ├── ATTACH 'project_1/out_c_reports/summary.duckdb' │
│ │ AS out_c_reports_summary (READ_ONLY) │
│ └── ... (all project tables attached read-only) │
│ │
│ User can: │
│ - SELECT FROM any attached table (project data, read-only) │
│ - CREATE TABLE, INSERT, UPDATE in workspace schema (read-write) │
│ - Run transformations, CTEs, window functions, etc. │
└───────────────────────────────────────────────────────────────────────┘
1. CreateWorkspace API call
├── Create workspace_123.duckdb file
├── Generate credentials (username, password)
├── Store in metadata.duckdb
└── Return connection string
2. User connects via PG protocol
├── PG Wire server authenticates
├── Opens workspace_123.duckdb
├── ATTACHes all project tables (READ_ONLY)
└── Session ready
3. User runs queries
├── SELECT from project tables → reads from ATTACHed files
├── CREATE TABLE in workspace → writes to workspace_123.duckdb
└── Full SQL support (JOINs across tables, CTEs, etc.)
4. DropWorkspace API call
├── Close all sessions
├── Delete workspace_123.duckdb
└── Remove from metadata
| Kriteria | PG Wire | REST API | Arrow Flight SQL |
|---|---|---|---|
| Kompatibilita | Vsechny SQL nastroje | Omezena | Rastouci |
| Interaktivita | Nativni | Omezena | Dobra |
| Ecosystem | Obrovsk y | - | Mensi |
| Latence | Nizka | Stredni | Nizka |
| Streaming | Ano | Ne | Ano |
| Implementace | Existuje | Mame | Slozita |
Vyherce: PG Wire Protocol
- Maximalni kompatibilita: psql, DBeaver, DataGrip, Tableau, Python (psycopg2), R (RPostgres), Go (pgx), Java (JDBC), .NET...
- Produkcni reference: PostHog pouziva duckgres v produkci
- Uzivatelska zkusenost: Stejna jako PostgreSQL - zadne uceni
- Existujici implementace: duckgres, duckdb-pgwire
- Uz mame REST API pro management operace
- Ale: REST neni vhodny pro interaktivni SQL sessions
- Chybi: streaming, cursors, prepared statements, transactions
- Moderni, efektivni (zero-copy)
- Ale: Mensi podpora v nastrojich (zatim)
- Slozitejsi implementace
- Moznost: Pridat jako alternativu v budoucnu
- Uzivatele se pripoji s jakymkoli PostgreSQL klientem
- Zadne nove nastroje - pouziji co znaji
- Produkcne overene reseni (PostHog)
- ATTACH READ_ONLY zajistuje bezpecnost produkcnich dat
- Dalsi komponenta k provozovani (PG Wire server)
- PG Wire neni 100% PostgreSQL - nektere features nebudou fungovat
- Memory overhead pro ATTACH (file descriptors)
| Feature | Podpora |
|---|---|
| SELECT, INSERT, UPDATE, DELETE | Ano |
| CREATE/DROP TABLE | Ano |
| JOINs, CTEs, Window Functions | Ano |
| Prepared Statements | Castecna |
| Transactions (BEGIN/COMMIT) | DuckDB semantika |
| PostgreSQL-specific functions | Ne (DuckDB funkce) |
| COPY FROM/TO | Ano (DuckDB syntaxe) |
| pg_catalog views | Castecna |
| Extensions (PostGIS, etc.) | Ne |
postgresql://ws_123_user:password@host:5432/workspace_123
# Nebo s parametry
Host: duckdb.keboola.local
Port: 5432
Database: workspace_123
Username: ws_123_user
Password: <generated>
SSL: required
-- metadata.duckdb
CREATE TABLE workspace_credentials (
workspace_id VARCHAR PRIMARY KEY,
username VARCHAR NOT NULL, -- ws_{workspace_id}_{random}
password_hash VARCHAR NOT NULL, -- SHA256
created_at TIMESTAMPTZ DEFAULT now(),
expires_at TIMESTAMPTZ,
FOREIGN KEY (workspace_id) REFERENCES workspaces(id)
);async def on_client_connect(username: str, password: str) -> DuckDBConnection:
# 1. Authenticate
workspace_id = extract_workspace_id(username)
if not verify_password(workspace_id, password):
raise AuthenticationError()
# 2. Get workspace info
workspace = get_workspace(workspace_id)
project_id = workspace.project_id
# 3. Open workspace database
conn = duckdb.connect(workspace.db_path)
# 4. ATTACH all project tables as READ_ONLY
tables = list_project_tables(project_id)
for table in tables:
alias = f"{table.bucket}_{table.name}"
conn.execute(f"""
ATTACH '{table.db_path}' AS {alias} (READ_ONLY)
""")
# 5. Create convenient views in workspace
for table in tables:
alias = f"{table.bucket}_{table.name}"
conn.execute(f"""
CREATE OR REPLACE VIEW {table.bucket}.{table.name} AS
SELECT * FROM {alias}.main.data
""")
return conn@dataclass
class WorkspaceConfig:
max_attached_tables: int = 1000 # Max ATTACHed databases
max_memory_per_session: str = "4GB" # DuckDB memory limit
max_temp_storage: str = "10GB" # Temp files for large queries
session_timeout: int = 3600 # 1 hour idle timeout
query_timeout: int = 300 # 5 min per query- Zamitnut: Spatna uzivatelska zkusenost pro interaktivni SQL
- Odlozeno: Mensi ecosystem, slozitejsi implementace
- Moznost pridat pozdeji jako alternativu
- Zamitnut: Zadna kompatibilita s existujicimi nastroji
- Zamitnut: Bezpecnostni riziko, slozita sprava
- Connection pooling: Pro vyssi zatez
- Read replicas: ATTACH na vice strojich
- Arrow Flight SQL: Jako alternativni protokol pro Python/data science
- Query governor: Limity na CPU, memory, IO per session
- buenavista - POUZITO v implementaci
- duckgres (PostHog) - puvodne zvazovano
- duckdb-pgwire - alternativa
- PostgreSQL Wire Protocol
- DuckDB ATTACH
- ADR-009: File per Table - zaklad pro ATTACH architekturu