A toy analytical SQL engine built in C++, designed as a learning project targeting internship roles at companies like Snowflake and Databricks. SwiftQL takes SQL queries as input, parses them, plans their execution, and runs them against structured tabular data stored as CSV files.
Project thesis: "I built a correct SQL engine, then made storage smarter, then made execution significantly faster — and measured every step."
- Project Overview
- Feature Scope
- Architecture
- Data Domain
- Phase 1 — Correct Row-Based Engine
- Phase 2 — Columnar Storage + Hash Join
- Phase 3 — Vectorized Execution
- Phase 4 — Cost-Based Optimizer
- 20-Week Plan
- Benchmarks
- Build Instructions
- Usage
- Limitations
- Possible Extensions
SwiftQL is a single-process analytical query engine. It is not a full DBMS — there are no transactions, no multi-user sessions, and no write path. It is purely a read query engine, which is exactly the right scope for understanding how analytical database systems like Snowflake and Databricks work internally.
The project is structured in four progressive phases, each leaving a working and demonstrable system before moving to the next:
| Phase | Focus | Key Idea |
|---|---|---|
| 1 | Correct row-based SQL engine | Make it work |
| 2 | Columnar storage + encodings + pruning + hash join | Make storage smarter |
| 3 | Vectorized execution + late materialization | Make execution faster |
| 4 | Cost-based optimizer + predicate pushdown | Make planning smarter |
Tech stack:
- Core engine: C++
- Build system: CMake
- Testing: GoogleTest
- Benchmarking: Google Benchmark / custom harness
- Data generation + correctness testing: Python
SELECT,FROM,WHERE,GROUP BY,HAVING,ORDER BY,LIMITDISTINCT— eliminates duplicate rows from outputIS NULL/IS NOT NULL— null-aware predicate evaluationJOIN ... ON— hash join execution over columnar storage (Phase 2+)- Aggregates:
COUNT,SUM,AVG,MIN,MAX EXPLAIN— prints the query plan tree without executingEXPLAIN ANALYZE— executes the query and annotates each plan node with rows in, rows out, exclusive self-time (child time excluded), and % of total execution time; footer shows rows returned and separate parse, plan, and execution times--storage row | columnar— switches the storage backend--execution volcano | vectorized— switches the execution model- Query result cache — identical queries served from cache without re-execution
- Cost-based optimizer — uses table and column statistics to reorder predicates and select join sides (Phase 4)
- Formal predicate pushdown — filters pushed as close to the scan as possible (Phase 4)
- CSV-based table storage with a
catalog.jsonmetadata file
CREATE TABLESQL — tables registered via catalog only- Subqueries
- Transactions / writes (
INSERT,UPDATE,DELETE) - Indexes
- Distributed execution
- Full SQL null semantics (three-valued logic) — null handling scoped to
IS NULL/IS NOT NULLpredicates and null display in output
The codebase is organized into clean, separated modules. Each module has a well-defined responsibility and a clear interface.
swiftql/
├── CMakeLists.txt
├── README.md
├── catalog.json
├── data/
│ ├── laps.csv
│ └── drivers.csv
├── src/
│ ├── common/ # Value, Schema, Row, TypeId
│ ├── catalog/ # Catalog, TableMetadata, TableStats
│ ├── storage/ # CSVLoader, ColumnarTable, encoders
│ ├── parser/ # Lexer, Parser, AST nodes
│ ├── planner/ # Validator, plan nodes, optimizer
│ ├── execution/ # Operators (volcano + vectorized)
│ └── cli/ # main.cc, result printer
├── include/
├── tests/
├── benchmarks/
└── python_tools/
├── generate_data.py
├── run_queries.py
├── compare_against_sqlite.py
└── benchmark.py
Everything else depends on this layer. No module reaches past it.
TypeIdenum —INT,DOUBLE,STRINGValue—std::variant<int64_t, double, std::string>holding one cell's data, with null stateColumnDef— name + TypeId for one columnSchema— ordered list ofColumnDefwith lookup by nameRow—std::vector<Value>representing one table row- Error / result types — how the engine signals failures
The engine's directory of what tables exist.
TableMetadata— table name, file path, SchemaTableStats— row count, per-column statistics (min, max, distinct count) — populated at load time, used by the Phase 4 optimizerCatalog— loads and stores allTableMetadataandTableStats; answers "does table X exist?", "what columns does it have?", "where is its file?"- Backed by
catalog.jsonon disk — no SQL DDL
Example catalog.json:
{
"tables": [
{
"name": "laps",
"file": "data/laps.csv",
"columns": [
{"name": "lap_id", "type": "INT"},
{"name": "team", "type": "STRING"},
{"name": "speed", "type": "DOUBLE"},
{"name": "season", "type": "INT"}
]
}
]
}Responsible for physically reading table data and turning it into something the execution engine can consume.
Phase 1 — Row storage:
CSVLoader— reads a CSV file line by line, converts each line into aRowusing the table schema- Loaded rows held in memory as
std::vector<Row>for the duration of a query
Phase 2 — Columnar storage:
ColumnArray— a typed column:std::variant<vector<int64_t>, vector<double>, vector<string>>ColumnarTable— map of column name → ColumnArray + schema + row countDictionaryEncoder— maps unique strings to int IDs; stores column asvector<int32_t>RLEColumn— stores repeated-value columns as(value, run_length)pairsColumnChunk— a segment of a column with min, max, and row count metadata for zone-map pruning
Takes a raw SQL string and produces a structured Abstract Syntax Tree (AST). Hand-written recursive descent parser — no parser generator library.
Grammar (restricted subset):
select_stmt → SELECT [DISTINCT] select_list FROM table_ref
[JOIN IDENT ON expr]
[WHERE expr]
[GROUP BY col_list]
[HAVING expr]
[ORDER BY col_list]
[LIMIT INT_LITERAL]
table_ref → IDENT
select_list → expr (COMMA expr)*
col_list → IDENT (COMMA IDENT)*
expr → or_expr
or_expr → and_expr (OR and_expr)*
and_expr → compare (AND compare)*
compare → primary [(= | != | < | > | <= | >=) primary]
| primary IS NULL
| primary IS NOT NULL
primary → IDENT
| IDENT LPAREN expr RPAREN ← aggregate call
| IDENT LPAREN STAR RPAREN ← COUNT(*)
| INT_LITERAL
| FLOAT_LITERAL
| STRING_LITERAL
| LPAREN expr RPAREN
AST node types:
ColumnRef— reference to a column by name (with optional table qualifier)Literal— a constant valueBinaryExpr— left expr, operator, right exprIsNullExpr— expr + is_not_null flagAggregateExpr— function name, argument expr, is_star flagSelectStatement— select list, from table, optional join, where, group-by, having, order-by, limit, distinct flag
Bridges the gap between the AST and the execution plan.
Semantic validation:
FROMtable exists in catalog- All referenced columns exist in the relevant table schema
- Aggregate functions applied to compatible types only
- Non-aggregated
SELECTcolumns appear inGROUP BYwhen aggregates are present HAVINGonly used whenGROUP BYis present- Join columns exist in their respective tables
Plan nodes:
SeqScanNode— read from a tableFilterNode— apply a predicateProjectNode— select output columns / compute expressionsHashAggregateNode— group by + aggregation functionsHavingNode— post-aggregation filterDistinctNode— deduplication via hash setSortNode—ORDER BYLimitNode—LIMIT NHashJoinNode— build/probe hash join (execution wired in Phase 2; stubbed in Phase 1)
Example plan for SELECT team, AVG(speed) FROM laps JOIN drivers ON laps.driver_id = drivers.driver_id WHERE season = 2025 GROUP BY team HAVING AVG(speed) > 300:
Project [team, AVG(speed)]
Having [AVG(speed) > 300]
Aggregate [group_by=team, agg=AVG(speed)]
Filter [season = 2025]
HashJoin [laps.driver_id = drivers.driver_id]
SeqScan [laps, 4 columns]
SeqScan [drivers, 5 columns]
Phase 4 — Optimizer pass (applied between planning and execution):
- Formal predicate pushdown —
FilterNodes moved as close toSeqScanNodeas possible - Predicate reordering — most selective predicates evaluated first using column distinct counts
- Join side selection — smaller table by row count chosen as build side
Execution modes are two orthogonal dimensions:
| Volcano (row-at-a-time) | Vectorized (batch) | |
|---|---|---|
| Row storage | --storage row --execution volcano |
— |
| Columnar storage | --storage columnar --execution volcano |
--storage columnar --execution vectorized |
--storage row --execution vectorizedis not supported — vectorized execution is designed for and built on top of columnar storage. The three supported combinations allow clean isolation of storage gains vs execution gains in benchmarks.
Phase 1 — Volcano / Iterator model:
Each operator implements:
void open(); // initialize state
Row* next(); // return next row, nullptr when exhausted
void close(); // release resources| Operator | Behaviour |
|---|---|
SeqScanNode |
Returns rows one at a time from the loaded row vector |
FilterNode |
Calls child, evaluates predicate (including IS NULL), discards non-matching rows |
ProjectNode |
Calls child, evaluates select expressions, emits projected row |
HashAggregateNode |
Consumes all child rows into a hash map, emits one result row per group |
HavingNode |
Calls child, evaluates post-aggregation predicate, discards non-matching groups |
DistinctNode |
Calls child, tracks seen rows in a hash set, suppresses duplicates |
SortNode |
Consumes all child rows, sorts, emits in order |
LimitNode |
Passes rows through until N have been emitted |
HashJoinNode |
Build phase: smaller table into hash map. Probe phase: larger table probed row by row |
Phase 3 — Vectorized model:
Instead of one row at a time, operators exchange chunks. Late materialization is a first-class design principle: VecFilterNode produces a SelectionVector of valid row indices without copying or materializing data — columns are only fully materialized at VecProjectNode at the top of the pipeline.
struct DataChunk {
std::vector<ColumnVector> columns;
int num_rows = 0;
};
struct SelectionVector {
std::vector<int> indices; // valid row indices within the chunk
int size = 0;
};| Operator | Behaviour |
|---|---|
VecScanNode |
Reads 1024 rows at a time from ColumnarTable, returns DataChunk* |
VecFilterNode |
Evaluates predicate across all rows in a tight loop, produces SelectionVector — no data copied |
VecProjectNode |
Materializes only required columns for rows passing the selection vector |
VecHashAggregateNode |
Processes one chunk at a time, updates group-by hash map in batch |
VecHashJoinNode |
Probe phase operates over DataChunk — batch lookup into build-side hash map |
Keyed on the raw SQL string. On a cache hit, cached result rows are returned without touching storage or execution. Cache is in-memory for the lifetime of the process. Bypassed with --no-cache.
std::unordered_map<std::string, std::vector<Row>> result_cache;Directly analogous to Snowflake's result cache.
./swiftql --catalog catalog.json --query "..."
./swiftql --catalog catalog.json --storage columnar --execution vectorized --query "..."
./swiftql --catalog catalog.json --query "..." --explain
./swiftql --catalog catalog.json --query "..." --explain-analyze
./swiftql --catalog catalog.json --query "..." --no-cache
./swiftql --catalog catalog.json --query "..." --no-optimize| Script | Purpose |
|---|---|
generate_data.py |
Generates synthetic F1 CSVs at configurable scale (1k / 100k / 1M rows) |
run_queries.py |
Runs a query file against SwiftQL and captures output |
compare_against_sqlite.py |
Runs same queries against SQLite, diffs results — correctness oracle |
benchmark.py |
Automates benchmark runs across all modes, generates results table and matplotlib plots |
F1-themed tables generated synthetically via Python scripts.
Note: Once the MVP is complete, TPC-H benchmark queries will be used for formal performance evaluation.
| Table | Columns |
|---|---|
laps |
lap_id, driver_id, team, speed, sector_1, sector_2, sector_3, season, round |
drivers |
driver_id, name, nationality, team, age |
races |
race_id, round, circuit, country, season |
pit_stops |
stop_id, lap_id, driver_id, duration_ms, season |
Goal: A working end-to-end SQL engine covering the full SQL surface area of the project. User types a query, engine returns correct results. Nothing fast yet — just correct.
Hash join is parsed and planned in this phase but execution is stubbed — join queries return a clean "not yet implemented" error at runtime. This keeps the SQL surface area complete from the start without coupling join execution to row storage.
- Full folder structure and CMake build system with GoogleTest
TypeId,Value(with null state),ColumnDef,Schema,Row- Comparison operators on
Value - Unit tests: construct rows manually, assert types and comparisons
Checkpoint: Build system works. Value/Schema/Row solid and tested.
TableMetadata,Catalogwith JSON loading via nlohmann/jsonCSVLoader::load(filepath, schema)→std::vector<Row>generate_data.py— generates F1 CSVs from day one
Checkpoint: Catalog resolves table names. CSV loads into typed rows.
TokenTypeenum covering all keywords (SELECT,FROM,WHERE,GROUP,BY,HAVING,ORDER,LIMIT,DISTINCT,JOIN,ON,IS,NULL,AND,OR,NOT,AS,COUNT,SUM,AVG,MIN,MAX), operators, literals, punctuationTokenstruct with type, raw value, line/col for error messagesLexerwithnextToken()andpeek()- AST node structs:
ColumnRef,Literal,BinaryExpr,IsNullExpr,AggregateExpr,SelectStatement
Checkpoint: Lexer correctly tokenizes the full SQL target subset.
Parserclass consumingLexeroutput- One method per grammar rule
- Operator precedence: OR → AND → comparison → primary
- Support for:
DISTINCT,JOIN ... ON,HAVING,IS NULL/IS NOT NULL ParseErrorwith message and position on unexpected tokens
Checkpoint: Parser produces correct AST for all target query patterns including joins, having, distinct, and null predicates.
Validator— semantic checks against the catalog, including join column validation and having/group-by consistencyPlanNodeabstract base withopen(),next(),close()- Plan node classes:
SeqScanNode,FilterNode,ProjectNode,HashAggregateNode,HavingNode,DistinctNode,SortNode,LimitNode,HashJoinNode(stubbed) Planner::plan(SelectStatement, Catalog, table_rows)→PlanNode*tree — accepts pre-loaded rows; planner performs no I/O
Checkpoint: Plan trees built correctly for all query types. Join queries plan but return "not yet implemented" at execution. Bad queries rejected with clean error messages.
Value evaluate(Expr*, const Row&, const Schema&)— handlesColumnRef,Literal,BinaryExpr,IsNullExpr- Full operator implementations:
SeqScan,Filter,Project,HashAggregate,Having,Distinct,Sort,Limit - Null handling: null values propagate correctly through expressions;
IS NULL/IS NOT NULLevaluate correctly; nulls display asNULLin output
Checkpoint: SELECT DISTINCT team, AVG(speed) FROM laps WHERE season = 2025 AND speed IS NOT NULL GROUP BY team HAVING AVG(speed) > 300 ORDER BY team LIMIT 10 returns correct results.
main.ccwith--catalog,--query,--storage,--execution,--explain,--explain-analyze,--no-cache,--no-optimizeargs- Aligned result printer with null display
EXPLAIN ANALYZE— executes query; per-node exclusive self-time (child time excluded) and % of execution total; footer shows rows returned and parse/plan/execution breakdown (CSV load excluded from all timers, consistent with TPC-H benchmark methodology)- Query result cache —
unordered_map<string, vector<Row>>, bypassed with--no-cache compare_against_sqlite.pycorrectness harness — 20+ test queries passing vs SQLite- Consistent error handling throughout — no crashes on bad input
Checkpoint: Phase 1 complete. All 20+ test queries pass vs SQLite. --explain and --explain-analyze work. Result cache demonstrated. Project fully demonstrable.
Goal: Replace row storage with a columnar layout. Add encodings and zone-map pruning. Wire up hash join execution over the columnar storage layer. Benchmark against Phase 1.
ColumnArraytyped column arraysColumnarTable— collection of columns + schema + row countCSVToColumnarconverter — CSV rows transposed into column arraysSeqScanNoderewritten to operate onColumnarTableby row index under--storage columnar- All 20+ test queries still pass
Checkpoint: Engine correct on columnar layout. Both storage modes accessible via --storage flag.
required_columnsset pushed down toSeqScanNode— planner determines which columns are needed, scan skips the restDictionaryEncoderfor string columns — unique strings mapped toint32_tIDsRLEColumnfor repeated integer columns — stored as(value, run_length)pairs- Storage size measured and recorded before/after encoding
Checkpoint: Fewer columns touched per query. Storage size reduced and measured.
- Each column split into
ColumnChunks of 8,192 rows - Each chunk stores min, max, row count metadata
ChunkPrunerskips chunks provably non-matching for simple predicates (col = val,col < val,col > val)- Wired into
SeqScanNode— skipped chunks never accessed
Checkpoint: Selective queries skip chunks. Chunk skip count and speedup measured on large dataset.
HashJoinNodeexecution wired up over columnar storage- Build phase: scan the smaller table (by row count), populate
std::unordered_map<Value, std::vector<Row>> - Probe phase: scan the larger table row by row, probe hash map, emit joined rows
- Join queries execute end-to-end correctly
- All join test queries added to correctness harness and verified against SQLite
Checkpoint: Join queries execute correctly over columnar storage. Results match SQLite.
Benchmark queries on 1M-row dataset across --storage row and --storage columnar modes:
Note: Benchmark times measured after CSV load to isolate query execution performance.
| Query | What it tests |
|---|---|
SELECT AVG(speed) FROM laps |
Full column scan aggregate |
SELECT COUNT(*) FROM laps WHERE season = 2025 |
Selective filter + zone-map pruning |
SELECT team, speed FROM laps WHERE speed > 300 |
Projection of 2 of 8 columns |
SELECT team, COUNT(*) FROM laps GROUP BY team |
Group by on dictionary-encoded string column |
SELECT l.team, AVG(l.speed) FROM laps l JOIN drivers d ON l.driver_id = d.driver_id GROUP BY l.team |
Hash join + aggregate |
Metrics per query: latency (ms, average of 5 runs), rows/sec, storage size.
Checkpoint: Row vs columnar benchmark numbers documented. Phase 2 demonstrably faster on analytical queries. Codebase cleaned and documented.
Goal: Replace the row-at-a-time Volcano model with batch processing over columnar storage. Late materialization is a first-class design principle. Demonstrate and measure the speedup over Phase 2.
DataChunkandSelectionVectorabstractionsVecScanNode— reads 1024 rows at a time fromColumnarTable, returnsDataChunk*- New operator interface:
virtual DataChunk* nextChunk() = 0 - Volcano operators remain intact — both execution paths coexist, selected via
--execution
Checkpoint: VecScan returns correct chunks. Total row count across all chunks equals table size.
VecFilterNode— evaluates predicate across entire chunk in a tight loop, producesSelectionVector— no data is copied or materializedVecProjectNode— only columns required for output are materialized, only for rows passing the selection vector — late materialization made explicitEXPLAIN ANALYZEupdated to report materialization points per operator- All 20+ test queries pass on vectorized path
Checkpoint: Selection vector pattern working. Late materialization documented in EXPLAIN ANALYZE output. Vectorized path correct.
VecHashAggregateNode— processes one chunk at a time, updates group-by hash map in batch; dictionary-encoded string columns use integer ID comparison in the hot loopVecHashJoinNode— probe phase operates overDataChunk, batch lookup into build-side hash map- Benchmark: same 5 queries across all three supported mode combinations
- Batch size experiment on
SELECT AVG(speed) FROM laps: sizes 128, 256, 512, 1024, 2048 — latency recorded for each, sweet spot documented
Checkpoint: All three execution mode combinations benchmarked. Batch size sensitivity documented. Vectorized hash join correct.
Goal: Replace the rule-based planner with a statistics-driven optimizer. Use table and column statistics to make smarter planning decisions. Measure the impact independently of storage and execution changes.
ColumnStats— min, max, distinct count per columnTableStats— row count + map of column name →ColumnStats- Statistics computed at table load time and stored in
Catalog EXPLAIN ANALYZEupdated to show estimated vs actual row counts per plan node
Checkpoint: Stats populated for all tables on load. Visible in EXPLAIN ANALYZE output.
- Formal predicate pushdown —
FilterNodes moved as close toSeqScanNodeas possible in the plan tree, reducing rows flowing through upstream operators - Predicate reordering — when multiple predicates exist in
WHERE, most selective predicate (lowestdistinct_count / row_countratio) evaluated first - Join side selection — smaller table by
row_countchosen as build side inHashJoinNode; optimizer can override the order tables appear in the query - Optimizer implemented as a plan tree rewrite pass between planning and execution
Checkpoint: Plan trees visibly reordered by optimizer. EXPLAIN shows pre- and post-optimization plans. Predicate ordering and join sides correct.
- Benchmark optimizer gains in isolation: same queries, same storage and execution mode, optimizer on vs off via
--no-optimize - Document which query types benefit most, which are unaffected, and why
- Full 4-phase benchmark comparison — every benchmark query across every phase
benchmark.pygenerates final comparison plots: latency by phase, rows/sec by phase, batch size sensitivity curve, optimizer impact
Checkpoint: Optimizer gains isolated and documented. Full benchmark suite complete. All plots generated.
| Week | Focus | Checkpoint |
|---|---|---|
| 1 | Scaffold + Common layer | Build system works, Value/Schema/Row tested |
| 2 | Catalog + CSV loader | Tables load from JSON + CSV |
| 3 | Lexer + AST nodes | Tokenizer correct for full SQL subset |
| 4 | Recursive descent parser | AST produced for all target queries incl. JOIN, HAVING, DISTINCT, IS NULL |
| 5 | Planner + validator | Plan trees built, join stubbed, bad queries rejected cleanly |
| 6 | Expression eval + operators | End-to-end queries correct incl. HAVING, DISTINCT, IS NULL, LIMIT |
| 7 | CLI + EXPLAIN ANALYZE + cache + tests | 20+ queries pass vs SQLite, EXPLAIN ANALYZE works, cache demonstrated |
| 8 | Columnar layout | Engine correct on columnar storage, --storage flag works |
| 9 | Projection pushdown + encodings | Fewer columns touched, storage size reduced and measured |
| 10 | Zone-map chunk pruning | Chunks skipped on selective queries, speedup measured |
| 11 | Hash join execution | Join queries execute correctly over columnar storage |
| 12 | Phase 2 benchmarks | Row vs columnar numbers + join benchmarks documented |
| 13 | DataChunk + VecScan | Batch reads correct, row count verified |
| 14 | VecFilter + VecProject + late materialization | Selection vector pattern correct, materialization documented |
| 15 | VecAggregate + VecHashJoin + Phase 3 benchmarks | All 3 mode combos benchmarked, batch size tuned |
| 16 | TableStats + ColumnStats | Stats populated on load, visible in EXPLAIN ANALYZE |
| 17 | Optimizer rules | Predicate pushdown, reordering, join side selection working |
| 18 | Phase 4 benchmarks + full suite | Optimizer gains isolated, all plots generated |
| 19 | Full integration pass | All modes correct, full test suite passing, clean build |
| 20 | README final + verbal prep | Project recruiting-ready |
To be populated during Weeks 12, 15, and 18.
Note: All benchmark times measured after CSV load to isolate query execution performance. Once the MVP is complete, TPC-H benchmark queries will be used for formal evaluation.
| Query | Row + Volcano (ms) | Col + Volcano (ms) | Col + Vectorized (ms) |
|---|---|---|---|
SELECT AVG(speed) FROM laps |
— | — | — |
SELECT COUNT(*) FROM laps WHERE season = 2025 |
— | — | — |
SELECT team, speed FROM laps WHERE speed > 300 |
— | — | — |
SELECT team, COUNT(*) FROM laps GROUP BY team |
— | — | — |
SELECT l.team, AVG(l.speed) FROM laps l JOIN drivers d ON l.driver_id = d.driver_id GROUP BY l.team |
— | — | — |
| Query | No Optimizer (ms) | With Optimizer (ms) |
|---|---|---|
SELECT AVG(speed) FROM laps WHERE season = 2025 AND speed > 300 |
— | — |
SELECT l.team, COUNT(*) FROM laps l JOIN drivers d ON l.driver_id = d.driver_id GROUP BY l.team |
— | — |
| Batch Size | Latency (ms) |
|---|---|
| 128 | — |
| 256 | — |
| 512 | — |
| 1024 | — |
| 2048 | — |
# Clone the repository
git clone https://github.com/yourname/swiftql.git
cd swiftql
# Generate test data
python3 python_tools/generate_data.py --rows 100000
# Build
mkdir build && cd build
cmake ..
make -j$(nproc)
# Run tests
./tests/swiftql_tests# Run a query (defaults: row storage, volcano execution)
./swiftql --catalog catalog.json --query "SELECT team, AVG(speed) FROM laps WHERE season = 2025 GROUP BY team"
# Use columnar storage + vectorized execution
./swiftql --catalog catalog.json --storage columnar --execution vectorized --query "..."
# Print the query plan without executing
./swiftql --catalog catalog.json --query "..." --explain
# Execute and profile each plan node
./swiftql --catalog catalog.json --query "..." --explain-analyze
# Bypass the result cache
./swiftql --catalog catalog.json --query "..." --no-cache
# Disable the optimizer
./swiftql --catalog catalog.json --query "..." --no-optimizeExample output:
team AVG(speed)
-----------------------
Ferrari 312.45
McLaren 308.91
Mercedes 310.17
Example --explain output:
Project [team, AVG(speed)]
Aggregate [group_by=team, agg=AVG(speed)]
Filter [season = 2025]
SeqScan [laps, 4 columns]
Example --explain-analyze output:
Project [team, AVG(speed)] rows_out=3 time=0.1ms (0.1%)
Aggregate [group_by=team] rows_in=48203 rows_out=3 time=12.4ms (17.2%)
Filter [season = 2025] rows_in=1000000 rows_out=48203 time=38.2ms (53.1%)
SeqScan [laps, 4 columns] rows_out=1000000 time=21.3ms (29.6%)
Rows returned: 3
Parse: 1.2ms
Plan: 0.8ms
Execution: 72.0ms
- No write path —
INSERT,UPDATE,DELETEare not supported - No
CREATE TABLESQL — tables must be registered viacatalog.json - Single join only — multi-way joins not supported
- No subqueries or correlated expressions
- Null handling scoped to
IS NULL/IS NOT NULLpredicates — full three-valued logic not implemented - Commas inside string values not supported in CSV input
- No persistence beyond CSV files and catalog JSON
- Optimizer uses simple heuristics — no dynamic programming join ordering
- Result cache invalidation not implemented — cache is cleared on process restart only
If the project completes ahead of schedule, the following extensions are candidates:
- Binary columnar file format — serialize
ColumnarTableto a simple binary format on first load, read from binary on subsequent runs; eliminates CSV parsing overhead on cold start, analogous to how Parquet works - Parallel scan + parallel aggregation — partition table chunks across threads using a thread pool; per-thread aggregation maps merged at the end; expected 2–4× speedup on scan-heavy workloads on multi-core systems
Built as a learning project targeting internship roles at Snowflake and Databricks. Each phase is independently demonstrable with correctness tests and benchmarks.