Add imdb SQL benchmark#22680
Conversation
adriangb
left a comment
There was a problem hiding this comment.
I scrolled over this and it all looks good to me. But at 6k LOC I can't say I read each one. I also kicked off a Copilot review to see if it catches anything. If not this is ready to merge by me.
I assume this was already the case? Thanks for investigating the root cause. |
There was a problem hiding this comment.
Pull request overview
Adds an IMDB benchmark suite to DataFusion’s SQL benchmark framework (benchmarks/sql_benchmarks), extending the existing migration effort from issue #21706.
Changes:
- Adds IMDB table init SQL for Parquet and CSV, plus cleanup SQL.
- Adds IMDB benchmark definitions (
.benchmark) for queries Q01a–Q33c. - Updates
BENCH_QUERYenv parsing to accept non-numeric query identifiers (nowOption<String>).
Reviewed changes
Copilot reviewed 117 out of 117 changed files in this pull request and generated 27 comments.
Show a summary per file
| File | Description |
|---|---|
| benchmarks/benches/sql.rs | Accept BENCH_QUERY as a string (enables suffix queries like 16a). |
| benchmarks/sql_benchmarks/imdb/init/cleanup.sql | Drops IMDB tables after each benchmark run. |
| benchmarks/sql_benchmarks/imdb/init/load_csv.sql | Creates IMDB external tables backed by CSV files. |
| benchmarks/sql_benchmarks/imdb/init/load_parquet.sql | Creates IMDB external tables backed by Parquet files. |
| benchmarks/sql_benchmarks/imdb/benchmarks/01a.benchmark | IMDB benchmark query Q01a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/01b.benchmark | IMDB benchmark query Q01b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/01c.benchmark | IMDB benchmark query Q01c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/01d.benchmark | IMDB benchmark query Q01d. |
| benchmarks/sql_benchmarks/imdb/benchmarks/02a.benchmark | IMDB benchmark query Q02a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/02b.benchmark | IMDB benchmark query Q02b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/02c.benchmark | IMDB benchmark query Q02c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/02d.benchmark | IMDB benchmark query Q02d. |
| benchmarks/sql_benchmarks/imdb/benchmarks/03a.benchmark | IMDB benchmark query Q03a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/03b.benchmark | IMDB benchmark query Q03b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/03c.benchmark | IMDB benchmark query Q03c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/04a.benchmark | IMDB benchmark query Q04a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/04b.benchmark | IMDB benchmark query Q04b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/04c.benchmark | IMDB benchmark query Q04c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/05a.benchmark | IMDB benchmark query Q05a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/05b.benchmark | IMDB benchmark query Q05b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/05c.benchmark | IMDB benchmark query Q05c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/06a.benchmark | IMDB benchmark query Q06a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/06b.benchmark | IMDB benchmark query Q06b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/06c.benchmark | IMDB benchmark query Q06c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/06d.benchmark | IMDB benchmark query Q06d. |
| benchmarks/sql_benchmarks/imdb/benchmarks/06e.benchmark | IMDB benchmark query Q06e. |
| benchmarks/sql_benchmarks/imdb/benchmarks/06f.benchmark | IMDB benchmark query Q06f. |
| benchmarks/sql_benchmarks/imdb/benchmarks/07a.benchmark | IMDB benchmark query Q07a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/07b.benchmark | IMDB benchmark query Q07b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/07c.benchmark | IMDB benchmark query Q07c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/08a.benchmark | IMDB benchmark query Q08a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/08b.benchmark | IMDB benchmark query Q08b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/08c.benchmark | IMDB benchmark query Q08c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/08d.benchmark | IMDB benchmark query Q08d. |
| benchmarks/sql_benchmarks/imdb/benchmarks/09a.benchmark | IMDB benchmark query Q09a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/09b.benchmark | IMDB benchmark query Q09b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/09c.benchmark | IMDB benchmark query Q09c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/09d.benchmark | IMDB benchmark query Q09d. |
| benchmarks/sql_benchmarks/imdb/benchmarks/10a.benchmark | IMDB benchmark query Q10a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/10b.benchmark | IMDB benchmark query Q10b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/10c.benchmark | IMDB benchmark query Q10c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/11a.benchmark | IMDB benchmark query Q11a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/11b.benchmark | IMDB benchmark query Q11b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/11c.benchmark | IMDB benchmark query Q11c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/11d.benchmark | IMDB benchmark query Q11d. |
| benchmarks/sql_benchmarks/imdb/benchmarks/12a.benchmark | IMDB benchmark query Q12a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/12b.benchmark | IMDB benchmark query Q12b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/12c.benchmark | IMDB benchmark query Q12c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/13a.benchmark | IMDB benchmark query Q13a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/13b.benchmark | IMDB benchmark query Q13b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/13c.benchmark | IMDB benchmark query Q13c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/13d.benchmark | IMDB benchmark query Q13d. |
| benchmarks/sql_benchmarks/imdb/benchmarks/14a.benchmark | IMDB benchmark query Q14a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/14b.benchmark | IMDB benchmark query Q14b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/14c.benchmark | IMDB benchmark query Q14c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/15a.benchmark | IMDB benchmark query Q15a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/15b.benchmark | IMDB benchmark query Q15b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/15c.benchmark | IMDB benchmark query Q15c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/15d.benchmark | IMDB benchmark query Q15d. |
| benchmarks/sql_benchmarks/imdb/benchmarks/16a.benchmark | IMDB benchmark query Q16a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/16b.benchmark | IMDB benchmark query Q16b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/16c.benchmark | IMDB benchmark query Q16c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/16d.benchmark | IMDB benchmark query Q16d. |
| benchmarks/sql_benchmarks/imdb/benchmarks/17a.benchmark | IMDB benchmark query Q17a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/17b.benchmark | IMDB benchmark query Q17b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/17c.benchmark | IMDB benchmark query Q17c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/17d.benchmark | IMDB benchmark query Q17d. |
| benchmarks/sql_benchmarks/imdb/benchmarks/17e.benchmark | IMDB benchmark query Q17e. |
| benchmarks/sql_benchmarks/imdb/benchmarks/17f.benchmark | IMDB benchmark query Q17f. |
| benchmarks/sql_benchmarks/imdb/benchmarks/18a.benchmark | IMDB benchmark query Q18a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/18b.benchmark | IMDB benchmark query Q18b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/18c.benchmark | IMDB benchmark query Q18c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/19a.benchmark | IMDB benchmark query Q19a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/19b.benchmark | IMDB benchmark query Q19b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/19c.benchmark | IMDB benchmark query Q19c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/19d.benchmark | IMDB benchmark query Q19d. |
| benchmarks/sql_benchmarks/imdb/benchmarks/20a.benchmark | IMDB benchmark query Q20a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/20b.benchmark | IMDB benchmark query Q20b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/20c.benchmark | IMDB benchmark query Q20c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/21a.benchmark | IMDB benchmark query Q21a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/21b.benchmark | IMDB benchmark query Q21b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/21c.benchmark | IMDB benchmark query Q21c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/22a.benchmark | IMDB benchmark query Q22a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/22b.benchmark | IMDB benchmark query Q22b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/22c.benchmark | IMDB benchmark query Q22c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/22d.benchmark | IMDB benchmark query Q22d. |
| benchmarks/sql_benchmarks/imdb/benchmarks/23a.benchmark | IMDB benchmark query Q23a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/23b.benchmark | IMDB benchmark query Q23b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/23c.benchmark | IMDB benchmark query Q23c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/24a.benchmark | IMDB benchmark query Q24a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/24b.benchmark | IMDB benchmark query Q24b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/25a.benchmark | IMDB benchmark query Q25a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/25b.benchmark | IMDB benchmark query Q25b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/25c.benchmark | IMDB benchmark query Q25c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/26a.benchmark | IMDB benchmark query Q26a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/26b.benchmark | IMDB benchmark query Q26b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/26c.benchmark | IMDB benchmark query Q26c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/27a.benchmark | IMDB benchmark query Q27a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/27b.benchmark | IMDB benchmark query Q27b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/27c.benchmark | IMDB benchmark query Q27c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/28a.benchmark | IMDB benchmark query Q28a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/28b.benchmark | IMDB benchmark query Q28b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/28c.benchmark | IMDB benchmark query Q28c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/29a.benchmark | IMDB benchmark query Q29a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/29b.benchmark | IMDB benchmark query Q29b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/29c.benchmark | IMDB benchmark query Q29c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/30a.benchmark | IMDB benchmark query Q30a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/30b.benchmark | IMDB benchmark query Q30b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/30c.benchmark | IMDB benchmark query Q30c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/31a.benchmark | IMDB benchmark query Q31a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/31b.benchmark | IMDB benchmark query Q31b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/31c.benchmark | IMDB benchmark query Q31c. |
| benchmarks/sql_benchmarks/imdb/benchmarks/32a.benchmark | IMDB benchmark query Q32a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/32b.benchmark | IMDB benchmark query Q32b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/33a.benchmark | IMDB benchmark query Q33a. |
| benchmarks/sql_benchmarks/imdb/benchmarks/33b.benchmark | IMDB benchmark query Q33b. |
| benchmarks/sql_benchmarks/imdb/benchmarks/33c.benchmark | IMDB benchmark query Q33c. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| DROP TABLE aka_name; | ||
|
|
||
| DROP TABLE aka_title; | ||
|
|
||
| DROP TABLE cast_info; | ||
|
|
||
| DROP TABLE char_name; | ||
|
|
||
| DROP TABLE comp_cast_type; | ||
|
|
||
| DROP TABLE company_name; | ||
|
|
||
| DROP TABLE company_type; | ||
|
|
||
| DROP TABLE complete_cast; | ||
|
|
||
| DROP TABLE info_type; | ||
|
|
||
| DROP TABLE keyword; | ||
|
|
||
| DROP TABLE kind_type; |
There was a problem hiding this comment.
I see no harm in using IF EXISTS. It's just a defense in depth against confusing error messages if things fail. Wdyt @Omega359 ?
| AND mc.movie_id = mi_idx.movie_id | ||
| AND it.id = mi_idx.info_type_id; | ||
|
|
||
| result sql_benchmarks/imdb/results/01a.csv |
|
Some minor things to address @Omega359 🙏🏻 |
| name_pcode_nf varchar(5), | ||
| surname_pcode varchar(5), | ||
| md5sum varchar(32) | ||
| ) STORED AS CSV LOCATION 'data/imdb/char_name.csv' OPTIONS ('has_header' 'false', 'format.delimiter' ',', 'format.escape' '\'); |
Which issue does this PR close?
Part of #21706
Rationale for this change
Continue work on sql benchmark migration.
What changes are included in this PR?
Imdb sql benchmark.
Are these changes tested?
Yes
BENCH_NAME=imdb IMDB_FILE_TYPE=csv cargo bench --bench sqlBENCH_NAME=imdb IMDB_FILE_TYPE=parquet cargo bench --bench sqlNote that the IMDB_FILE_TYPE=csv will OOM on most systems because csv doesn't infer statistics and thus won't get scan predicates and dynamic filters pushed into DataSourceExec. This results in queries such a 16a doing joining large tables/intermediates before enough of the selective filters have reduced the data size to not OOM (tested on a 96GB system). Setting PARTITION=1 does not solve the issue.
Are there any user-facing changes?
no