Skip to content

Add imdb SQL benchmark#22680

Open
Omega359 wants to merge 1 commit into
apache:mainfrom
Omega359:sql-benchmarks/imdb
Open

Add imdb SQL benchmark#22680
Omega359 wants to merge 1 commit into
apache:mainfrom
Omega359:sql-benchmarks/imdb

Conversation

@Omega359
Copy link
Copy Markdown
Contributor

@Omega359 Omega359 commented Jun 1, 2026

Which issue does this PR close?

Part of #21706

Rationale for this change

Continue work on sql benchmark migration.

What changes are included in this PR?

Imdb sql benchmark.

Are these changes tested?

Yes

BENCH_NAME=imdb IMDB_FILE_TYPE=csv cargo bench --bench sql
BENCH_NAME=imdb IMDB_FILE_TYPE=parquet cargo bench --bench sql

Note that the IMDB_FILE_TYPE=csv will OOM on most systems because csv doesn't infer statistics and thus won't get scan predicates and dynamic filters pushed into DataSourceExec. This results in queries such a 16a doing joining large tables/intermediates before enough of the selective filters have reduced the data size to not OOM (tested on a 96GB system). Setting PARTITION=1 does not solve the issue.

Are there any user-facing changes?

no

@Omega359 Omega359 marked this pull request as ready for review June 1, 2026 14:45
@adriangb adriangb requested a review from Copilot June 2, 2026 01:28
Copy link
Copy Markdown
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I scrolled over this and it all looks good to me. But at 6k LOC I can't say I read each one. I also kicked off a Copilot review to see if it catches anything. If not this is ready to merge by me.

@adriangb
Copy link
Copy Markdown
Contributor

adriangb commented Jun 2, 2026

Note that the IMDB_FILE_TYPE=csv will OOM on most systems because csv doesn't infer statistics and thus won't get scan predicates and dynamic filters pushed into DataSourceExec. This results in queries such a 16a doing joining large tables/intermediates before enough of the selective filters have reduced the data size to not OOM (tested on a 96GB system). Setting PARTITION=1 does not solve the issue.

I assume this was already the case? Thanks for investigating the root cause.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an IMDB benchmark suite to DataFusion’s SQL benchmark framework (benchmarks/sql_benchmarks), extending the existing migration effort from issue #21706.

Changes:

  • Adds IMDB table init SQL for Parquet and CSV, plus cleanup SQL.
  • Adds IMDB benchmark definitions (.benchmark) for queries Q01a–Q33c.
  • Updates BENCH_QUERY env parsing to accept non-numeric query identifiers (now Option<String>).

Reviewed changes

Copilot reviewed 117 out of 117 changed files in this pull request and generated 27 comments.

Show a summary per file
File Description
benchmarks/benches/sql.rs Accept BENCH_QUERY as a string (enables suffix queries like 16a).
benchmarks/sql_benchmarks/imdb/init/cleanup.sql Drops IMDB tables after each benchmark run.
benchmarks/sql_benchmarks/imdb/init/load_csv.sql Creates IMDB external tables backed by CSV files.
benchmarks/sql_benchmarks/imdb/init/load_parquet.sql Creates IMDB external tables backed by Parquet files.
benchmarks/sql_benchmarks/imdb/benchmarks/01a.benchmark IMDB benchmark query Q01a.
benchmarks/sql_benchmarks/imdb/benchmarks/01b.benchmark IMDB benchmark query Q01b.
benchmarks/sql_benchmarks/imdb/benchmarks/01c.benchmark IMDB benchmark query Q01c.
benchmarks/sql_benchmarks/imdb/benchmarks/01d.benchmark IMDB benchmark query Q01d.
benchmarks/sql_benchmarks/imdb/benchmarks/02a.benchmark IMDB benchmark query Q02a.
benchmarks/sql_benchmarks/imdb/benchmarks/02b.benchmark IMDB benchmark query Q02b.
benchmarks/sql_benchmarks/imdb/benchmarks/02c.benchmark IMDB benchmark query Q02c.
benchmarks/sql_benchmarks/imdb/benchmarks/02d.benchmark IMDB benchmark query Q02d.
benchmarks/sql_benchmarks/imdb/benchmarks/03a.benchmark IMDB benchmark query Q03a.
benchmarks/sql_benchmarks/imdb/benchmarks/03b.benchmark IMDB benchmark query Q03b.
benchmarks/sql_benchmarks/imdb/benchmarks/03c.benchmark IMDB benchmark query Q03c.
benchmarks/sql_benchmarks/imdb/benchmarks/04a.benchmark IMDB benchmark query Q04a.
benchmarks/sql_benchmarks/imdb/benchmarks/04b.benchmark IMDB benchmark query Q04b.
benchmarks/sql_benchmarks/imdb/benchmarks/04c.benchmark IMDB benchmark query Q04c.
benchmarks/sql_benchmarks/imdb/benchmarks/05a.benchmark IMDB benchmark query Q05a.
benchmarks/sql_benchmarks/imdb/benchmarks/05b.benchmark IMDB benchmark query Q05b.
benchmarks/sql_benchmarks/imdb/benchmarks/05c.benchmark IMDB benchmark query Q05c.
benchmarks/sql_benchmarks/imdb/benchmarks/06a.benchmark IMDB benchmark query Q06a.
benchmarks/sql_benchmarks/imdb/benchmarks/06b.benchmark IMDB benchmark query Q06b.
benchmarks/sql_benchmarks/imdb/benchmarks/06c.benchmark IMDB benchmark query Q06c.
benchmarks/sql_benchmarks/imdb/benchmarks/06d.benchmark IMDB benchmark query Q06d.
benchmarks/sql_benchmarks/imdb/benchmarks/06e.benchmark IMDB benchmark query Q06e.
benchmarks/sql_benchmarks/imdb/benchmarks/06f.benchmark IMDB benchmark query Q06f.
benchmarks/sql_benchmarks/imdb/benchmarks/07a.benchmark IMDB benchmark query Q07a.
benchmarks/sql_benchmarks/imdb/benchmarks/07b.benchmark IMDB benchmark query Q07b.
benchmarks/sql_benchmarks/imdb/benchmarks/07c.benchmark IMDB benchmark query Q07c.
benchmarks/sql_benchmarks/imdb/benchmarks/08a.benchmark IMDB benchmark query Q08a.
benchmarks/sql_benchmarks/imdb/benchmarks/08b.benchmark IMDB benchmark query Q08b.
benchmarks/sql_benchmarks/imdb/benchmarks/08c.benchmark IMDB benchmark query Q08c.
benchmarks/sql_benchmarks/imdb/benchmarks/08d.benchmark IMDB benchmark query Q08d.
benchmarks/sql_benchmarks/imdb/benchmarks/09a.benchmark IMDB benchmark query Q09a.
benchmarks/sql_benchmarks/imdb/benchmarks/09b.benchmark IMDB benchmark query Q09b.
benchmarks/sql_benchmarks/imdb/benchmarks/09c.benchmark IMDB benchmark query Q09c.
benchmarks/sql_benchmarks/imdb/benchmarks/09d.benchmark IMDB benchmark query Q09d.
benchmarks/sql_benchmarks/imdb/benchmarks/10a.benchmark IMDB benchmark query Q10a.
benchmarks/sql_benchmarks/imdb/benchmarks/10b.benchmark IMDB benchmark query Q10b.
benchmarks/sql_benchmarks/imdb/benchmarks/10c.benchmark IMDB benchmark query Q10c.
benchmarks/sql_benchmarks/imdb/benchmarks/11a.benchmark IMDB benchmark query Q11a.
benchmarks/sql_benchmarks/imdb/benchmarks/11b.benchmark IMDB benchmark query Q11b.
benchmarks/sql_benchmarks/imdb/benchmarks/11c.benchmark IMDB benchmark query Q11c.
benchmarks/sql_benchmarks/imdb/benchmarks/11d.benchmark IMDB benchmark query Q11d.
benchmarks/sql_benchmarks/imdb/benchmarks/12a.benchmark IMDB benchmark query Q12a.
benchmarks/sql_benchmarks/imdb/benchmarks/12b.benchmark IMDB benchmark query Q12b.
benchmarks/sql_benchmarks/imdb/benchmarks/12c.benchmark IMDB benchmark query Q12c.
benchmarks/sql_benchmarks/imdb/benchmarks/13a.benchmark IMDB benchmark query Q13a.
benchmarks/sql_benchmarks/imdb/benchmarks/13b.benchmark IMDB benchmark query Q13b.
benchmarks/sql_benchmarks/imdb/benchmarks/13c.benchmark IMDB benchmark query Q13c.
benchmarks/sql_benchmarks/imdb/benchmarks/13d.benchmark IMDB benchmark query Q13d.
benchmarks/sql_benchmarks/imdb/benchmarks/14a.benchmark IMDB benchmark query Q14a.
benchmarks/sql_benchmarks/imdb/benchmarks/14b.benchmark IMDB benchmark query Q14b.
benchmarks/sql_benchmarks/imdb/benchmarks/14c.benchmark IMDB benchmark query Q14c.
benchmarks/sql_benchmarks/imdb/benchmarks/15a.benchmark IMDB benchmark query Q15a.
benchmarks/sql_benchmarks/imdb/benchmarks/15b.benchmark IMDB benchmark query Q15b.
benchmarks/sql_benchmarks/imdb/benchmarks/15c.benchmark IMDB benchmark query Q15c.
benchmarks/sql_benchmarks/imdb/benchmarks/15d.benchmark IMDB benchmark query Q15d.
benchmarks/sql_benchmarks/imdb/benchmarks/16a.benchmark IMDB benchmark query Q16a.
benchmarks/sql_benchmarks/imdb/benchmarks/16b.benchmark IMDB benchmark query Q16b.
benchmarks/sql_benchmarks/imdb/benchmarks/16c.benchmark IMDB benchmark query Q16c.
benchmarks/sql_benchmarks/imdb/benchmarks/16d.benchmark IMDB benchmark query Q16d.
benchmarks/sql_benchmarks/imdb/benchmarks/17a.benchmark IMDB benchmark query Q17a.
benchmarks/sql_benchmarks/imdb/benchmarks/17b.benchmark IMDB benchmark query Q17b.
benchmarks/sql_benchmarks/imdb/benchmarks/17c.benchmark IMDB benchmark query Q17c.
benchmarks/sql_benchmarks/imdb/benchmarks/17d.benchmark IMDB benchmark query Q17d.
benchmarks/sql_benchmarks/imdb/benchmarks/17e.benchmark IMDB benchmark query Q17e.
benchmarks/sql_benchmarks/imdb/benchmarks/17f.benchmark IMDB benchmark query Q17f.
benchmarks/sql_benchmarks/imdb/benchmarks/18a.benchmark IMDB benchmark query Q18a.
benchmarks/sql_benchmarks/imdb/benchmarks/18b.benchmark IMDB benchmark query Q18b.
benchmarks/sql_benchmarks/imdb/benchmarks/18c.benchmark IMDB benchmark query Q18c.
benchmarks/sql_benchmarks/imdb/benchmarks/19a.benchmark IMDB benchmark query Q19a.
benchmarks/sql_benchmarks/imdb/benchmarks/19b.benchmark IMDB benchmark query Q19b.
benchmarks/sql_benchmarks/imdb/benchmarks/19c.benchmark IMDB benchmark query Q19c.
benchmarks/sql_benchmarks/imdb/benchmarks/19d.benchmark IMDB benchmark query Q19d.
benchmarks/sql_benchmarks/imdb/benchmarks/20a.benchmark IMDB benchmark query Q20a.
benchmarks/sql_benchmarks/imdb/benchmarks/20b.benchmark IMDB benchmark query Q20b.
benchmarks/sql_benchmarks/imdb/benchmarks/20c.benchmark IMDB benchmark query Q20c.
benchmarks/sql_benchmarks/imdb/benchmarks/21a.benchmark IMDB benchmark query Q21a.
benchmarks/sql_benchmarks/imdb/benchmarks/21b.benchmark IMDB benchmark query Q21b.
benchmarks/sql_benchmarks/imdb/benchmarks/21c.benchmark IMDB benchmark query Q21c.
benchmarks/sql_benchmarks/imdb/benchmarks/22a.benchmark IMDB benchmark query Q22a.
benchmarks/sql_benchmarks/imdb/benchmarks/22b.benchmark IMDB benchmark query Q22b.
benchmarks/sql_benchmarks/imdb/benchmarks/22c.benchmark IMDB benchmark query Q22c.
benchmarks/sql_benchmarks/imdb/benchmarks/22d.benchmark IMDB benchmark query Q22d.
benchmarks/sql_benchmarks/imdb/benchmarks/23a.benchmark IMDB benchmark query Q23a.
benchmarks/sql_benchmarks/imdb/benchmarks/23b.benchmark IMDB benchmark query Q23b.
benchmarks/sql_benchmarks/imdb/benchmarks/23c.benchmark IMDB benchmark query Q23c.
benchmarks/sql_benchmarks/imdb/benchmarks/24a.benchmark IMDB benchmark query Q24a.
benchmarks/sql_benchmarks/imdb/benchmarks/24b.benchmark IMDB benchmark query Q24b.
benchmarks/sql_benchmarks/imdb/benchmarks/25a.benchmark IMDB benchmark query Q25a.
benchmarks/sql_benchmarks/imdb/benchmarks/25b.benchmark IMDB benchmark query Q25b.
benchmarks/sql_benchmarks/imdb/benchmarks/25c.benchmark IMDB benchmark query Q25c.
benchmarks/sql_benchmarks/imdb/benchmarks/26a.benchmark IMDB benchmark query Q26a.
benchmarks/sql_benchmarks/imdb/benchmarks/26b.benchmark IMDB benchmark query Q26b.
benchmarks/sql_benchmarks/imdb/benchmarks/26c.benchmark IMDB benchmark query Q26c.
benchmarks/sql_benchmarks/imdb/benchmarks/27a.benchmark IMDB benchmark query Q27a.
benchmarks/sql_benchmarks/imdb/benchmarks/27b.benchmark IMDB benchmark query Q27b.
benchmarks/sql_benchmarks/imdb/benchmarks/27c.benchmark IMDB benchmark query Q27c.
benchmarks/sql_benchmarks/imdb/benchmarks/28a.benchmark IMDB benchmark query Q28a.
benchmarks/sql_benchmarks/imdb/benchmarks/28b.benchmark IMDB benchmark query Q28b.
benchmarks/sql_benchmarks/imdb/benchmarks/28c.benchmark IMDB benchmark query Q28c.
benchmarks/sql_benchmarks/imdb/benchmarks/29a.benchmark IMDB benchmark query Q29a.
benchmarks/sql_benchmarks/imdb/benchmarks/29b.benchmark IMDB benchmark query Q29b.
benchmarks/sql_benchmarks/imdb/benchmarks/29c.benchmark IMDB benchmark query Q29c.
benchmarks/sql_benchmarks/imdb/benchmarks/30a.benchmark IMDB benchmark query Q30a.
benchmarks/sql_benchmarks/imdb/benchmarks/30b.benchmark IMDB benchmark query Q30b.
benchmarks/sql_benchmarks/imdb/benchmarks/30c.benchmark IMDB benchmark query Q30c.
benchmarks/sql_benchmarks/imdb/benchmarks/31a.benchmark IMDB benchmark query Q31a.
benchmarks/sql_benchmarks/imdb/benchmarks/31b.benchmark IMDB benchmark query Q31b.
benchmarks/sql_benchmarks/imdb/benchmarks/31c.benchmark IMDB benchmark query Q31c.
benchmarks/sql_benchmarks/imdb/benchmarks/32a.benchmark IMDB benchmark query Q32a.
benchmarks/sql_benchmarks/imdb/benchmarks/32b.benchmark IMDB benchmark query Q32b.
benchmarks/sql_benchmarks/imdb/benchmarks/33a.benchmark IMDB benchmark query Q33a.
benchmarks/sql_benchmarks/imdb/benchmarks/33b.benchmark IMDB benchmark query Q33b.
benchmarks/sql_benchmarks/imdb/benchmarks/33c.benchmark IMDB benchmark query Q33c.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1 to +21
DROP TABLE aka_name;

DROP TABLE aka_title;

DROP TABLE cast_info;

DROP TABLE char_name;

DROP TABLE comp_cast_type;

DROP TABLE company_name;

DROP TABLE company_type;

DROP TABLE complete_cast;

DROP TABLE info_type;

DROP TABLE keyword;

DROP TABLE kind_type;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no harm in using IF EXISTS. It's just a defense in depth against confusing error messages if things fail. Wdyt @Omega359 ?

Comment thread benchmarks/sql_benchmarks/imdb/init/cleanup.sql
Comment thread benchmarks/sql_benchmarks/imdb/benchmarks/08b.benchmark
Comment thread benchmarks/sql_benchmarks/imdb/benchmarks/12b.benchmark
AND mc.movie_id = mi_idx.movie_id
AND it.id = mi_idx.info_type_id;

result sql_benchmarks/imdb/results/01a.csv
Comment thread benchmarks/sql_benchmarks/imdb/init/load_csv.sql
Comment thread benchmarks/sql_benchmarks/imdb/init/load_csv.sql
Comment thread benchmarks/sql_benchmarks/imdb/init/load_csv.sql
Comment thread benchmarks/sql_benchmarks/imdb/init/load_csv.sql
Comment thread benchmarks/sql_benchmarks/imdb/init/load_csv.sql
@adriangb
Copy link
Copy Markdown
Contributor

adriangb commented Jun 2, 2026

Some minor things to address @Omega359 🙏🏻

name_pcode_nf varchar(5),
surname_pcode varchar(5),
md5sum varchar(32)
) STORED AS CSV LOCATION 'data/imdb/char_name.csv' OPTIONS ('has_header' 'false', 'format.delimiter' ',', 'format.escape' '\');
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also fix #22660 (comment) here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants