This document fixes the first official public vs holdout split for SourceBench leaderboard evaluation.
Benchmark size:
- total queries:
100 - query types:
5 - queries per type:
20
Types:
VACOSDebateQAHotpotQAPinocchiosQuoraQuestions
Version v1 uses:
65public queries35holdout queries
Per query type:
13public7holdout
This keeps the query-type distribution balanced between the open leaderboard and the official leaderboard while reserving a larger hidden test set for official validation.
The v1 split was created once from the benchmark master query pool with stratification by query type.
Publicly disclosed:
- total benchmark size
- query-type taxonomy
- public/holdout counts
- balanced per-type allocation
Not publicly disclosed:
- the exact holdout query membership
- the exact internal selection rule
- the internal benchmark master file used to materialize the holdout set
Public split:
data/queries/sourcebench_public_queries_v1.csv
Holdout split:
- stored outside the public repository in the official evaluation environment
The holdout query content must not live inside the public repository.
Recommended public-release behavior:
- keep
sourcebench_public_queries_v1.csvin the public repo - keep only the split policy and high-level counts public
- store the holdout query file only in the official evaluation environment