Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion superbench/benchmarks/model_benchmarks/megatron_gpt3.py
Original file line number Diff line number Diff line change
Expand Up @@ -651,13 +651,16 @@ def _generate_dataset(self):
if self._args.dataset_url:
self._raw_data_path = str(Path(self._args.data_home) / 'data.json')
download_file(self._args.dataset_url, self._raw_data_path)

command = (
'python3 '
f'{os.path.join(self._args.code_base, "tools/preprocess_data.py")} '
f'--input {self._raw_data_path} '
f'--tokenizer-type {self._args.tokenizer_type} '
f'--output-prefix {os.path.join(self._args.data_home, "dataset")} '
f'--workers {str(self._args.num_workers)} '
# num_workers=0 is valid for DataLoader (main process loads data),
# but preprocess_data.py requires workers>=1 for multiprocessing.Pool.
f'--workers {max(1, self._args.num_workers)} '
Comment on lines +661 to +663
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new max(1, self._args.num_workers) clamp is a regression-critical behavior change (fixing the multiprocessing.Pool(0) crash), but there’s no unit test asserting that --workers passed to preprocess_data.py is at least 1 when --num_workers=0. Since this benchmark already has tests, consider adding a test that exercises _generate_dataset() with num_workers=0 and asserts the constructed preprocess command uses --workers 1.

Copilot uses AI. Check for mistakes.
f'--vocab-file {self._vocab_path} '
Comment on lines 659 to 664
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--output-prefix is hardcoded to ${data_home}/dataset, but the rest of the method uses self._args.data_prefix (and exposes it as a CLI arg). If --data_prefix is customized and the dataset needs to be generated (no existing .bin/.idx), preprocessing will generate dataset_text_document.* while the code checks for {data_prefix}.*, leading to a false failure. Consider deriving --output-prefix from data_prefix (e.g., strip a _text_document suffix when present) so generation and subsequent checks are consistent.

Copilot uses AI. Check for mistakes.
f'--merge-file {self._merges_path}'
)
Expand Down
Loading