All changes we make to the assignment code or PDF will be documented in this file.
- code: Halve training tokens for the leaderboard run
- code: update Paloma validation set file name to
tokenized_paloma_c4_100_domains_validation.bin, as it is a binary file loaded withnp.fromfile("/data/paloma/tokenized_paloma_c4_100_domains_validation.bin", dtype=np.uint16) - handout: add guidance to load the validation set with
np.fromfile, and update references to new file name
- code: update
README.mdto clarify that students should use the provided training script, not their own train script - code: update dependencies (
pyproject.tomlanduv.lock) with packages for WARC processing:fastwarcandtldextract - handout: add hint to use
fastwarcfor WARC record iteration earlier in assignment - handout: fix Together cluster paths to hatespeech and nsfw classifiers
- handout: clarify that students should use the provided training script, not their own train script
- handout: change references to WARC files in the final filtering step to WET files
- handout: provide hints on helpful classes to process the WET files
- code: script to get all assets
- code: improve supplied training script
- handout: update data to 2025
- handout: use WET files instead of WARC files for most tasks
- code: update deployment to use uv
- handout: make sure to specify in problem
train_modelthat we provide a training script.
- handout: add
--device cudato training command
- handout: added usage example for parallelism with
concurrent.futuresandsubmitit. - handout: added points to each of the problems
- code: fix type signature of
run_mask_emails,run_mask_phone_numbers, andrun_mask_ipsadapters. - code: fix expected labels in NSFW classifier test.
- handout: fix typo in mention of adapter
run_classify_qualityfor problemquality_classifier. - handout: fix link to Dolma NSFW and hatespeech classifiers, since the HF links point to the same model binary
Initial release.