Pack datasets efficiently for LLM training by combining multiple examples into fixed-length sequences. Pre-pack your datasets locally to avoid wasting expensive GPU time on servers.
When training large language models, dataset packing is essential for efficient training. This tool allows you to:
- Save GPU time: Pack datasets on your local machine before uploading to training servers
- Preserve data integrity: Ensures no example is split across multiple packed sequences
- Handle large datasets: Process datasets in chunks to manage memory efficiently
- Flexible tokenization: Support for both standard text data and chat-formatted (SFT) data
This feature requires the [pack] extra:
pip install "dalla-data-processing[pack]"
# or install all features: pip install "dalla-data-processing[all]"The packing functionality is integrated into the unified dalla-dp CLI:
dalla-dp -i input_dataset -o output_dataset pack --tokenizer-path /path/to/tokenizer-i, --input-dataset: Path to input HuggingFace dataset (required)-o, --output-dataset: Path to save output dataset (required)-w, --num-workers: Number of parallel workers for packing (default: 4)-v, --verbose: Enable verbose output--overwrite: Overwrite output if it exists
--config: Path to config YAML file (optional)--tokenizer-path: Path to tokenizer--max-seq-length: Maximum sequence length (default: 2048)--chunk-size-gb: Chunk size in GB (default: 2.0)--text-column: Text column name (default: 'text' or 'messages' for SFT)--subset-order: Subset processing order--sft: Enable SFT mode--rbpe: Use R-BPE tokenizer
Basic packing:
dalla-dp -i my_dataset -o packed_dataset pack --tokenizer-path /path/to/tokenizerUsing a config file:
dalla-dp -i my_dataset -o packed_dataset pack --config pack_config.yamlOverride config with command line:
dalla-dp -i my_dataset -o packed_dataset pack \
--config pack_config.yaml \
--max-seq-length 4096With custom sequence length and workers:
dalla-dp -i my_dataset -o packed_dataset -w 8 pack \
--tokenizer-path /path/to/tokenizer \
--max-seq-length 4096SFT mode with subset order:
dalla-dp -i my_dataset -o packed_dataset pack \
--tokenizer-path /path/to/tokenizer \
--sft \
--subset-order train --subset-order validationUsing a custom text column:
dalla-dp -i my_dataset -o packed_dataset pack \
--tokenizer-path /path/to/tokenizer \
--text-column contentWith verbose output:
dalla-dp -i my_dataset -o packed_dataset -v pack \
--tokenizer-path /path/to/tokenizer \
--chunk-size-gb 1.0tokenizer_path: "/path/to/tokenizer"
max_seq_length: 2048
chunk_size_gb: 2.0
sft: false
rbpe: false
text_column: "content"
subset_order:
- "train"
- "validation"CLI arguments override config values.
You can also use the packing functionality directly in Python:
from dalla_data_processing.packing import DatasetPacker
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("path/to/tokenizer")
packer = DatasetPacker(
input_dataset="my_dataset",
output_dataset="packed_dataset",
tokenizer=tokenizer,
num_workers=4,
max_seq_length=2048,
chunk_size_gb=2.0,
text_column="content",
)
final_path = packer.process()input_dataset(str): Path to input datasetoutput_dataset(str): Path for output datasettokenizer: HuggingFace tokenizer instancesubset_order(list[str], optional): Order to process subsetsnum_workers(int): Number of parallel packing processes (default: 4)chunk_size_gb(float): Size of processing chunks in GB (default: 2.0)max_seq_length(int): Maximum sequence length (default: 2048)sft(bool): Enable SFT mode (default: False)rbpe(bool): Use R-BPE tokenizer (default: False)text_column(str, optional): Text column name
- Packs multiple examples into a single sequence up to
max_seq_length - Adds EOS token between examples
- Pads remaining space with PAD tokens
- Guarantees no example is cut in the middle - if an example doesn't fit, it starts in the next sequence
- Best for preserving data integrity
Enable SFT mode when working with chat-formatted data:
dalla-dp -i my_dataset -o packed_dataset pack \
--tokenizer-path /path/to/tokenizer \
--sftWhen to use SFT mode:
- Your dataset has a
messagesfield with chat conversations - Your tokenizer has a chat template defined
- You're doing supervised fine-tuning (SFT) on conversational data
When NOT to use SFT mode:
- Continued pre-training (CPT) on plain text
- Your tokenizer doesn't have a chat template
- Your dataset only has a
textfield
Input format for SFT mode:
{
"messages": [
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hi there!"}
]
}Defaults: text (non-SFT) or messages (SFT). Override with --text-column.
Your dataset should be in Hugging Face datasets format:
my_dataset/
train/
data-00000-of-00001.arrow
dataset_info.json
state.json
validation/
data-00000-of-00001.arrow
dataset_info.json
state.json
dataset_dict.json
The packed dataset will be saved to:
packed_dataset/
final_dataset/
train/
data-00000-of-00001.arrow
dataset_info.json
state.json
Each example in the final dataset will have:
input_ids: Token IDs (length =max_seq_length)labels: Same asinput_ids(or masked for SFT)
The --chunk-size-gb parameter controls memory usage:
- Smaller values (0.5-1 GB): Lower memory, more chunks
- Larger values (2-4 GB): Higher memory, fewer chunks
- Analyze: Calculate sizes of dataset subsets
- Split: Divide datasets into manageable chunks
- Tokenize: Convert text to token IDs using parallel processing
- Pack: Combine multiple examples into fixed-length sequences
- Concatenate: Merge all packed chunks into final dataset
The tool automatically:
- Preserves subset ordering
- Removes intermediate files to save disk space
- Handles empty subsets gracefully
- Skips examples longer than
max_seq_length