Your current training speed would take ~388 days to complete 20 epochs!
-
Data Loading Bottleneck (Main Issue)
- Only 2 workers for data loading
- CPU can't keep up with GPU
- GPU sits idle waiting for data
-
Large Dataset
- 254,976 training samples
- 42,496 iterations per epoch
-
Complex Neuroscience Features
- Homeostatic plasticity calculations
- Sleep-wake cycle computations
- Additional overhead per batch
# Use more data loading workers
uv run scripts/train_neuroscience_3090.py --epochs 20 --num-workers 8 --wandb# Optimized for speed with minimal features
uv run scripts/train_fast_3090.py --epochs 10 --num-workers 8 --minimal --wandb
# With slightly more features
uv run scripts/train_fast_3090.py --epochs 10 --num-workers 8 --batch-size 12 --gradient-accumulation 1# Create a smaller dataset for faster iteration
head -n 10000 data/train.jsonl > data/train_small.jsonl
# Convert to binary
uv run cortexgpt/data/prepare_data.py \
--input-file data/train_small.jsonl \
--output-file data/train_small.bin
# Train on smaller dataset
uv run scripts/train_neuroscience_3090.py \
--train-data data/train_small.bin \
--epochs 5 \
--num-workers 8# Monitor GPU utilization (should be >90%)
watch -n 1 nvidia-smi
# Check if CPU is the bottleneck
htop # Look for high CPU usage during training| Configuration | Speed | Time per Epoch | 20 Epochs |
|---|---|---|---|
| Current (2 workers) | 39.51s/iter | ~19.4 days | ~388 days |
| With 8 workers | ~5s/iter | ~2.5 days | ~50 days |
| Fast script (minimal) | ~1-2s/iter | ~12-24 hours | ~10-20 days |
| Smaller dataset (10k) | ~0.5s/iter | ~1 hour | ~20 hours |
# Add to trainer for 2x speedup
scaler = torch.cuda.amp.GradScaler()
with torch.cuda.amp.autocast():
outputs = model(inputs)# Reduce memory usage, slight speed penalty
--gradient-checkpointing# If you have multiple GPUs
torchrun --nproc_per_node=2 scripts/train_neuroscience_3090.py- Start with fast script for initial experiments
- Use 8+ workers for data loading
- Test on smaller dataset first
- Gradually enable features once training is stable
- Monitor GPU utilization to ensure it's >90%
- Batch Size: Larger = more GPU utilization, but watch memory
- Workers: Set to number of CPU cores (usually 8-16)
- Pin Memory: Already enabled, keeps data ready for GPU
- Persistent Workers: Reduces worker startup overhead
The key is to keep the GPU fed with data continuously!