Description
There seems to be an issue where frequent checkpointing impacts the training time negatively. Slowing training by >2x every 100 epochs or so.
Examples of the issue (checkpoint_interval: 1)
The per-epoch time here does not include the checkpointing (synchronous).
# Expected behavior (CUDA)
Epoch 6 completed in 3.9217326641082764 seconds. Total train time so far: 171.78343605995178
...
Epoch 99 completed in 3.8156378269195557 seconds. Total train time so far: 3106.8321845531464
# Issue (ROCm)
Epoch 2 completed in 5.56635046005249 seconds. Total train time so far: 164.07908129692078
...
Epoch 46 completed in 26.325266361236572 seconds. Total train time so far: 3019.2210574150085
Examples of the issue (checkpoint_interval: 10)
Not nearly as bad, but still significant.
Epoch 2 completed in 6.140593528747559 seconds. Total train time so far: 129.89002299308777
...
Epoch 70 completed in 10.53577995300293 seconds. Total train time so far: 943.0159168243408
Solution
disable checkpointing (by setting checkpoint_interval > epochs)
If checkpointing is necessary, this may be mitagated with a very high checkpoint_interval. suggest checkpoint_interval > 100
Description
There seems to be an issue where frequent checkpointing impacts the training time negatively. Slowing training by >2x every 100 epochs or so.
Examples of the issue (
checkpoint_interval: 1)The per-epoch time here does not include the checkpointing (synchronous).
Examples of the issue (
checkpoint_interval: 10)Not nearly as bad, but still significant.
Solution
disable checkpointing (by setting
checkpoint_interval > epochs)If checkpointing is necessary, this may be mitagated with a very high
checkpoint_interval. suggestcheckpoint_interval > 100checkpoint_interval: 100