Hello LBANN team,
I would like to evaluate an LBANN for strong scalability described in the LBANN publication: (The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs with Hybrid Parallelism, arXiv '20).
However, I cannot reproduce the scalability of CosmoFlow benchmark. In the paper, they said that this result worked with spatial-parallel I/O, but I cannot find the related option in LBANN.
Could you help me to produce the strong scalability of LBANN on NERSC Perlmutter? My question is:
i) How to use spatial-parallel I/O? (Does it mean the "distconv" option?)
ii) Could you share the detailed training parameters (batch size, training options of CosmoFlow)?
Thank you for your help
Hello LBANN team,
I would like to evaluate an LBANN for strong scalability described in the LBANN publication: (The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs with Hybrid Parallelism, arXiv '20).
However, I cannot reproduce the scalability of CosmoFlow benchmark. In the paper, they said that this result worked with spatial-parallel I/O, but I cannot find the related option in LBANN.
Could you help me to produce the strong scalability of LBANN on NERSC Perlmutter? My question is:
i) How to use spatial-parallel I/O? (Does it mean the "distconv" option?)
ii) Could you share the detailed training parameters (batch size, training options of CosmoFlow)?
Thank you for your help