Tasks - full fault-tolerant training - [ ] design doc, https://github.com/PaddlePaddle/Paddle/pull/11625 - [ ] recoverable trainer process without shutting down the whole job - [ ] recoverable pserver process without shutting down the whole job - [ ] distributed task queue to manage tasks in etcd - [ ] distributed reader to fetch record from task queue - [ ] pserver HA - [ ] dynamic trainer count in the pserver side so that we will be able to average gradients according to current trainer count. - [ ] Upgrade EDL controller to CRD so that we can support Kubernetes higher than v1.8 - [ ] a tutorial to run distributed lookup sparse table with EDL - [ ] update experiment report, https://github.com/PaddlePaddle/cloud/tree/develop/doc/edl/experiment
Tasks