You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 2025.md
+8-2Lines changed: 8 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -31,7 +31,7 @@ One selected paper was also published in [ACM SIGOPS Operating Systems Review (V
31
31
| 12:00-12:30 |**Invited talk - Jean-Thomas Acquaviva (DDN Storage)**| From HPC to AI: A Data Journey |
32
32
| 12:30-14:00 ||**Lunch break**|
33
33
| 14:00-14:30 | Zebin Ren (Vrije Universiteit Amsterdam), Krijn Doekemeijer (Vrije Universiteit Amsterdam), Tiziano De Matteis (Vrije Universiteit Amsterdam), Christian Pinto (IBM Research Europe), Radu Stoica (IBM Research Europe), Animesh Trivedi (IBM Research Europe) | An I/O Characterizing Study of Offloading LLM Models and KV Caches to NVMe SSD |
34
-
| 14:30-15:00 |**Invited talk - Yang Zheng (Huawei Technologies)**|**TBD**|
34
+
| 14:30-15:00 |**Invited talk - Yang Zheng (Huawei Technologies)**|**Reliability challenges and opportunities for AI infra: from industry perspective**|
35
35
| 15:00-15:30 |**Invited talk - Shadi Ibrahim (Inria, Rennes)**|**TBD**|
36
36
| 15:30-16:00 ||**Coffee break**|
37
37
| 16:00-16:15 | Joost Hoozemans (Voltron Data, Delft University of Technology), Robin Vonk (Delft University of Technology), Johan Peltenburg (Voltron Data), Felipe Aramburu (Voltron Data), Zaid Al-Ars (Delft University of Technology) | Using GPU Direct Storage with High-Performance Distributed Filesystems |
@@ -75,11 +75,17 @@ Jean-Thomas successively worked for Intel, the University of Versailles and the
75
75
76
76
#### Dr. Yang Zheng, Huawei
77
77
78
-
***Invited Talk: TBD***
78
+
***Invited Talk: Reliability challenges and opportunities for AI infra: from industry perspective***
79
79
80
80
##### Abstract
81
+
AI clusters are emerging as a critical infrastructure and technological frontier. As models grow in size following scaling laws, ensuring stable and reliable operation of large-scale model tasks on massive AI clusters has become a significant challenge in the industry.
82
+
83
+
Training and inference tasks for large models are highly coupled and low-fault-tolerant systems. Distributed training involves frequent communication between nodes, strong dependencies across parallel domains, and requirements for proper computational accuracy. These factors lead to frequent training interruptions due to hardware failures, slow recovery, and fail-slow. Additionally, silent data corruptions can result in model non-convergence. As the scale of training expands, reliability becomes a major bottleneck.
84
+
85
+
The key challenge is to build a highly available AI system architecture capable of supporting scenarios such as training on clusters with hundreds of thousands of cards, inference on super-nodes with hundreds or thousands of cards, and integrated training-inference tasks. Achieving "zero" perception of fault recovery in business operations is essential for ensuring the reliability of large model infrastructures. Addressing these challenges will be critical for advancing the scalability and robustness of AI systems in the future.
81
86
82
87
##### Bio
88
+
Dr. Yang Zheng is a principle Engineer of Reliability Technology Lab of Huawei Technologies Co., Ltd.. Dr Zheng is also currently a member of reliable AI infra project, focus on research on AI Infra testing, monitoring and recovery. Dr Yang Zheng received his PhD degree from Imperial College London in UK. Research interest includes elastic training/inference, silent data corruption.
0 commit comments