Skip to content

Commit 294f0b0

Browse files
authored
Invited speech details - 1
1 parent 066589b commit 294f0b0

1 file changed

Lines changed: 8 additions & 2 deletions

File tree

2025.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ One selected paper was also published in [ACM SIGOPS Operating Systems Review (V
3131
| 12:00-12:30 | **Invited talk - Jean-Thomas Acquaviva (DDN Storage)** | From HPC to AI: A Data Journey |
3232
| 12:30-14:00 | | **Lunch break** |
3333
| 14:00-14:30 | Zebin Ren (Vrije Universiteit Amsterdam), Krijn Doekemeijer (Vrije Universiteit Amsterdam), Tiziano De Matteis (Vrije Universiteit Amsterdam), Christian Pinto (IBM Research Europe), Radu Stoica (IBM Research Europe), Animesh Trivedi (IBM Research Europe) | An I/O Characterizing Study of Offloading LLM Models and KV Caches to NVMe SSD |
34-
| 14:30-15:00 | **Invited talk - Yang Zheng (Huawei Technologies)** | **TBD** |
34+
| 14:30-15:00 | **Invited talk - Yang Zheng (Huawei Technologies)** | **Reliability challenges and opportunities for AI infra: from industry perspective** |
3535
| 15:00-15:30 | **Invited talk - Shadi Ibrahim (Inria, Rennes)** | **TBD** |
3636
| 15:30-16:00 | | **Coffee break** |
3737
| 16:00-16:15 | Joost Hoozemans (Voltron Data, Delft University of Technology), Robin Vonk (Delft University of Technology), Johan Peltenburg (Voltron Data), Felipe Aramburu (Voltron Data), Zaid Al-Ars (Delft University of Technology) | Using GPU Direct Storage with High-Performance Distributed Filesystems |
@@ -75,11 +75,17 @@ Jean-Thomas successively worked for Intel, the University of Versailles and the
7575

7676
#### Dr. Yang Zheng, Huawei
7777

78-
***Invited Talk: TBD***
78+
***Invited Talk: Reliability challenges and opportunities for AI infra: from industry perspective***
7979

8080
##### Abstract
81+
AI clusters are emerging as a critical infrastructure and technological frontier. As models grow in size following scaling laws, ensuring stable and reliable operation of large-scale model tasks on massive AI clusters has become a significant challenge in the industry.
82+
83+
Training and inference tasks for large models are highly coupled and low-fault-tolerant systems. Distributed training involves frequent communication between nodes, strong dependencies across parallel domains, and requirements for proper computational accuracy. These factors lead to frequent training interruptions due to hardware failures, slow recovery, and fail-slow. Additionally, silent data corruptions can result in model non-convergence. As the scale of training expands, reliability becomes a major bottleneck.
84+
85+
The key challenge is to build a highly available AI system architecture capable of supporting scenarios such as training on clusters with hundreds of thousands of cards, inference on super-nodes with hundreds or thousands of cards, and integrated training-inference tasks. Achieving "zero" perception of fault recovery in business operations is essential for ensuring the reliability of large model infrastructures. Addressing these challenges will be critical for advancing the scalability and robustness of AI systems in the future.
8186

8287
##### Bio
88+
Dr. Yang Zheng is a principle Engineer of Reliability Technology Lab of Huawei Technologies Co., Ltd.. Dr Zheng is also currently a member of reliable AI infra project, focus on research on AI Infra testing, monitoring and recovery. Dr Yang Zheng received his PhD degree from Imperial College London in UK. Research interest includes elastic training/inference, silent data corruption.
8389

8490
<!--
8591
### Invited Speakers

0 commit comments

Comments
 (0)