Skip to content

Commit 4565ce9

Browse files
SCC24 Postmortem edits
changed udnerscores to dashes here
1 parent ad19a74 commit 4565ce9

1 file changed

Lines changed: 8 additions & 10 deletions

File tree

content/posts/sc25-scc24-post-mortem.md

Lines changed: 8 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,6 @@ title: "SCC24 Postmortem & Supercomputing 2025 info"
33
date: 2025-03-04
44
author: ["paco", "org", "aarush"]
55
description: 'Description of SCC24 and upcoming 2025'
6-
draft: true
76
math: true
87
---
98

@@ -115,16 +114,14 @@ I want you to realize that during this whole time our teams are working on build
115114
Now, acknowledging our issues, SDSC has accomodated the team with Express Shipping, and will allow us to ship Monday morning to arrive on Friday/Saturday of our competition. And we have a load of hardware/software support problems.
116115

117116

118-
We've finally got all hardware of our final cluster, but the AMD GPU BIOS from HPE are different BIOS versions and are set to read only, even with their free firmware tools. We've installed the NICs and set up the Switch, but the copper cables are not working, even after going to the bios and correcting the relevant BIOS tools. And we do not have our cluster up working.
119-
120-
<!-- Broadcom Drivers -->
117+
We've finally got all hardware of our final cluster, but the AMD GPU BIOS from HPE are different BIOS versions and are set to read only, even with their free firmware tools. We've installed the NICs and set up the Switch and followed the instructions online for a manual Broadcom Driver install, but the copper cables are not working, even after going to the bios and correcting the relevant BIOS tools. And we do not have our cluster up working.
121118

122119

123120
## Dawn of the Final Week
124121

125122
The flight to Atlanta is Friday, the 15th of November. The date this details is Friday, 8th of November. The express shipping driver has arrived early and he has parked his truck on the loading dock. 3 days until shipping.
126123

127-
The final GPUs and Infinity Fabric Links from AMD show shipping status SD yesterday. They've been shipped directly from Taiwan to us. They *smell* new. Of metal and that slight burnt smell leftover from manufacturing. We integrate these GPUs and their Infinity Fabric Links into the final node and they just work with our benchmarks. It's almost magical.
124+
The final GPUs and Infinity Fabric Links from AMD show shipping status as of in SD yesterday. They've been shipped directly from Taiwan to us. They *smell* new. Of metal and that slight burnt smell leftover from manufacturing. We integrate these GPUs and their Infinity Fabric Links into the final node and they just work with our benchmarks. It's almost magical.
128125

129126
We are spending all day at SDSC working. The network is still not setup and the switch is still having some issues. But luckily, Kent Tsui, a Micas Network Engineer, is in San Diego and is willing to help us. At the same time, Wyatt from Liqid also comes to get an idea of our work, show off of the hardware LIQID is working on, and treats the whole team to a meal. So midday Kent comes over and helps us directly in the datacenter. Since we had already went through all of the intiail debugging, he is quickly able to pin the issue down to be due to copper cables, which had not been tested with the networking equipment before. He explains that optical cables expect some sort of port training, a communication between the switch and the NICs to operate on the same speeds and standards. Turning this feature of the NICs off in the BIOS was all we needed. We then just ran some testing:
130127

@@ -169,7 +166,7 @@ Going back to our GPU situation, the AMD BIOS on our first GPU system is still d
169166

170167
It's still Saturday, and the whole team is trying to run their tests. We believe the main compute part of the cluster is complete. There's still a little more setup to go, and Sunday will be grueling. So the app/benchmarking team takes a rest while our sysadmins continue throughout to keep setting up monitoring, debugging speeds, firewalls, and anything leftover. By the time this is done, it's already Sunday midday.
171168

172-
> I remember coming up to our club space and trying to work on more issues at my desk. But every couple of seconds I would lose consciousness only to jolt awake at the sense of my head falling. After this had happened too many times that I had lost my sense of time, I decided to go to sleep. But I was awoken an hour and a half an hour by the team for help. I didn't sleep again until after the cluster had been shipped.
169+
> I remember coming up to our club space and trying to work on more issues at my desk. I wanted to use every minute. But every couple of seconds I would lose consciousness, only to awake at the sense of my head falling. After this had happened too many times that I had lost my sense of time, I decided to go to sleep. But I was awoken an hour and a half an hour by the team for help. I didn't sleep again until after the cluster had been shipped.
173170
> <br> &emsp;&emsp; &ndash; Francisco, aka Paco
174171
175172
Setting up the firewall and installation of docker breaks our MPI instance on Sunday. So we finally figure out it's trying to route MPI processes through an incorrect port and Bryan Chin realized specifying the port or configuring the default solves this issue. When we finally have it working, it's already midnight. We have to begin tearing down the cluster.
@@ -279,8 +276,7 @@ The 3 hour limit given to us included any set up and initialization tasks. After
279276

280277
<div align="center">
281278
<div style="display: flex; gap: 10px;">
282-
<img src="/post-media/scc24-postmortem/final_zonal_wind.png" style="max-width: 45%;">
283-
<img src="/post-media/scc24-postmortem/final_zonal_wind.png" style="max-width: 45%;">
279+
<img src="/post-media/scc24-postmortem/final-zonal-wind.png" style="max-width: 45%;">
284280
</div>
285281
<i>The final zonal wind visualized after our run.</i>
286282
</div>
@@ -317,7 +313,9 @@ Overall, it was a very fun experience. All of it was. During our disassembly of
317313
**Our team won 4th overall!**
318314

319315
This makes it the best placement from the US teams for a third year in a row.
320-
And this year best placement among US+European teams.
316+
And this year's best placement amongst US+European teams.
317+
318+
While a majority of teams use an Nvidia based stack, our team has used the AMD ROCm, a rarity as H100 GPUs going ever so prevelant amongst the competitors in the last years.
321319

322320

323321
<center>
@@ -339,4 +337,4 @@ Total 100% | | 57.137 / 100
339337

340338
## Takeaways and Thanks
341339

342-
We'd like to extend thanks to the awesome contributions of the entirety of [SDSC](https://www.sdsc.edu/) and [CSE](https://cse.ucsd.edu/) at UCSD. Specifically the efforts of our mentors Mary P. Thomas, Martin Kandes, Mahidhar Tatineni, and Bryan Chin.
340+
We'd like to extend thanks to the awesome contributions of the entirety of [SDSC](https://www.sdsc.edu/) and [CSE](https://cse.ucsd.edu/) at UCSD. Specifically the efforts of our mentors Mary P. Thomas, Martin Kandes, Mahidhar Tatineni, and Bryan Chin. The efforts of our company sponsors as HPE, Pivotal Optics, Aeon Computing, Applied Data Systems, International Computer Concepts, Gigabyte, LIQID, Broadcom, and Micas Networks and AMD. The UCSD Supercomputing Club would never have grown and accomplished as much without your help, and we are grateful for all of it.

0 commit comments

Comments
 (0)