You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/posts/sc25-scc24-post-mortem.md
+8-11Lines changed: 8 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,13 +25,14 @@ TLDR:
25
25
- 4.5 kW limit for entire cluster, 2 kW limit per node
26
26
- Benchmarks are: HPL, MLPERF inference, and a mystery benchmark revealed at the start of the benchmarking period.
27
27
- Applications are: NAMD, ICON, Reproducibility (data flow lifecycle analysis tool DataLife), and a mystery application revealed at the start of application period
28
-
-- Icosahedral Nonhydrostatic, or ICON, is a complex weather modeling application that is part of Germany's DWD weather monitoring service, part of NOAA's (a US weather agency) ensemble model that predicts global weather, and a tool used by amateur hurricane trackers. Although a GPU port exists, ICON is typically compiled for CPU runs and its data-heavy nature means that it streses a system's IO.
28
+
- NAMD is a parallelized molecular dynamics simulation software for large biomecular systems.
29
+
- Icosahedral Nonhydrostatic, or ICON, is a complex weather modeling application that is part of Germany's DWD weather monitoring service, part of NOAA's (a US weather agency) ensemble model that predicts global weather, and a tool used by amateur hurricane trackers. Although a GPU port exists, ICON is typically compiled for CPU runs and its data-heavy nature means that it streses a system's IO.
29
30
- You must try to achiece the best scores for benchmarks, applications, systems, and certain other areas.
30
31
- Benchmarking period 8am-5pm. Application period is the next 48 hours. No external help.
31
32
32
33
You can read for further details on the [SCC SC24 page](https://sc24.supercomputing.org/students/student-cluster-competition/). Team introductions are on the [page as well](https://sc24.supercomputing.org/2024/09/teams-compete-in-the-ultimate-hpc-challenge-at-sc24/). If you are interested in the history of the program, you may also see the official [Student Cluster Competition site](https://www.studentclustercompetition.us/).
33
34
34
-
###System Specifications
35
+
## System Specifications
35
36
36
37
A diagram of our final system, <u>Ocean Vista</u>:
37
38
@@ -65,29 +66,29 @@ Here is the diagram matching the hardware with the source of vendor:
65
66
66
67
Besides lending the team hardware, we received generous financial donations from Micas Networks and Intel, as well as the mentoring and help from our company sponsors and the faculty and staff from SDSC, UCSD, and the CSE department.
67
68
68
-
###Preparing for Benchmarks and Applications
69
+
## Preparing for Benchmarks and Applications
69
70
70
71
Early on, we divided out work into teams with a few overlapping responsibilities. In this division of teams we also included students from our *home team*, another number of students who helped prepare and worked on the tasks with the *travel team* but ultimately stayed at our institution during the conference.
71
72
72
73
During the summer we had weekly meetings to sync up on our progress. Throughout these weeks we were working. Working on our tasks on our club resources. Through one of our members, we were able to get access to MI210s in an AMD research cluster. Through SDSC, the Expanse supercomputer. And through the club itself, a few older servers. Once in a while, we would meet with our mentors to ask them for advice or help regarding certain technical and nontechnical issues we would come accross.
73
74
74
75
Our goals were to familiarize with the benchmarks and applications, as well as practice our HPC skills. There were also certain stages of the competition application we would have to submit every couple of weeks: final team submission, final poster, and a final architecture. And while we were working, we were also speccing out what we wanted from our cluster with our hardware vendor(s).
75
76
76
-
####NAMD
77
+
### NAMD
77
78
We spent summer understanding how NAMD and molecular dyanmics in general worked, by running simulations from the user guide and following online tutorials. We also tested a variety of NAMD builds; NAMD as a software was built to work with the Charm++ message-passing parallel language, and Charm++ setup was dependent on the underlying network layer setup we wanted to use (`netlrts`, Infiniband, MPI). A really important thing to note about NAMD was that the developers had just released version 3.0, which introduced a new "GPU-resident" mode. NAMD's GPU-resident mode offloaded all calculations including integration calculations to the GPU, making simulations much faster compared to the CPU and the weaker GPU-offload mode available. However, GPU-resident mode could only run on a single node. We realized that these variations would be key to maximizing performance on our cluster.
78
79
79
80
In the fall, we focused on running the publicly available NAMD benchmarks and toggling the different command line parameters and configuaration parameters to maximize performance on the AMD server available to us. We also had invaluable mentor support; Andy Goetz helped us understand how MD works, and Julio Maia, a researcher at AMD who was one of the NAMD developers, gave us crucial advice on different types of NAMD simulations and worked with us to resolve issues in NAMD's source code due to incompatibility with AMD hardware.
80
81
81
82
When we finally had access to our cluster the week before shipping it out (more on that later), there were a new slew of problems. For some reason NAMD would not work with the `netlrts` installation of Charm++; even though the nodes were able to connect to each other via `ssh` protocol, Charm++ was not able to establish that connection. We switched over to the MPI version, but were unable to get MPI working until right before shipping the cluster.
82
83
83
-
####ICON
84
+
### ICON
84
85
Throughout the summer and fall, it was a massive struggle for our team to compile ICON, since documentation was limited to a few custom architectures and the complex nature of the program meant that a lot of testing and debugging was required to find the right set of compile parameters for our architecture. The complex compile process required iterating through build scripts and making sure all of the required dependencies were able to talk to each other. The changing nature of our cluster in certain weeks meant that there were occasionally changinges to our linker flags and other variables. Having Spack set up made a huge difference in this effort. Spack allowed us to more easily manage the dependencies, installations, and making sure that everything was using a supported version.
85
86
86
87
ICON required relentless debugging and iteration. Being transparent with teammates about problems, deadlines, and resource usage kept us aligned under pressure. Having the support of our home team and mentors was helpful at this stage, providing multiple perspectives and ensuring someone was always trying something new to make the best of a difficult situation.
87
88
88
89
After many trials, we settled on a CPU-only run for ICON, which freed up the GPUs for other applications. This ensured that we would be able to give our other applications, which had been more successful in our testing, more resources and time to run, while trying our best with ICON even though we knew it would be a struggle.
89
90
90
-
###Sourcing Hardware
91
+
## Sourcing Hardware (a major challenge)
91
92
92
93
The earliest hardware we got was the [32 Port 400Gb Switch](https://micasnetworks.com/product/m2-w6920-32qc2x/) from Micas Networks. Our relationship with the company started with the SCC23 team at last years conference. For this competition year, Micas Networks was one of our essential supporters. They generously lent us their hardware since May, and this support pushed our team to really reach the 400Gb bandwidth speeds despite having a small cluster.
93
94
@@ -119,7 +120,7 @@ We've finally got all hardware of our final cluster, but the AMD GPU BIOS from H
119
120
<!-- Broadcom Drivers -->
120
121
121
122
122
-
###Dawn of the Final Week
123
+
## Dawn of the Final Week
123
124
124
125
The flight to Atlanta is Friday, the 15th of November. The date this details is Friday, 8th of November. The express shipping driver has arrived early and he has parked his truck on the loading dock. 3 days until shipping.
0 commit comments