- Wipe hard drive
- Create a bootable usb drive with Ubuntu 24.04.3 LTS
- Boot bios and choose usb drive as first option
- Boot
- Enable SSH
- Create a user with username
cadmin(please consult team for password)
- Edit
/etc/hostsand add10.0.0.1 orca. Should resemble the configuration provided in this repo under/hosts/etchostscompute1.txt - Use
sudo vim network.yamlto create the network yaml file- Enter following information
network: version: 2 ethernets: eth0: dhcp4: no addresses: - 10.0.0.X/24 - Note file is space sensitive so, follow the format.
Xis specific to the computer (by convention, should be the number of the compute node). Always check with the/etc/hostsfile onorca(the head node) to make sure there does not already exist a compute node with that number.
- Enter following information
- Use
escthen:wqto exit vim - Enter
sudo netplan apply- If you see any formatting errors, double check to make sure you followed the above format exactly
At this point, your node should be ready to connect to the server. Use an ethernet cable to connect your node to the network switch
- SSH into
cadminon the head node from your PC - Verify that you can
ping 10.0.0.X, whereXis the number of the added compute node - Add
10.0.0.X computeXto/etc/hostson orca, the head node. - Verify you can
ping computeXfrom the head node - Verify that you can
ssh computeX.- If
ping computeXworks butssh computeXfails, you likely set up thecadminaccount wrong on the compute node
- If
Munge is an authentication client that Slurm uses to authenticate its communications
Note that sudo apt install cannot be used on the compute nodes as they are not connected to the internet
-
SSH into the compute node from the head node
-
Configure apt to use the head node's cache proxy
- Create
/etc/apt/apt.conf.d/01proxy:sudo tee /etc/apt/apt.conf.d/01proxy >/dev/null <<'EOF' Acquire::http::Proxy "http://10.0.0.1:3142"; Acquire::https::Proxy "http://10.0.0.1:3142"; EOF
- Create
-
Create the correct sources list
- Note we are using
noble. /etc/os-release echo $UBUNTU_CODENAME
- Note we are using
-
Then overwrite
/etc/apt/sources.list(NOTE: if this doesn’t work try manually replacing${UBUNTU_CODENAME}withnoble):sudo tee /etc/apt/sources.list >/dev/null <<EOF deb http://archive.ubuntu.com/ubuntu ${UBUNTU_CODENAME} main restricted universe multiverse deb http://archive.ubuntu.com/ubuntu ${UBUNTU_CODENAME}-updates main restricted universe multiverse deb http://security.ubuntu.com/ubuntu ${UBUNTU_CODENAME}-security main restricted universe multiverse deb http://archive.ubuntu.com/ubuntu ${UBUNTU_CODENAME}-backports main restricted universe multiverse EOF -
Remove any leftover bad list files (NOTE: if this doesnt work try removing the
sudoat the start of the command):sudo find /etc/apt/sources.list.d -type f -name '*.list' -exec sudo sed -i 's|^deb http://10\.0\.0\.1|# &|' {} \;If this command doesnt work, omit the
sudoat the beginning of the command. -
Refresh and bring system to a clean state
sudo apt-get clean sudo rm -rf /var/lib/apt/lists/* sudo apt-get update sudo apt --fix-broken install -y sudo apt-get dist-upgrade -y -
Install Munge
sudo apt-get install -y munge libmunge2
which munge should return /usr/bin/munge
Checksystemctl status munge
Chrony is an NTP client, Munge requires all nodes to be time synchronized.
- Run
sudo apt-get install -y chrony - Edit config file (
/etc/chrony/chrony.conf)- Add
server 10.0.0.1 iburst
- Add
- Restart chrony:
sudo systemctl restart chrony - Makestep:
sudo chronyc makestep(allows chrony to make a large jump in time to correct for errors)- Should see
200 Ok
- Should see
Run chronyc tracking, should see Leap status: normal at the bottom
Run sudo systemctl daemon-reload
Run sudo systemctl restart munge
Verify with sudo systemctl status munge
Steps marked with [CN] should be performed on the compute node, steps marked with [HN] should be performed on the head node
- [CN] Run
sudo apt-get install -y slurm-wlm - [HN] Edit
slurm.conffile- Add
NodeName=computeX NodeAddr=10.0.0.X State=UNKNOWNwhereXis the number of your node - Push an updated version of
slurm.confto github
- Add
- [CN] create
slurm.conffilesudo vim /etc/slurm/slurm.conf- Copy and paste the contents from
slurm.confon the head node
- Copy and paste the contents from
- [CN] Start slurm:
sudo systemctl restart slurmd, verify withsudo systemctl status slurmd - [HN] Run
scontrol reconfigure - [HN] Verify status with
sinfoon head node, should see the state of your node asidle
- Install NFS client:
sudo apt install nfs-common - Edit
/etc/hostsand add10.0.0.2 compute1 - Add
compute1:/home /home nfs defaults 0 0to/etc/fstab - Mount from compute1:
sudo mount /home
Type ls, should see cadmin's home directory rather an empty directory