Step 14: Verify Slurm Cluster and Kubernetes on the Service Cluster

Slurm cluster

After booting the nodes, verify the following to ensure slurm is deployed successfully: On slurm controller node

Verify if the required services are running. Run the following commands and confirm that each service is active (running):
systemctl status munge
systemctl status slurmctld
systemctl status slurmdbd
systemctl status mariadb
Verify the node status with sinfo:

Ensure that the worker nodes are listed and the node state should be idle.

It is recommended to store job output and error files in NFS-mounted directories (/var/log/slurm/) so that job logs are persisted.

Slurm cluster with GPU

On Slurm nodes that have GPUs, it may take some time for Slurmd to start because of the GPU driver installation. To view the logs during this process, you can run:
```
tail -f /var/log/cloud-init-output.log
```
The CUDA installation path on the OIM and nodes must be {client_share_path}/slurm/cuda.
The client_share_path is the same as mentioned in storage_config.yml for nfs_slurm.

PAM Feature for Slurm

Slurm PAM restricts SSH access to compute nodes for non-root users. You can log in only while their job is actively running on the node. After the job is completed, you are automatically logged out.

On login node: Switch to the LDAP user:

ssh <ldap_user>@<login_node_hostname>
sbatch job.sh

While the job is running, ssh as <ldap_user> to the slurm node where the job is running. After the job is completed, <ldap_user> is logged out.

Kubernetes on the service cluster

Run the following commands on Kubernetes controller node:

kubectl get pods -A -o wide
kubectl get nodes -o wide

If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.