Step 14: Verify Slurm Cluster and Kubernetes on the Service Cluster

Slurm cluster

After booting the nodes, verify the following to ensure slurm is deployed successfully: On slurm controller node

  • Verify if the required services are running. Run the following commands and confirm that each service is active (running):

systemctl status munge
systemctl status slurmctld
systemctl status slurmdbd
systemctl status mariadb
  • Verify the node status with sinfo:

    ../../_images/sinfo.jpg
    • Ensure that the worker nodes are listed and the node state should be idle.

It is recommended to store job output and error files in NFS-mounted directories (/var/log/slurm/) so that job logs are persisted.

Slurm cluster with GPU

  • On Slurm nodes that have GPUs, it may take some time for Slurmd to start because of the GPU driver installation. To view the logs during this process, you can run:

    tail -f /var/log/cloud-init-output.log
    
  • The CUDA installation path on the OIM and nodes must be {client_share_path}/slurm/cuda.

  • The client_share_path is the same as mentioned in storage_config.yml for nfs_slurm.

PAM Feature for Slurm

Slurm PAM restricts SSH access to compute nodes for non-root users. You can log in only while their job is actively running on the node. After the job is completed, you are automatically logged out.

On login node: Switch to the LDAP user:

ssh <ldap_user>@<login_node_hostname>
sbatch job.sh

While the job is running, ssh as <ldap_user> to the slurm node where the job is running. After the job is completed, <ldap_user> is logged out.

Kubernetes on the service cluster

Run the following commands on Kubernetes controller node:

kubectl get pods -A -o wide
kubectl get nodes -o wide

If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.