Configure Deployment Required for LDMS Telemetry
Using Omnia, you can deploy Lightweight Distributed Metric Service (LDMS) to collect in-band telemetry from Slurm clusters. The deployment includes installing LDMS producers on Slurm nodes, deploying LDMS aggregator and store components on Service Kubernetes nodes, and integrating LDMS with Kafka for downstream telemetry processing.
LDMS collects system metrics such as CPU, memory, network, I/O, and Slurm job statistics. LDMS includes these components:
LDMS producer (collector): Collects local system metrics and runs on Slurm controller, compute, and login nodes.
LDMS aggregator: Receives and aggregates metrics from producers. Runs as a Kubernetes pod.
LDMS store: Buffers and stores metric batches reliably. Runs as a Kubernetes pod.
Kafka broker: Handles telemetry streaming for consumption by downstream systems.
For more details on LDMS, see Lightweight Distributed Metric Service
During deployment, Omnia attaches LDMS aggregator and store pods to the admin network. This configuration improves throughput between Slurm nodes and the Kubernetes cluster.
Prerequisites
Ensure that the
discovery.ymlplaybook has been executed successfully withservice_kube_control_planeandservice_kube_nodein the mapping file.
Steps
Specify the following entries in the
software_config.json. If any entry is missing, Omnia skips LDMS deployment and logs an informational message. For more information, see Input Parameters for Local Repositories.
{"name": "slurm_custom", "arch": ["x86_64","aarch64"]},
{"name": "service_k8s", "version": "1.34.1", "arch": ["x86_64"]},
{"name": "ldms", "arch": ["x86_64", "aarch64"]}
Ensure the
ldms.jsonfile contains the following entries.
Note
If the LDMS RPM is not available, refer to Building LDMS Producer RPM Package for instructions on building LDMS RPMs.
The following ldms.json sample is for x86_64. For aarch64 architecture, update the repo name accordingly in the ldms.json file.
{
"ldms": {
"cluster": [
{"package": "python3-devel", "type": "rpm", "repo_name": "x86_64_appstream"},
{"package": "python3-cython", "type": "rpm", "repo_name": "x86_64_appstream"},
{"package": "openssl-libs", "type": "rpm", "repo_name": "x86_64_baseos"},
{"package": "ovis-ldms", "type": "rpm", "repo_name": "x86_64_ldms"}
]
}
}
In
local_repo_config.yml, specify the paths for theovis-ldmsRPMs accordingly for theuser_repo_url_x86_64anduser_repo_url_aarch64.Configure the
omnia_config.yml:omnia_config.yml Variables
Mandatory/Optional
Details
cluster_nameMandatory
Type: String
Name of the cluster on which you want to deploy Kubernetes.
This input is case-sensitive. Do not add any special characters except
_(underscore) in the cluster name.
deploymentMandatory
Type: Boolean
Indicates if Kubernetes will be deployed or not.
Accepted values:
trueorfalse
k8s_cniMandatory
Type: String
Kubernetes SDN network.
Accepted values:
calicoDefault value:
calico
pod_external_ip_rangeMandatory
Type: String
These addresses will be used by the loadbalancer for assigning external IPs to Kubernetes services.
Ensure that the IP range provided is not assigned to any node in the cluster.
Sample values:
172.16.107.170-172.16.107.200
k8s_service_addressesOptional
Type: String
Kubernetes internal network for services.
This network must be unused in your network infrastructure.
Default value:
"10.233.0.0/18"
k8s_pod_network_cidrOptional
Type: String
Kubernetes pod network CIDR for internal network. When used, it will assign IP addresses from this range to individual pods.
This network must be unused in your network infrastructure.
Default value:
"10.233.64.0/18"
csi_powerscale_driver_secret_file_pathOptional
Type: File path
If you want to deploy the CSI driver for PowerScale on your service cluster, add the file path of the
secrets.yamlfile to this variable.
csi_powerscale_driver_values_file_pathOptional
Type: File path
If you want to deploy the CSI driver for PowerScale on your service cluster, add the file path of the
values.yamlfile to this variable.
nfs_storage_nameMandatory
Type: String
Use same name as mentioned in each of the
nfs_nameavailable instorage_config.yml.
k8s_crio_storage_sizeMandatory
Type: String
Specifies the disk size allocated for CRI-O container storage.
Ensure that the
telemetry_config.ymlhas the entries specific for LDMS and Kafka deployment.Note
For LDMS telemetry configuration, at least one sampler plugin is mandatory to collect system metrics.
telemetry_config.yml Parameter
Mandatory/Optional
Details
idrac_telemetry_supportMandatory
Type: Boolean
If you want iDRAC telemetry support on your service cluster, set this variable to
truebefore executingtelemetry.ymlanddiscovery.ymlplaybooks.Accepted values:
trueorfalseDefault value:
true
Note
If
idrac_telemetry_supportis set totrue,``mysqldb_user``,mysqldb_password, andmysqldb_root_passwordparameters in theomnia_config_credentials.ymlfile becomes mandatory.Note
If you want to deploy only Slurm clusters (
slurm_custom),idrac_telemetry_supportmust be set tofalse.idrac_telemetry_collection_typeMandatory
Specify where to store iDRAC telemetry data.
- Supported values:
victoria: Store in VictoriaMetrics onlyKafka: Store in Kafka onlyvictoria,kafka: Store in both (recommended)
Default:
victoria,kafka
victoria_configurations >
deployment_modeMandatory
- Supported values:
single-node: Simple deployment (1 pod, suitable for dev/test)cluster: High-availability deployment (7 pods, recommended for production)
Default:
cluster- Cluster Mode Benefits:
High availability (no single point of failure)
Horizontal scalability (scale components independently)
Better performance (4x ingestion, 2x query speed)
Production-ready architecture
- Single-Node Benefits:
Simple setup (fewer resources)
Suitable for small deployments (<10 nodes)
Lower resource usage (~4Gi memory vs ~10Gi for cluster)
victoria_configurations >
persistence_sizeConditional Mandatory
The amount of storage allocated for each VictoriaMetrics persistent volume.
- Important: Total VictoriaMetrics storage depends on deployment mode:
Single-node mode: Total storage =persistence_size * 1 podCluster mode: Total storage =persistence_size * 3 vmstorage pods
Example (cluster):
8Gi * 3 = 24Gitotal VictoriaMetrics storageAccepted values: Must be specified in the form of
X[Ki|Mi|Gi|Ti|Pi|Ei]Default value:
8Gi(results in 24Gi total storage for cluster mode)
victoria_configurations >
retention_periodConditional Mandatory
Specify the number of hours to retain victoria logs before they are deleted.
Default: 168 (7 days)
kafka_configurations >
persistence_sizeConditional Mandatory
The amount of storage allocated for each Kafka persistent volume.
Important: Total Kafka storage =
persistence_size * 6 pods* 3 Kafka brokers (each getspersistence_sizestorage) * 3 Kafka controllers (each getspersistence_sizestorage)Example:
8Gi * 6 = 48Gitotal Kafka storageAccepted values: Must be specified in the form of
X[Ki|Mi|Gi|Ti|Pi|Ei]Default value:
8Gi(results in 48Gi total storage)The default
8Gisize is suitable for small clusters (typically fewer than 5 nodes). For larger clusters, you should increase thepersistent sizeand adjustlog_retention_hoursandlog_retention_bytesbased on expected data volume and cluster size.
Caution
Ensure that the Kafka broker settings
persistence_size,log_retention_hours, andlog_retention_bytesare configured based on your data retention requirements. If the persistent volume reaches its capacity before logs are deleted according to the log retention period configured, Kafka brokers may run out of disk space. For more details on managing Kafka log retention and cleanup policies, see Managing Kafka logs with delete and compact policies.kafka_configurations >
log_retention_hoursConditional Mandatory
Specify the number of hours to retain Kafka logs before they are deleted.
Default: 168 (7 days)
kafka_configurations >
log_retention_bytesConditional Mandatory
Specify the maximum size of Kafka logs (in bytes) before they are deleted.
Default: -1 (unlimited)
kafka_configurations >
log_segment_bytesConditional Mandatory
Specify the maximum size of Kafka log segments (in bytes) before they are deleted.
Default: 1073741824 (1 GB)
kafka_configurations >
topic_partitionsConditional Mandatory
Specify the partition counts for the following topics: * idrac * ldms * ome
Default partition counts:
idrac=1,ldms=2,ome=1Example:
topic_partitions:
name: “idrac”
partitions: 1
name: “ldms”
partitions: 2
LDMS port configurations >
ldms_agg_portConditional Mandatory
Specify the aggregator port to be used on the service k8s cluster.
Valid range: 6001-6100
Default: 6001
LDMS port configurations >
ldms_store_portConditional Mandatory
Specify the store daemon port to be used on the service k8s cluster
The port can be the same as LDMS aggregator port specified for
ldms_agg_portValid range: 6001-6100
Default: 6001
LDMS port configurations >
ldms_sampler_portConditional Mandatory
Specify the sampler port to be used on the compute nodes.
Valid range: 10001-10100
Default: 10001
ldms_sampler_configurations >
meminfoMandatory
Collects memory usage statistics (free, used, buffers, cached, etc.).
plugin_name:
meminfoconfig_parameters
""activation_parameters:
interval=1000000indicates memory data metrics collected every 1 second.The interval unit is microseconds
ldms_sampler_configurations >
procstat2Mandatory
Collects process statistics (CPU, memory, I/O per process).
plugin_name:
procstat2config_parameters
""activation_parameters:
interval=1000000The interval unit is microseconds
ldms_sampler_configurations >
vmstatMandatory
Collects virtual memory statistics (paging, swapping, memory pressure).
plugin_name:
vmstatconfig_parameters
""activation_parameters:
interval=1000000The interval unit is microseconds
ldms_sampler_configurations >
loadavgMandatory
Collects system load average (1, 5, and 15 minute averages).
plugin_name:
loadavgconfig_parameters:
""activation_parameters
interval=1000000The interval unit is microseconds
ldms_sampler_configurations >
procnetdev2Mandatory
Collects network interface statistics (bytes, packets, errors, drops per interface)
- The possible config parameters are:
ifaces=eth0,eth1: Specific interfaces to monitor
If not specified, all network interfaces will be monitored
plugin_name: procnetdev2
config_parameters: “”
activation_parameters:
interval=1000000 offset=0The interval unit is microseconds
If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.