Configure Deployment Required for LDMS Telemetry

Using Omnia, you can deploy Lightweight Distributed Metric Service (LDMS) to collect in-band telemetry from Slurm clusters. The deployment includes installing LDMS producers on Slurm nodes, deploying LDMS aggregator and store components on Service Kubernetes nodes, and integrating LDMS with Kafka for downstream telemetry processing.

LDMS collects system metrics such as CPU, memory, network, I/O, and Slurm job statistics. LDMS includes these components:

  • LDMS producer (collector): Collects local system metrics and runs on Slurm controller, compute, and login nodes.

  • LDMS aggregator: Receives and aggregates metrics from producers. Runs as a Kubernetes pod.

  • LDMS store: Buffers and stores metric batches reliably. Runs as a Kubernetes pod.

  • Kafka broker: Handles telemetry streaming for consumption by downstream systems.

For more details on LDMS, see Lightweight Distributed Metric Service

Note

To consume LDMS metrics from the Kafka ‘ldms’ topic, transform to Prometheus format, and write to VictoriaMetrics, see Configure Vector Telemetry Pipeline to Route Data to Victoria from Kafka.

During deployment, Omnia attaches LDMS aggregator and store pods to the admin network. This configuration improves throughput between Slurm nodes and the Kubernetes cluster.

Supported LDMS Plugins

The following LDMS plugins are supported in Omnia:

  • meminfo: Memory usage statistics

  • procstat2: Process statistics

  • vmstat: Virtual memory statistics

  • loadavg: System load average

  • procnetdev2: Network interface statistics

Note

The LDMS Slurm sampler metrics are not supported in the current telemetry deployment.

Prerequisites

  • Ensure that the provision.yml playbook has been executed successfully with service_kube_control_plane and service_kube_node in the mapping file.

Steps

  1. Specify the following entries in the software_config.json. If any entry is missing, Omnia skips LDMS deployment and logs an informational message. For more information, see Input Parameters for Local Repositories.

{"name": "slurm_custom", "arch": ["x86_64","aarch64"]},
{"name": "service_k8s", "version": "1.34.1", "arch": ["x86_64"]},
{"name": "ldms", "arch": ["x86_64", "aarch64"]}
  1. Ensure the ldms.json file contains the following entries.

Note

If the LDMS RPM is not available, refer to Building LDMS Producer RPM Package for instructions on building LDMS RPMs.

The following ldms.json sample is for x86_64. For aarch64 architecture, update the repo name accordingly in the ldms.json file.

{
    "ldms": {
        "cluster": [
            {"package": "python3-devel", "type": "rpm", "repo_name": "x86_64_appstream"},
            {"package": "python3-cython", "type": "rpm", "repo_name": "x86_64_appstream"},
            {"package": "openssl-libs", "type": "rpm", "repo_name": "x86_64_baseos"},
            {"package": "ovis-ldms", "type": "rpm", "repo_name": "x86_64_ldms"}
        ]
    }
}
  1. In local_repo_config.yml, specify the paths for the ovis-ldms RPMs accordingly for the user_repo_url_x86_64 and user_repo_url_aarch64.

  2. Configure the omnia_config.yml:

    omnia_config.yml

    Variables

    Mandatory/Optional

    Details

    cluster_name

    Mandatory

    • Type: String

    • Name of the cluster on which you want to deploy Kubernetes.

    • This input is case-sensitive. Do not add any special characters except _ (underscore) in the cluster name.

    deployment

    Mandatory

    • Type: Boolean

    • Indicates if Kubernetes will be deployed or not.

    • Accepted values: true or false

    k8s_cni

    Mandatory

    • Type: String

    • Kubernetes SDN network.

    • Accepted values: calico

    • Default value: calico

    pod_external_ip_range

    Mandatory

    • Type: String

    • These addresses will be used by the loadbalancer for assigning external IPs to Kubernetes services.

    • Ensure that the IP range provided is not assigned to any node in the cluster.

    • Ensure that the pod_external_ip_range defined in the omnia_config.yml file is reachable from the OpenManage Enterprise appliance and the SFM network.

    • Sample values: 172.16.107.170-172.16.107.200

    k8s_service_addresses

    Optional

    • Type: String

    • Kubernetes internal network for services.

    • This network must be unused in your network infrastructure.

    • Default value: "10.233.0.0/18"

    k8s_pod_network_cidr

    Optional

    • Type: String

    • Kubernetes pod network CIDR for internal network. When used, it will assign IP addresses from this range to individual pods.

    • This network must be unused in your network infrastructure.

    • Default value: "10.233.64.0/18"

    csi_powerscale_driver_secret_file_path

    Optional

    • Type: File path

    • If you want to deploy the CSI driver for PowerScale on your service cluster, add the file path of the secrets.yaml file to this variable.

    csi_powerscale_driver_values_file_path

    Optional

    • Type: File path

    • If you want to deploy the CSI driver for PowerScale on your service cluster, add the file path of the values.yaml file to this variable.

    nfs_storage_name

    Mandatory

    • Type: String

    • Use same name as mentioned in each of the name available in storage_config.yml.

    k8s_crio_storage_size

    Mandatory

    • Type: String

    • Specifies the disk size allocated for CRI-O container storage.

    etcd_on_local_disk

    Optional

    • Type: Boolean

    • Determines whether ETCD is deployed on local disk or NFS storage.

    • Accepted values: true or false

    • Default value: false

    • When set to true, ETCD is deployed on local disk on all master nodes. The system prioritizes BOSS card if available, and falls back to SSD/SATA disks if BOSS is not present. The /var/lib/etcd directory is mounted on the selected local disk.

    • When set to false or omitted, ETCD storage is provisioned using NFS, and no local disk configuration is performed for ETCD.

    • Important: Migration from NFS to local disk is not supported during upgrades. This configuration is only applicable for fresh installations.

  3. Ensure that the telemetry_config.yml has the entries specific for LDMS and Kafka deployment.

    Note

    For LDMS telemetry configuration, at least one sampler plugin is mandatory to collect system metrics.

    telemetry_config.yml

    Parameter

    Mandatory/Optional

    Details

    telemetry_sources > idrac > metrics_enabled

    Mandatory

    • Type: Boolean

    • Enable or disable iDRAC metrics collection from Dell PowerEdge servers

    • Collected metrics: temperature, power, fan speed, storage health, CPU/memory errors

    • Data path:
      • iDRAC Receiver -> ActiveMQ -> KafkaPump -> Kafka ‘idrac’ topic

      • iDRAC Receiver -> ActiveMQ -> VictoriaPump -> vmagent -> victoria_metrics

    • Accepted values: true or false

    • Default value: true

    Note

    If iDRAC telemetry is enabled, mysqldb_user, mysqldb_password, and mysqldb_root_password parameters in the omnia_config_credentials.yml file become mandatory.

    Note

    If you want to deploy only Slurm clusters (slurm_custom), set metrics_enabled to false.

    telemetry_sources > idrac > collection_targets

    Mandatory

    • Collection targets define where iDRAC data is sent before Vector processing

    • Supported values: victoria_metrics, kafka

    • Multiple targets: Can specify both [victoria_metrics, kafka]

    • Default: [victoria_metrics, kafka]

    idrac_telemetry_configurations > mysqldb_storage

    Conditional Mandatory

    • MySQL database storage for iDRAC telemetry

    • Purpose: Storage configuration for iDRAC telemetry MySQL database

    • Accepted values: Must be specified in the form of X[Ki|Mi|Gi|Ti|Pi|Ei]

    • Default value: 1Gi

    • Required when: telemetry_sources > idrac > metrics_enabled is true

    telemetry_sources > ldms > metrics_enabled

    Mandatory

    • Type: Boolean

    • Enable or disable LDMS metrics collection from compute nodes

    • Collected metrics: CPU, memory, network, disk metrics

    • Data path: LDMS samplers → LDMS aggregator → store_avro_kafka → Kafka ‘ldms’ topic

    • Accepted values: true or false

    • Default value: true

    telemetry_sources > ldms > collection_targets

    Mandatory

    • LDMS only supports Kafka collection (no direct victoria_metrics path)

    • Vector-LDMS bridge consumes from Kafka and routes to victoria_metrics

    • Supported values: kafka

    • Default: [kafka]

    telemetry_sources > dcgm > metrics_enabled

    Optional

    • Type: Boolean

    • Enable or disable DCGM (NVIDIA Data Center GPU Manager) metrics collection

    • Collected metrics: GPU temperature, utilization, memory, ECC errors, power

    • Requires: NVIDIA GPU driver installed on compute nodes

    • Accepted values: true or false

    • Default value: true

    telemetry_sources > powerscale > metrics_enabled

    Optional

    • Type: Boolean

    • Enable or disable PowerScale metrics collection from Dell PowerScale (OneFS) storage

    • Collected metrics: Storage metrics from Dell PowerScale clusters

    • Requires: CSM Observability (Karavi) values file configured

    • Data path: CSM Metrics PowerScale → OTEL Collector → vmagent(shared) → victoria_metrics

    • Accepted values: true or false

    • Default value: true

    telemetry_sources > powerscale > logs_enabled

    Optional

    • Type: Boolean

    • Enable or disable PowerScale logs collection

    • Accepted values: true or false

    • Default value: true

    telemetry_sources > powerscale > collection_targets

    Conditional Mandatory

    • PowerScale uses dedicated vmagent(shared) (no Kafka, no Vector)

    • Supported values: victoria_metrics, victoria_logs

    • Default: [victoria_metrics, victoria_logs]

    telemetry_sources > ufm > metrics_enabled

    Optional

    • Type: Boolean

    • Enable or disable UFM (NVIDIA UFM InfiniBand Fabric Manager) metrics collection

    • Collected metrics: IB port state, transmit/receive data, error counters, fabric topology

    • Requires: NVIDIA UFM appliance with Prometheus exporter enabled

    • Data path: UFM Prometheus Exporter → vmagent(shared) → victoria_metrics

    • Accepted values: true or false

    • Default value: false

    telemetry_sources > ufm > logs_enabled

    Optional

    • Type: Boolean

    • Enable or disable UFM syslog logs collection

    • Accepted values: true or false

    • Default value: false

    telemetry_sources > ufm > collection_targets

    Conditional Mandatory

    • UFM uses vmagent(shared) for metrics and VLAgent for logs

    • Supported values: victoria_metrics, victoria_logs

    • Default: [victoria_metrics, victoria_logs]

    telemetry_sources > vast > metrics_enabled

    Optional

    • Type: Boolean

    • Enable or disable VAST (Data Storage) metrics collection

    • Collected metrics: IB port state, transmit/receive data, error counters, fabric topology

    • Requires: VAST appliance with Prometheus exporter enabled

    • Data path: Prometheus Exporter → vmagent(shared) → victoria_metrics

    • Accepted values: true or false

    • Default value: false

    telemetry_sources > vast > logs_enabled

    Optional

    • Type: Boolean

    • Enable or disable VAST syslog logs collection

    • Accepted values: true or false

    • Default value: false

    telemetry_sources > vast > collection_targets

    Conditional Mandatory

    • VAST uses vmagent(shared) for metrics and VLAgent for logs

    • Supported values: victoria_metrics, victoria_logs

    • Default: [victoria_metrics, victoria_logs]

    telemetry_bridges > vector_ldms > metrics_enabled

    Optional

    • Type: Boolean

    • Enable or disable Vector-LDMS bridge (Kafka-to-victoria_metrics bridge for LDMS metrics)

    • Purpose: Consume LDMS metrics from Kafka ‘ldms’ topic, transform NERSC schema to Prometheus format, and write to victoria_metrics

    • Data flow: Kafka ‘ldms’ topic → Vector-LDMS → vmagent-vector → victoria_metrics

    • Requires: telemetry_sources > ldms > metrics_enabled = true

    • Accepted values: true or false

    • Default value: true

    telemetry_bridges > vector_ome > metrics_enabled

    Optional

    • Type: Boolean

    • Enable or disable Vector-OME metrics routing (Kafka-to-Victoria bridge for OME metrics)

    • Data flow: Kafka ‘ome.*’ topics → Vector-OME → vmagent-vector (metrics)

    • Requires: OME to be configured with kafka

    • Accepted values: true or false

    • Default value: true

    telemetry_bridges > vector_ome > logs_enabled

    Optional

    • Type: Boolean

    • Enable or disable Vector-OME logs routing

    • Data flow: Kafka ‘ome.*’ topics → Vector-OME → vlagent-vector (logs)

    • Accepted values: true or false

    • Default value: true

    telemetry_bridges > vector_ome > ome_identifier

    Optional

    • Identifier used by Vector-OME for topic identification and routing

    • Internally used to match topics with the prefix (e.g., ^ome\\..*$)

    • Type: String

    • minLength: 1

    • Default value: ome

    • Note: Change only if your OME Kafka topics use a different prefix

    telemetry_sinks > victoria_metrics > persistence_size

    Conditional Mandatory

    • Storage per vmstorage pod PVC

    • Important: Total VictoriaMetrics storage depends on deployment mode:
      • Single-node mode: Total storage = persistence_size * 1 pod

      • Cluster mode: Total storage = persistence_size * 3 vmstorage pods

    • Example (cluster): 8Gi * 3 = 24Gi total VictoriaMetrics storage

    • Accepted values: Must be specified in the form of X[Ki|Mi|Gi|Ti|Pi|Ei]

    • Default value: 8Gi (results in 24Gi total storage for cluster mode)

    telemetry_sinks > victoria_metrics > retention_period

    Conditional Mandatory

    • Metric retention period in hours

    • Default: 168 (7 days)

    telemetry_sinks > victoria_metrics > additional_metric_remote_write_endpoints

    Optional

    • Additional remote write endpoints for metrics (optional)

    • Purpose: Send metrics to external VictoriaMetrics instances in addition to Omnia-managed VictoriaMetrics

    • Format: List of endpoint objects with ‘url’ field (must start with http:// or https://)

    • TLS: Set ‘tls_insecure_skip_verify: true’ to skip TLS certificate verification

    • Default: [] (empty — only Omnia VictoriaMetrics receives metrics)

    • Example: - url: https://external-metrics-server:8480/insert/0/prometheus/api/v1/write

      tls_insecure_skip_verify: false

    telemetry_sinks > victoria_logs > storage_size

    Conditional Mandatory

    • Storage per vlstorage pod PVC

    • Total storage = storage_size × 3 vlstorage pods

    • Accepted values: Must be specified in the form of X[Ki|Mi|Gi|Ti|Pi|Ei]

    • Default value: 8Gi (results in 24Gi total storage)

    • Sizing formula: (140 MB/day × retention_days × node_count) / 3 replicas

    Warning

    Storage under-provisioning can lead to data loss before the retention period is reached. Calculate storage requirements based on expected log volume and retention needs.

    telemetry_sinks > victoria_logs > retention_period

    Conditional Mandatory

    • Log retention period in hours

    • Type: Integer (hours)

    • Accepted values: 24-8760 (1 day to 1 year)

    • Default: 168 (7 days)

    • Note: Retention is global and applies to all log streams uniformly. Deletion occurs asynchronously during background merge operations.

    Note

    VictoriaLogs does not return an error when log entries with timestamps outside the configured retention window are submitted. Log entries will be automatically removed from VictoriaLogs after the retention period.

    telemetry_sinks > victoria_logs > additional_log_write_endpoints

    Optional

    • Additional remote write endpoints for logs (optional)

    • Purpose: Send logs to external VictoriaLogs instances in addition to Omnia-managed VictoriaLogs

    • Format: List of endpoint objects with ‘url’ field (must start with http:// or https://)

    • TLS: Set ‘tls_insecure_skip_verify: true’ to skip TLS certificate verification

    • Default: [] (empty — only Omnia VictoriaLogs receives logs)

    • Example: - url: https://external-logs-server:9481/internal/insert

      tls_insecure_skip_verify: false

    telemetry_sinks > kafka > persistence_size

    Conditional Mandatory

    • Storage per Kafka pod PVC

    • Total = persistence_size × 6 pods (3 brokers + 3 controllers)

    • Accepted values: Must be specified in the form of X[Ki|Mi|Gi|Ti|Pi|Ei]

    • Default value: 8Gi (results in 48Gi total storage)

    • The default 8Gi size is suitable for small clusters (typically fewer than 5 nodes). For larger clusters, you should increase the persistence_size and adjust log_retention_hours and log_retention_bytes based on expected data volume and cluster size.

    Caution

    Ensure that the Kafka broker settings persistence_size, log_retention_hours, and log_retention_bytes are configured based on your data retention requirements. If the persistent volume reaches its capacity before logs are deleted according to the log retention period configured, Kafka brokers may run out of disk space. For more details on managing Kafka log retention and cleanup policies, see Managing Kafka logs with delete and compact policies.

    telemetry_sinks > kafka > log_retention_hours

    Conditional Mandatory

    • Log retention period in hours

    • Default: 168 (7 days)

    telemetry_sinks > kafka > log_retention_bytes

    Conditional Mandatory

    • Maximum size of Kafka logs (in bytes) before deletion

    • Default: -1 (unlimited)

    telemetry_sinks > kafka > log_segment_bytes

    Conditional Mandatory

    • Maximum size of Kafka log segments (in bytes)

    • Default: 1073741824 (1 GB)

    telemetry_sinks > kafka > topic_partitions

    Conditional Mandatory

    • Topic partitions per source (object format, not array)

    • Format: {topic_name: partition_count}

    • Required when: Source has kafka in collection_targets

    • Allowed topics: idrac, ldms only

    • Default partition counts: idrac=1, ldms=2

    • Example: {idrac: 1, ldms: 2}

    ldms_configurations > agg_port

    Conditional Mandatory

    • Aggregator port on service K8s cluster

    • Valid range: 6001-6100

    • Default: 6001

    ldms_configurations > store_port

    Conditional Mandatory

    • Store daemon port on service K8s cluster

    • The port can be the same as LDMS aggregator port

    • Valid range: 6001-6100

    • Default: 6001

    ldms_configurations > sampler_port

    Conditional Mandatory

    • Sampler port on compute nodes

    • Valid range: 10001-10100

    • Default: 10001

    ldms_configurations > sampler_plugins

    Mandatory

    • Sampler plugins define which metrics to collect from compute nodes

    • Parameters:
      • plugin_name: Name of the LDMS sampler plugin

      • config_parameters: Plugin-specific configuration (as a single string)

      • activation_parameters: Collection schedule in MICROSECONDS

        Format: interval=<microseconds> offset=<microseconds> Example: interval=30000000 (30 seconds)

    • Default plugins:
      • meminfo: Memory usage statistics (free, used, buffers, cached)

      • procstat2: Process statistics (CPU, memory, I/O per process)

      • vmstat: Virtual memory statistics (paging, swapping, memory pressure)

      • loadavg: System load average (1, 5, and 15 minute averages)

      • procnetdev2: Network interface statistics (bytes, packets, errors, drops per interface)

    • Default activation_parameters: interval=30000000 (30 seconds for all plugins except procnetdev2 which includes offset=0)

    powerscale_configurations > otel_collector_storage_size

    Conditional Mandatory

    • PVC size for OTEL Collector metric batching and buffering

    • Accepted values: Must be specified in the form of X[Ki|Mi|Gi|Ti|Pi|Ei]

    • Default value: 5Gi

    powerscale_configurations > csm_observability_values_file_path

    Conditional Mandatory

    Note

    In the values.yaml file, only set karaviMetricsPowerscale -> enabled: true. Set the following parameters to false: karaviMetricsPowerflex -> enabled=false, karaviMetricsPowerstore -> enabled=false, karaviMetricsPowerscaleauthorization.-> enabled=false, karaviMetricsPowermax -> enabled=false.

    Note

    Update the isiAuthType in the values.yaml file based on the current auth type setting. To check the current auth type setting, use the command isi http settings view.

    Note

    For CSI PowerScale health metrics, enable controller > healthMonitor > enabled: true and node > healthMonitor > enabled: true in the CSI PowerScale values.yaml (https://raw.githubusercontent.com/dell/helm-charts/csi-isilon-2.15.0/charts/csi-isilon/values.yaml).

    ufm_configuration > ufm_endpoint

    Conditional Mandatory

    • UFM appliance IP address or hostname

    • Required when: telemetry_sources > ufm > metrics_enabled is true

    • Example: 172.20.44.180 or ufm.example.com

    • Default value: "

    ufm_configuration > ufm_metrics_port

    Optional

    • UFM Prometheus exporter port

    • Default value: 9001 (UFM default Prometheus port)

    ufm_configuration > scrape_interval

    Optional

    • Prometheus scrape interval for UFM metrics

    • Accepted values: Prometheus duration format (e.g., 15s, 30s, 1m)

    • Default value: 30s

    ufm_configuration > scrape_timeout

    Optional

    • Prometheus scrape timeout (must be <= scrape_interval)

    • Accepted values: Prometheus duration format (e.g., 10s, 15s)

    • Default value: 15s

    ufm_configuration > tls_mode

    Optional

    • TLS mode for connecting to UFM Prometheus endpoint

    • Accepted values: self_signed, ca_signed

    • self_signed: Skip TLS verification (insecure_skip_verify=true)

    • ca_signed: Use CA certificate for TLS verification

    • Default value: self_signed

    ufm_configuration > ufm_ca_cert_path

    Optional

    • Path to CA certificate file for UFM TLS verification

    • Required when: tls_mode is ca_signed

    • Must be a valid PEM-format certificate file

    • Default value: " (empty — not used when tls_mode is self_signed)

    ufm_configuration > auth_mode

    Optional

    • Authentication mode for UFM Prometheus endpoint

    • Accepted values: basic, none

    • basic: Use ufm_username/ufm_password from omnia_config_credentials.yml

    • none: No authentication (UFM endpoint is open)

    • Default value: basic

    vast_configuration > vast_endpoint

    Conditional Mandatory

    • VAST appliance IP address or hostname

    • Required when: telemetry_sources > vast > metrics_enabled is true

    • Example: 172.20.44.180 or vast.example.com

    • Default value: "

    vast_configuration > vast_metrics_port

    Optional

    • VAST Prometheus exporter port

    • Default value: 9001 (VAST default Prometheus port)

    vast_configuration > scrape_interval

    Optional

    • Prometheus scrape interval for VAST metrics

    • Accepted values: Prometheus duration format (e.g., 15s, 30s, 1m)

    • Default value: 30s

    vast_configuration > scrape_timeout

    Optional

    • Prometheus scrape timeout (must be <= scrape_interval)

    • Accepted values: Prometheus duration format (e.g., 10s, 15s)

    • Default value: 15s

    vast_configuration > tls_mode

    Optional

    • TLS mode for connecting to VAST Prometheus endpoint

    • Accepted values: self_signed, ca_signed

    • self_signed: Skip TLS verification (insecure_skip_verify=true)

    • ca_signed: Use CA certificate for TLS verification

    • Default value: self_signed

    vast_configuration > vast_ca_cert_path

    Optional

    • Path to CA certificate file for VAST TLS verification

    • Required when: tls_mode is ca_signed

    • Must be a valid PEM-format certificate file

    • Default value: " (empty — not used when tls_mode is self_signed)

    vast_configuration > auth_mode

    Optional

    • Authentication mode for VAST Prometheus endpoint

    • Accepted values: basic, none

    • basic: Use vast_username/vast_password from omnia_config_credentials.yml

    • none: No authentication (VAST endpoint is open)

    • Default value: basic

If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.