Integrate NVIDIA Unified Fabric Manager (UFM) with Omnia Telemetry for Secure Metrics and Logs Streaming

This section describes how to configure NVIDIA Unified Fabric Manager (UFM) to securely stream telemetry metrics and logs to the Service Kubernetes cluster.

This procedure assumes that VictoriaMetrics is deployed in cluster mode in the telemetry namespace of the Service Kubernetes cluster. For more information, see the VictoriaMetrics cluster mode documentation.

Using Omnia, you can deploy UFM Telemetry to collect InfiniBand fabric performance metrics and syslog events from NVIDIA Unified Fabric Manager (UFM) appliances. The deployment includes configuring the UFM Telemetry components on the service Kubernetes cluster, integrating with the UFM Prometheus exporter via vmagent, and ingesting syslog events via the VLAgent pipeline.

UFM Telemetry collects InfiniBand fabric metrics and logs. UFM Telemetry includes these components:

UFM Prometheus Exporter: Exposes InfiniBand fabric metrics on a Prometheus-compatible HTTPS endpoint (default port 9001).
vmagent (shared): Scrapes the UFM Prometheus exporter endpoint over TLS and forwards metrics to VictoriaMetrics.
VMServiceScrape CR: Kubernetes custom resource that declares the UFM scrape target for the VictoriaMetrics operator.
VLAgent: Receives UFM syslog events (RFC 3164/5424) and forwards them to VictoriaLogs.
Kubernetes Service + Endpoints: Abstracts the external UFM appliance as a discoverable Kubernetes service for vmagent.

Supported Metrics and Logs

Metrics:

The UFM metrics include the following:

Port State: InfiniBand port operational state (up, down, disabled)
Traffic Counters: Transmit/receive data rates (bytes/sec), packet counts per port
Error Counters: Symbol errors, link error recovery, link downed, VL15 dropped, excessive buffer overrun errors
Fabric Topology: Switch information, port mapping, node GUIDs, LID assignments
Telemetry Health: Scrape success rate, scrape duration, ingest latency

Logs:

The UFM logs include the following:

Fabric topology change events, port state transitions, error/warning messages
SM (Subnet Manager) events, SHARP events, UFM health events
Events are labeled with hostname, severity, and facility

Prerequisites

Make sure the following prerequisites are met:

Ensure that the provision.yml playbook has been executed successfully with service_kube_control_plane and service_kube_node in the mapping file.
Ensure the service Kubernetes cluster has sufficient resources to run vmagent (shared instance) and VLAgent.
Ensure network connectivity between the service Kubernetes cluster and the NVIDIA UFM appliance.
Ensure that the telemetry_config.yml has the entries specific for UFM Telemetry deployment enabled. For more details on configuring telemetry_config.yml, see Service Cluster Telemetry Configuration.

Note

UFM Telemetry supports independent feature flags for metric collection and log collection. You can enable or disable each independently. For more details, see Configure Deployment Required for iDRAC Telemetry Service.

For UFM metrics collection, configure the following settings on the UFM appliance:

The UFM Prometheus exporter must be enabled and accessible from the service cluster.

Enable UFM Telemetry: Ensure UFM Telemetry is enabled in the gv.cfg configuration file:
```
[Telemetry]
telemetry_provider = telemetry
```
Verify Prometheus endpoint: Confirm that the UFM Prometheus exporter is accessible at https://<ufm_ip>:9001/metrics.
Configure SSL certificates (optional): If using CA-signed TLS, set up SSL and CA certificates in UFM. For detailed steps, see Setting Up SSL and CA Certificates in UFM - Optional Configurations.

For UFM log collection, configure the following settings on the UFM appliance:

Enable syslog forwarding using the UFM Web UI:
Configure using the UFM Web UI to send syslog messages to the VLAgent LoadBalancer IP:
1. From the left navigation menu, select Settings → Data Streaming
2. Select System log and complete the fields:
  - Destination: Enter the VLAgent LoadBalancer IP address
  - Syslog Port: Enter 514 (default)
  - System logs Level: Select syslog level from the dropdown based on your requirements
  - Streaming Data: Select UFM logs
3. Click Save
Enable syslog forwarding using the UFM CLI:
Configure using the UFM CLI to send syslog messages to the VLAgent LoadBalancer IP. Modify the [Logging] section in /opt/ufm/conf/gv.cfg:
[Logging] syslog = true syslog_addr = <external vlagent loadbalancer IP>:514 ufm_syslog = true event_syslog = true syslog_level = WARNING
For detailed information on UFM syslog configuration parameters, see NVIDIA UFM Enterprise User Manual - Configuring Syslog.
Set VLAgent LoadBalancer IP: Configure the VLAgent LoadBalancer IP address for syslog delivery. Retrieve the VLAgent external IP from the service Kubernetes cluster:
```
kubectl get svc -n telemetry | grep vlagent
```

Verify UFM Telemetry Flow

This section outlines the steps to verify UFM telemetry data in VictoriaMetrics.

View Collected UFM Telemetry Data using VictoriaMetrics UI (VMUI) - Cluster Mode Deployment

After applying the telemetry.yml configuration using the VictoriaMetrics deployment mode as cluster, use the (VMUI) to validate that UFM telemetry data is being collected and stored successfully in a cluster mode VictoriaMetrics deployment. For more details, see VictoriaMetrics Cluster deployment documentation.

Run the following command to verify that the VictoriaMetrics pod is running:
```
kubectl get pods -n telemetry -o wide | grep vm
```

../../../../_images/verify_umf_telemetry_1.png

Run the following command to verify that the VictoriaMetrics service is running:
```
kubectl get service -n telemetry -o wide | grep vm
```

../../../../_images/verify_umf_telemetry_2.png

Run the following command to verify VMagent logs for UFM scraping to view recent logs:

VMAGENT_POD=$(kubectl get pods -n telemetry -l app.kubernetes.io/name=vmagent -o jsonpath='{.items[0].metadata.name}')
kubectl logs $VMAGENT_POD -n telemetry -c vmagent --tail=50

../../../../_images/verify_umf_telemetry_3.png

Note the External IP and port number of the VictoriaMetrics service. The external IP and port number will be used to access the VictoriaMetrics UI (VMUI):
```
kubectl get svc -n telemetry | grep vmselect
```

../../../../_images/verify_umf_telemetry_4.png

Access the VMUI in a web browser using:

``https://<external vmselect loadbalancer IP>:8481/select/vmui``

../../../../_images/verify_umf_telemetry_5.png

6. Filter and view UFM InfiniBand metrics using queries in VMUI. For example, the following query displays UFM InfiniBand metrics:

{source="ufm", subsystem="infiniband"}

Key UFM Metrics

Key UFM InfiniBand Metrics
Metric Name	Description	Unit
Metric Name	Description	Unit
ib_port_state	InfiniBand port operational state (1=down, 2=init, 3=armed, 4=active)	state
ib_port_xmit_data	Transmitted data per port	bytes
ib_port_rcv_data	Received data per port	bytes
ib_port_xmit_pkts	Transmitted packets per port	packets
ib_port_rcv_pkts	Received packets per port	packets
ib_port_symbol_error	Symbol error count per port	count
ib_port_link_error_recovery	Link error recovery count per port	count
ib_port_link_downed	Link downed count per port	count
ib_port_rcv_errors	Received errors per port	count
ib_port_vl15_dropped	VL15 dropped packets per port	packets
ib_excessive_buffer_overrun_errors	Excessive buffer overrun error count per port	count

View UFM Logs using VictoriaLogs

Configure the VLAgent LoadBalancer IP address for syslog delivery. Retrieve the VLAgent LoadBalancer IP and configure it on the UFM appliance by following the steps outlined in the prerequisites section above:
```
kubectl get svc -n telemetry | grep -E "(vlagent|victoria-logs)"
```

../../../../_images/view_umf_telemetry_1.png

Retrieve the external IP and port of the vlselect service:
```
kubectl get svc -n telemetry | grep vlselect
```

../../../../_images/view_umf_telemetry_2.png

Access the VMUI in a web browser using:

``https://<external vlselect loadbalancer IP>:9471/select/0/vmui``

../../../../_images/view_umf_telemetry_3.png

If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.