Telemetry
==========

⦾ **When a Kubernetes worker node fails, affected telemetry services may take time to fail over to available worker nodes.**

**Resolution**: No manual intervention is required.  Wait for the telemetry services to recover and fail over automatically. Do not restart pods or nodes during this period, as it may extend recovery time.


⦾ **Telemetry Pods Enter CrashLoopBackOff After Worker Node Reboot - Script-Based Resolution**

**Potential Cause**: In Omnia deployments that use PowerScale as NFS-backed persistent
storage, telemetry pods (Kafka and iDRAC/MySQL) may enter a **CrashLoopBackOff** state
following an abrupt worker node reboot or network interruption. During normal operation,
Kafka and MySQL write lock files (``.lock``, ``.pid``, ``.sock``) to their persistent
volumes to prevent concurrent access. When a pod terminates unexpectedly, these lock files
are not released. Because PowerScale operates as an external, highly available NFSv3
server, it retains the lock state across client failures. When the pod restarts, it cannot
acquire the existing locks and fails to initialize, resulting in a crash loop.

**Resolution**: Use the following scripts to automate lock cleanup and data corruption recovery. These scripts check for the type of failure and apply the appropriate resolution automatically.

**Usage Instructions**

1. Save each script to a file with the corresponding name (e.g., ``kafka_lock_cleanup.sh``, ``idrac_lock_cleanup.sh``, ``idrac_data_corruption_recovery.sh``)

2. Make the scripts executable::

    chmod +x kafka_lock_cleanup.sh idrac_lock_cleanup.sh idrac_data_corruption_recovery.sh

3. Run the scripts in the following order:

   **Important**: If both Kafka and iDRAC scripts need to be executed, run the Kafka script first and wait for 1 minute before executing the iDRAC script.

   - For Kafka lock issues: ``./kafka_lock_cleanup.sh``
   - For iDRAC lock issues: ``./idrac_lock_cleanup.sh``
   - For iDRAC data corruption: ``./idrac_data_corruption_recovery.sh``

**Kafka Lock Cleanup Script**

Save the following script as ``kafka_lock_cleanup.sh``::

   #!/bin/bash
   set -euo pipefail
   NAMESPACE="telemetry"


   echo "=== Kafka Lock Cleanup ==="


   # Step 1: Get PVC names before deleting pods
   echo "[1] Collecting PVC names..."
   PVCS=$(kubectl get pods -n "$NAMESPACE" -l strimzi.io/kind=Kafka \
    -o jsonpath='{.items[*].spec.volumes[*].persistentVolumeClaim.claimName}')
   echo "PVCs found: $PVCS"


   # Step 2: Force delete all Kafka pods
   echo "[2] Force deleting Kafka pods..."
   kubectl delete pod -n "$NAMESPACE" -l strimzi.io/kind=Kafka --force --grace-period=0


   # Step 3: Clean lock files from each PVC
   for PVC in $PVCS; do
    echo "[3] Cleaning lock files from PVC: $PVC"
    kubectl run kafka-lock-cleanup --image=busybox:1.36 -n "$NAMESPACE" --restart=Never --overrides="
    {
    \"spec\": {
        \"containers\": [{
        \"name\": \"cleanup\",
        \"image\": \"busybox:1.36\",
        \"command\": [\"sh\", \"-c\", \"find /data -type f \\\\( -name '*.lock' -o -name '*.sock' -o -name '*.pid' \\\\) -print -delete; echo Done\"],
        \"volumeMounts\": [{\"name\": \"data\", \"mountPath\": \"/data\"}]
        }],
        \"volumes\": [{\"name\": \"data\", \"persistentVolumeClaim\": {\"claimName\": \"$PVC\"}}]
    }
    }"


    # Step 4: Wait for completion
    echo "[4] Waiting for cleanup pod..."
    kubectl wait --for=jsonpath='{.status.phase}'=Succeeded pod/kafka-lock-cleanup -n telemetry --timeout=120s
    kubectl logs kafka-lock-cleanup -n "$NAMESPACE"
    kubectl delete pod kafka-lock-cleanup -n "$NAMESPACE"
   done


   echo "[5] Verify: kubectl get pods -n $NAMESPACE -l strimzi.io/kind=Kafka"


**iDRAC Lock Cleanup Script**

Save the following script as ``idrac_lock_cleanup.sh``::

   #!/bin/bash
   set -euo pipefail
   NAMESPACE="telemetry"

   echo "=== iDRAC Lock Cleanup ==="

   # Step 1: Check for corruption — abort if found
   echo "[1] Checking logs for data corruption..."
   for POD in $(kubectl get pods -n "$NAMESPACE" -l app=idrac-telemetry -o jsonpath='{.items[*].metadata.name}'); do
     LOGS=$(kubectl logs "$POD" -n "$NAMESPACE" --tail=50 2>/dev/null || echo "")

     # Check for ALL corruption indicators from screenshot
     if echo "$LOGS" | grep -qiE "trying to read page|corruption in the InnoDB tablespace|innodb_force_recovery"; then
       echo ""
       echo "============================================================"
       echo "ERROR: Data corruption detected in pod: $POD"
       echo ""
       echo "Errors found:"
       echo "$LOGS" | grep -iE "trying to read page|Unable to lock mysql.ibd|corruption|Assertion failure|innodb_force_recovery|Unable to read page" | head -5
       echo ""
       echo "Lock cleanup will NOT fix this issue."
       echo "Run: ./idrac_data_corruption_recovery.sh"
       echo "============================================================"
       exit 1
     fi
   done

   echo "No corruption detected. Proceeding with lock cleanup..."

   # Step 2: Get PVC names
   echo "[2] Collecting PVC names..."
   PVCS=$(kubectl get pods -n "$NAMESPACE" -l app=idrac-telemetry \
    -o jsonpath='{.items[*].spec.volumes[*].persistentVolumeClaim.claimName}')
   echo "PVCs found: $PVCS"

   # Step 3: Force delete all iDRAC pods
   echo "[3] Force deleting iDRAC pods..."
   kubectl delete pod -n "$NAMESPACE" -l app=idrac-telemetry --force --grace-period=0

   # Step 4: Clean lock files from each PVC
   for PVC in $PVCS; do
     echo "[4] Cleaning lock files from PVC: $PVC"
     kubectl run mysql-lock-cleanup --image=busybox:1.36 -n "$NAMESPACE" --restart=Never --overrides="
    {
    \"spec\": {
        \"containers\": [{
        \"name\": \"cleanup\",
        \"image\": \"busybox:1.36\",
        \"command\": [\"sh\", \"-c\", \"find /data -type f \\\\( -name '*.sock' -o -name '*.pid' -o -name '*.lock' -o -name 'ibdata1.lock' \\\\) -print -delete; echo Done\"],
        \"volumeMounts\": [{\"name\": \"data\", \"mountPath\": \"/data\"}]
        }],
        \"volumes\": [{\"name\": \"data\", \"persistentVolumeClaim\": {\"claimName\": \"$PVC\"}}]
    }
    }"

     echo "[5] Waiting for cleanup pod..."
     kubectl wait --for=jsonpath='{.status.phase}'=Succeeded pod/mysql-lock-cleanup -n "$NAMESPACE" --timeout=120s
     kubectl logs mysql-lock-cleanup -n "$NAMESPACE"
     kubectl delete pod mysql-lock-cleanup -n "$NAMESPACE"
   done

   echo "[6] Verify: kubectl get pods -n $NAMESPACE -l app=idrac-telemetry"
  

**iDRAC Data Corruption Recovery Script**

Save the following script as ``idrac_data_corruption_recovery.sh``::

   #!/bin/bash
   set -euo pipefail
   NAMESPACE="telemetry"

   echo "=== iDRAC Data Corruption Recovery ==="
   echo "WARNING: This will DELETE all iDRAC PVCs and wipe stored data."
   read -rp "Type DELETE to confirm: " CONFIRM
   [[ "$CONFIRM" != "DELETE" ]] && echo "Aborted." && exit 0

   # Step 1: List PVCs
   echo "[1] Current iDRAC PVCs:"
   kubectl get pvc -n "$NAMESPACE" -l app=idrac-telemetry

   # Step 2: Force delete all pods FIRST (releases PVC binding)
   echo "[2] Force deleting iDRAC pods..."
   kubectl delete pod -n "$NAMESPACE" -l app=idrac-telemetry --force --grace-period=0

   # Step 3: Wait for pods to terminate
   echo "[3] Waiting for pods to terminate..."
   sleep 10

   # Step 4: Delete PVCs (now unbound)
   echo "[4] Deleting iDRAC PVCs..."
   kubectl delete pvc -n "$NAMESPACE" -l app=idrac-telemetry --wait=false

   # Step 5: Verify PVCs are terminating
   echo "[5] Checking PVC status..."
   sleep 5
   REMAINING=$(kubectl get pvc -n "$NAMESPACE" -l app=idrac-telemetry --no-headers 2>/dev/null | wc -l)

   if [[ "$REMAINING" -gt 0 ]]; then
     echo "PVCs still terminating. Removing finalizers..."
     for PVC in $(kubectl get pvc -n "$NAMESPACE" -l app=idrac-telemetry -o jsonpath='{.items[*].metadata.name}'); do
       kubectl patch pvc "$PVC" -n "$NAMESPACE" -p '{"metadata":{"finalizers":null}}'
       echo "Finalizer removed from: $PVC"
     done
     sleep 5
   fi

   # Step 6: Final verification
   echo "[6] Verifying cleanup..."
   kubectl get pvc -n "$NAMESPACE" -l app=idrac-telemetry 2>/dev/null || echo "All PVCs deleted."
   kubectl get pod -n "$NAMESPACE" -l app=idrac-telemetry 2>/dev/null || echo "All pods deleted."

   # Step 7: Re-deploy
   echo ""
   echo "[7] Re-deploy with: ansible-playbook telemetry/telemetry.yml"
   echo "    Use the SAME inputs as previous deployment."


⦾ **GPU Usage Metrics Not Available via iDRAC Telemetry on PowerEdge XE8712**

**Description:** On the PowerEdge XE8712 equipped with NVIDIA GB200 accelerators, GPU utilization metrics are not correctly reported through iDRAC telemetry. As a result, downstream consumers such as Kafka and VictoriaMetrics show zero GPU usage, even though the GPUs are fully utilized. This behavior is inconsistent with on-host monitoring, where nvidia-smi reports 100% GPU utilization.

**Potential Cause:** The issue has been identified as an iDRAC telemetry limitation specific to this platform and accelerator combination. The behavior has been observed with iDRAC version 1.30.30.50 and lower.

**Resolution / Workaround:** Until a fix is provided in a future iDRAC release, GPU utilization should be monitored directly from the host using ``nvidia-smi``, rather than relying on iDRAC-based telemetry for GPU usage metrics.