Kubernetes
⦾ Why do Kubernetes Pods show ImagePullBackOff or ErrImagePull errors in their status?
Potential Cause: The errors occur when the Docker pull limit is exceeded.
Resolution:
Ensure that the
docker_usernameanddocker_passwordare provided in/opt/omnia/input/project_default/omnia_config_credentials.yml.For
ErrImagePullandImagePullBackOffissue, ensure that local_repo.yml playbook is executed successfully without any failures for packages. Check the local_repo logs for more details. Click here for more info.
⦾ What to do if the nodes in a Kubernetes cluster reboot?
Resolution: Wait for 15 minutes after the Kubernetes cluster reboots. To verify the status of the cluster nodes, run the following commands from the kube_control_plane:
To get real-time kubernetes cluster status, run:
kubectl get nodes
To check which pods are in the Running state, run:
kubectl get pods all-namespaces
To verify that both the kubernetes master and kubeDNS are in the Running state, run:
kubectl cluster-info
⦾ What to do when the Kubernetes pods are not in the Running state?
Resolution:
Run
kubectl get pods all-namespacesto get the status of all the pods.If the pod(s) are not in
Runningstate, delete it using the command:kubectl delete pods <name of pod>
⦾ When the DNS servers are unresponsive, the Kubernetes pods stop communicating with the servers.
Potential Cause: The host network is faulty causing DNS to be unresponsive.
Resolution:
In your Kubernetes cluster, run
kubectl rollout restart deployments coredns -n kube-systemon any of thekube_control_plane.Wait till the coredns pods are in the running state.
⦾ Why does the NFS-client provisioner go to a ContainerCreating or CrashLoopBackOff state?
Potential Cause: This issue usually occurs when server_share_path given in storage_config.yml for nfs_name does not have an NFS server running.
Resolution:
Ensure that
server_share_pathmentioned instorage_config.ymlfornfs_name: nfs_k8shas an active nfs_server running on it.
⦾ If the NFS-client provisioner is in ContainerCreating or CrashLoopBackOff state, why does the kubectl describe <pod_name> command show the following output?
Potential Cause: This is a known issue. For more information, click here.
Resolution:
Wait for some time for the pods to come up. or
Do the following:
Run the following command to delete the pod:
kubectl delete pod <pod_name> -n <namespace>Post deletion, the pod will be restarted and it will come to running state.
⦾ Why do Kubernetes workloads fail to resolve the PowerScale SmartConnect hostname (e.g., management.ps.com) within the cluster.
Potential Cause: The SmartConnect hostname is not resolvable by the Kubernetes cluster’s internal DNS (CoreDNS). This typically happens when: - CoreDNS is unaware of the external DNS zone used by PowerScale. - The SmartConnect service IP or hostname is not defined in CoreDNS or the upstream DNS servers.
- Resolution:
Step 1 — Identify the SmartConnect Hostname and IP
- In the PowerScale UI, go to:
Cluster Management → Network Configuration → Subnets → <Your Subnet Name>
- Note the following details:
SmartConnect Service Name: e.g., management.ps.com
SmartConnect IP Address: e.g., 10.x.x.x
Step 2 — Update the CoreDNS ConfigMap
- On a control-plane node, edit the CoreDNS ConfigMap:
kubectl -n kube-system edit configmap coredns
2. Locate the Corefile: section and add a hosts block before the forward or proxy section. Example:
hosts { 10.x.x.x management.ps.com fallthrough }
Replace 10.x.x.x with your actual PowerScale DNS IP. You can find the DNS IP inside the file:
/opt/omnia/input/project_default/network_spec.yml → under [dns] field.Step 3 — Restart CoreDNS Pods
Apply the changes by restarting CoreDNS:
kubectl -n kube-system rollout restart deployment coredns
Verify the CoreDNS pods are running:
kubectl -n kube-system get pods -l k8s-app=kube-dns
Step 4 — Validate DNS Resolution
Launch a temporary pod to test name resolution:
kubectl run -it dns-test --image=busybox --restart=Never -- sh
Inside the pod shell, test DNS:
nslookup management.ps.com
Expected Output: Server: 10.x.x.x Address 1: management.ps.com
⦾ Why is kubeadm join --control-plane is unsuccessful with the following messages: Failed to pull required certs, secret kubeadm-certs was not found in kube-system, or certificate key expired
Potential Cause: During kubeadm init, encrypted control-plane certificates are uploaded to the cluster. These certificates require a certificate key, which expires after approximately two hours. If a control-plane node attempts to join after this window, it cannot download or decrypt certificates, resulting in join failure.
Resolution:
On any existing and healthy control-plane node (not the affected node), run the script located on the shared NFS mount:
{{ k8s_client_mount_path }}/generate-control-plane-join.sh
k8s_client_mount_path is the local directory on every Kubernetes node where the NFS share is mounted, allowing all nodes to access and use shared resources automatically.
This script uploads fresh control-plane certificates to the cluster and automatically generates a refreshed control-plane join command. It saves it to {{ k8s_client_mount_path }}/control-plane-join-command.sh
On the control-plane node where the join previously failed reboot the node.
After reboot, the node automatically reads the refreshed join command from the shared NFS path and successfully adds itself to the cluster. No manual join command execution is required.
If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.