Kubernetes 集群calico网络故障排查思路

报错calico/node is not ready: BIRD is not ready: BGP not established with 172.16.0.20,172.16.0.30

复制代码

\\calico未准备好，BGP协议不能与172.16.0.20,172.16.0.30内网IP地址连接
BGP协议:边界网关协议

访问k8s的dashboard界面无法访问网站，查看pod，未知原因导致calico的Pod资源重新创建后无法启动，显示的是0/1状态

复制代码

[root@k8s-master yaml]# kubectl get pod -n kube-system
NAMESPACE              NAME                                        READY   STATUS    RESTARTS   AGE
...
kube-system            calico-kube-controllers-578894d4cd-rsgqd    1/1     Running   0          115d
kube-system            calico-node-64s8s                           1/1     Running   3          127d
kube-system            calico-node-j4t7q                           1/1     Running   0          127d
kube-system            calico-node-n6vr4                           0/1     Running   0          40s

Calico的Pod报错内容

复制代码

[root@k8s-master yaml]# kubectl describe pod -n kube-system calico-node-n6vr4
Events:
  Type     Reason     Age        From                 Message
  ----     ------     ----       ----                 -------
  Normal   Scheduled  <unknown>  default-scheduler    Successfully assigned kube-system/calico-node-n6vr4 to k8s-master
  Normal   Pulled     41s        kubelet, k8s-master  Container image "calico/cni:v3.15.1" already present on machine
  Normal   Created    41s        kubelet, k8s-master  Created container upgrade-ipam
  Normal   Started    40s        kubelet, k8s-master  Started container upgrade-ipam
  Normal   Pulled     40s        kubelet, k8s-master  Container image "calico/cni:v3.15.1" already present on machine
  Normal   Started    39s        kubelet, k8s-master  Started container install-cni
  Normal   Created    39s        kubelet, k8s-master  Created container install-cni
  Normal   Pulled     39s        kubelet, k8s-master  Container image "calico/pod2daemon-flexvol:v3.15.1" already present on machine
  Normal   Pulled     38s        kubelet, k8s-master  Container image "calico/node:v3.15.1" already present on machine
  Normal   Started    38s        kubelet, k8s-master  Started container flexvol-driver
  Normal   Created    38s        kubelet, k8s-master  Created container flexvol-driver
  Normal   Created    37s        kubelet, k8s-master  Created container calico-node
  Normal   Started    37s        kubelet, k8s-master  Started container calico-node
  Warning  Unhealthy  27s        kubelet, k8s-master  Readiness probe failed: 2020-08-14 02:16:54.068 [INFO][142] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 172.16.0.20,172.16.0.30
  Warning  Unhealthy  17s  kubelet, k8s-master  Readiness probe failed: 2020-08-14 02:17:04.059 [INFO][181] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 172.16.0.20,172.16.0.30
  Warning  Unhealthy  7s  kubelet, k8s-master  Readiness probe failed: 2020-08-14 02:17:14.065 [INFO][207] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 172.16.0.20,172.16.0.30

原因：calico没有发现实node节点实际的网卡名称

解决方法

调整calicao的网络插件的网卡发现机制，修改IP_AUTODETECTION_METHOD对应的value值。下载的官方提供的yaml文件中，ip识别策略（IPDETECTMETHOD）没有配置，即默认为first-found，这会导致一个网络异常的ip作为nodeIP被注册，从而影响node之间的网络连接。可以修改成can-reach或者interface的策略，尝试连接某一个Ready的node的IP，以此选择出正确的IP。

复制代码

# 修改calicao的yaml文件，添加两行配置# - name: IP_AUTODETECTION_METHOD# value: "interface=eth1"  # 根据实际网卡名称配置           [root@k8s-master yaml]# vim calico.yaml...(3546行)            # Cluster type to identify the deployment type            - name: CLUSTER_TYPE              value: "k8s,bgp"            #新添加的配置            - name: IP_AUTODETECTION_METHOD              value: "interface=eth1"            # Auto-detect the BGP IP address.            - name: IP              value: "autodetect"            # Enable IPIP            - name: CALICO_IPV4POOL_IPIP              value: "Always"            # Enable or Disable VXLAN on the default IP pool.            - name: CALICO_IPV4POOL_VXLAN              value: "Never"
#重新构建kubectl apply -f calico.yaml

修复完成

复制代码

[root@k8s-master yaml]# kubectl get pod -n kube-system 
NAME                                       READY   STATUS    RESTARTS   AGE
calico-kube-controllers-578894d4cd-rsgqd   1/1     Running   0          115d
calico-node-6ktn4                          1/1     Running   0          26m
calico-node-8k5z8                          1/1     Running   0          26m
calico-node-g87hc                          1/1     Running   0          1m

再次访问集群的各种资源已经可以访问了