模式介绍
项目文档:https://docs.cilium.io/en/stable/operations/performance/tuning/#ebpf-host-routing



标准容器网络模式下,流量从 Pod 发出后,经过 Pod 网络空间的 iptables 规则处理,通过 veth pair 到达宿主机网络空间,再由宿主机 iptables 完成路由转发与地址转换,最终从宿主机网卡发出;
而 Cilium Native eBPF Host Routing 模式下,通过引入 bpf_redirect_peer() 与 bpf_redirect_neigh() 函数,实现数据包从物理网卡的 TC ingress 直接重定向到 Pod veth 的 TC ingress,跳过宿主机 veth pair 设备驱动的收包处理,从而显著提升网络性能。参考文章:
部署流程
使用 Cilium 代替 Kube-Proxy
部署 k8s 集群时跳过 kube-proxy 安装,通过 Cilium 代替 kube-proxy。
Cilium 通过挂载 BPF cgroup 程序来实现基于 Socket 的负载均衡(即 host-reachable 服务)。Socket LB 依赖的 BPF cgroup 挂载(connect4、sendmsg4 等)是 cgroup v2 的特性。cgroup v1 不支持这些挂载类型,所以使用 kube-proxy replacement 必须开启 cgroup v2。
通过 Kind 快速生成集群并部署 Cilium Native 模式
使用 cilium native 模式,并且通过
kubeProxyReplacement=true代替 kube-proxy
bash
#!/bin/bash
set -v
# 1. Prepare NoCNI kubernetes environment
cat <<EOF | HTTP_PROXY= HTTPS_PROXY= http_proxy= https_proxy= kind create cluster --name=cilium-kpr-ebpf --image=kindest/node:v1.27.3 --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
## 关闭集群默认 CNI 部署
disableDefaultCNI: true
## 关闭集群 kubeproxy 部署
kubeProxyMode: "none"
nodes:
- role: control-plane
- role: worker
- role: worker
EOF
# 2. Remove kubernetes node taints
controller_node_ip=`kubectl get node -o wide --no-headers | grep -E "control-plane|bpf1" | awk -F " " '{print $6}'`
kubectl taint nodes $(kubectl get nodes -o name | grep control-plane) node-role.kubernetes.io/control-plane:NoSchedule-
# 3. Install CNI[Cilium v1.17.15]
cilium_version=v1.17.15
docker pull cilium/cilium:$cilium_version && docker pull cilium/operator-generic:$cilium_version
kind load docker-image cilium/cilium:$cilium_version cilium/operator-generic:$cilium_version --name cilium-kpr-ebpf
helm repo add cilium https://helm.cilium.io ; helm repo update
helm install cilium cilium/cilium --set k8sServiceHost=$controller_node_ip --set k8sServicePort=6443 --version 1.17.15 --namespace kube-system --set image.pullPolicy=IfNotPresent --set debug.enabled=true --set debug.verbose="datapath flow kvstore envoy policy" --set bpf.monitorAggregation=none --set monitor.enabled=true --set ipam.mode=cluster-pool --set cluster.name=cilium-kpr-ebpf --set kubeProxyReplacement=true --set routingMode=native --set autoDirectNodeRoutes=true --set ipv4NativeRoutingCIDR="10.0.0.0/8" --set bpf.masquerade=true
# 6. Separate namesapce and cgroup v2 verify [https://github.com/cilium/cilium/pull/16259 && https://docs.cilium.io/en/stable/installation/kind/#install-cilium]
#for container in $(docker ps -a --format "table {{.Names}}" | grep cilium-kpr-ebpf);do docker exec $container ls -al /proc/self/ns/cgroup;done
#mount -l | grep cgroup && docker info | grep "Cgroup Version" | awk '$1=$1'
创建测试 Pod
本质是 Nginx,仅用于通过访问 svc 时抓包使用
yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app: nginx
name: pod
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- image: burlyluo/nettool:latest
name: nettoolbox
env:
- name: NETTOOL_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
securityContext:
privileged: true
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: nginx
topologyKey: kubernetes.io/hostname
---
apiVersion: v1
kind: Service
metadata:
name: pod
spec:
type: NodePort
selector:
app: nginx
ports:
- name: http
port: 80
targetPort: 80
nodePort: 32000
查看部署结果
bash
root@network-demo:~# kubectl get pods -A -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
default pod-0 1/1 Running 0 111s 10.0.1.231 cilium-kpr-ebpf-worker2
default pod-1 1/1 Running 0 106s 10.0.0.153 cilium-kpr-ebpf-worker
default pod-2 1/1 Running 0 100s 10.0.2.19 cilium-kpr-ebpf-control-plane
kube-system cilium-7jjcl 2/2 Running 0 4m39s 172.18.0.3 cilium-kpr-ebpf-control-plane
kube-system cilium-envoy-9jmzg 1/1 Running 0 4m39s 172.18.0.3 cilium-kpr-ebpf-control-plane
kube-system cilium-envoy-pthdk 1/1 Running 0 4m39s 172.18.0.4 cilium-kpr-ebpf-worker
kube-system cilium-envoy-s7j2t 1/1 Running 0 4m39s 172.18.0.2 cilium-kpr-ebpf-worker2
kube-system cilium-mds5q 2/2 Running 0 4m39s 172.18.0.2 cilium-kpr-ebpf-worker2
kube-system cilium-operator-7bfd9d69f4-fvvnp 1/1 Running 0 4m39s 172.18.0.2 cilium-kpr-ebpf-worker2
kube-system cilium-operator-7bfd9d69f4-pb9w6 1/1 Running 0 4m39s 172.18.0.4 cilium-kpr-ebpf-worker
kube-system cilium-psx5j 2/2 Running 0 4m39s 172.18.0.4 cilium-kpr-ebpf-worker
kube-system coredns-5d78c9869d-2r5k2 1/1 Running 0 10m 10.0.0.57 cilium-kpr-ebpf-worker
kube-system coredns-5d78c9869d-hzvn9 1/1 Running 0 10m 10.0.0.67 cilium-kpr-ebpf-worker
kube-system etcd-cilium-kpr-ebpf 1/1 Running 0 10m 172.18.0.3 cilium-kpr-ebpf-control-plane
kube-system kube-apiserver-cilium-kpr-ebpf 1/1 Running 0 10m 172.18.0.3 cilium-kpr-ebpf-control-plane
kube-system kube-controller-manager-cilium-kpr-ebpf 1/1 Running 0 10m 172.18.0.3 cilium-kpr-ebpf-control-plane
kube-system kube-scheduler-cilium-kpr-ebpf 1/1 Running 0 10m 172.18.0.3 cilium-kpr-ebpf-control-plane
验证效果
查询 Cilium 详细信息
1.查询 Cilium 详细运行状态
bash
## 可通过 cilium status --verbose 查询最详细的运行状态
root@network-demo:~# kubectl exec -n kube-system cilium-7jjcl -- cilium status
KVStore: Disabled
Kubernetes: Ok 1.27 (v1.27.3) [linux/amd64]
Kubernetes APIs: ["EndpointSliceOrEndpoint", "cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "cilium/v2alpha1::CiliumCIDRGroup", "core/v1::Namespace", "core/v1::Pods", "core/v1::Service", "networking.k8s.io/v1::NetworkPolicy"]
## 已通过 cilium 代替 kubeproxy
KubeProxyReplacement: True [eth0 172.18.0.3 172:18:0:1::3 fe80::a43a:82ff:fe6d:b272 (Direct Routing)]
Host firewall: Disabled
SRv6: Disabled
CNI Chaining: none
CNI Config file: successfully wrote CNI configuration file to /host/etc/cni/net.d/05-cilium.conflist
Cilium: Ok 1.17.15 (v1.17.15-4206eaa5)
NodeMonitor: Listening for events on 8 CPUs with 64x4096 of shared memory
Cilium health daemon: Ok
IPAM: IPv4: 3/254 allocated from 10.0.2.0/24,
IPv4 BIG TCP: Disabled
IPv6 BIG TCP: Disabled
BandwidthManager: Disabled
## 使用 native 模式 + eBPF 主机路由
Routing: Network: Native Host: BPF
Attach Mode: TCX
Device Mode: veth
Masquerading: BPF [eth0] 10.0.0.0/8 [IPv4: Enabled, IPv6: Disabled]
Controller Status: 26/26 healthy
Proxy Status: OK, ip 10.0.2.84, 0 redirects active on ports 10000-20000, Envoy: external
Global Identity Range: min 256, max 65535
Hubble: Ok Current/Max Flows: 4095/4095 (100.00%), Flows/s: 108.07 Metrics: Disabled
Encryption: Disabled
Cluster health: 3/3 reachable (2026-05-02T06:26:46Z)
Name IP Node Endpoints
Modules Health: Stopped(0) Degraded(0) OK(55)
2.查询 Cilium Endpoint 信息
在 Cilium 中,Endpoint 术语含义:Cilium 为容器分配 IP。一个 Pod 中可以包含多个容器(多个容器共享同一个 Pod IP)。所有共享同一地址的容器被分组在一起,Cilium 将其称为一个 Endpoint。
每个节点的 Cilium Agent 只管理本节点的 Endpoint,所以不同节点的 cilium endpoint list 输出不同,本次以 Controller 节点 Pod 作为示例:
bash
root@network-demo:~# kubectl exec -n kube-system cilium-7jjcl -- cilium endpoint list
ENDPOINT POLICY (ingress) POLICY (egress) IDENTITY LABELS (source:key[=value]) IPv4 STATUS
ENFORCEMENT ENFORCEMENT
10 Disabled Disabled 25489 k8s:app=nginx 10.0.2.19 ready
k8s:io.cilium.k8s.namespace.labels/metadata.name=default
k8s:io.cilium.k8s.policy.cluster=cilium-kpr-ebpf
k8s:io.cilium.k8s.policy.serviceaccount=default
k8s:io.kubernetes.pod.namespace=default
786 Disabled Disabled 1 k8s:node-role.kubernetes.io/control-plane ready
k8s:node.kubernetes.io/exclude-from-external-load-balancers
reserved:host
2559 Disabled Disabled 4 reserved:health 10.0.2.110 ready
3.查询 Cilium Service 信息
在 Cilium 中,Service 术语含义:k8s svc 在 Cilium eBPF Map 中实际转发状态(Cilium 代替了 k8s kube-proxy)
Cilium 用 eBPF map 替代 iptables 实现的 k8s svc 负载均衡表:
bash
root@network-demo:~# kubectl exec -n kube-system cilium-7jjcl -- cilium service list
ID Frontend Service Type Backend
1 10.96.0.1:443/TCP ClusterIP 1 => 172.18.0.3:6443/TCP (active)
2 10.96.58.81:443/TCP ClusterIP 1 => 172.18.0.3:4244/TCP (active)
3 10.96.0.10:53/UDP ClusterIP 1 => 10.0.0.57:53/UDP (active)
2 => 10.0.0.67:53/UDP (active)
4 10.96.0.10:53/TCP ClusterIP 1 => 10.0.0.57:53/TCP (active)
2 => 10.0.0.67:53/TCP (active)
5 10.96.0.10:9153/TCP ClusterIP 1 => 10.0.0.57:9153/TCP (active)
2 => 10.0.0.67:9153/TCP (active)
6 10.96.231.166:80/TCP ClusterIP 1 => 10.0.1.231:80/TCP (active)
2 => 10.0.0.153:80/TCP (active)
3 => 10.0.2.19:80/TCP (active)
7 172.18.0.3:32000/TCP NodePort 1 => 10.0.1.231:80/TCP (active)
2 => 10.0.0.153:80/TCP (active)
3 => 10.0.2.19:80/TCP (active)
8 0.0.0.0:32000/TCP NodePort 1 => 10.0.1.231:80/TCP (active)
2 => 10.0.0.153:80/TCP (active)
3 => 10.0.2.19:80/TCP (active)
查询 iptables 规则
1.查询使用 kube-proxy 集群的 iptables 规则
这里的查询环境使用 Calico VXLan 模式的 k8s 集群
bash
## 只看 NodePort 3112 部分
root@calico-vxlan:~# kubectl get svc -n deepflow-otel-spring-demo web-shop
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
web-shop NodePort 172.96.248.195 <none> 18090:3112/TCP 176d
bash
## 在 iptables 规则中查询 Service NodePort 3112 相关内容
root@ce-demo-1:~# iptables-save | grep 3112 -w
## 规则一:
## 额外匹配了目标地址是 localhost 的流量,例如 curl 127.0.0.1:3112
## nfacct 是内核模块提供的网络包计数器。这里只是挂的 localhost_nps_accepted_pkts 的计数器仅用于统计
## 区分流量来源,方便做监控
-A KUBE-NODEPORTS -d 127.0.0.0/8 -p tcp -m comment --comment "deepflow-otel-spring-demo/web-shop:http-shop" -m tcp --dport 3112 -m nfacct --nfacct-name localhost_nps_accepted_pkts -j KUBE-EXT-YBYZULGZYV76JYRW
## 规则二:
## 外部流量访问 NodeIP:3112 时,在 PREROUTING 阶段会被引流到 KUBE-NODEPORTS
## 所有目标端口为 3112 的 TCP 包,统一跳转到 KUBE-EXT-YBYZULGZYV76JYRW 链
-A KUBE-NODEPORTS -p tcp -m comment --comment "deepflow-otel-spring-demo/web-shop:http-shop" -m tcp --dport 3112 -j KUBE-EXT-YBYZULGZYV76JYRW
2.查询 Cilium 集群的 iptables 规则
由于部署集群时使用 Cilium 代替了 kube-proxy,所以不会有 iptables 规则(k8s svc 本质上是 iptables 规则)
bash
root@network-demo:~# kubectl get svc pod
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
pod NodePort 10.96.160.230 <none> 80:32000/TCP 25h
bash
## 可以看到 iptables 规则中没有 NodePort 相关的处理
root@network-demo:~# iptables-save | grep 32000
Pod 网卡处抓包
bash
root@network-demo:~# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
pod-0 1/1 Running 0 16m 10.0.1.231 cilium-kpr-ebpf-worker2
pod-1 1/1 Running 0 16m 10.0.0.153 cilium-kpr-ebpf-worker
pod-2 1/1 Running 0 16m 10.0.2.19 cilium-kpr-ebpf-control-plane
1.查询 Pod 网卡与对应 veth pair 信息
Pod 网卡信息中 @ifx 接口编号与宿主机侧网卡编号对应即可
bash
root@network-demo:~# kubectl exec pod-2 -- ip address show eth0
8: eth0@if9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether ca:e8:ab:e0:8e:c1 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.0.2.19/32 scope global eth0
valid_lft forever preferred_lft forever
root@network-demo:~# docker exec cilium-kpr-ebpf-control-plane ip address show lxc939e7d20131a
9: lxc939e7d20131a@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 9a:6c:ff:92:85:f5 brd ff:ff:ff:ff:ff:ff link-netns cni-e8d9d993-3bd7-9552-06e8-92f205293390
inet6 fe80::986c:ffff:fe92:85f5/64 scope link
valid_lft forever preferred_lft forever
root@network-demo:~# docker exec cilium-kpr-ebpf-control-plane ip -d link show lxc939e7d20131a
9: lxc939e7d20131a@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 9a:6c:ff:92:85:f5 brd ff:ff:ff:ff:ff:ff link-netns cni-e8d9d993-3bd7-9552-06e8-92f205293390 promiscuity 0 minmtu 68 maxmtu 65535
veth addrgenmode eui64 numtxqueues 8 numrxqueues 8 gso_max_size 65536 gso_max_segs 65535
2.Client Pod eth0 侧抓包
此时三次握手、四次挥手,流量走向是正常的

bash
root@network-demo:~# kubectl exec -it pod-2 -- tcpdump -pnei eth0
07:20:31.814196 ca:e8:ab:e0:8e:c1 > 9a:6c:ff:92:85:f5, ethertype IPv4 (0x0800), length 74: 10.0.2.19.34994 > 10.0.0.153.80: Flags [S], seq 1322885877, win 64240, options [mss 1460,sackOK,TS val 770797784 ecr 0,nop,wscale 7], length 0
07:20:31.814413 9a:6c:ff:92:85:f5 > ca:e8:ab:e0:8e:c1, ethertype IPv4 (0x0800), length 74: 10.0.0.153.80 > 10.0.2.19.34994: Flags [S.], seq 2571982596, ack 1322885878, win 65160, options [mss 1460,sackOK,TS val 2744097757 ecr 770797784,nop,wscale 7], length 0
07:20:31.814429 ca:e8:ab:e0:8e:c1 > 9a:6c:ff:92:85:f5, ethertype IPv4 (0x0800), length 66: 10.0.2.19.34994 > 10.0.0.153.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 770797784 ecr 2744097757], length 0
07:20:31.814534 ca:e8:ab:e0:8e:c1 > 9a:6c:ff:92:85:f5, ethertype IPv4 (0x0800), length 146: 10.0.2.19.34994 > 10.0.0.153.80: Flags [P.], seq 1:81, ack 1, win 502, options [nop,nop,TS val 770797784 ecr 2744097757], length 80: HTTP: GET / HTTP/1.1
07:20:31.814605 9a:6c:ff:92:85:f5 > ca:e8:ab:e0:8e:c1, ethertype IPv4 (0x0800), length 66: 10.0.0.153.80 > 10.0.2.19.34994: Flags [.], ack 81, win 509, options [nop,nop,TS val 2744097757 ecr 770797784], length 0
07:20:31.814760 9a:6c:ff:92:85:f5 > ca:e8:ab:e0:8e:c1, ethertype IPv4 (0x0800), length 302: 10.0.0.153.80 > 10.0.2.19.34994: Flags [P.], seq 1:237, ack 81, win 509, options [nop,nop,TS val 2744097757 ecr 770797784], length 236: HTTP: HTTP/1.1 200 OK
07:20:31.814777 ca:e8:ab:e0:8e:c1 > 9a:6c:ff:92:85:f5, ethertype IPv4 (0x0800), length 66: 10.0.2.19.34994 > 10.0.0.153.80: Flags [.], ack 237, win 501, options [nop,nop,TS val 770797784 ecr 2744097757], length 0
07:20:31.815009 9a:6c:ff:92:85:f5 > ca:e8:ab:e0:8e:c1, ethertype IPv4 (0x0800), length 109: 10.0.0.153.80 > 10.0.2.19.34994: Flags [P.], seq 237:280, ack 81, win 509, options [nop,nop,TS val 2744097757 ecr 770797784], length 43: HTTP
07:20:31.815019 ca:e8:ab:e0:8e:c1 > 9a:6c:ff:92:85:f5, ethertype IPv4 (0x0800), length 66: 10.0.2.19.34994 > 10.0.0.153.80: Flags [.], ack 280, win 501, options [nop,nop,TS val 770797784 ecr 2744097757], length 0
07:20:31.815294 ca:e8:ab:e0:8e:c1 > 9a:6c:ff:92:85:f5, ethertype IPv4 (0x0800), length 66: 10.0.2.19.34994 > 10.0.0.153.80: Flags [F.], seq 81, ack 280, win 501, options [nop,nop,TS val 770797785 ecr 2744097757], length 0
07:20:31.815479 9a:6c:ff:92:85:f5 > ca:e8:ab:e0:8e:c1, ethertype IPv4 (0x0800), length 66: 10.0.0.153.80 > 10.0.2.19.34994: Flags [F.], seq 280, ack 82, win 509, options [nop,nop,TS val 2744097758 ecr 770797785], length 0
07:20:31.815487 ca:e8:ab:e0:8e:c1 > 9a:6c:ff:92:85:f5, ethertype IPv4 (0x0800), length 66: 10.0.2.19.34994 > 10.0.0.153.80: Flags [.], ack 281, win 501, options [nop,nop,TS val 770797785 ecr 2744097758], length 0
3.Client Pod 宿主机 veth pair 侧抓包
此时只有请求流量,而响应流量已经被 bpf_redirect_peer() 函数重定向到 Pod eth0 处,跳过了宿主机侧的 veth pair

bash
root@network-demo:~# docker exec -it cilium-kpr-ebpf-control-plane tcpdump -pnei lxc939e7d20131a
06:50:21.924398 ca:e8:ab:e0:8e:c1 > 9a:6c:ff:92:85:f5, ethertype IPv4 (0x0800), length 74: 10.0.2.19.55196 > 10.0.0.153.80: Flags [S], seq 3742586534, win 64240, options [mss 1460,sackOK,TS val 768987894 ecr 0,nop,wscale 7], length 0
06:50:21.924547 ca:e8:ab:e0:8e:c1 > 9a:6c:ff:92:85:f5, ethertype IPv4 (0x0800), length 66: 10.0.2.19.55196 > 10.0.0.153.80: Flags [.], ack 190778436, win 502, options [nop,nop,TS val 768987894 ecr 2742287867], length 0
06:50:21.924643 ca:e8:ab:e0:8e:c1 > 9a:6c:ff:92:85:f5, ethertype IPv4 (0x0800), length 146: 10.0.2.19.55196 > 10.0.0.153.80: Flags [P.], seq 0:80, ack 1, win 502, options [nop,nop,TS val 768987894 ecr 2742287867], length 80: HTTP: GET / HTTP/1.1
06:50:21.924874 ca:e8:ab:e0:8e:c1 > 9a:6c:ff:92:85:f5, ethertype IPv4 (0x0800), length 66: 10.0.2.19.55196 > 10.0.0.153.80: Flags [.], ack 237, win 501, options [nop,nop,TS val 768987894 ecr 2742287867], length 0
06:50:21.925086 ca:e8:ab:e0:8e:c1 > 9a:6c:ff:92:85:f5, ethertype IPv4 (0x0800), length 66: 10.0.2.19.55196 > 10.0.0.153.80: Flags [.], ack 280, win 501, options [nop,nop,TS val 768987895 ecr 2742287867], length 0
06:50:21.925263 ca:e8:ab:e0:8e:c1 > 9a:6c:ff:92:85:f5, ethertype IPv4 (0x0800), length 66: 10.0.2.19.55196 > 10.0.0.153.80: Flags [F.], seq 80, ack 280, win 501, options [nop,nop,TS val 768987895 ecr 2742287867], length 0
06:50:21.925536 ca:e8:ab:e0:8e:c1 > 9a:6c:ff:92:85:f5, ethertype IPv4 (0x0800), length 66: 10.0.2.19.55196 > 10.0.0.153.80: Flags [.], ack 281, win 501, options [nop,nop,TS val 768987895 ecr 2742287868], length 0