请求是如何通过k8s service 路由到对应的pod
当请求访问k8s service 的时候,请求就会被kube-proxy 设置的路由规则路由到对应的pod上面。 kube-proxy 主要有两种转发模式 iptables
or ipvs
。
Iptables 模式
Deployment
部署有client side 的 pod monkey 一个, monkey这个pod有主container monkey 和 sidecar istio (name:istio-proxy)。
部署有server side 的pod mooncake-provider 两个, 服务监听在 8080 port, mooncake-provider 这种pod 也有主container mooncake-provider 和 sidecar istio (name:istio-proxy)。
mooncake-provider pod 的部署方式是 statefulSet,而对应的service 是 clusterIP type 的 mooncake-provider svc。
对应的IP信息如下:
- monkey pod
fd01:14ba:9ea:8800:3c70:dc10:db2a:3340
- mooncake-provider svc
[fd00:14ba:9ea:8800::ffff:1d22]:8080
- mooncake-provider pods
mooncake-provider-0:[fd01:14ba:9ea:8800:b950:1075:b3ba:de17]:8080
mooncake-provider-1:[fd01:14ba:9ea:8800:b3ec:e045:7b9a:e493]:8080
这些POD IP 的前缀都是一样的,都是 fd01:14ba:9ea:8800/64
, 这个是因为它们都是在同一个cluster 内,虽然它们可能在不同的worker node 上。
Traffic path
monkey pod -> mooncake-provider svc -> mooncake-provider pods
Connection status
因为istio container 和 主 container 在同一个pod,它们共享同一个network namespace, 所以可以在任意一个container 上查看网络连接的状态
From monkey pod istio container
从 client 端的container 中看,会存在三个连接
一个连接是和mooncake-provider svc 建立的。
另外两个连接是和两个mooncake-provider pod 各自建立的。
bash
bash-4.4$ ss -an | grep 8080
tcp ESTAB 0 0 [fd01:14ba:9ea:8800:3c70:dc10:db2a:3340]:44488 [fd00:14ba:9ea:8800::ffff:1d22]:8080
tcp ESTAB 0 0 [fd01:14ba:9ea:8800:3c70:dc10:db2a:3340]:37440 [fd01:14ba:9ea:8800:b950:1075:b3ba:de17]:8080
tcp ESTAB 0 0 [fd01:14ba:9ea:8800:3c70:dc10:db2a:3340]:38444 [fd01:14ba:9ea:8800:b3ec:e045:7b9a:e493]:8080
From mooncake-provider pod istio container
从 server 端的container 来看,只存在一个连接,不存在来自source 为service ip 的connection。
这个连接就是来自于monkey pod。
bash
k exec -it mooncake-provider-0 -c istio-proxy bash
## The traffic from monkey pod is forwarded to sidecar container istio-proxy inbound port 15006
bash-4.4$ ss -an | grep fd01:14ba:9ea:8800:3c70:dc10:db2a:3340
tcp ESTAB 0 0 [fd01:14ba:9ea:8800:b950:1075:b3ba:de17]:15006 [fd01:14ba:9ea:8800:3c70:dc10:db2a:3340]:37440
## Then the traffic is forwarded to mooncake-provider container port 8080 by the istio local ip [::6]
bash-4.4$ ss -an | grep 8080
tcp LISTEN 0 4096 *:8080 *:*
tcp ESTAB 0 0 [::6]:40691 [fd01:14ba:9ea:8800:b950:1075:b3ba:de17]:8080
tcp ESTAB 0 0 [fd01:14ba:9ea:8800:b950:1075:b3ba:de17]:8080 [::6]:40691
上面的connection 同时存在两个和端口 40691
相关的连接,这个是因为这是一个内部的通讯。 在同一个network namespace 中,也就是在同一个pod中存在多个网络接口, 连接是从一个接口的地址访问另外一个接口的地址。
上面这一条代表从客户端发起的,下面这一条是表示服务端这边接收的。
下面是网路接口的一个参考例子
bash
raket@seliius00190:~> k -n monkey-ds-2734 exec -it mooncake-provider-0 -c istio-proxy bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
bash-4.4$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::6/128 scope global
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
link/ipip 0.0.0.0 brd 0.0.0.0
4: eth0@if4478: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1440 qdisc noqueue state UP group default
link/ether ce:50:b3:82:43:33 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 192.168.63.150/32 scope global eth0
valid_lft forever preferred_lft forever
inet6 fd01:14ba:9ea:8800:b950:1075:b3ba:de17/128 scope global
valid_lft forever preferred_lft forever
inet6 fe80::cc50:b3ff:fe82:4333/64 scope link
valid_lft forever preferred_lft forever
The connection status in another pod is same with the one above, omit.
Q&A
为什么monkey pod container 存在两种类型的连接呢?一种是连接service的,另外一种是连接到目标pod的。
The reasons could be as below:
- Service-Level Abstraction:
1. Initial traffic to a Service follows the Kubernetes Service abstraction.
2. Istio intercepts this traffic and performs Service discovery, resolving the request to a specific destination pod.
- Direct Pod-to-Pod Optimization:
3. Once Istio resolves a Service to a specific pod, it may cache the destination pod's IP for subsequent requests, resulting in direct connections.
4. This approach reduces latency by bypassing the Service's load-balancing logic for repeat requests.
请问既然monkey pod container 同时存在两种connection, 那么traffic 会走哪种connection 呢?
为什么只存在一种来自于monkey pod 的connection 而没有来自service 的呢?
service ip 其实是一个虚拟的IP, 其实是不存在于host或者pod 上面的。Service 是一个逻辑上的概念,并不存在一个服务的实体,对应的实体其实是和service 关联的pod。 当client 往service ip 发起连接的请求的时候,这个连接是直接被forward 到了pod 上面。并不是client 连接service,再由service 连接pod 的。
这个forward 的动作是iptables 来完成的,而对应的iptables规则 是 kube-proxy 定义的。
当service 或者pod 发生增加/减少 kube-proxy 都要相应地改动iptables 的规则。所以当service 和pod 很多的时候,性能就会受到严重的影响。
为什么source pod 的IP 还是可以在destination pod 中看到呢?
能不能看到source ip 其实是取决于k8s 如何应用 NAT, 针对 clusterIP 的service,source ip 是会被保留的。
Packets sent to ClusterIP from within the cluster are never source NAT'd if you're running kube-proxy in iptables mode, (the default).
Refer to https://kubernetes.io/docs/tutorials/services/source-ip/
This behavior is common in intra-cluster communications where preserving the original source IP is essential for accurate service behavior, logging, and security policies.
Iptables rules for traffic routing
traffic 的流向是 monkey pod-> mooncake-provider svc -> mooncake-provider pods。
对应的 iptables 就是如下。
monkey pod 的traffic 从pod 的namespace 中出来后,进入 root(host) 的namespace, 这样就会经过 PREROUTING chain, 这样matched 的 iptables rules 就是 KUBE-SERVICES -> KUBE-SVC-EMCU6HZ77BOFCMA2 -> KUBE-SEP-HRZ7UVRN35C4XX3N or KUBE-SEP-TL5AGK43LPHIYKVG。 当走到最后,traffic 中的 dest address 就会被修改成其中一个pod 的IP.
因为这个traffic 不是发送给 host 的process 的,所以不会经过 INPUT chain。当traffic 离开node 之前,就会经过 POSTROUTING chain,这时会src address 会被修改,src 修改可以保证,traffic 可以原路返回。
bash
seliius01059:~ # ip6tables -t nat -L
Chain PREROUTING (policy ACCEPT)
target prot opt source destination
KUBE-SERVICES all -- anywhere anywhere /* kubernetes service portals */
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
KUBE-SERVICES all -- anywhere anywhere /* kubernetes service portals */
Chain POSTROUTING (policy ACCEPT)
target prot opt source destination
KUBE-POSTROUTING all -- anywhere anywhere /* kubernetes postrouting rules */
Chain KUBE-MARK-MASQ (679 references)
target prot opt source destination
MARK all -- anywhere anywhere MARK or 0x4000
Chain KUBE-POSTROUTING (1 references)
target prot opt source destination
# 同一个cluster 的traffic 会被return,就是不做SNAT.
RETURN all -- anywhere anywhere mark match ! 0x4000/0x4000
MARK all -- anywhere anywhere MARK xor 0x4000
MASQUERADE all -- anywhere anywhere /* kubernetes service traffic requiring SNAT */ random-fully
Chain KUBE-SERVICES (2 references)
target prot opt source destination
KUBE-SVC-EMCU6HZ77BOFCMA2 tcp -- anywhere mooncake-provider.monkey-off-ds-2714.svc.cluster.local /* monkey-off-ds-2714/mooncake-provider:http-8080 cluster IP */ tcp dpt:http-alt
Chain KUBE-SVC-EMCU6HZ77BOFCMA2 (1 references)
target prot opt source destination KUBE-MARK-MASQ
# 这里的`KUBE-MARK-MASQ`规则是 指不是同一个cluster 的traffic 会apply 这条规则,即打上`0x4000`标记。然后在后面的POSTROUTING 中做SNAT。
KUBE-MARK-MASQ tcp -- !fd01:14ba:9ea:8800::/64 mooncake-provider.monkey-off-ds-2714.svc.cluster.local /* monkey-off-ds-2714/mooncake-provider:http-8080 cluster IP */ tcp dpt:http-alt
KUBE-SEP-HRZ7UVRN35C4XX3N all -- anywhere anywhere /* monkey-off-ds-2714/mooncake-provider:http-8080 -> [fd01:14ba:9ea:8800:b3ec:e045:7b9a:e493]:8080 */ statistic mode random probability 0.50000000000
KUBE-SEP-TL5AGK43LPHIYKVG all -- anywhere anywhere /* monkey-off-ds-2714/mooncake-provider:http-8080 -> [fd01:14ba:9ea:8800:b950:1075:b3ba:de17]:8080 */
Chain KUBE-SEP-HRZ7UVRN35C4XX3N (1 references)
target prot opt source destination
# 这里的 `KUBE-MARK-MASQ` 是指 source ip == dest ip 的traffic 也要打上标记。这种场景可能是pod通过service访问了自己pod上其他应用服务。
# 这种场景是在respond时要修改source ip 为不同的IP.要不就会认为是自己发给自己,会被drop 掉。
KUBE-MARK-MASQ all -- mooncake-provider-1.mooncake-provider.monkey-off-ds-2714.svc.cluster.local anywhere /* monkey-off-ds-2714/mooncake-provider:http-8080 */
DNAT tcp -- anywhere anywhere /* monkey-off-ds-2714/mooncake-provider:http-8080 */ tcp to:[fd01:14ba:9ea:8800:b3ec:e045:7b9a:e493]:8080
Chain KUBE-SEP-TL5AGK43LPHIYKVG (1 references)
target prot opt source destination
KUBE-MARK-MASQ all -- mooncake-provider-0.mooncake-provider.monkey-off-ds-2714.svc.cluster.local anywhere /* monkey-off-ds-2714/mooncake-provider:http-8080 */
DNAT tcp -- anywhere anywhere /* monkey-off-ds-2714/mooncake-provider:http-8080 */ tcp to:[fd01:14ba:9ea:8800:b950:1075:b3ba:de17]:8080
Iptables rule with dns resolved example
bash
seliius01059:~ # ip6tables -t nat -L -n | grep KUBE-SVC-EMCU6HZ77BOFCMA2
KUBE-SVC-EMCU6HZ77BOFCMA2 tcp ::/0 fd00:14ba:9ea:8800::ffff:1d22 /* monkey-off-ds-2714/mooncake-provider:http-8080 cluster IP */ tcp dpt:8080
Iptables rule explanation
Purpose of KUBE-MARK-MASQ
:
When a packet is marked in the KUBE-MARK-MASQ
chain, a specific bit in the packet's mark field is set. This mark indicates that the packet should undergo masquerading, which involves changing the source IP address to the node's IP. This process is essential for:
Outbound Traffic: Ensuring that responses to packets originating from within the cluster are correctly routed back, especially when the destination is outside the cluster's network.
Service Access: Allowing external clients to access cluster services by translating internal pod IPs to the node's IP, making services accessible outside the cluster.
How It Works:
Packet Matching: When a packet matches certain criteria (e.g., destined for a service IP but originating from outside the cluster), it is directed to the KUBE-MARK-MASQ chain.
Marking: Within this chain, the packet is marked by setting a specific bit in its mark field.
Postrouting Masquerade: Later, in the POSTROUTING chain, rules check for this mark. If present, the packet undergoes masquerading (SNAT), changing its source IP to the node's IP.
Why should SNAT be performed when pod requests itself via service.
Explanation For the rule KUBE-MARK-MASQ
in Chain KUBE-SEP-HRZ7UVRN35C4XX3N
If SNAT didn't happen, traffic would leave the pod's network namespace as (pod IP, source port -> virtual IP, virtual port) and then be NAT'd to (pod IP, source port-> pod IP, service port) and immediately sent back to the pod. Thus, this traffic would then arrive at the service with the source being (pod IP, source port). So when this service replies it will be replying to (pod IP, source port), but the pod (the kernel, really) is expecting traffic to come back on the same IP and port it sent the traffic to originally, which is (virtual IP, virtual port) and thus the traffic would get dropped on the way back
References
kube-proxy-and-iptables-rules
iptables-rules-for-kube-dns
k8s-source-ip-visibility