请求是如何通过k8s service 路由到对应的pod

当请求访问k8s service 的时候，请求就会被kube-proxy 设置的路由规则路由到对应的pod上面。 kube-proxy 主要有两种转发模式 iptables or ipvs。

Iptables 模式

Deployment

部署有client side 的 pod monkey 一个, monkey这个pod有主container monkey 和 sidecar istio (name:istio-proxy)。

部署有server side 的pod mooncake-provider 两个, 服务监听在 8080 port, mooncake-provider 这种pod 也有主container mooncake-provider 和 sidecar istio (name:istio-proxy)。

mooncake-provider pod 的部署方式是 statefulSet，而对应的service 是 clusterIP type 的 mooncake-provider svc。

对应的IP信息如下：

monkey pod
fd01:14ba:9ea:8800:3c70:dc10:db2a:3340
mooncake-provider svc
[fd00:14ba:9ea:8800::ffff:1d22]:8080
mooncake-provider pods
mooncake-provider-0: [fd01:14ba:9ea:8800:b950:1075:b3ba:de17]:8080
mooncake-provider-1: [fd01:14ba:9ea:8800:b3ec:e045:7b9a:e493]:8080

这些POD IP 的前缀都是一样的，都是 fd01:14ba:9ea:8800/64, 这个是因为它们都是在同一个cluster 内，虽然它们可能在不同的worker node 上。

Traffic path

monkey pod -> mooncake-provider svc -> mooncake-provider pods

Connection status

因为istio container 和主 container 在同一个pod，它们共享同一个network namespace, 所以可以在任意一个container 上查看网络连接的状态

From monkey pod istio container

从 client 端的container 中看，会存在三个连接

一个连接是和mooncake-provider svc 建立的。

另外两个连接是和两个mooncake-provider pod 各自建立的。

bash 复制代码

bash-4.4$ ss -an | grep 8080
tcp   ESTAB     0      0          [fd01:14ba:9ea:8800:3c70:dc10:db2a:3340]:44488              [fd00:14ba:9ea:8800::ffff:1d22]:8080
tcp   ESTAB     0      0          [fd01:14ba:9ea:8800:3c70:dc10:db2a:3340]:37440     [fd01:14ba:9ea:8800:b950:1075:b3ba:de17]:8080
tcp   ESTAB     0      0          [fd01:14ba:9ea:8800:3c70:dc10:db2a:3340]:38444     [fd01:14ba:9ea:8800:b3ec:e045:7b9a:e493]:8080

From mooncake-provider pod istio container

从 server 端的container 来看，只存在一个连接，不存在来自source 为service ip 的connection。

这个连接就是来自于monkey pod。

bash 复制代码

k exec -it mooncake-provider-0 -c istio-proxy bash

## The traffic from monkey pod is forwarded to sidecar container istio-proxy inbound port 15006
bash-4.4$ ss -an | grep fd01:14ba:9ea:8800:3c70:dc10:db2a:3340
tcp   ESTAB     0      0          [fd01:14ba:9ea:8800:b950:1075:b3ba:de17]:15006     [fd01:14ba:9ea:8800:3c70:dc10:db2a:3340]:37440

## Then the traffic is forwarded to mooncake-provider container port 8080 by the istio local ip [::6]
bash-4.4$ ss -an | grep 8080
tcp   LISTEN    0      4096                                              *:8080                                             *:*

tcp   ESTAB     0      0                                             [::6]:40691     [fd01:14ba:9ea:8800:b950:1075:b3ba:de17]:8080
tcp   ESTAB     0      0          [fd01:14ba:9ea:8800:b950:1075:b3ba:de17]:8080                                         [::6]:40691

上面的connection 同时存在两个和端口 40691 相关的连接，这个是因为这是一个内部的通讯。在同一个network namespace 中，也就是在同一个pod中存在多个网络接口, 连接是从一个接口的地址访问另外一个接口的地址。

上面这一条代表从客户端发起的，下面这一条是表示服务端这边接收的。

下面是网路接口的一个参考例子

bash 复制代码

raket@seliius00190:~> k -n monkey-ds-2734 exec -it mooncake-provider-0  -c istio-proxy bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
bash-4.4$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::6/128 scope global
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0
4: eth0@if4478: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1440 qdisc noqueue state UP group default
    link/ether ce:50:b3:82:43:33 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.63.150/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fd01:14ba:9ea:8800:b950:1075:b3ba:de17/128 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::cc50:b3ff:fe82:4333/64 scope link
       valid_lft forever preferred_lft forever

The connection status in another pod is same with the one above, omit.

Q&A

为什么monkey pod container 存在两种类型的连接呢？一种是连接service的，另外一种是连接到目标pod的。

The reasons could be as below:

复制代码

- Service-Level Abstraction:

1. Initial traffic to a Service follows the Kubernetes Service abstraction.
2. Istio intercepts this traffic and performs Service discovery, resolving the request to a specific destination pod.

- Direct Pod-to-Pod Optimization:

3. Once Istio resolves a Service to a specific pod, it may cache the destination pod's IP for subsequent requests, resulting in direct connections.
4. This approach reduces latency by bypassing the Service's load-balancing logic for repeat requests.

请问既然monkey pod container 同时存在两种connection, 那么traffic 会走哪种connection 呢？

为什么只存在一种来自于monkey pod 的connection 而没有来自service 的呢？

service ip 其实是一个虚拟的IP, 其实是不存在于host或者pod 上面的。Service 是一个逻辑上的概念，并不存在一个服务的实体，对应的实体其实是和service 关联的pod。当client 往service ip 发起连接的请求的时候，这个连接是直接被forward 到了pod 上面。并不是client 连接service，再由service 连接pod 的。

这个forward 的动作是iptables 来完成的，而对应的iptables规则是 kube-proxy 定义的。

当service 或者pod 发生增加/减少 kube-proxy 都要相应地改动iptables 的规则。所以当service 和pod 很多的时候，性能就会受到严重的影响。

为什么source pod 的IP 还是可以在destination pod 中看到呢？

能不能看到source ip 其实是取决于k8s 如何应用 NAT, 针对 clusterIP 的service，source ip 是会被保留的。

Packets sent to ClusterIP from within the cluster are never source NAT'd if you're running kube-proxy in iptables mode, (the default).

Refer to https://kubernetes.io/docs/tutorials/services/source-ip/

This behavior is common in intra-cluster communications where preserving the original source IP is essential for accurate service behavior, logging, and security policies.

Iptables rules for traffic routing

traffic 的流向是 monkey pod-> mooncake-provider svc -> mooncake-provider pods。

对应的 iptables 就是如下。

monkey pod 的traffic 从pod 的namespace 中出来后，进入 root(host) 的namespace, 这样就会经过 PREROUTING chain, 这样matched 的 iptables rules 就是 KUBE-SERVICES -> KUBE-SVC-EMCU6HZ77BOFCMA2 -> KUBE-SEP-HRZ7UVRN35C4XX3N or KUBE-SEP-TL5AGK43LPHIYKVG。当走到最后，traffic 中的 dest address 就会被修改成其中一个pod 的IP.

因为这个traffic 不是发送给 host 的process 的，所以不会经过 INPUT chain。当traffic 离开node 之前，就会经过 POSTROUTING chain，这时会src address 会被修改，src 修改可以保证，traffic 可以原路返回。

bash 复制代码

seliius01059:~ # ip6tables -t nat -L
Chain PREROUTING (policy ACCEPT)
target         prot opt source               destination
KUBE-SERVICES  all  --  anywhere             anywhere             /* kubernetes service portals */

Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target         prot opt source               destination
KUBE-SERVICES  all  --  anywhere             anywhere             /* kubernetes service portals */

Chain POSTROUTING (policy ACCEPT)
target     		  prot opt source               destination
KUBE-POSTROUTING  all  --  anywhere             anywhere             /* kubernetes postrouting rules */

Chain KUBE-MARK-MASQ (679 references)
target     prot opt source               destination
MARK       all  --  anywhere             anywhere             MARK or 0x4000

Chain KUBE-POSTROUTING (1 references)
target      prot opt source               destination
# 同一个cluster 的traffic 会被return，就是不做SNAT.
RETURN      all  --  anywhere             anywhere             mark match ! 0x4000/0x4000
MARK        all  --  anywhere             anywhere             MARK xor 0x4000
MASQUERADE  all  --  anywhere             anywhere             /* kubernetes service traffic requiring SNAT */ random-fully

Chain KUBE-SERVICES (2 references)
target                     prot opt source               destination
KUBE-SVC-EMCU6HZ77BOFCMA2  tcp  --  anywhere             mooncake-provider.monkey-off-ds-2714.svc.cluster.local  /* monkey-off-ds-2714/mooncake-provider:http-8080 cluster IP */ tcp dpt:http-alt


Chain KUBE-SVC-EMCU6HZ77BOFCMA2 (1 references)
target                     prot opt source               destination KUBE-MARK-MASQ

# 这里的`KUBE-MARK-MASQ`规则是 指不是同一个cluster 的traffic 会apply 这条规则，即打上`0x4000`标记。然后在后面的POSTROUTING 中做SNAT。
KUBE-MARK-MASQ             tcp  --  !fd01:14ba:9ea:8800::/64  mooncake-provider.monkey-off-ds-2714.svc.cluster.local  /* monkey-off-ds-2714/mooncake-provider:http-8080 cluster IP */ tcp dpt:http-alt
KUBE-SEP-HRZ7UVRN35C4XX3N  all  --  anywhere             anywhere             /* monkey-off-ds-2714/mooncake-provider:http-8080 -> [fd01:14ba:9ea:8800:b3ec:e045:7b9a:e493]:8080 */ statistic mode random probability 0.50000000000
KUBE-SEP-TL5AGK43LPHIYKVG  all  --  anywhere             anywhere             /* monkey-off-ds-2714/mooncake-provider:http-8080 -> [fd01:14ba:9ea:8800:b950:1075:b3ba:de17]:8080 */


Chain KUBE-SEP-HRZ7UVRN35C4XX3N (1 references)
target          prot opt source               destination
# 这里的 `KUBE-MARK-MASQ` 是指 source ip == dest ip 的traffic 也要打上标记。这种场景可能是pod通过service访问了自己pod上其他应用服务。
# 这种场景是在respond时要修改source ip 为不同的IP.要不就会认为是自己发给自己，会被drop 掉。 
KUBE-MARK-MASQ  all  --  mooncake-provider-1.mooncake-provider.monkey-off-ds-2714.svc.cluster.local  anywhere             /* monkey-off-ds-2714/mooncake-provider:http-8080 */
DNAT            tcp  --  anywhere             anywhere             /* monkey-off-ds-2714/mooncake-provider:http-8080 */ tcp to:[fd01:14ba:9ea:8800:b3ec:e045:7b9a:e493]:8080


Chain KUBE-SEP-TL5AGK43LPHIYKVG (1 references)
target          prot opt source               destination
KUBE-MARK-MASQ  all  --  mooncake-provider-0.mooncake-provider.monkey-off-ds-2714.svc.cluster.local  anywhere             /* monkey-off-ds-2714/mooncake-provider:http-8080 */
DNAT            tcp  --  anywhere             anywhere             /* monkey-off-ds-2714/mooncake-provider:http-8080 */ tcp to:[fd01:14ba:9ea:8800:b950:1075:b3ba:de17]:8080

Iptables rule with dns resolved example

bash 复制代码

seliius01059:~ # ip6tables -t nat -L  -n | grep  KUBE-SVC-EMCU6HZ77BOFCMA2
KUBE-SVC-EMCU6HZ77BOFCMA2  tcp      ::/0                 fd00:14ba:9ea:8800::ffff:1d22  /* monkey-off-ds-2714/mooncake-provider:http-8080 cluster IP */ tcp dpt:8080

Iptables rule explanation

Purpose of `KUBE-MARK-MASQ`:

When a packet is marked in the KUBE-MARK-MASQ chain, a specific bit in the packet's mark field is set. This mark indicates that the packet should undergo masquerading, which involves changing the source IP address to the node's IP. This process is essential for:

复制代码

Outbound Traffic: Ensuring that responses to packets originating from within the cluster are correctly routed back, especially when the destination is outside the cluster's network.

Service Access: Allowing external clients to access cluster services by translating internal pod IPs to the node's IP, making services accessible outside the cluster.

How It Works:

复制代码

Packet Matching: When a packet matches certain criteria (e.g., destined for a service IP but originating from outside the cluster), it is directed to the KUBE-MARK-MASQ chain.

Marking: Within this chain, the packet is marked by setting a specific bit in its mark field.

Postrouting Masquerade: Later, in the POSTROUTING chain, rules check for this mark. If present, the packet undergoes masquerading (SNAT), changing its source IP to the node's IP.

Why should SNAT be performed when pod requests itself via service.

Explanation For the rule KUBE-MARK-MASQ in Chain KUBE-SEP-HRZ7UVRN35C4XX3N

If SNAT didn't happen, traffic would leave the pod's network namespace as (pod IP, source port -> virtual IP, virtual port) and then be NAT'd to (pod IP, source port-> pod IP, service port) and immediately sent back to the pod. Thus, this traffic would then arrive at the service with the source being (pod IP, source port). So when this service replies it will be replying to (pod IP, source port), but the pod (the kernel, really) is expecting traffic to come back on the same IP and port it sent the traffic to originally, which is (virtual IP, virtual port) and thus the traffic would get dropped on the way back

References

kube-proxy-and-iptables-rules
iptables-rules-for-kube-dns
k8s-source-ip-visibility

请求是如何通过k8s service 路由到对应的pod