项目环境
- 开发语言:golang
- 开发框架:go-zero
- 部署环境:阿里云k8s集群
具体表现
调用微信app支付统一下单接口,测试反馈偶尔会出现报错:
bash
Post \"https://api.mch.weixin.qq.com/pay/unifiedorder\": dial tcp [240e:e1:a900:50::4a]:443: connect: network is unreachabl
这里连接的是ipv6的地址[240e:e1:a900:50::4a],查看当前服务器配置,并不支持ipv6地址解析,查看域名dns解析:
yaml
[root@localhost xwj-services]# nslookup api.mch.weixin.qq.com
Server: 10.225.136.20
Address: 10.225.136.20#53
Non-authoritative answer:
api.mch.weixin.qq.com canonical name = forward.weixin.qq.com.
forward.weixin.qq.com canonical name = forwardtmp.weixin.qq.com.
Name: forwardtmp.weixin.qq.com
Address: 101.91.0.140
Name: forwardtmp.weixin.qq.com
Address: 101.226.137.13
Name: forwardtmp.weixin.qq.com
Address: 240e:e1:a900:50::4a
Name: forwardtmp.weixin.qq.com
Address: 240e:e1:a900:50::49
可以看到该域名同时支持了ipv4和ipv6,但是为什么偶尔会解析到ipv6地址呢
查看微信社区文档:pay.weixin.qq.com/wiki/doc/ap...
scss
2. IPV6相关
如果您的服务器开启了IPv6支持,由于当前互联网对IPv6支持不完整,导致在DNS解析时通常会碰到超时问题;
建议在调用支付API时,显示指定使用IPv4解析.
PHP程序使用curl调用参考代码如下:
if(defined('CURLOPT_IPRESOLVE') && defined('CURL_IPRESOLVE_V4'))
{
curl_setopt($ch, CURLOPT_IPRESOLVE, CURL_IPRESOLVE_V4);
}
查看当前服务器ipv6配置
ini
sysctl -a |grep ipv6|grep disable
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.136.65.38 netmask 255.255.255.0 broadcast 10.136.65.255
ether 00:16:3e:0e:46:ca txqueuelen 1000 (Ethernet)
RX packets 348024466 bytes 124403304860 (115.8 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 276384477 bytes 87874832467 (81.8 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
并无inet6的字样,没有开启ipv6,于是尝试查看golang客户端是否支持指定使用Ipv4,网上搜索后并未找到结果,查看golang net/http源码
go
net/lookup.go 288-line go version 1.18
// lookupIPAddr looks up host using the local resolver and particular network.
// It returns a slice of that host's IPv4 and IPv6 addresses.
func (r *Resolver) lookupIPAddr(ctx context.Context, network, host string) ([]IPAddr, error) {
// Make sure that no matter what we do later, host=="" is rejected.
// parseIP, for example, does accept empty strings.
if host == "" {
return nil, &DNSError{Err: errNoSuchHost.Error(), Name: host, IsNotFound: true}
}
if ip, zone := parseIPZone(host); ip != nil {
return []IPAddr{{IP: ip, Zone: zone}}, nil
}
fmt.Println("return ip??")
trace, _ := ctx.Value(nettrace.TraceKey{}).(*nettrace.Trace)
if trace != nil && trace.DNSStart != nil {
trace.DNSStart(host)
}
// The underlying resolver func is lookupIP by default but it
// can be overridden by tests. This is needed by net/http, so it
// uses a context key instead of unexported variables.
resolverFunc := r.lookupIP
if alt, _ := ctx.Value(nettrace.LookupIPAltResolverKey{}).(func(context.Context, string, string) ([]IPAddr, error)); alt != nil {
resolverFunc = alt
}
...
可以看到最终进行dns解析时,ipv4和ipv6都会一起解析返回,再来看下解析完成后选择ip的过程:
ipsock.go 249 line
go
// internetAddrList resolves addr, which may be a literal IP
// address or a DNS name, and returns a list of internet protocol
// family addresses. The result contains at least one address when
// error is nil.
func (r *Resolver) internetAddrList(ctx context.Context, net, addr string) (addrList, error) {
var (
err error
host, port string
portnum int
)
switch net {
case "tcp", "tcp4", "tcp6", "udp", "udp4", "udp6":
if addr != "" {
if host, port, err = SplitHostPort(addr); err != nil {
return nil, err
}
if portnum, err = r.LookupPort(ctx, net, port); err != nil {
return nil, err
}
}
case "ip", "ip4", "ip6":
if addr != "" {
host = addr
}
default:
return nil, UnknownNetworkError(net)
}
inetaddr := func(ip IPAddr) Addr {
switch net {
case "tcp", "tcp4", "tcp6":
return &TCPAddr{IP: ip.IP, Port: portnum, Zone: ip.Zone}
case "udp", "udp4", "udp6":
return &UDPAddr{IP: ip.IP, Port: portnum, Zone: ip.Zone}
case "ip", "ip4", "ip6":
return &IPAddr{IP: ip.IP, Zone: ip.Zone}
default:
panic("unexpected network: " + net)
}
}
if host == "" {
return addrList{inetaddr(IPAddr{})}, nil
}
// Try as a literal IP address, then as a DNS name.
ips, err := r.lookupIPAddr(ctx, net, host)
fmt.Println("net",net,ips,net[len(net)-1])
if err != nil {
return nil, err
}
// Issue 18806: if the machine has halfway configured
// IPv6 such that it can bind on "::" (IPv6unspecified)
// but not connect back to that same address, fall
// back to dialing 0.0.0.0.
if len(ips) == 1 && ips[0].IP.Equal(IPv6unspecified) {
ips = append(ips, IPAddr{IP: IPv4zero})
}
var filter func(IPAddr) bool
if net != "" && net[len(net)-1] == '4' {
filter = ipv4only
}
if net != "" && net[len(net)-1] == '6' {
filter = ipv6only
}
return filterAddrList(filter, ips, inetaddr, host)
}
// filterAddrList applies a filter to a list of IP addresses,
// yielding a list of Addr objects. Known filters are nil, ipv4only,
// and ipv6only. It returns every address when the filter is nil.
// The result contains at least one address when error is nil.
func filterAddrList(filter func(IPAddr) bool, ips []IPAddr, inetaddr func(IPAddr) Addr, originalAddr string) (addrList, error) {
var addrs addrList
for _, ip := range ips {
if filter == nil || filter(ip) {
addrs = append(addrs, inetaddr(ip))
}
}
if len(addrs) == 0 {
return nil, &AddrError{Err: errNoSuitableAddress.Error(), Addr: originalAddr}
}
return addrs, nil
}
dial.go 359 line
go
// DialContext connects to the address on the named network using
// the provided context.
//
// The provided Context must be non-nil. If the context expires before
// the connection is complete, an error is returned. Once successfully
// connected, any expiration of the context will not affect the
// connection.
//
// When using TCP, and the host in the address parameter resolves to multiple
// network addresses, any dial timeout (from d.Timeout or ctx) is spread
// over each consecutive dial, such that each is given an appropriate
// fraction of the time to connect.
// For example, if a host has 4 IP addresses and the timeout is 1 minute,
// the connect to each single address will be given 15 seconds to complete
// before trying the next one.
//
// See func Dial for a description of the network and address
// parameters.
func (d *Dialer) DialContext(ctx context.Context, network, address string) (Conn, error) {
if ctx == nil {
panic("nil context")
}
deadline := d.deadline(ctx, time.Now())
if !deadline.IsZero() {
if d, ok := ctx.Deadline(); !ok || deadline.Before(d) {
subCtx, cancel := context.WithDeadline(ctx, deadline)
defer cancel()
ctx = subCtx
}
}
if oldCancel := d.Cancel; oldCancel != nil {
subCtx, cancel := context.WithCancel(ctx)
defer cancel()
go func() {
select {
case <-oldCancel:
cancel()
case <-subCtx.Done():
}
}()
ctx = subCtx
}
// Shadow the nettrace (if any) during resolve so Connect events don't fire for DNS lookups.
resolveCtx := ctx
if trace, _ := ctx.Value(nettrace.TraceKey{}).(*nettrace.Trace); trace != nil {
shadow := *trace
shadow.ConnectStart = nil
shadow.ConnectDone = nil
resolveCtx = context.WithValue(resolveCtx, nettrace.TraceKey{}, &shadow)
}
//这里得到了所有地址
addrs, err := d.resolver().resolveAddrList(resolveCtx, "dial", network, address, d.LocalAddr)
if err != nil {
return nil, &OpError{Op: "dial", Net: network, Source: nil, Addr: nil, Err: err}
}
sd := &sysDialer{
Dialer: *d,
network: network,
address: address,
}
var primaries, fallbacks addrList
// FallbackDelay specifies the length of time to wait before
// spawning a RFC 6555 Fast Fallback connection. That is, this
// is the amount of time to wait for IPv6 to succeed before
// assuming that IPv6 is misconfigured and falling back to
// IPv4.
//
// If zero, a default delay of 300ms is used.
// A negative value disables Fast Fallback support.
// FallbackDelay time.Duration
// dualStack() = FallbackDelay > 0
if d.dualStack() && network == "tcp" {
primaries, fallbacks = addrs.partition(isIPv4)
} else {
primaries = addrs
}
var c Conn
if len(fallbacks) > 0 {
c, err = sd.dialParallel(ctx, primaries, fallbacks)
} else {
c, err = sd.dialSerial(ctx, primaries)
}
if err != nil {
return nil, err
}
if tc, ok := c.(*TCPConn); ok && d.KeepAlive >= 0 {
setKeepAlive(tc.fd, true)
ka := d.KeepAlive
if d.KeepAlive == 0 {
ka = defaultTCPKeepAlive
}
setKeepAlivePeriod(tc.fd, ka)
testHookSetKeepAlive(ka)
}
return c, nil
}
// dialSerial connects to a list of addresses in sequence, returning
// either the first successful connection, or the first error.
func (sd *sysDialer) dialSerial(ctx context.Context, ras addrList) (Conn, error) {
var firstErr error // The error from the first address is most relevant.
for i, ra := range ras {
select {
case <-ctx.Done():
return nil, &OpError{Op: "dial", Net: sd.network, Source: sd.LocalAddr, Addr: ra, Err: mapErr(ctx.Err())}
default:
}
dialCtx := ctx
if deadline, hasDeadline := ctx.Deadline(); hasDeadline {
partialDeadline, err := partialDeadline(time.Now(), deadline, len(ras)-i)
if err != nil {
// Ran out of time.
if firstErr == nil {
firstErr = &OpError{Op: "dial", Net: sd.network, Source: sd.LocalAddr, Addr: ra, Err: err}
}
break
}
if partialDeadline.Before(deadline) {
var cancel context.CancelFunc
dialCtx, cancel = context.WithDeadline(ctx, partialDeadline)
defer cancel()
}
}
//遍历所有连接,当有一个连接成功时直接返回
c, err := sd.dialSingle(dialCtx, ra)
if err == nil {
return c, nil
}
if firstErr == nil {
firstErr = err
}
}
if firstErr == nil {
firstErr = &OpError{Op: "dial", Net: sd.network, Source: nil, Addr: nil, Err: errMissingAddress}
}
return nil, firstErr
}
// dialSingle attempts to establish and returns a single connection to
// the destination address.
func (sd *sysDialer) dialSingle(ctx context.Context, ra Addr) (c Conn, err error) {
trace, _ := ctx.Value(nettrace.TraceKey{}).(*nettrace.Trace)
if trace != nil {
raStr := ra.String()
if trace.ConnectStart != nil {
trace.ConnectStart(sd.network, raStr)
}
if trace.ConnectDone != nil {
defer func() { trace.ConnectDone(sd.network, raStr, err) }()
}
}
la := sd.LocalAddr
switch ra := ra.(type) {
case *TCPAddr:
la, _ := la.(*TCPAddr)
c, err = sd.dialTCP(ctx, la, ra)
case *UDPAddr:
la, _ := la.(*UDPAddr)
c, err = sd.dialUDP(ctx, la, ra)
case *IPAddr:
la, _ := la.(*IPAddr)
c, err = sd.dialIP(ctx, la, ra)
case *UnixAddr:
la, _ := la.(*UnixAddr)
c, err = sd.dialUnix(ctx, la, ra)
default:
return nil, &OpError{Op: "dial", Net: sd.network, Source: la, Addr: ra, Err: &AddrError{Err: "unexpected address type", Addr: sd.address}}
}
if err != nil {
return nil, &OpError{Op: "dial", Net: sd.network, Source: la, Addr: ra, Err: err} // c is non-nil interface containing nil pointer
}
return c, nil
}
可以看到,如果不是特意指定了ipv4或者ipv6(这里对应的是nerwork=tcp4/tcp6或udp4/udp6),经过测试,正常的http请求默认都是tcp,在不做特殊设置的情况下(FallbackDelay > 0)客户端会优先使用IPv4再使用ipv6
go
func main() {
u := "https://api.mch.weixin.qq.com/pay/unifiedorder"
b := strings.NewReader("test")
c := NewClient()
resp, _ := c.Post(u, "text/json;charset=utf-8", b)
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
fmt.Println(err)
}
ur, _ := url.Parse(u)
fmt.Println(ur.Host, string(body))
}
func NewClient() *http.Client {
return &http.Client{
Timeout: 60 * time.Second,
Transport: &http.Transport{
TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
DisableKeepAlives: true,
Proxy: http.ProxyFromEnvironment,
},
}
}
打印dns日志,并放到线上测试
css
打印日志代码,在上面部分:
// Try as a literal IP address, then as a DNS name.
ips, err := r.lookupIPAddr(ctx, net, host)
fmt.Println("net",net,ips,net[len(net)-1])
请求正常响应,连接ipv4时的日志:
net tcp [{101.226.137.13 } {101.91.0.140 } {240e:e1:a900:50::4a } {240e:e1:a900:50::49 }] 112 api.mch.weixin.qq.com:443
请求报错,连接到ipv6时的日志
net tcp [{240e:e1:a900:50::4a } {240e:e1:a900:50::49 }] 112 api.mch.weixin.qq.com:443
出现异常时,addrlist只有ipv6地址,由于没有解析到ipv4地址,已经没得选择,只能去连接ipv6了
dns抓包
makefile
#正常时
17:05:29.473836 IP shop-rpc-7c84bb44f6-6vj7p.36570 > kube-dns.kube-system.svc.cluster.local.53: 46005+ AAAA? api.mch.weixin.qq.com. (39)
17:05:29.473887 IP shop-rpc-7c84bb44f6-6vj7p.57888 > kube-dns.kube-system.svc.cluster.local.53: 31830+ A? api.mch.weixin.qq.com. (39)
17:05:29.474309 IP kube-dns.kube-system.svc.cluster.local.53 > shop-rpc-7c84bb44f6-6vj7p.36570: 46005 4/0/0 CNAME forward.weixin.qq.com., CNAME forwardtmp.weixin.qq.com., AAAA 240e:e1:a900:50::4a, AAAA 240e:e1:a900:50::49 (258)
17:05:29.474404 IP kube-dns.kube-system.svc.cluster.local.53 > shop-rpc-7c84bb44f6-6vj7p.57888: 31830 4/0/0 CNAME forward.weixin.qq.com., CNAME forwardtmp.weixin.qq.com., A 101.226.137.13, A 101.91.0.140 (234)
# 异常时
17:02:33.045324 IP shop-rpc-7c84bb44f6-6vj7p.48458 > kube-dns.kube-system.svc.cluster.local.53: 41571+ AAAA? api.mch.weixin.qq.com.xwj.svc.cluster.local. (61)
17:02:33.045325 IP shop-rpc-7c84bb44f6-6vj7p.46782 > kube-dns.kube-system.svc.cluster.local.53: 22281+ A? api.mch.weixin.qq.com.xwj.svc.cluster.local. (61)
17:02:33.045846 IP kube-dns.kube-system.svc.cluster.local.53 > shop-rpc-7c84bb44f6-6vj7p.46782: 22281 NXDomain*- 0/1/0 (154)
17:02:33.046166 IP kube-dns.kube-system.svc.cluster.local.53 > shop-rpc-7c84bb44f6-6vj7p.48458: 41571 5/0/0 CNAME api.mch.weixin.qq.com., CNAME forward.weixin.qq.com., CNAME forwardtmp.weixin.qq.com., AAAA 240e:e1:a900:50::4a, AAAA 240e:e1:a900:50::49 (358)
dns解析类型:
A记录与CNAME记录
A记录是把一个域名解析到一个IP地址,而CNAME记录是把域名解析到另外一个域名,而这个域名最终会指向一个A记录,在功能实现在上A记录与CNAME记录没有区别。
CNAME记录在做IP地址变更时要比A记录方便。CNAME记录允许将多个名字映射到同一台计算机,当有多个域名需要指向同一服务器IP,此时可以将一个域名做A记录指向服务器IP,然后将其他的域名做别名(即:CNAME)到A记录的域名上。当服务器IP地址变更时,只需要更改A记录的那个域名到新IP上,其它做别名的域名会自动更改到新的IP地址上,而不必对每个域名做更改。
A记录与AAAA记录
二者都是指向一个IP地址,但对应的IP版本不同。
A记录指向IPv4地址,AAAA记录指向IPv6地址。AAAA记录是A记录的升级版本。
NXDomain错误
根据抓包可以看出,dns A类型解析出现异常,返回了NXDomain,所以导致应用程序只拿到了ipv6的ip。
以下是一些可能导致 NXDomain 的情况:
- 拼写错误: 用户输入的域名可能存在拼写错误,或者请求的域名确实不存在。
- DNS记录尚未生效: 如果域名是最近注册或修改的,DNS 记录可能尚未完全传播到所有的 DNS 服务器。这种情况下,需要等待 DNS 记录的生效时间,通常为 TTL(Time To Live)值所指定的时间。
- 域名被停用或删除: 域名可能已被停用或删除,导致在 DNS 中找不到相应的记录。
- DNS服务器问题: DNS 服务器本身可能遇到问题,无法提供正确的域名解析。这可能是由于服务器故障、配置错误或网络问题引起的。
- DNS缓存问题: 本地 DNS 缓存可能包含过期或不正确的信息,导致域名解析错误。尝试清除本地 DNS 缓存,然后重新尝试解析域名。
另一个实验,使用nslookup和curl对api.mch.weixin.qq.com进行解析
在相同环境下,curl和nslookup并未出现过解析异常,目前问题只是出现在了goalng net/http客户端
dns抓包结果:
yaml
13:44:11.618584 IP localhost.localdomain.11115 > sgs-dc-01.dobest.corp.domain: 9108+ A? api.mch.weixin.qq.com. (39)
13:44:11.618658 IP localhost.localdomain.11115 > sgs-dc-01.dobest.corp.domain: 11739+ AAAA? api.mch.weixin.qq.com. (39)
13:44:11.619076 IP sgs-dc-01.dobest.corp.domain > localhost.localdomain.11115: 9108 4/0/0 CNAME forward.weixin.qq.com., CNAME forwardtmp.weixin.qq.com., A 101.226.137.13, A 101.91.0.140 (118)
13:44:11.619339 IP sgs-dc-01.dobest.corp.domain > localhost.localdomain.11115: 11739 4/0/0 CNAME forward.weixin.qq.com., CNAME forwardtmp.weixin.qq.com., AAAA 240e:e1:a900:50::4a, AAAA 240e:e1:a900:50::49 (142)
13:44:12.538100 IP localhost.localdomain.43789 > sgs-dc-01.dobest.corp.domain: 9052+ A? api.mch.weixin.qq.com. (39)
13:44:12.538159 IP localhost.localdomain.43789 > sgs-dc-01.dobest.corp.domain: 31643+ AAAA? api.mch.weixin.qq.com. (39)
13:44:12.538692 IP sgs-dc-01.dobest.corp.domain > localhost.localdomain.43789: 9052 4/0/0 CNAME forward.weixin.qq.com., CNAME forwardtmp.weixin.qq.com., A 101.91.0.140, A 101.226.137.13 (118)
13:44:12.538765 IP sgs-dc-01.dobest.corp.domain > localhost.localdomain.43789: 31643 4/0/0 CNAME forward.weixin.qq.com., CNAME forwardtmp.weixin.qq.com., AAAA 240e:e1:a900:50::49, AAAA 240e:e1:a900:50::4a (142)
13:44:13.312403 IP localhost.localdomain.35964 > sgs-dc-01.dobest.corp.domain: 4603+ A? api.mch.weixin.qq.com. (39)
13:44:13.312459 IP localhost.localdomain.35964 > sgs-dc-01.dobest.corp.domain: 25914+ AAAA? api.mch.weixin.qq.com. (39)
13:44:13.313349 IP sgs-dc-01.dobest.corp.domain > localhost.localdomain.35964: 4603 4/0/0 CNAME forward.weixin.qq.com., CNAME forwardtmp.weixin.qq.com., A 101.226.137.13, A 101.91.0.140 (118)
13:44:13.313448 IP sgs-dc-01.dobest.corp.domain > localhost.localdomain.35964: 25914 4/0/0 CNAME forward.weixin.qq.com., CNAME forwardtmp.weixin.qq.com., AAAA 240e:e1:a900:50::4a, AAAA 240e:e1:a900:50::49 (142)
13:44:14.002346 IP localhost.localdomain.25122 > sgs-dc-01.dobest.corp.domain: 52577+ A? api.mch.weixin.qq.com. (39)
13:44:14.002455 IP localhost.localdomain.25122 > sgs-dc-01.dobest.corp.domain: 22434+ AAAA? api.mch.weixin.qq.com. (39)
13:44:14.002983 IP sgs-dc-01.dobest.corp.domain > localhost.localdomain.25122: 22434 4/0/0 CNAME forward.weixin.qq.com., CNAME forwardtmp.weixin.qq.com., AAAA 240e:e1:a900:50::49, AAAA 240e:e1:a900:50::4a (142)
13:44:14.003062 IP sgs-dc-01.dobest.corp.domain > localhost.localdomain.25122: 52577 4/0/0 CNAME forward.weixin.qq.com., CNAME forwardtmp.weixin.qq.com., A 101.91.0.140, A 101.226.137.13 (118)
A和AAAA类型解析,大部分情况都是同一个连接发出和接收的,但是golang 客户端是两个不同连接
yaml
13:45:36.754562 IP localhost.localdomain.60407 > sgs-dc-01.dobest.corp.domain: 20327+ A? api.mch.weixin.qq.com. (39)
13:45:36.754562 IP localhost.localdomain.44534 > sgs-dc-01.dobest.corp.domain: 20065+ AAAA? api.mch.weixin.qq.com. (39)
13:45:36.755395 IP localhost.localdomain.44588 > sgs-dc-01.dobest.corp.domain: 14900+ PTR? 20.136.225.10.in-addr.arpa. (44)
13:45:36.755486 IP sgs-dc-01.dobest.corp.domain > localhost.localdomain.60407: 20327 4/0/0 CNAME forward.weixin.qq.com., CNAME forwardtmp.weixin.qq.com., A 101.226.137.13, A 101.91.0.140 (118)
13:45:36.756127 IP sgs-dc-01.dobest.corp.domain > localhost.localdomain.44588: 14900* 1/0/0 PTR sgs-dc-01.dobest.corp. (79)
13:45:36.756294 IP localhost.localdomain.59552 > sgs-dc-01.dobest.corp.domain: 3799+ PTR? 192.136.225.10.in-addr.arpa. (45)
13:45:36.756886 IP sgs-dc-01.dobest.corp.domain > localhost.localdomain.44534: 20065 4/0/0 CNAME forward.weixin.qq.com., CNAME forwardtmp.weixin.qq.com., AAAA 240e:e1:a900:50::49, AAAA 240e:e1:a900:50::4a (142)
原因猜测
阿里云dns服务器对攻击做了预防机制,这里goalng客户端获取dns时每次都是并行2个连接去做解析,请求频率过快导致被拦截,返回NXDomain错误。
dns解析不稳定处理方案
- 更换dns(当前集群里依赖了coredns做解析,更换较为困难)
- 增加dns缓存
最后解决方案
增加dns缓存,具体参考:help.aliyun.com/document_de...