tw引发的对redis的深入了解

背景

最近由于增长团队的持续买量，业务在逐步增长，与此同时逐渐开始收到一些timewait超过阈值的告警。通过这次tw的调查，对go-redis有了更多的理解和学习。

我用的redis client是github.com/redis/go-redis/v9 v9.7.1

结论

通过调整redis的xTimeout配置，tw变成了0（2000左右 -> 0），etb变成了200（130左右 -> 200）。

变更如下：

从

c 复制代码

PoolSize = 200
MinIdleConns = 50
ReadTimeout = "30ms"
WriteTimeout = "20ms"
DialTimeout = "20ms"
PoolTimeout = "10ms"

变成

c 复制代码

PoolSize = 200
MinIdleConns = 50
ReadTimeout = "2s"
WriteTimeout = "2s"
DialTimeout = "3s"
PoolTimeout = "2s"

staleConn：

sum (rate(redis_pool_events{type="miss", env="prod"}[1m]))-clamp_min(sum(delta(redis_pool_connections{state="total", env="prod"}[1m])), 0)

调查tw

登录实例，通过netstat -apn就可以直接看出成片的跟6379的连接都是tw状态。大概tw有2000左右，而etb有150左右。

bash 复制代码

root@abc-864774d65d-4gx68:~/etc# netstat -apn | grep 172.0.10.1:6379 | awk '{print $6}' | sort | uniq -c

    109 ESTABLISHED

   2350 TIME_WAIT

root@abc-864774d65d-4gx68:~/etc# netstat -apn | grep 172.0.10.1:6379 | awk '{print $6}' | sort | uniq -c

    116 ESTABLISHED

   1447 TIME_WAIT

root@abc-864774d65d-4gx68:~/etc# netstat -apn | grep 172.0.10.1:6379 | awk '{print $6}' | sort | uniq -c

    123 ESTABLISHED

   1273 TIME_WAIT

root@abc-864774d65d-4gx68:~/etc# netstat -apn | grep 172.0.10.1:6379 | awk '{print $6}' | sort | uniq -c

    123 ESTABLISHED

      1 FIN_WAIT1

   1336 TIME_WAIT

root@abc-864774d65d-4gx68:~/etc# netstat -apn | grep 172.0.10.1:6379 | awk '{print $6}' | sort | uniq -c

    116 ESTABLISHED

   1676 TIME_WAIT

tw经常出现在主动关闭的一方，那也就是说我的服务在大量的关闭与redis的连接。

虽然使用了redis的连接池，从现象看就像是短连接一样。我又确认了一下确实使用了连接池。

那大概率是连接池的使用方式有问题。

连接池配置

我的redis连接池配置如下：

复制代码

PoolSize = 200
MinIdleConns = 50
ReadTimeout = "30ms"
WriteTimeout = "20ms"
DialTimeout = "20ms"
PoolTimeout = "10ms"

我有很多疑问，为什么达不到200个etb；主调方给我的ctx超时时间是50ms，为什么30ms都没有读完。

日志

从我的服务错误日志上看，有大量的context cancel和context deadline exceeded的错误。也就是超时了，主调方取消了。

redis stat监控

redis的版本是：github.com/redis/go-redis/v9 v9.7.1

redis有几个监控指标：

go 复制代码

type Stats struct {
	Hits     uint32 // number of times free connection was found in the pool
	Misses   uint32 // number of times free connection was NOT found in the pool
	Timeouts uint32 // number of times a wait timeout occurred

	TotalConns uint32 // number of total connections in the pool
	IdleConns  uint32 // number of idle connections in the pool
	StaleConns uint32 // number of stale connections removed from the pool
}

指标分为两部分，一部分是事件类型（hits/misses/timeouts），一部分是连接数类型（totalConns/idleConns/staleConns）

hits：发起请求，并且拿到了空闲且healthy 的连接的计数

misses：发起请求，没有拿到空闲切healthy 的连接的计数

timeouts：发起请求，在PoolTimeout内没有拿到连接的计数

totalConns：总连接数

idleConns：空闲且healthy的连接数

staleConns：销毁的连接数

有几个明显的视图：

1/ stale的数量很多

2/ miss的数量也是有的

并且miss的数量跟请求量是正相关的，每当业务高峰期就会产生大量的miss。

3/ 我们的timeouts是没有图像的，说明我们并没有从池子中获取连接而超时。

那我们来分析一下：

1/ stale是怎么产生的

2/ miss是怎么产生的

顺着统计值反查stale的逻辑大致如下：

NewClient -> c.init -> c.initHooks -> c.baseClient.process -> c._process -> c.withConn -> c.releaseConn -> isBadConn -> c.connPool.Remove -> p.removeConnInternal -> p.removeConn -> atomic.AddUint32(&p.stats.StaleConns, 1)

p.stat.StaleConns是在要移除Conn的时候，会有一个原子加1。

这里有一个关键的方法，决定是否要Remove这个Conn。这个方法就是isBadConn：

c 复制代码

位置：github.com/redis/go-redis/v9@v9.7.1/error.go: 81
func isBadConn(err error, allowTimeout bool, addr string) bool {
	if err == nil {
		return false
	}

	// Check for context errors (works with wrapped errors)
	if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
		return true
	}

	// Check for pool timeout errors (works with wrapped errors)
	if errors.Is(err, pool.ErrConnUnusableTimeout) {
		return true
	}

	if isRedisError(err) {
		switch {
		case isReadOnlyError(err):
			// Close connections in read only state in case domain addr is used
			// and domain resolves to a different Redis Server. See #790.
			return true
		case isMovedSameConnAddr(err, addr):
			// Close connections when we are asked to move to the same addr
			// of the connection. Force a DNS resolution when all connections
			// of the pool are recycled
			return true
		default:
			return false
		}
	}

	if allowTimeout {
		// Check for network timeout errors (works with wrapped errors)
		var netErr net.Error
		if errors.As(err, &netErr) && netErr.Timeout() {
			return false
		}
	}

	return true
}

该方法中对超时的判断：errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded)就是我遇到的tw的主要原因。

由于我们的readTimeout和ctxTimeout设置的比较短，readTimeout是30ms，主调方调我的服务的context超时时间是50ms，以至于产生大量的context.Canceled或context.DeadlineExceeded，进而RemoveConn，导致tw。

效果

ok，这里是我们tw的原因所在。那么我们尝试调整xTimeouts后，tw得到巨大的改善，甚至修复。

从

c 复制代码

PoolSize = 200
MinIdleConns = 50
ReadTimeout = "30ms"
WriteTimeout = "20ms"
DialTimeout = "20ms"
PoolTimeout = "10ms"

变成

c 复制代码

PoolSize = 200
MinIdleConns = 50
ReadTimeout = "2s"
WriteTimeout = "2s"
DialTimeout = "3s"
PoolTimeout = "2s"

效果如下：

staleConn：

c 复制代码

容器上没有tw了
root@abc-6c4ffc8864-4q7nc:~# netstat -apn | grep 172.0.10.1:6379 | awk '{print $6}' | sort | uniq -c
    200 ESTABLISHED
root@abc-6c4ffc8864-4q7nc:~# netstat -apn | grep 172.0.10.1:6379 | awk '{print $6}' | sort | uniq -c
    200 ESTABLISHED
root@abc-6c4ffc8864-4q7nc:~# netstat -apn | grep 172.0.10.1:6379 | awk '{print $6}' | sort | uniq -c
    200 ESTABLISHED

参考

source: https://uptrace.dev/blog/golang-context-timeout

内容如下：

You probably should avoid ctx.WithTimeout or ctx.WithDeadline with code that makes network calls. Here is why.

Using context for cancellation

Typically, context.Context is used to cancel operations like this:

c 复制代码

package main

import (
    "context"
    "fmt"
    "time"
)

func main() {
    ctx := context.Background()
    ctx, cancel := context.WithTimeout(ctx, time.Second)
    defer cancel()

    select {
    case <-ctx.Done():
        fmt.Println(ctx.Err())
        fmt.Println("cancelling...")
    }
}

Later, you can use such context with, for example, Redis client:

c 复制代码

import "github.com/go-redis/redis/v8"

rdb := redis.NewClient(...)

ctx := context.Background()
ctx, cancel := context.WithTimeout(ctx, 3*time.Second)
defer cancel()

val, err := rdb.Get(ctx, "redis-key").Result()

At first glance, the code above works fine. But what happens when rdb.Get operation exceeds the timeout?

Context deadline exceeded

When context is cancelled, go-redis and most other database clients (including database/sql) must do the following:

Close the connection, because it can't be safely reused.

Open a new connection.

Perform TLS handshake using the new connection.

Optionally, pass some authentication checks, for example, using Redis AUTH command.

Effectively, your application does not use the connection pool any more which makes each operation slower and increases the chance of exceeding the timeout again. The result can be disastrous.

Technically, this problem is not caused by context.Context and using small deadlines with net.Conn can cause similar issues. But because context.Context imposes a single timeout on all operations that use the context, each individual operation has a random timeout which depends on timings of previous operations.

What to do instead?

Your first option is to use fixed net.Conn deadlines:

c 复制代码

var cn net.Conn
cn.SetDeadline(time.Now().Add(3 * time.Second))

With go-redis, you can use ReadTimeout and WriteTimeout options which control net.Conn deadlines:

c 复制代码

rdb := redis.NewClient(&redis.Options{
    ReadTimeout:  3 * time.Second,
    WriteTimeout: 3 * time.Second,
})

Alternatively, you can also use a separate context timeout for each operation:

c 复制代码

ctx := context.Background()
op1(ctx.WithTimeout(ctx, time.Second))
op2(ctx.WithTimeout(ctx, time.Second))

You should also avoid timeouts smaller than 1 second, because they have the same problem. If you must deliver a SLA no matter what, you can make sure to generate a response in time but let the operation to continue in background:

c 复制代码

func handler(w http.ResponseWriter, req *http.Request) {
    // Process asynchronously in a goroutine.
    ch := process(req)

    select {
    case res := <-ch:
        // success
    case <-time.After(time.Second):
        // unknown result
    }
}