记一次golang项目context引发的进程OOM故障

之前写过一篇一种基于etcd实践节点自动故障转移的思路，程序经历过一次线上进程内存持续上涨终OOOM的小事故，本次技术复盘导致本次内存泄露的完整起因。

提炼代码：

业务函数etcdWatchLoop：基于etcd的Watch机制持续监听/foo前缀键值对的变更; 收到Watch信道的变更消息，就去查询当前键值对。

golang 复制代码

func etcdWatchLoop() error {
	ctx, cancle := context.WithTimeout(context.Background(), time.Second*5)
	defer cancle()
	wchan := eClient.Watch(ctx, "/foo", clientv3.WithPrefix())

   var tick = time.NewTicker(time.Minute * 1)
	defer tick.Stop()
	for {
		select {
                case <-tick.C: // 1min 探测一次，防止假死
			fmt.Println("watch tick")
		case resp := <-wchan:
			fmt.Printf("watch result: %v \n", resp)
			ctx, cancel := context.WithTimeout(context.Background(), time.Second*10)
			defer cancel()
			if r, err := eClient.Get(ctx, "/foo"); err != nil {
				fmt.Println(err)
			} else {
				// todo logic 
			}
		}
	}
}

程序日志显示：程序进入死循环。

 watch result: {{0 0 0 0 {} [] 0} [] 0 false false <nil> } 
 watch result: {{0 0 0 0 {} [] 0} [] 0 false false <nil> } 
 watch result: {{0 0 0 0 {} [] 0} [] 0 false false <nil> } 
 .....

当时etcd底层正在压缩或者发生网络问题，watch方法产生的信道resp := <-wchan被cancle了，信道被关闭，程序进入了死循环。

故障产生的第一点：没有关注到从closed的信道中能持续读取到零值，导致进入无限循环。

无限循环（持续发送到etcd的get请求）导致了OOM，那具体是哪块内存泄露呢，高频grpc请求还是其他？

事后重现的示例进程。

ps -p <PID> -o etime=显示程序执行了20:33:12，内存从7M上涨到184M，持续进行中。

执行go tool pprof -http=:8090 http://localhost:6060/debug/pprof/heap) 显示调用grpc请求时与context相关的2处堆内存占用较大且持续增长。

故障点二：代码中的defer cancel()函数并不会执行 ，因为是无限循环，函数不会返回，defer压栈的cancel函数无法出栈执行。

godoc：
Calling the CancelFunc cancels the child and its children, removes the parent's reference to the child, and stops any associated timers. Failing to call the CancelFunc leaks the child and its children until the parent is "canceled" or the "timer fires"

修复代码如下：

func etcdWatchLoop1() error {
	ctx, cancle := context.WithTimeout(context.Background(), time.Second*5)
	defer cancle()
	wchan := eClient.Watch(ctx, "/foo", clientv3.WithPrefix()) //  使用超时机制模拟 信道关闭

	var tick = time.NewTicker(time.Minute * 1)
	defer tick.Stop()

	for {
		select {
		case <-tick.C: 
			fmt.Println("watch tick")
		case resp, ok := <-wchan: // 从cancled信道或者超时信道中，信道会关闭，从closed信道会读取到零值，导致死循环
			if ok {
				fmt.Printf("watch result: %v \n", resp)
				ctx, cancel := context.WithTimeout(context.Background(), time.Second*10)
				fmt.Printf("watch ptr: %p, %p \n", &ctx, cancel)
				if _, err := eClient.Get(ctx, "/foo"); err != nil {
					fmt.Println(err)
				} else {
                                    // todo logic
				}
				cancel()
			} else {
				wchan = eClient.Watch(ctx, "/foo", clientv3.WithPrefix())
			}
		}
	}
}

利用读信道的参数2，来判断信道是否关闭，如果关闭了，重新初始化监听信道。
context.WithTimeout 产生的cancel，在业务逻辑结束后迅速主动执行。

在本例中，与context相关的内存泄露有两处，且有关联。

<1> 业务函数context.WithTimeout无限循环，未能执行cancel()，导致高频产生的timerCtx堆内存迟迟无法释放。

<2> grpc请求底层源码以第一处产生的timerCtx为父级, 产生的子级cancelCtx接收父级取消传播，此处为父级timerCtx填充了取消信道。

第<1>处：未能调用cancel 导致的内存泄露。

ctx, cancel := context.WithTimeout(context.Background(), time.Second*10)
defer cancel()

产生了timerCtx对象, 因函数返回逃逸到堆上（由栈区返回值ctx引用）。

func WithDeadlineCause(parent Context, d time.Time, cause error) (Context, CancelFunc) {
	if parent == nil {
		panic("cannot create context from nil parent")
	}
	if cur, ok := parent.Deadline(); ok && cur.Before(d) {
		// The current deadline is already sooner than the new one.
		return WithCancel(parent)
	}
	c := &timerCtx{    //  withTmeout实际产生了timerCtx对象
		deadline: d,
	}
	c.cancelCtx.propagateCancel(parent, c)  
	dur := time.Until(d)
	if dur <= 0 {
		c.cancel(true, DeadlineExceeded, cause) // deadline has already passed
		return c, func() { c.cancel(false, Canceled, nil) }
	}
	c.mu.Lock()
	defer c.mu.Unlock()
	if c.err == nil {
		c.timer = time.AfterFunc(dur, func() { // 异步启动goroutine执行定时器触发逻辑
			c.cancel(true, DeadlineExceeded, cause)
		})
	}
	return c, func() { c.cancel(true, Canceled, nil) } 
}

context.WithTimeout返回的cancel函数和timer触发函数做了相同的动作：

形成了闭包，捕获了timerCtx对象
与父级context解绑，停止timer资源

区别在于释放的时机：定时器触发函数捕获的timerCtx，要在定时器触发之后才能释放，也就是说timerCtx堆内存被硬生生持有了timeout=10s（连带上timerCtx附加的timer资源）。

于是在本例中，理想情况下，高频产生的timerCtx虽然在10s之后被GC清理，但是架不住无限循环导致的随地分配啊。

有如下简化实验：

for {
	 context.WithTimeout(context.Background(), time.Second*10)
}

GODEBUG = gotrace=1 ./sample 执行程序并打印gc日志：

有关gotrace=1 的输出解释，godoc https://pkg.go.dev/runtime 有详细介绍。

#->#-># MB   heap size at GC mark start, at GC Mark end, and live heap

当第三列值持续上升，说明发生了内存泄露（每次GC之后 live heap在持续上升）。

第<2>处的内存泄露：

在grpc一元请求堆栈函数newClientStreamWithParams内会产生子context： cancelCtx，也会逃逸到堆上（由另一个栈区变量ctx引用)。

WithCancel returns a copy of parent with a new Done channel. The returned

context's Done channel is closed when the returned cancel function is called

or when the parent context's Done channel is closed, whichever happens first.

newClientStreamWithParams
--- ctx, cancel = context.WithCancel(ctx)
--- defer func() {
		if err != nil {
			cancel()
		}
	}()


func WithCancel(parent Context) (ctx Context, cancel CancelFunc) {
	c := withCancel(parent)
	return c, func() { c.cancel(true, Canceled, nil) }
}

func withCancel(parent Context) *cancelCtx {
	if parent == nil {
		panic("cannot create context from nil parent")
	}
	c := &cancelCtx{}
	c.propagateCancel(parent, c)
	return c
}

根据火焰图，此处产生内存泄露的地方是 propagateCancel函数：

设置接受父级的取消传播, 此处是通过懒加载的方式为父级timerCtx填充取消信道，

func (c *cancelCtx) propagateCancel(parent Context, child canceler) {
	c.Context = parent

	done := parent.Done()    // 此函数为timerCtx填充信道， 懒加载
	if done == nil {
		return // parent is never canceled
	}
}
......


func (c *cancelCtx) Done() <-chan struct{} {
	d := c.done.Load()
	if d != nil {
		return d.(chan struct{})
	}
	c.mu.Lock()
	defer c.mu.Unlock()
	d = c.done.Load()
	if d == nil {
		d = make(chan struct{})
		c.done.Store(d)
	}
	return d.(chan struct{})
}

总结

本文复盘了golang项目生产环境某次OOM的现场，记录了本人未能强化的golang的知识点。

从closed信道能持续读取零值
defer 函数压栈，在函数返回之前出栈。
在业务逻辑结束后尽早执行cancel() 解绑子级关系和释放timer资源，避免内存泄露。
强化了pprof的使用方式、理解了火焰图的指标意义
GODEBUG=gotrace=1 输出了gc日志，观察每次gc的堆内存变动。

有些坑位看起来，是菜鸟级别，老鸟轻喷，有则改之无则加勉。

https://go.dev/src/context/context.go?s=9162:9288