并发编程(四) - WaitGroup 协同控制

基本用法

在计数器的例子中，创建了 1000 个 goroutine 同时执行 +1 操作。最后使用 WaitGroup来等待所有的 goroutine 执行完毕，输出 count。

go 复制代码

type Counter struct {
    m     sync.Mutex
    count int
}

func (c *Counter) Incr() {
    c.m.Lock()
    c.count++
    c.m.Unlock()
}

func main() {
    c := Counter{}
    wg := &sync.WaitGroup{}
    for i := 0; i < 1000; i++ {
       wg.Add(1)
       go func() {
          defer wg.Done()
          c.Incr()
       }()
    }
    // 等待所有 goroutine 执行完毕
    wg.Wait()
    fmt.Println(c.count)
}

WaitGroup 的设计是为了解决多个 goroutine 的协同控制问题，阻塞等待对应的逻辑执行完毕。WaitGroup 本质上是一个计数器，执行 Add 方法加减计数器，Done 计数器减一，Wait 等待计数器变为 0。

scss 复制代码

type WaitGroup
// 计数增加 delta「可以为负数」。如果计数变为 0，所有被 Wait 阻塞的 goroutine 将会继续执行。
func (wg *WaitGroup) Add(delta int)
// WaitGroup 的计数 -1
func (wg *WaitGroup) Done()
// 阻塞直到 WaitGroup 的计数为 0
func (wg *WaitGroup) Wait()

官方给了一个例子，创建三个 goroutine 并行执行 http 请求，当所有的 goroutine 执行完毕后退出程序。

go 复制代码

func ExampleWaitGroup() {
    var wg sync.WaitGroup
    var urls = []string{
       "http://www.golang.org/",
       "http://www.google.com/",
       "http://www.example.com/",
    }
    for _, url := range urls {
       // 增加计数器
wg.Add(1)
       // 创建 goroutine
go func(url string) {
          // goroutine 执行完成，计数器 -1
defer wg.Done()
          // Fetch the URL.
http.Get(url)
       }(url)
    }
    // 等待所有的 goroutine 执行完毕
wg.Wait()
}

注意事项

计数器负值

在使用 WaitGroup 时，如果计数器出现负值则会直接 panic。最容易出错的场景就是 Done 的执行次数多于 Add。或者 Add 传入的负值过大。

scss 复制代码

func main() {
    wg := sync.WaitGroup{}
    wg.Add(1)
    wg.Done()
    // panic: sync: negative WaitGroup counter
    wg.Done()
}

Wait 没有在 Add 之后执行

另一个常见的问题是 Wait 没有在 Add 之后执行，从而导致非预期的错误。例如下面的 case 中，将 Add 方法放入到了 goroutine 中执行，因此 Wait 和 Add 的执行顺序是无法被保证，从而导致协同控制异常。因此，在使用时需要严格保证 Wait 在 Add 之后执行。更具体来说，需要让 Add 在 goroutine 创建之前或者其他需要等待的时间之前执行。

go 复制代码

func main() {
    var wg sync.WaitGroup
    var urls = []string{
       "http://www.golang.org/",
       "http://www.google.com/",
       "http://www.example.com/",
    }
    for _, url := range urls {
       // wg.Add(1)
       go func(url string) {
           wg.add(1)
          defer wg.Done()
          func() {
             time.Sleep(time.Millisecond * 10)
             fmt.Printf("process url: %s", url)
          }()
       }(url)
    }
    wg.Wait()
}

Wait 没有结束就重用

WaitGroup 是可以重复使用的，也就是当计数器归零之后可以重新使用 Add、Done、Wait 方法来进行协同控制。

go 复制代码

func main() {
    var wg sync.WaitGroup
    var urls = []string{
       "http://www.golang.org/",
       "http://www.google.com/",
       "http://www.example.com/",
    }
    for _, url := range urls {
       wg.Add(1)
       go func(url string) {
          defer wg.Done()
          func() {
             time.Sleep(time.Millisecond * 10)
             fmt.Printf("process url: %s", url)
          }()
       }(url)
    }
    wg.Wait()
    // 重复使用 wg
for _, url := range urls {
       wg.Add(1)
       go func(url string) {
          defer wg.Done()
          func() {
             time.Sleep(time.Millisecond * 10)
             fmt.Printf("process url: %s", url)
          }()
       }(url)
    }
    wg.Wait()
}

重复使用 WaitGroup 时需要保证计数器已经归零。否则就会导致异常。在下面的例子中，Wait 还没有执行完毕就继续使用了 Add 方法「wait 和 add 并发执行」，导致 panic。即 Wait 没有执行完毕之前不能重复使用。

scss 复制代码

func main() {
    wg := sync.WaitGroup{}
    wg.Add(1)
    go func() {
       time.Sleep(time.Millisecond) // 先确保 Wait 被调用
wg.Done()
       wg.Add(1)
    }()
    // panic: sync: WaitGroup is reused before previous Wait has returned
    wg.Wait()
}

代码解读

要实现 WaitGroup 需要存储计数器和 Wait 的数量，当计数器变为 0 时需要唤醒 Wait 的 goroutine。

看一下 WaitGroup 的结构体定义。这里的 noCopy 其实一个实现了 Lock() 和 Unlock() 接口的空对象。在前面的章节中，介绍通过 vet 检查 Mutex 被复制使用的 case。在 WaitGroup 中添加 noCopy 就可以被 vet 检查是否有值复制的场景。其实 vet 检查时，会检查所有实现 Lock 和 Unlock 接口的结构体是否发生了值复制。因此，只需要将 noCopy 放入到 WaitGroup 中就可以被 vet 检查。如果在自定义结构体中，不希望被复制使用，就是可以增加一个 noCopy 属性。

go 复制代码

type WaitGroup struct {
    noCopy noCopy
    // 老思路，为了性能考虑，使用一个字段表示多个含义，高 32 位时计数器，低 32 位时 waiter(即调用 Wait 方法等待的 goroutine) 的数量
    state atomic.Uint64 // high 32 bits are counter, low 32 bits are waiter count.
 // 信号量，用于唤醒 Wait 的 goroutine
    sema  uint32
}

Add 方法，增加计数器，可以为负值。在查看源码时，可以先忽略 race.Enabled 相关逻辑，这个在使用 race 竟争检测时会执行到。核心操作就是调整计数器的数值，当数值变为 0 时唤醒被 Wait 阻塞的 goroutine。此外，还需要对数值进行判断，避免并发调用。

scss 复制代码

func (wg *WaitGroup) Add(delta int) {
    if race.Enabled {
       if delta < 0 {
          // Synchronize decrements with Wait.
race.ReleaseMerge(unsafe.Pointer(wg))
       }
       race.Disable()
       defer race.Enable()
    }
    // 高 32 位增加 delta
    state := wg.state.Add(uint64(delta) << 32)
    // 取高 32 位，计数器的数值
    v := int32(state >> 32)
    // 取低 32 位，调用 Wait 等待的 goroutine 的数量
    w := uint32(state)
    if race.Enabled && delta > 0 && v == int32(delta) {
       // The first increment must be synchronized with Wait.
 // Need to model this as a read, because there can be
 // several concurrent wg.counter transitions from 0.
race.Read(unsafe.Pointer(&wg.sema))
    }
    // 计数器 < 0 直接 panic
    if v < 0 {
       panic("sync: negative WaitGroup counter")
    }
    // Wait 和 Add 方法被同时调用
    if w != 0 && delta > 0 && v == int32(delta) {
       panic("sync: WaitGroup misuse: Add called concurrently with Wait")
    }
    // 此时还没有 goroutine 调用 Wait 方法，直接返回
    if v > 0 || w == 0 {
       return
    }
    // 此时有 goroutine 调用 Wait 方法等待计数器变为 0，且计数器的值已经变为 0。此时不应该再有其他的 goroutine 并发的调用 Wait 或者 Add 方法。使用原子操作查看是否有 goroutine 修改 state。
    // This goroutine has set counter to 0 when waiters > 0.
 // Now there can't be concurrent mutations of state:
 // - Adds must not happen concurrently with Wait,
 // - Wait does not increment waiters if it sees counter == 0.
 // Still do a cheap sanity check to detect WaitGroup misuse.
if wg.state.Load() != state {
       panic("sync: WaitGroup misuse: Add called concurrently with Wait")
    }
    // Reset waiters count to 0.
wg.state.Store(0)
    // 释放信号量，唤醒调用 Wait 的 goroutine
    for ; w != 0; w-- {
       runtime_Semrelease(&wg.sema, false, 0)
    }
}

Done 方法其实就是调用的 Add 方法，计数 -1。

scss 复制代码

func (wg *WaitGroup) Done() {
    wg.Add(-1)
}

Wait 方法，当计数器没有变为 0 时被阻塞等待。底层使用的是 runtime 提供的信号量机制。当 Add 操作将技术量变为 0 是执行 v 操作，唤醒阻塞的 p 操作。

scss 复制代码

func (wg *WaitGroup) Wait() {
    if race.Enabled {
       race.Disable()
    }
    for {
       state := wg.state.Load()
       // 高 32 位获取计数值
       v := int32(state >> 32)
       // 低 32 位获取调用 Wait 的 goroutine 数量
       w := uint32(state)
       // 计数变为 0，不用阻塞，直接 return
       if v == 0 {
          // Counter is 0, no need to wait.
if race.Enabled {
             race.Enable()
             race.Acquire(unsafe.Pointer(wg))
          }
          return
       }
       // Increment waiters count. 增加 Wait 的数量
if wg.state.CompareAndSwap(state, state+1) {
          if race.Enabled && w == 0 {
             // Wait must be synchronized with the first Add.
 // Need to model this is as a write to race with the read in Add.
 // As a consequence, can do the write only for the first waiter,
 // otherwise concurrent Waits will race with each other.
race.Write(unsafe.Pointer(&wg.sema))
          }
          // P 操作，等待计数变为 0 释放信号量
          runtime_Semacquire(&wg.sema)
          // Wait 和 Add 存在并发操作
          if wg.state.Load() != 0 {
             panic("sync: WaitGroup is reused before previous Wait has returned")
          }
          if race.Enabled {
             race.Enable()
             race.Acquire(unsafe.Pointer(wg))
          }
          return
       }
    }
}

从上面的源码分析可以看出，WaitGroup 的底层实现主要依赖原子操作和信号量机制。在使用中需要严格的控制 Wait、Add 的执行顺序，并发执行会导致 panic。此外，还需要的注意的是，Wait 可以在多个 goroutine 中同时使用，等待计数值变为 0。在下面的例子中，除了 main goroutine 等待执行完毕，还单独启动了一个 goroutine 等待任务执行完毕。

go 复制代码

func main() {
    var wg sync.WaitGroup
    var urls = []string{
       "http://www.golang.org/",
       "http://www.google.com/",
       "http://www.example.com/",
    }
    for _, url := range urls {
       wg.Add(1)
       go func(url string) {
          defer wg.Done()
          func() {
             time.Sleep(time.Millisecond * 10)
          }()
       }(url)
    }

    go func() {
       wg.Wait()
       fmt.Println("all url processed")
    }()
    wg.Wait()
    time.Sleep(time.Millisecond)
}

总结

WaitGroup 协同控制工具。主要使用场景是等待多个 gotoutine 完成任务之后执行其他操作。在使用时需要注意要让 Add 方法在 Wait 之前执行，同时避免并发执行的场景。在实现时，使用 uint64 的高 32 表示计数量，即最多只能监测 2**32 个 goroutine 是否执行完毕，使用低 32 表示 Wait 的数量。使用信号量实现 Wait 的阻塞等待，当计数不为 0 时阻塞等待其他 goroutine 释放信号量。