文章目录
- 内核级线程模型
- 用户级线程模型
- 两级线程模型
- Golang的线程模型
- G-M-P模型概览
- GMP引言
- [0. 结论](#0. 结论)
- [1. GMP调度模型的设计思想](#1. GMP调度模型的设计思想)
-
- [1.1 传统多线程的问题](#1.1 传统多线程的问题)
- [1.2 Go语言早期引入的 GM 模型](#1.2 Go语言早期引入的 GM 模型)
- [1.3 当前高效的 GMP 模型](#1.3 当前高效的 GMP 模型)
- [2. 多图详解几种常见的调度场景](#2. 多图详解几种常见的调度场景)
- [3. GMP的数据结构和各种状态](#3. GMP的数据结构和各种状态)
-
- [3.1 G 的数据结构和状态](#3.1 G 的数据结构和状态)
- [3.2 M 的数据结构](#3.2 M 的数据结构)
- [3.3 P 的数据结构](#3.3 P 的数据结构)
- [3.4 schedt 的数据结构](#3.4 schedt 的数据结构)
- [4. 调度器的启动](#4. 调度器的启动)
-
- [4.1 程序启动流程](#4.1 程序启动流程)
- [4.2 调度器的启动](#4.2 调度器的启动)
-
- [mcommoninit 函数](#mcommoninit 函数)
- [runtime.procresize 函数](#runtime.procresize 函数)
- [runtime.p.init 方法](#runtime.p.init 方法)
- [4.3 怎样创建 G ?](#4.3 怎样创建 G ?)
-
- [newproc 函数](#newproc 函数)
- [runtime.newproc1 函数](#runtime.newproc1 函数)
- [runtime.gfget 函数](#runtime.gfget 函数)
- [runtime.malg 函数](#runtime.malg 函数)
- [runtime.runqput 函数](#runtime.runqput 函数)
- [runtime.runqputslow 函数](#runtime.runqputslow 函数)
- [5. 调度循环](#5. 调度循环)
-
- [runtime.mstart0 函数](#runtime.mstart0 函数)
- [runtime.schedule 函数](#runtime.schedule 函数)
- [runtime.globrunqget 函数(老版)](#runtime.globrunqget 函数(老版))
- [func globrunqget() *g 函数(新版)](#func globrunqget() *g 函数(新版))
- [func globrunqgetbatch(n int32) (gp *g, q gQueue) 函数(新版)](#func globrunqgetbatch(n int32) (gp *g, q gQueue) 函数(新版))
- [runtime.runqget 函数](#runtime.runqget 函数)
- [runtime.findrunnable 函数](#runtime.findrunnable 函数)
- [6. 总结](#6. 总结)
- Reference


用户级线程即协程,由应用程序创建与管理,协程必须与内核级线程绑定之后才能执行。线程由 CPU 调度是抢占式的,协程由用户态调度是协作式的,一个协程让出 CPU 后,才执行下一个协程。

内核级线程模型

内核级线程模型中用户线程与内核线程是一对一关系(1 : 1)。线程的创建、销毁、切换工作都是有内核完成的。 应用程序不参与线程的管理工作,只能调用内核级线程编程接口(应用程序创建一个新线程或撤销一个已有线程时,都会进行一个系统调用)。每个用户线程都会被绑定到一个内核线程。用户线程在其生命期内都会绑定到该内核线程。一旦用户线程终止,两个线程都将离开系统。
操作系统调度器管理、调度并分派这些线程。运行时库为每个用户级线程请求一个内核级线程。操作系统的内存管理和调度子系统必须要考虑到数量巨大的用户级线程。操作系统为每个线程创建上下文。进程的每个线程在资源可用时都可以被指派到处理器内核。

用户级线程模型

用户线程模型中的用户线程与内核线程KSE是多对一关系(N : 1)。线程的创建、销毁以及线程之间的协调、同步等工作都是在用户态完成,具体来说就是由应用程序的线程库来完成 。内核对这些是无感知的,内核此时的调度都是基于进程的 。线程的并发处理从宏观来看,任意时刻每个进程只能够有一个线程在运行,且只有一个处理器内核会被分配给该进程。
从上图中可以看出来:库调度器从进程的多个线程中选择一个线程,然后该线程和该进程允许的一个内核线程关联起来。内核线程将被操作系统调度器指派到处理器内核。用户级线程是一种"多对一"的线程映射
两级线程模型

两级线程模型中用户线程与内核线程是多对多关系(N : M)。两级线程模型充分吸收上面两种模型的优点,尽量规避缺点。其线程创建在用户空间中完成,线程的调度和同步也在应用程序中进行。一个应用程序中的多个用户级线程被绑定到一些(小于或等于用户级线程的数目)内核级线程上
Golang的线程模型
Golang在底层实现了混合型线程模型。M即系统线程,由系统调用产生,一个M关联一个KSE,即两级线程模型中的系统线程。G为Groutine,即两级线程模型的的应用及线程。M与G的关系是N:M。

G-M-P模型概览


GMP引言
众所周知,Go语言有强大的并发能力,能够简单的通过 go 关键字创建大量的轻量级协程 Goroutine,帮助程序并发执行各种任务,除了会用它,我们还需要掌握其底层原理,每位 Go 程序员最好都花时间把 GMP 调度模型的底层源码学习一遍,才能对这一块 Go 中最为重要的内容有较为深刻的理解。
在学习 Go 语言的 GMP 调度器之前,原以为 GMP 底层原理较为复杂,要花很多时间和精力才能掌握,亲自下场学习之后,才发现其实并不复杂,只需三个多小时就足够:先花半个多小时,学习下刘丹冰Aceld 的 B 站讲解视频《Golang深入理解GPM模型》,然后花两个小时,结合《Go语言设计和实现》6.5节调度器的内容阅读下相关源码,最后,可以快速看看 GoLang-Scheduling In Go 前两篇文章的中译版,这样可以较快掌握 GMP 调度器的设计思想。
当然,如果希望理解的更加透彻,还需要仔细钻研几遍源码,并学习其他各种资料,尤其是 Go 开发者的文章,最好能够输出一篇文章,以加深头脑中神经元的连接和对事情本质的理解,本文就是这一学习思路的结果,希望能帮助到感兴趣的同学。
本文的代码基于 Go1.18.1 版本,跟 Go1.14 版本的调度器的主要逻辑相比,依然没有大的变化,目前看到的改动是调度循环的 runtime.findrunnable() 函数,抽取了多个逻辑封装成了新的方法,比如 M 从 其他 P 上偷取 G 的 runtime.stealWork()。
0. 结论
先给出整篇文章的结论和大纲,便于大家获取重点:
-
为了解决 Go 早期多线程 M 对应多协程 G 调度器的全局锁、中心化状态带来的锁竞争导致的性能下降等问题 ,Go 开发者引入了处理器 P 结构,形成了当前经典的 GMP 调度模型;
-
Go 调度器是指:运行时在用户态提供的多个函数组成的一种机制,目的是高效地调度 G 到 M上去执行;
-
Go 调度器的核心思想是:尽可能复用线程 M ,避免频繁的线程创建和销毁;利用多核并行能力 ,限制同时运行(不包含阻塞)的 M 线程数 等于 CPU 的核心数目; Work Stealing 任务窃取机制 ,当某个 M 绑定的 P 没活了,这个 M 会代表自己的 P,从其他 P 的本地运行队列(runq)中窃取一部分 G 来执行;Hand Off 交接机制 ,为了提高效率,M 阻塞时,会将 M 上 P 的运行队列交给其他 M 执行;基于协作的抢占机制 ,为了保证公平性和防止 Goroutine 饥饿问题,Go 程序会保证每个 G 运行 10ms 就让出 M,交给其他 G 去执行,这个 G 运行 10ms 就让出 M 的机制,是由单独的系统监控线程通过 retake() 函数给当前的 G 发送抢占信号实现的,
如果所在的 P 没有陷入系统调用且没有满,让出的 G 优先进入本地 P 队列,否则进入全局队列;基于信号的真抢占机制,Go1.14 引入了基于信号的抢占式调度机制,解决了 GC 垃圾回收和栈扫描时无法被抢占的问题【在 Go 1.14 之前,Go 的调度器主要是协作式的。这意味着 Goroutine 通常只在函数调用、系统调用或显式的运行时调用(如 runtime.Gosched())这些安全点才会让出执行权。这种方式的缺点在于,如果一个 Goroutine长时间执行一个没有函数调用的密集计算循环(例如,一个大的 for 循环进行纯计算),它就不会主动让出 CPU。 Go 1.14 引入的异步抢占机制利用了操作系统的信号,当 GC 需要进行 STW 时,它可以更快地暂停所有 Goroutine,因为不再需要等待那些长时间运行的循环自然地到达函数调用点。sysmon 会主动发出信号来中断这些 Goroutine。类似地,当 GC 需要扫描某个 Goroutine 的栈时,如果该 Goroutine 正在运行且未被抢占,信号机制可以强制其中断,使得栈扫描可以及时进行。这有助于 GC 按预期进度完成,减少整体 GC 时间】 -
由于数据局部性 ,新创建的 G 优先放入本地队列,在本地队列满了时,会将本地队列的一半 G 和新创建的 G 打乱顺序,一起放入全局队列;本地队列如果一直没有满,也不用担心,全局队列的 G 永远会有 1/61 的机会被获取到,调度循环中,优先从本地队列获取 G 执行,不过每隔61次,就会直接从全局队列获取,至于为啥是 61 次,Dmitry 的视频讲解了,就是要一个既不大又不小的数,而且不能跟其他的常见的2的幂次方的数如 64 或 48 重合;
-
M 优先执行其所绑定的 P 的本地运行队列中的 G,如果本地队列没有 G,则会从全局队列获取,为了提高效率和负载均衡,会从全局队列获取多个 G,而不是只取一个 ,个数是自己应该从全局队列中承担的,globrunqsize / nprocs + 1;同样,当全局队列没有时,会从其他 M 的 P 上偷取 G 来运行,偷取的个数通常是其他 P 运行队列的一半;
-
G 在运行时中的状态可以简化成三种:等待中_Gwaiting、可运行_Grunnable、运行中_Grunning,运行期间大部分情况是在这三种状态间来回切换;
-
M 的状态可以简化为只有两种:自旋和非自旋;自旋状态,表示 M 绑定了 P 又没有获取 G;非自旋状态,表示正在执行 Go 代码中,或正在进入系统调用,或空闲;
-
P 结构体中最重要的,是持有一个可运行 G 的长度为 256 的本地环形队列,可以通过 CAS 的方式无锁访问,跟需要加锁访问的全局队列 schedt.runq 相对应;
-
调度器的启动逻辑是:初始化 g0 和 m0,并将二者互相绑定, m0 是程序启动后的初始线程,g0 是 m0 线程的系统栈代表的 G 结构体,负责普通 G 在 M 上的调度切换 --> runtime.schedinit():负责M、P 的初始化过程,分别调用runtime.mcommoninit() 初始化 M 的全局队列allm 、调用 runtime.procresize() 初始化全局 P 队列 allp --> runtime.newproc():负责获取空闲的 G 或创建新的 G --> runtime.mstart() 启动调度循环; -
调度器的循环逻辑是:运行函数 schedule() --> 通过 runtime.globrunqget() 从全局队列、通过 runtime.runqget() 从 P 本地队列、 runtime.findrunnable 从各个地方,获取一个可执行的 G --> 调用 runtime.execute() 执行 G --> 调用 runtime.gogo() 在汇编代码层面上真正执行G --> 调用 runtime.goexit0() 执行 G 的清理工作,重新将 G 加入 P 的空闲队列 --> 调用 runtime.schedule() 进入下一次调度循环。
1. GMP调度模型的设计思想
1.1 传统多线程的问题
在现代的操作系统中,为了提高并发处理任务的能力,一个 CPU 核上通常会运行多个线程,多个线程的创建、切换使用、销毁开销通常较大:
1)一个内核线程的大小通常达到1M,因为需要分配内存来存放用户栈和内核栈的数据;
2)在一个线程执行系统调用(发生 IO 事件如网络请求或读写文件)不占用 CPU 时,需要及时让出 CPU,交给其他线程执行,这时会发生线程之间的切换;
3)线程在 CPU 上进行切换时,需要保持当前线程的上下文,将待执行的线程的上下文恢复到寄存器中,还需要向操作系统内核申请资源;
在高并发的情况下,大量线程的创建、使用、切换、销毁会占用大量的内存,并浪费较多的 CPU 时间在非工作任务的执行上,导致程序并发处理事务的能力降低。
图1.1 传统多线程之间的切换开销较大
1.2 Go语言早期引入的 GM 模型
为了解决传统内核级的线程的创建、切换、销毁开销较大的问题,Go 语言将线程分为了两种类型:内核级线程 M (Machine),轻量级的用户态的协程 Goroutine,至此,Go 语言调度器的三个核心概念出现了两个:
-
M: Machine的缩写,代表了内核线程 OS Thread,CPU调度的基本单元;
-
G: Goroutine的缩写,用户态、轻量级的协程,一个 G 代表了对一段需要被执行的 Go 语言程序的封装;每个 Goroutine 都有自己独立的栈存放自己程序的运行状态;分配的栈大小 2KB,可以按需扩缩容;

图1.2 Go将线程拆分为内核线程 M 和用户线程 G
在早期,Go 将传统线程拆分为了 M 和 G 之后,为了充分利用轻量级的 G 的低内存占用、低切换开销的优点,会在当前一个M上绑定多个 G,某个正在运行中的 G 执行完成后,Go 调度器会将该 G 切换走,将其他可以运行的 G 放入 M 上执行,这时一个 Go 程序中只有一个 M 线程:

图1.3 多个 G 对应一个 M
这个方案的优点是用户态的 G 可以快速切换,不会陷入内核态,缺点是每个 Go 程序都用不了硬件的多核加速能力,并且 G 阻塞会导致跟 G 绑定的 M 阻塞,其他 G 也用不了 M 去执行自己的程序了。
为了解决这些不足,Go 后来快速上线了多线程调度器:

图1.4 多个 M 对应多个 G
每个Go程序,都有多个 M 线程对应多个 G 协程,该方案有以下缺点:
1)全局锁、中心化状态带来的锁竞争导致的性能下降 ;
2)M 会频繁交接 G,导致额外开销、性能下降;每个 M 都得能执行任意的 runnable 状态的 G;
3)每个 M 都需要处理内存缓存,导致大量的内存占用并影响数据局部性;
4)系统调用频繁阻塞和解除阻塞正在运行的线程,增加了额外开销;
1.3 当前高效的 GMP 模型
为了解决多线程调度器的问题,Go 开发者 Dmitry Vyokov 在已有 G、M 的基础上,引入了 P 处理器,由此产生了当前 Go 中经典的 GMP 调度模型。
P:Processor的缩写,代表一个虚拟的处理器 ,它维护一个局部的可运行的 G 队列,可以通过 CAS 的方式无锁访问 ,工作线程 M 优先使用自己的局部运行队列中的 G,只有必要时才会去访问全局运行队列,这大大减少了锁冲突,提高了大量 G 的并发性。每个 G 要想真正运行起来,首先需要被分配一个 P。
如图 1.5 所示,是当前 Go 采用的 GMP 调度模型。可运行的 G 是通过处理器 P 和线程 M 绑定起来的,M 的执行是由操作系统调度器将 M 分配到 CPU 上实现的,Go 运行时调度器负责调度 G 到 M 上执行,主要在用户态运行,跟操作系统调度器在内核态运行相对应。

图1.5 当前高效的GMP调度模型
需要说明的是,Go 调度器也叫 Go 运行时调度器,或 Goroutine 调度器,指的是由运行时在用户态提供的多个函数组成的一种机制,目的是为了高效地调度 G 到 M上去执行 。可以跟操作系统的调度器 OS Scheduler 对比来看,后者负责将 M 调度到 CPU 上运行。从操作系统层面来看,运行在用户态的 Go 程序只是一个请求和运行多个线程 M 的普通进程,操作系统不会直接跟上层的 G 打交道。
至于为什么不直接将本地队列放在 M 上、而是要放在 P 上呢? 这是因为当一个线程 M 阻塞(可能执行系统调用或 IO请求)的时候,可以将和它绑定的 P 上的 G 转移到其他线程 M 去执行,如果直接把可运行 G 组成的本地队列绑定到 M,则万一当前 M 阻塞,它拥有的 G 就不能给到其他 M 去执行了。
Go 对网络 I/O 做了运行时层面的集成优化:当 goroutine 在 net.Conn 的读写上需要等待网络就绪时,goroutine 会被挂起并交由 runtime 的网络轮询器(netpoll)等待事件通知 ,同时当前 P/M 可以继续调度执行其他可运行的 goroutine。这样网络 I/O 的等待通常不会长期占用 OS 线程资源,因此 Go 特别适合高并发、网络 I/O 密集型场景。不过在少数路径(如部分 DNS/cgo 调用等)仍可能发生线程阻塞。
基于 GMP 模型的 Go 调度器的核心思想是:
-
尽可能复用线程 M:避免频繁的线程创建和销毁;
-
利用多核并行能力 :限制同时运行(不包含阻塞)的 M 线程数为 N,N 等于 CPU 的核心数目 ,这里通过设置 P 处理器的个数为 GOMAXPROCS 来保证,GOMAXPROCS 一般为 CPU 核数,因为 M 和 P 是一一绑定的,没有找到 P 的 M 会放入空闲 M 列表,没有找到 M 的 P 也会放入空闲 P 列表;
-
Work Stealing 任务窃取机制:M 优先执行其所绑定的 P 的本地队列的 G,如果本地队列为空,可以从全局队列获取 G 运行,如果全局队列没有再从其他 M 偷取 G 来运行;为了提高并发执行的效率,M 可以从其他 M 绑定的 P 的运行队列偷取 G 执行,这种 GMP 调度模型也叫任务窃取调度模型,这里,任务就是指 G;
-
Hand Off 交接机制:M 阻塞,会将 M 上 P 的运行队列交给其他 M 执行,交接效率要高,才能提高 Go 程序整体的并发度;
-
基于协作的抢占机制 :每个真正运行的G,如果不被打断,将会一直运行下去,为了保证公平,防止新创建的 G 一直获取不到 M 执行造成饥饿问题,Go 程序会保证每个 G 运行10ms 就要让出 M,交给其他 G 去执行;
-
基于信号的真抢占机制 :尽管基于协作的抢占机制能够缓解长时间 GC 导致整个程序无法工作和大多数 Goroutine 饥饿问题,但是还是有部分情况下,Go调度器有无法被抢占的情况,例如,for 循环或者垃圾回收长时间占用线程【一个是 for 死循环(用户代码),一个是 GC 的 STW/标记阶段(运行时)】,为了解决这些问题, Go1.14 引入了基于信号的抢占式调度机制【第一步:监控线程 调用操作系统的信号相关系统调用,将指定信号发送给 需要抢占的线程 第二步:目标线程收到信号之后 就是将现场保留到栈上,然后将当前G切换为 _Gpreempted状态,调用schedule函数,开启下一轮调度】,能够解决 GC 垃圾回收和栈扫描时存在的问题。
2. 多图详解几种常见的调度场景
在进入 GMP 调度模型的数据结构和源码之前,可以先用几张图形象的描述下 GMP 调度机制的一些场景,帮助理解 GMP 调度器为了保证公平性、可扩展性、及提高并发效率,所设计的一些机制和策略。
1)创建 G: 正在 M1 上运行的P,有一个G1,通过go func() 创建 G2 后,由于局部性,G2优先放入P的本地队列;【这里"局部性"就是 让新创建的 G 先进入当前 P 的本地队列,优先在同一执行上下文跑,提升缓存命中、减少锁竞争和跨核迁移开销。】

图2.1 正在M1上运行的G1通过go func() 创建 G2 后,由于局部性,G2优先放入P的本地队列
2)G 运行完成后:M1 上的 G1 运行完成后(调用goexit()函数),M1 上运行的 Goroutine 会切换为 G0,G0 负责调度协程的切换(运行schedule() 函数),从 M1 上 P 的本地运行队列获取 G2 去执行(函数execute()) ;注意:这里 G0 是程序启动时的线程 M(也叫M0)的系统栈表示的 G 结构体,负责 M 上 G 的调度;
g0 是 runtime 为 每个 M(OS 线程) 创建的一个特殊 g 结构体,对应该 M 的 系统栈(system stack)。runtime 在执行调度、GC、栈管理等底层逻辑时,会切换到 g0 的系统栈上运行。g0 不用于执行用户代码,也不会像普通 goroutine 那样进入运行队列被调度。每一个 M 都有自己的 g0

图2.2 M1 上的 G1 运行完会切换到 P 本地队列的 G2 运行
3)M 上创建的 G 个数大于本地队列长度时 :如果 P 本地队列最多能存 4 个G(实际上是256个),正在 M1 上运行的 G2 要通过go func()创建 6 个G,那么,前 4 个G 放在 P 本地队列中,G2 创建了第 5 个 G(G7)时,P 本地队列中前一半和 G7 一起打乱顺序放入全局队列 ,P 本地队列剩下的 G 往前移动,G2 创建的第 6 个 G(G8)时,放入 P 本地队列中,因为还有空间;


所以下图有点问题,应该是G7在本地队列,而G8在runnext

图2.3 M1上的G2要创建的G个数多于P本地队列能够存放的G个数时
4)M 的自旋状态 :创建新的 G 时,运行的 G 会尝试唤醒其他空闲的 M 绑定 P 去执行 ,如果 G2 唤醒了M2,M2 绑定了一个 P2,会先运行 M2 的 G0,这时 M2 没有从 P2 的本地队列中找到 G,会进入自旋状态(spinning),自旋状态的 M2 会尝试从全局空闲线程队列里面获取 G,放到 P2 本地队列去执行,获取的数量满足公式:n = min(len(globrunqsize)/GOMAXPROCS + 1,
len(localrunsize/2)),含义是每个P应该从全局队列承担的 G 数量,为了提高效率,不能太多,要给其他 P 留点;

图2.4 创建新的 G 时,运行的G会尝试唤醒其他空闲的M绑定P去执行
5)任务窃取机制 :自旋状态的 M 会寻找可运行的 G,如果全局队列为空,则会从其他 P 偷取 G 来执行,个数是其他 P 运行队列的一半;

图2.5 自旋状态的 M 会寻找可运行的 G,如果全局队列为空,则会从其他 P 偷取 G 来执行
6)G 发生系统调用时 :如果 G 发生系统调度进入阻塞,其所在的 M 也会阻塞,因为会进入内核状态等待系统资源,和 M 绑定的 P 会寻找空闲的 M 执行 ,这是为了提高效率,不能让 P 本地队列的 G 因所在 M 进入阻塞状态而无法执行;需要说明的是,如果是 M 上的 G 进入 Channel 阻塞,则该 M 不会一起进入阻塞,因为 Channel 数据传输涉及内存拷贝,不涉及系统资源等待;【m没有阻塞,说明是可以运行g的,此时不能让m闲着,所以当前p队列如果有可运行的g,就会尝试获取下一个可运行g来运行,如果当前p队列没有可运行g,则m会解绑当前p,去绑定下一个p,获取可运行g】

图2.6 如果 G 发生阻塞,其所在的 M 也会阻塞,和 M 绑定的 P 会寻找空闲的 M 执行
7)G 退出系统调用时:如果刚才进入系统调用的 G2 解除了阻塞,其所在的 M1 会寻找 P 去执行,优先找原来的 P,发现没有找到,则其上的 G2 会进入全局队列,等其他 M 获取执行,M1 进入空闲队列;

图 2.7 当 G 解除阻塞时,所在的 M会寻找 P 去执行,如果没有找到,则 G 进入全局队列,M 进入空闲队列
3. GMP的数据结构和各种状态
3.1 G 的数据结构和状态
/usr/local/go/src/runtime/runtime2.go
G 的数据结构是:
go
type g struct {
// Stack parameters.
// stack describes the actual stack memory: [stack.lo, stack.hi).
// stackguard0 is the stack pointer compared in the Go stack growth prologue.
// It is stack.lo+StackGuard normally, but can be StackPreempt to trigger a preemption.
// stackguard1 is the stack pointer compared in the //go:systemstack stack growth prologue.
// It is stack.lo+StackGuard on g0 and gsignal stacks.
// It is ~0 on other goroutine stacks, to trigger a call to morestackc (and crash).
// 描述了当前 Goroutine 的栈内存范围 [stack.lo, stack.hi)
stack stack // offset known to runtime/cgo
// 用于调度器抢占式调度
stackguard0 uintptr // offset known to liblink
stackguard1 uintptr // offset known to liblink
// 最内侧的 panic 结构体
_panic *_panic // innermost panic - offset known to liblink
// 最内侧的 defer 延迟函数结构体
_defer *_defer // innermost defer
// 当前 G 占用的线程,可能为空
m *m // current m; offset known to arm liblink
// 存储 G 的调度相关的数据
sched gobuf
syscallsp uintptr // if status==Gsyscall, syscallsp = sched.sp to use during gc
syscallpc uintptr // if status==Gsyscall, syscallpc = sched.pc to use during gc
syscallbp uintptr // if status==Gsyscall, syscallbp = sched.bp to use in fpTraceback
stktopsp uintptr // expected sp at top of stack, to check in traceback
// param is a generic pointer parameter field used to pass
// values in particular contexts where other storage for the
// parameter would be difficult to find. It is currently used
// in four ways:
// 1. When a channel operation wakes up a blocked goroutine, it sets param to
// point to the sudog of the completed blocking operation.
// 2. By gcAssistAlloc1 to signal back to its caller that the goroutine completed
// the GC cycle. It is unsafe to do so in any other way, because the goroutine's
// stack may have moved in the meantime.
// 3. By debugCallWrap to pass parameters to a new goroutine because allocating a
// closure in the runtime is forbidden.
// 4. When a panic is recovered and control returns to the respective frame,
// param may point to a savedOpenDeferState.
param unsafe.Pointer
// G 的状态
atomicstatus atomic.Uint32
stackLock uint32 // sigprof/scang lock; TODO: fold in to atomicstatus
// G 的 ID
goid uint64
schedlink guintptr
waitsince int64 // approx time when the g become blocked
//当状态status==Gwaiting时等待的原因
waitreason waitReason // if status==Gwaiting
// 抢占信号
preempt bool // preemption signal, duplicates stackguard0 = stackpreempt
// 抢占时将状态修改成 `_Gpreempted`
preemptStop bool // transition to _Gpreempted on preemption; otherwise, just deschedule
// 在同步安全点收缩栈
preemptShrink bool // shrink stack at synchronous safe point
// asyncSafePoint is set if g is stopped at an asynchronous
// safe point. This means there are frames on the stack
// without precise pointer information.
asyncSafePoint bool
paniconfault bool // panic (instead of crash) on unexpected fault address
gcscandone bool // g has scanned stack; protected by _Gscan bit in status
throwsplit bool // must not split stack
// activeStackChans indicates that there are unlocked channels
// pointing into this goroutine's stack. If true, stack
// copying needs to acquire channel locks to protect these
// areas of the stack.
activeStackChans bool
// parkingOnChan indicates that the goroutine is about to
// park on a chansend or chanrecv. Used to signal an unsafe point
// for stack shrinking.
parkingOnChan atomic.Bool
// inMarkAssist indicates whether the goroutine is in mark assist.
// Used by the execution tracer.
inMarkAssist bool
coroexit bool // argument to coroswitch_m
raceignore int8 // ignore race detection events
nocgocallback bool // whether disable callback from C
tracking bool // whether we're tracking this G for sched latency statistics
trackingSeq uint8 // used to decide whether to track this G
trackingStamp int64 // timestamp of when the G last started being tracked
runnableTime int64 // the amount of time spent runnable, cleared when running, only used when tracking
//G 被锁定只能在这个 m 上运行
lockedm muintptr
fipsIndicator uint8
syncSafePoint bool // set if g is stopped at a synchronous safe point.
runningCleanups atomic.Bool
sig uint32
writebuf []byte
sigcode0 uintptr
sigcode1 uintptr
sigpc uintptr
parentGoid uint64 // goid of goroutine that created this goroutine
gopc uintptr // pc of go statement that created this goroutine
ancestors *[]ancestorInfo // ancestor information goroutine(s) that created this goroutine (only used if debug.tracebackancestors)
startpc uintptr // pc of goroutine function
racectx uintptr
// 这个 g 当前正在阻塞的 sudog 结构体
waiting *sudog // sudog structures this g is waiting on (that have a valid elem ptr); in lock order
cgoCtxt []uintptr // cgo traceback context
labels unsafe.Pointer // profiler labels
timer *timer // cached timer for time.Sleep
sleepWhen int64 // when to sleep until
selectDone atomic.Uint32 // are we participating in a select and did someone win the race?
// goroutineProfiled indicates the status of this goroutine's stack for the
// current in-progress goroutine profile
goroutineProfiled goroutineProfileStateHolder
coroarg *coro // argument during coroutine transfers
bubble *synctestBubble
// Per-G tracer state.
trace gTraceState
// Per-G GC state
// gcAssistBytes is this G's GC assist credit in terms of
// bytes allocated. If this is positive, then the G has credit
// to allocate gcAssistBytes bytes without assisting. If this
// is negative, then the G must correct this by performing
// scan work. We track this in bytes to make it fast to update
// and check for debt in the malloc hot path. The assist ratio
// determines how this corresponds to scan work debt.
gcAssistBytes int64
// valgrindStackID is used to track what memory is used for stacks when a program is
// built with the "valgrind" build tag, otherwise it is unused.
valgrindStackID uintptr
}
G 的主要字段有:
stack:描述了当前 Goroutine 的栈内存范围 [stack.lo, stack.hi);
stackguard0: 可以用于调度器抢占式调度;preempt,preemptStop,preemptShrink跟抢占相关;
defer 和 panic:分别记录这个 G 最内侧的panic和 _defer结构体;
m:记录当前 G 占用的线程 M,可能为空;
atomicstatus:表示G 的状态;
sched:存储 G 的调度相关的数据;
goid:表示 G 的 ID,对开发者不可见;
需要展开描述的是sched 字段的 runtime.gobuf 结构体:
go
type gobuf struct {
// The offsets of sp, pc, and g are known to (hard-coded in) libmach.
//
// ctxt is unusual with respect to GC: it may be a
// heap-allocated funcval, so GC needs to track it, but it
// needs to be set and cleared from assembly, where it's
// difficult to have write barriers. However, ctxt is really a
// saved, live register, and we only ever exchange it between
// the real register and the gobuf. Hence, we treat it as a
// root during stack scanning, which means assembly that saves
// and restores it doesn't need write barriers. It's still
// typed as a pointer so that any other writes from Go get
// write barriers.
// 栈指针
sp uintptr
// 程序计数器,记录G要执行的下一条指令位置
pc uintptr
// 持有 runtime.gobuf 的 G
g guintptr
ctxt unsafe.Pointer
lr uintptr
bp uintptr // for framepointer-enabled architectures
}
这些字段会在调度器将当前 G 切换离开 M 和调度进入 M 执行程序时用到 ,栈指针 sp 和程序计数器 pc 用来存放或恢复寄存器中的值,改变程序执行的指令。
结构体 runtime.g 的 atomicstatus 字段存储了当前 G 的状态,G 可能处于以下状态:
go
// defined constants
const (
// G status
//
// Beyond indicating the general state of a G, the G status
// acts like a lock on the goroutine's stack (and hence its
// ability to execute user code).
//
// If you add to this list, add to the list
// of "okay during garbage collection" status
// in mgcmark.go too.
//
// TODO(austin): The _Gscan bit could be much lighter-weight.
// For example, we could choose not to run _Gscanrunnable
// goroutines found in the run queue, rather than CAS-looping
// until they become _Grunnable. And transitions like
// _Gscanwaiting -> _Gscanrunnable are actually okay because
// they don't affect stack ownership.
// _Gidle means this goroutine was just allocated and has not
// yet been initialized.
// _Gidle 表示 G 刚刚被分配并且还没有被初始化
_Gidle = iota // 0
// _Grunnable means this goroutine is on a run queue. It is
// not currently executing user code. The stack is not owned.
// _Grunnable 表示 G 没有执行代码,没有栈的所有权,存储在运行队列中
_Grunnable // 1
// _Grunning means this goroutine may execute user code. The
// stack is owned by this goroutine. It is not on a run queue.
// It is assigned an M and a P (g.m and g.m.p are valid).
// _Grunning 可以执行代码,拥有栈的所有权,被赋予了内核线程 M 和处理器 P
_Grunning // 2
// _Gsyscall means this goroutine is executing a system call.
// It is not executing user code. The stack is owned by this
// goroutine. It is not on a run queue. It is assigned an M.
// _Gsyscall 正在执行系统调用,拥有栈的所有权,没有执行用户代码,被赋予了内核线程 M 但是不在运行队列上
_Gsyscall // 3
// _Gwaiting means this goroutine is blocked in the runtime.
// It is not executing user code. It is not on a run queue,
// but should be recorded somewhere (e.g., a channel wait
// queue) so it can be ready()d when necessary. The stack is
// not owned *except* that a channel operation may read or
// write parts of the stack under the appropriate channel
// lock. Otherwise, it is not safe to access the stack after a
// goroutine enters _Gwaiting (e.g., it may get moved).
// _Gwaiting 由于运行时而被阻塞,没有执行用户代码并且不在运行队列上,但是可能存在于 Channel 的等待队列上
_Gwaiting // 4
// _Gmoribund_unused is currently unused, but hardcoded in gdb
// scripts.
_Gmoribund_unused // 5
// _Gdead means this goroutine is currently unused. It may be
// just exited, on a free list, or just being initialized. It
// is not executing user code. It may or may not have a stack
// allocated. The G and its stack (if any) are owned by the M
// that is exiting the G or that obtained the G from the free
// list.
// _Gdead 没有被使用,没有执行代码,可能有分配的栈
_Gdead // 6
// _Genqueue_unused is currently unused.
_Genqueue_unused // 7
// _Gcopystack means this goroutine's stack is being moved. It
// is not executing user code and is not on a run queue. The
// stack is owned by the goroutine that put it in _Gcopystack.
// _Gcopystack 栈正在被拷贝,没有执行代码,不在运行队列上
_Gcopystack // 8
// _Gpreempted means this goroutine stopped itself for a
// suspendG preemption. It is like _Gwaiting, but nothing is
// yet responsible for ready()ing it. Some suspendG must CAS
// the status to _Gwaiting to take responsibility for
// ready()ing this G.
// _Gpreempted 由于抢占而被阻塞,没有执行用户代码并且不在运行队列上,等待唤醒
_Gpreempted // 9
// _Gscan combined with one of the above states other than
// _Grunning indicates that GC is scanning the stack. The
// goroutine is not executing user code and the stack is owned
// by the goroutine that set the _Gscan bit.
//
// _Gscanrunning is different: it is used to briefly block
// state transitions while GC signals the G to scan its own
// stack. This is otherwise like _Grunning.
//
// atomicstatus&~Gscan gives the state the goroutine will
// return to when the scan completes.
// _Gscan GC 正在扫描栈空间,没有执行代码,可以与其他状态同时存在
_Gscan = 0x1000
_Gscanrunnable = _Gscan + _Grunnable // 0x1001
_Gscanrunning = _Gscan + _Grunning // 0x1002
_Gscansyscall = _Gscan + _Gsyscall // 0x1003
_Gscanwaiting = _Gscan + _Gwaiting // 0x1004
_Gscanpreempted = _Gscan + _Gpreempted // 0x1009
)
其中主要的六种状态是:
Gidle:G 被创建但还未完全被初始化;
Grunnable:当前 G 为可运行的,正在等待被运行;
Grunning:当前 G 正在被运行;
Gsyscall:当前 G 正在被系统调用;
Gwaiting:当前 G 正在因某个原因而等待;
Gdead:当前 G 完成了运行;
图3.1描述了G从创建到结束的生命周期中经历的各种状态变化过程:

虽然 G 在运行时中定义的状态较多且复杂,但是我们可以将这些不同的状态聚合成三种:等待中、可运行、运行中,分别由_Gwaiting、_Grunnable、_Grunning 三种状态表示,运行期间大部分情况是在这三种状态来回切换:
- 等待中:G 正在等待某些条件满足,例如:系统调用结束等,包括 _Gwaiting、_Gsyscall 几个状态;
- 可运行:G 已经准备就绪,可以在线程 M 上运行,如果当前程序中有非常多的 G,每个 G 就可能会等待更多的时间,即 _Grunnable;
- 运行中:G 正在某个线程 M 上运行,即 _Grunning。
3.2 M 的数据结构
M 的数据结构是:
go
type m struct {
// 持有调度栈的 G
g0 *g // goroutine with scheduling stack
morebuf gobuf // gobuf arg to morestack
divmod uint32 // div/mod denominator for arm - known to liblink (cmd/internal/obj/arm/obj5.go)
// Fields not known to debuggers.
procid uint64 // for debuggers, but offset not hard-coded
// 处理 signal 的 g
gsignal *g // signal-handling g
goSigStack gsignalStack // Go-allocated signal handling stack
sigmask sigset // storage for saved signal mask
// 线程本地存储
tls [tlsSlots]uintptr // thread-local storage (for x86 extern register)
// M的起始函数,go语句携带的那个函数
mstartfn func()
// 在当前线程上运行的 G
curg *g // current running goroutine
caughtsig guintptr // goroutine running during fatal signal
// 执行 go 代码时持有的 p (如果没有执行则为 nil)
p puintptr // attached p for executing go code (nil if not executing go code)
// 用于暂存与当前 M 有潜在关联的 P
nextp puintptr
// 执行系统调用前绑定的 P
oldp puintptr // the p that was attached before executing a syscall
id int64
mallocing int32
throwing throwType
preemptoff string // if != "", keep curg running on this m
locks int32
dying int32
profilehz int32
// 表示当前 M 是否正在寻找 G,在寻找过程中 M 处于自旋状态
spinning bool // m is out of work and is actively looking for work
blocked bool // m is blocked on a note
newSigstack bool // minit on C thread called sigaltstack
printlock int8
incgo bool // m is executing a cgo call
isextra bool // m is an extra m
isExtraInC bool // m is an extra m that does not have any Go frames
isExtraInSig bool // m is an extra m in a signal handler
freeWait atomic.Uint32 // Whether it is safe to free g0 and delete m (one of freeMRef, freeMStack, freeMWait)
needextram bool
g0StackAccurate bool // whether the g0 stack has accurate bounds
traceback uint8
allpSnapshot []*p // Snapshot of allp for use after dropping P in findRunnable, nil otherwise.
ncgocall uint64 // number of cgo calls in total
ncgo int32 // number of cgo calls currently in progress
cgoCallersUse atomic.Uint32 // if non-zero, cgoCallers in use temporarily
cgoCallers *cgoCallers // cgo traceback if crashing in cgo call
park note
alllink *m // on allm
schedlink muintptr
// 表示与当前 M 锁定的那个 G
lockedg guintptr
createstack [32]uintptr // stack that created this thread, it's used for StackRecord.Stack0, so it must align with it.
lockedExt uint32 // tracking for external LockOSThread
lockedInt uint32 // tracking for internal lockOSThread
mWaitList mWaitList // list of runtime lock waiters
mLockProfile mLockProfile // fields relating to runtime.lock contention
profStack []uintptr // used for memory/block/mutex stack traces
// wait* are used to carry arguments from gopark into park_m, because
// there's no stack to put them on. That is their sole purpose.
waitunlockf func(*g, unsafe.Pointer) bool
waitlock unsafe.Pointer
waitTraceSkip int
waitTraceBlockReason traceBlockReason
syscalltick uint32
freelink *m // on sched.freem
trace mTraceState
// these are here because they are too large to be on the stack
// of low-level NOSPLIT functions.
libcall libcall
libcallpc uintptr // for cpu profiler
libcallsp uintptr
libcallg guintptr
winsyscall winlibcall // stores syscall parameters on windows
vdsoSP uintptr // SP for traceback while in VDSO call (0 if not in call)
vdsoPC uintptr // PC for traceback while in VDSO call
// preemptGen counts the number of completed preemption
// signals. This is used to detect when a preemption is
// requested, but fails.
preemptGen atomic.Uint32
// Whether this is a pending preemption signal on this M.
signalPending atomic.Uint32
// pcvalue lookup cache
pcvalueCache pcvalueCache
dlogPerM
mOS
chacha8 chacha8rand.State
cheaprand uint64
// Up to 10 locks held by this m, maintained by the lock ranking code.
locksHeldLen int
locksHeld [10]heldLockInfo
}
M 的字段众多,其中最重要的为下面几个:
g0: Go 运行时系统在启动之初创建的,用来调度其他 G 到 M 上;
mstartfn:表示M的起始函数,其实就是go 语句携带的那个函数;
curg:存放当前正在运行的 G 的指针;
p:指向当前与 M 关联的那个 P;
nextp:用于暂存于当前 M 有潜在关联的 P;
spinning:表示当前 M 是否正在寻找 G,在寻找过程中 M 处于自旋状态;
lockedg:表示与当前M锁定的那个 G,运行时系统会把 一个 M 和一个 G 锁定,一旦锁定就只能双方相互作用,不接受第三者;
M 并没有像 G 和 P 一样的状态标记, 但可以认为一个 M 有以下的状态:
-
自旋中(spinning): M 正在从运行队列获取 G, 这时候 M 会拥有一个 P;
-
执行go代码中: M 正在执行go代码, 这时候 M 会拥有一个 P;
-
执行原生代码中: M 正在执行原生代码或者阻塞的syscall, 这时M并不拥有P;
-
休眠中: M 发现无待运行的 G 时会进入休眠, 并添加到空闲 M 链表中, 这时 M 并不拥有 P。
3.3 P 的数据结构
P 的数据结构是:
go
type p struct {
id int32
// p 的状态 pidle/prunning/...
status uint32 // one of pidle/prunning/...
link puintptr
// 每次执行调度器调度 +1
schedtick uint32 // incremented on every scheduler call
// 每次执行系统调用 +1
syscalltick uint32 // incremented on every system call
sysmontick sysmontick // last tick observed by sysmon
// 关联的 m
m muintptr // back-link to associated m (nil if idle)
// 用于 P 所在的线程 M 的内存分配的 mcache
mcache *mcache
pcache pageCache
raceprocctx uintptr
// 本地 P 队列的 defer 结构体池
deferpool []*_defer // pool of available defer structs (see panic.go)
deferpoolbuf [32]*_defer
// Cache of goroutine ids, amortizes accesses to runtime·sched.goidgen.
goidcache uint64
goidcacheend uint64
// Queue of runnable goroutines. Accessed without lock.
// 可运行的 Goroutine 队列,可无锁访问
runqhead uint32
runqtail uint32
runq [256]guintptr
// runnext, if non-nil, is a runnable G that was ready'd by
// the current G and should be run next instead of what's in
// runq if there's time remaining in the running G's time
// slice. It will inherit the time left in the current time
// slice. If a set of goroutines is locked in a
// communicate-and-wait pattern, this schedules that set as a
// unit and eliminates the (potentially large) scheduling
// latency that otherwise arises from adding the ready'd
// goroutines to the end of the run queue.
//
// Note that while other P's may atomically CAS this to zero,
// only the owner P can CAS it to a valid G.
// 线程下一个需要执行的 G
runnext guintptr
// Available G's (status == Gdead)
// 空闲的 G 队列,G 状态 status 为 _Gdead,可重新初始化使用
gFree gList
sudogcache []*sudog
sudogbuf [128]*sudog
// Cache of mspan objects from the heap.
mspancache struct {
// We need an explicit length here because this field is used
// in allocation codepaths where write barriers are not allowed,
// and eliminating the write barrier/keeping it eliminated from
// slice updates is tricky, more so than just managing the length
// ourselves.
len int
buf [128]*mspan
}
// Cache of a single pinner object to reduce allocations from repeated
// pinner creation.
pinnerCache *pinner
trace pTraceState
palloc persistentAlloc // per-P to avoid mutex
// Per-P GC state
gcAssistTime int64 // Nanoseconds in assistAlloc
gcFractionalMarkTime int64 // Nanoseconds in fractional mark worker (atomic)
// limiterEvent tracks events for the GC CPU limiter.
limiterEvent limiterEvent
// gcMarkWorkerMode is the mode for the next mark worker to run in.
// That is, this is used to communicate with the worker goroutine
// selected for immediate execution by
// gcController.findRunnableGCWorker. When scheduling other goroutines,
// this field must be set to gcMarkWorkerNotWorker.
gcMarkWorkerMode gcMarkWorkerMode
// gcMarkWorkerStartTime is the nanotime() at which the most recent
// mark worker started.
gcMarkWorkerStartTime int64
// gcw is this P's GC work buffer cache. The work buffer is
// filled by write barriers, drained by mutator assists, and
// disposed on certain GC state transitions.
gcw gcWork
// wbBuf is this P's GC write barrier buffer.
//
// TODO: Consider caching this in the running G.
wbBuf wbBuf
runSafePointFn uint32 // if 1, run sched.safePointFn at next safe point
// statsSeq is a counter indicating whether this P is currently
// writing any stats. Its value is even when not, odd when it is.
statsSeq atomic.Uint32
// Timer heap.
timers timers
// Cleanups.
cleanups *cleanupBlock
cleanupsQueued uint64 // monotonic count of cleanups queued by this P
// maxStackScanDelta accumulates the amount of stack space held by
// live goroutines (i.e. those eligible for stack scanning).
// Flushed to gcController.maxStackScan once maxStackScanSlack
// or -maxStackScanSlack is reached.
maxStackScanDelta int64
// gc-time statistics about current goroutines
// Note that this differs from maxStackScan in that this
// accumulates the actual stack observed to be used at GC time (hi - sp),
// not an instantaneous measure of the total stack size that might need
// to be scanned (hi - lo).
scannedStackSize uint64 // stack size of goroutines scanned by this P
scannedStacks uint64 // number of goroutines scanned by this P
// preempt is set to indicate that this P should be enter the
// scheduler ASAP (regardless of what G is running on it).
preempt bool
// gcStopTime is the nanotime timestamp that this P last entered _Pgcstop.
gcStopTime int64
// Padding is no longer needed. False sharing is now not a worry because p is large enough
// that its size class is an integer multiple of the cache line size (for any of our architectures).
}
最主要的数据结构是 status 表示 P 的不同的状态,而 runqhead、runqtail 和 runq 三个字段表示处理器持有的运行队列,是一个长度为256的环形队列 ,其中存储着待执行的 G 列表 ,runnext 中是线程下一个需要执行的 G;gFree 存储 P 本地的状态为_Gdead 的空闲的 G,可重新初始化使用。
P 结构体中的状态 status 字段会是以下五种中的一种:
-
_Pidle:P 没有运行用户代码或者调度器,被空闲队列或者改变其状态的结构持有,运行队列为空;
-
_Prunning:被线程 M 持有,并且正在执行用户代码或者调度器;
-
_Psyscall:没有执行用户代码,当前线程陷入系统调用;
-
_Pgcstop:被线程 M 持有,当前处理器由于垃圾回收被停止;
-
_Pdead:当前 P 已经不被使用;
_Psyscall 常见于 goroutine 执行无法由 netpoll 管理的阻塞系统调用(如文件读写、部分 DNS 解析、cgo 调用等)时:对应的 M 进入内核/外部调用等待,暂时无法执行 Go 代码,P 被标记为 _Psyscall,以便调度器在必要时回收该 P 供其他 goroutine 运行。
"系统调用阻塞的是 M 和那个 G;P 不阻塞,_Psyscall 只是描述 P 正处在 syscall 相关的交接/可回收状态。P 通常会被迅速交给其他 M 去跑它队列里的其他 G;sysmon 负责避免 P 因长 syscall 被长期占用。"
go
const (
// P status
// _Pidle means a P is not being used to run user code or the
// scheduler. Typically, it's on the idle P list and available
// to the scheduler, but it may just be transitioning between
// other states.
//
// The P is owned by the idle list or by whatever is
// transitioning its state. Its run queue is empty.
_Pidle = iota
// _Prunning means a P is owned by an M and is being used to
// run user code or the scheduler. Only the M that owns this P
// is allowed to change the P's status from _Prunning. The M
// may transition the P to _Pidle (if it has no more work to
// do), _Psyscall (when entering a syscall), or _Pgcstop (to
// halt for the GC). The M may also hand ownership of the P
// off directly to another M (e.g., to schedule a locked G).
_Prunning
// _Psyscall means a P is not running user code. It has
// affinity to an M in a syscall but is not owned by it and
// may be stolen by another M. This is similar to _Pidle but
// uses lightweight transitions and maintains M affinity.
//
// Leaving _Psyscall must be done with a CAS, either to steal
// or retake the P. Note that there's an ABA hazard: even if
// an M successfully CASes its original P back to _Prunning
// after a syscall, it must understand the P may have been
// used by another M in the interim.
_Psyscall
// _Pgcstop means a P is halted for STW and owned by the M
// that stopped the world. The M that stopped the world
// continues to use its P, even in _Pgcstop. Transitioning
// from _Prunning to _Pgcstop causes an M to release its P and
// park.
//
// The P retains its run queue and startTheWorld will restart
// the scheduler on Ps with non-empty run queues.
_Pgcstop
// _Pdead means a P is no longer used (GOMAXPROCS shrank). We
// reuse Ps if GOMAXPROCS increases. A dead P is mostly
// stripped of its resources, though a few things remain
// (e.g., trace buffers).
_Pdead
)
P 的五种状态之间的转化关系如图 3.2 所示:

图3.2 P的状态变化
3.4 schedt 的数据结构
调度器的schedt结构体存储了全局的 G 队列,空闲的 M 列表和 P 列表:
/usr/local/go/src/runtime/runtime2.go
go
type schedt struct {
goidgen atomic.Uint64
lastpoll atomic.Int64 // time of last network poll, 0 if currently polling
pollUntil atomic.Int64 // time to which current poll is sleeping
pollingNet atomic.Int32 // 1 if some P doing non-blocking network poll
// schedt的锁
lock mutex
// When increasing nmidle, nmidlelocked, nmsys, or nmfreed, be
// sure to call checkdead().
// 空闲的M列表
midle muintptr // idle m's waiting for work
// 空闲的M列表的数量
nmidle int32 // number of idle m's waiting for work
// 被锁定正在工作的M数
nmidlelocked int32 // number of locked m's waiting for work
// 下一个被创建的 M 的 ID
mnext int64 // number of m's that have been created and next M ID
// 能拥有的最大数量的 M
maxmcount int32 // maximum number of m's allowed (or die)
nmsys int32 // number of system m's not counted for deadlock
nmfreed int64 // cumulative number of freed m's
ngsys atomic.Int32 // number of system goroutines
// 空闲的 P 链表
pidle puintptr // idle p's
// 空闲 P 数量
npidle atomic.Int32
// 处于自旋状态的 M 的数量
nmspinning atomic.Int32 // See "Worker thread parking/unparking" comment in proc.go.
needspinning atomic.Uint32 // See "Delicate dance" comment in proc.go. Boolean. Must hold sched.lock to set to 1.
// Global runnable queue.
// 全局可执行的 G 列表
runq gQueue
// disable controls selective disabling of the scheduler.
//
// Use schedEnableUser to control this.
//
// disable is protected by sched.lock.
disable struct {
// user disables scheduling of user goroutines.
user bool
runnable gQueue // pending runnable Gs
}
// Global cache of dead G's.
// 全局 _Gdead 状态的空闲 G 列表
gFree struct {
lock mutex
stack gList // Gs with stacks
noStack gList // Gs without stacks
}
// Central cache of sudog structs.
// sudog结构的集中存储
sudoglock mutex
sudogcache *sudog
// Central pool of available defer structs.
// 有效的 defer 结构池
deferlock mutex
deferpool *_defer
// freem is the list of m's waiting to be freed when their
// m.exited is set. Linked through m.freelink.
freem *m
gcwaiting atomic.Bool // gc is waiting to run
stopwait int32
stopnote note
sysmonwait atomic.Bool
sysmonnote note
// safePointFn should be called on each P at the next GC
// safepoint if p.runSafePointFn is set.
safePointFn func(*p)
safePointWait int32
safePointNote note
profilehz int32 // cpu profiling rate
procresizetime int64 // nanotime() of last change to gomaxprocs
totaltime int64 // ∫gomaxprocs dt up to procresizetime
customGOMAXPROCS bool // GOMAXPROCS was manually set from the environment or runtime.GOMAXPROCS
// sysmonlock protects sysmon's actions on the runtime.
//
// Acquire and hold this mutex to block sysmon from interacting
// with the rest of the runtime.
sysmonlock mutex
// timeToRun is a distribution of scheduling latencies, defined
// as the sum of time a G spends in the _Grunnable state before
// it transitions to _Grunning.
timeToRun timeHistogram
// idleTime is the total CPU time Ps have "spent" idle.
//
// Reset on each GC cycle.
idleTime atomic.Int64
// totalMutexWaitTime is the sum of time goroutines have spent in _Gwaiting
// with a waitreason of the form waitReasonSync{RW,}Mutex{R,}Lock.
totalMutexWaitTime atomic.Int64
// stwStoppingTimeGC/Other are distributions of stop-the-world stopping
// latencies, defined as the time taken by stopTheWorldWithSema to get
// all Ps to stop. stwStoppingTimeGC covers all GC-related STWs,
// stwStoppingTimeOther covers the others.
stwStoppingTimeGC timeHistogram
stwStoppingTimeOther timeHistogram
// stwTotalTimeGC/Other are distributions of stop-the-world total
// latencies, defined as the total time from stopTheWorldWithSema to
// startTheWorldWithSema. This is a superset of
// stwStoppingTimeGC/Other. stwTotalTimeGC covers all GC-related STWs,
// stwTotalTimeOther covers the others.
stwTotalTimeGC timeHistogram
stwTotalTimeOther timeHistogram
// totalRuntimeLockWaitTime (plus the value of lockWaitTime on each M in
// allm) is the sum of time goroutines have spent in _Grunnable and with an
// M, but waiting for locks within the runtime. This field stores the value
// for Ms that have exited.
totalRuntimeLockWaitTime atomic.Int64
}
除了上面的四个结构体,还有一些全局变量:
go
var (
// 所有的 M
allm *m
// P 的个数,默认为 ncpu 核数
gomaxprocs int32
numCPUStartup int32
forcegc forcegcstate
// schedt 全局结构体
sched schedt
newprocs int32
)
var (
// allpLock protects P-less reads and size changes of allp, idlepMask,
// and timerpMask, and all writes to allp.
// 全局 P 队列的锁
allpLock mutex
// len(allp) == gomaxprocs; may change at safe points, otherwise
// immutable.
// 全局 P 队列,个数为 gomaxprocs
allp []*p
// Bitmask of Ps in _Pidle list, one bit per P. Reads and writes must
// be atomic. Length may change at safe points.
//
// Each P must update only its own bit. In order to maintain
// consistency, a P going idle must the idle mask simultaneously with
// updates to the idle P list under the sched.lock, otherwise a racing
// pidleget may clear the mask before pidleput sets the mask,
// corrupting the bitmap.
//
// N.B., procresize takes ownership of all Ps in stopTheWorldWithSema.
idlepMask pMask
// Bitmask of Ps that may have a timer, one bit per P. Reads and writes
// must be atomic. Length may change at safe points.
//
// Ideally, the timer mask would be kept immediately consistent on any timer
// operations. Unfortunately, updating a shared global data structure in the
// timer hot path adds too much overhead in applications frequently switching
// between no timers and some timers.
//
// As a compromise, the timer mask is updated only on pidleget / pidleput. A
// running P (returned by pidleget) may add a timer at any time, so its mask
// must be set. An idle P (passed to pidleput) cannot add new timers while
// idle, so if it has no timers at that time, its mask may be cleared.
//
// Thus, we get the following effects on timer-stealing in findrunnable:
//
// - Idle Ps with no timers when they go idle are never checked in findrunnable
// (for work- or timer-stealing; this is the ideal case).
// - Running Ps must always be checked.
// - Idle Ps whose timers are stolen must continue to be checked until they run
// again, even after timer expiration.
//
// When the P starts running again, the mask should be set, as a timer may be
// added at any time.
//
// TODO(prattmic): Additional targeted updates may improve the above cases.
// e.g., updating the mask when stealing a timer.
timerpMask pMask
)
此外,src/runtime/proc.go 文件有两个全局变量:
go
var (
// 进程启动后的初始线程
m0 m
// 代表着初始线程的stack
g0 g
mcache0 *mcache
raceprocctx0 uintptr
raceFiniLock mutex
)
关于m0、g0
m0 是 Go 进程启动时创建/使用的第一个 M(初始 OS 线程),负责执行 runtime 的启动流程,并创建并启动 main goroutine ,随后进入正常的调度运行。
g0 是每个 M 都拥有的一个特殊 goroutine,对应该 M 的系统栈。runtime 在执行调度、栈管理、syscall 相关切换等底层逻辑时会切换到 g0 上运行,g0 不用于执行用户代码。

m0:线程身份(第一个 OS 线程)
g0:执行载体(系统栈上的特殊 goroutine,每个线程一个)
m0 是开局自带的,其他 m 是后面扩编的。m0 是程序启动时自带的那条初始线程,负责把 runtime 和调度器搭起来;其他 m 是运行过程中按需创建的工作线程。启动完成后,它们在调度能力上基本没本质区别,只是 m0 有'第一线程'的历史包袱。


到这里,G、M、P、schedt结构体和全局变量都描述完毕,GMP 的全部队列如下表3-1所示:
表3-1 GMP的队列

4. 调度器的启动
4.1 程序启动流程
Go 程序一启动,Go 的运行时 runtime 自带的调度器 scheduler 就开始启动了。
对于一个最简单的Go程序:
go
package main
import "fmt"
func main() {
fmt.Println("hello world")
}
通过 gdb或dlv的方式调试,会发现程序的真正入口不是在 runtime.main,对 AMD64 架构上的 Linux 和 macOS 服务器来说,分别在runtime包的 src/runtime/rt0_linux_amd64.s 和 src/runtime/rt0_darwin_amd64.s:
bash
TEXT _rt0_amd64_linux(SB),NOSPLIT,$-8
JMP _rt0_amd64(SB)
bash
TEXT _rt0_amd64_darwin(SB),NOSPLIT,$-8
JMP _rt0_amd64(SB)
两者均跳转到了 src/runtime/asm_amd64.s 包的 _rt0_amd64 函数:
bash
TEXT _rt0_amd64(SB),NOSPLIT,$-8
MOVQ 0(SP), DI // argc
LEAQ 8(SP), SI // argv
JMP runtime·rt0_go(SB)
_rt0_amd64 函数调用了 src/runtime/asm_arm64.s 包的 runtime·rt0_go 函数:
bash
TEXT runtime·rt0_go(SB),NOSPLIT|TOPFRAME,$0
......
// 初始化g0
MOVD $runtime·g0(SB), g
......
// 初始化 m0
MOVD $runtime·m0(SB), R0
// 绑定 g0 和 m0
MOVD g, m_g0(R0)
MOVD R0, g_m(g)
......
BL runtime·schedinit(SB) // 调度器初始化
// 创建一个新的 goroutine 来启动程序
MOVD $runtime·mainPC(SB), R0 // main函数入口
.......
BL runtime·newproc(SB) // 负责根据主函数即 main 的入口地址创建可被运行时调度的执行单元goroutine
.......
// 开始启动调度器的调度循环
BL runtime·mstart(SB)
......
DATA runtime·mainPC+0(SB)/8,$runtime·main<ABIInternal>(SB) // main函数入口地址
GLOBL runtime·mainPC(SB),RODATA,$8
Go程序的真正启动函数 runtime·rt0_go 主要做了几件事:
1)初始化 g0 和 m0,并将二者互相绑定 , m0 是程序启动后的初始线程 ,g0 是 m0 的系统栈代表的 G 结构体,负责普通 G 在M 上的调度切换;
2)schedinit:进行各种运行时组件初始化工作,这包括我们的调度器与内存分配器、回收器的初始化;
3)newproc:负责根据主函数即 main 的入口地址创建可被运行时调度的执行单元;
4)mstart:开始启动调度器的调度循环;
阅读 Go 调度器的源码,需要先从整体结构上对其有个把握,Go 程序启动后的调度器主逻辑如图 4.1 所示:

图4.1 调度器主逻辑
下面分为两部分来分析调度器的原理:调度器的启动和调度循环。
4.2 调度器的启动
调度器启动函数在 src/runtime/proc.go 包的 schedinit() 函数:
go
// The bootstrap sequence is:
//
// call osinit
// call schedinit
// make & queue new G
// call runtime·mstart
//
// The new G calls runtime·main.
// 调度器初始化
func schedinit() {
lockInit(&sched.lock, lockRankSched)
lockInit(&sched.sysmonlock, lockRankSysmon)
lockInit(&sched.deferlock, lockRankDefer)
lockInit(&sched.sudoglock, lockRankSudog)
lockInit(&deadlock, lockRankDeadlock)
lockInit(&paniclk, lockRankPanic)
lockInit(&allglock, lockRankAllg)
lockInit(&allpLock, lockRankAllp)
lockInit(&reflectOffs.lock, lockRankReflectOffs)
lockInit(&finlock, lockRankFin)
lockInit(&cpuprof.lock, lockRankCpuprof)
lockInit(&computeMaxProcsLock, lockRankComputeMaxProcs)
allocmLock.init(lockRankAllocmR, lockRankAllocmRInternal, lockRankAllocmW)
execLock.init(lockRankExecR, lockRankExecRInternal, lockRankExecW)
traceLockInit()
// Enforce that this lock is always a leaf lock.
// All of this lock's critical sections should be
// extremely short.
lockInit(&memstats.heapStats.noPLock, lockRankLeafRank)
lockVerifyMSize()
// raceinit must be the first call to race detector.
// In particular, it must be done before mallocinit below calls racemapshadow.
gp := getg()
if raceenabled {
gp.racectx, raceprocctx0 = raceinit()
}
// 设置机器线程数M最大为10000
sched.maxmcount = 10000
crashFD.Store(^uintptr(0))
// The world starts stopped.
worldStopped()
godebug, parsedGodebug := getGodebugEarly()
if parsedGodebug {
parseRuntimeDebugVars(godebug)
}
ticks.init() // run as early as possible
moduledataverify()
// 栈、内存分配器相关初始化
// 初始化栈
stackinit()
// 初始化内存分配器
mallocinit()
cpuinit(godebug) // must run before alginit
randinit() // must run before alginit, mcommoninit
alginit() // maps, hash, rand must not be used before this call
// 初始化当前系统线程 M0
mcommoninit(gp.m, -1)
modulesinit() // provides activeModules
typelinksinit() // uses maps, activeModules
itabsinit() // uses activeModules
stkobjinit() // must run before GC starts
sigsave(&gp.m.sigmask)
initSigmask = gp.m.sigmask
goargs()
goenvs()
secure()
checkfds()
if !parsedGodebug {
// Some platforms, e.g., Windows, didn't make env vars available "early",
// so try again now.
parseRuntimeDebugVars(gogetenv("GODEBUG"))
}
finishDebugVarsSetup()
// GC初始化
gcinit()
// Allocate stack space that can be used when crashing due to bad stack
// conditions, e.g. morestack on g0.
gcrash.stack = stackalloc(16384)
gcrash.stackguard0 = gcrash.stack.lo + 1000
gcrash.stackguard1 = gcrash.stack.lo + 1000
// if disableMemoryProfiling is set, update MemProfileRate to 0 to turn off memprofile.
// Note: parsedebugvars may update MemProfileRate, but when disableMemoryProfiling is
// set to true by the linker, it means that nothing is consuming the profile, it is
// safe to set MemProfileRate to 0.
if disableMemoryProfiling {
MemProfileRate = 0
}
// mcommoninit runs before parsedebugvars, so init profstacks again.
mProfStackInit(gp.m)
defaultGOMAXPROCSInit()
lock(&sched.lock)
sched.lastpoll.Store(nanotime())
var procs int32
if n, ok := strconv.Atoi32(gogetenv("GOMAXPROCS")); ok && n > 0 {
procs = n
sched.customGOMAXPROCS = true
} else {
// Use numCPUStartup for initial GOMAXPROCS for two reasons:
//
// 1. We just computed it in osinit, recomputing is (minorly) wasteful.
//
// 2. More importantly, if debug.containermaxprocs == 0 &&
// debug.updatemaxprocs == 0, we want to guarantee that
// runtime.GOMAXPROCS(0) always equals runtime.NumCPU (which is
// just numCPUStartup).
// 设置P的值为GOMAXPROCS个数
procs = defaultGOMAXPROCS(numCPUStartup)
}
// 调用procresize调整 P 列表
if procresize(procs) != nil {
throw("unknown runnable goroutine during bootstrap")
}
unlock(&sched.lock)
// World is effectively started now, as P's can run.
worldStarted()
if buildVersion == "" {
// Condition should never trigger. This code just serves
// to ensure runtime·buildVersion is kept in the resulting binary.
buildVersion = "unknown"
}
if len(modinfo) == 1 {
// Condition should never trigger. This code just serves
// to ensure runtime·modinfo is kept in the resulting binary.
modinfo = ""
}
}
schedinit() 函数会设置 M 最大数量为10000,实际中不会达到;会分别调用stackinit() 、mallocinit() 、mcommoninit() 、gcinit() 等执行 goroutine栈初始化、进行内存分配器初始化、进行系统线程M0的初始化、进行GC垃圾回收器的初始化;接着,将 P 个数设置为 GOMAXPROCS 的值,即程序能够同时运行的最大处理器数,最后会调用 runtime.procresize()函数初始化 P 列表。

图4.2 runtime.schedinit() 函数逻辑
schedinit() 函数负责M、P、G 的初始化过程。M/P/G 彼此的初始化顺序遵循:mcommoninit、procresize、newproc,他们分别负责初始化 M 资源池(allm)、P 资源池(allp)、G 的运行现场(g.sched)以及调度队列(p.runq)
mcommoninit 函数
mcommoninit() 函数主要负责对 M0 进行一个初步的初始化,并将其添加到 schedt 全局结构体中,这里访问 schedt 会加锁:
go
// Pre-allocated ID may be passed as 'id', or omitted by passing -1.
func mcommoninit(mp *m, id int64) {
gp := getg()
// g0 stack won't make sense for user (and is not necessary unwindable).
if gp != gp.m.g0 {
callers(1, mp.createstack[:])
}
lock(&sched.lock)
if id >= 0 {
mp.id = id
} else {
// mReserveID() 会返回 sched.mnext 给当前 m,并对 sched.mnext++,记录新增加的这个 M 到 schedt 全局结构体
mp.id = mReserveID()
}
mrandinit(mp)
mpreinit(mp)
if mp.gsignal != nil {
mp.gsignal.stackguard1 = mp.gsignal.stack.lo + stackGuard
}
// Add to allm so garbage collector doesn't free g->m
// when it is just in a register or thread-local storage.
// 添加到 allm 中
mp.alllink = allm
// NumCgoCall() and others iterate over allm w/o schedlock,
// so we need to publish it safely.
// 等价于 allm = mp
atomicstorep(unsafe.Pointer(&allm), unsafe.Pointer(mp))
unlock(&sched.lock)
// Allocate memory to hold a cgo traceback if the cgo call crashes.
if iscgo || GOOS == "solaris" || GOOS == "illumos" || GOOS == "windows" {
mp.cgoCallers = new(cgoCallers)
}
mProfStackInit(mp)
}
runtime.procresize 函数
go
// Change number of processors.
//
// sched.lock must be held, and the world must be stopped.
//
// gcworkbufs must not be being modified by either the GC or the write barrier
// code, so the GC must not be running if the number of Ps actually changes.
//
// Returns list of Ps with local work, they need to be scheduled by the caller.
func procresize(nprocs int32) *p {
assertLockHeld(&sched.lock)
assertWorldStopped()
// 获取先前的 P 个数
old := gomaxprocs
if old < 0 || nprocs <= 0 {
throw("procresize: invalid arg")
}
trace := traceAcquire()
if trace.ok() {
trace.Gomaxprocs(nprocs)
traceRelease(trace)
}
// update statistics
now := nanotime()
if sched.procresizetime != 0 {
sched.totaltime += int64(old) * (now - sched.procresizetime)
}
sched.procresizetime = now
maskWords := (nprocs + 31) / 32
// Grow allp if necessary.
// 如果全局变量 allp 切片中的处理器数量少于期望数量,对 allp 扩容
if nprocs > int32(len(allp)) {
// Synchronize with retake, which could be running
// concurrently since it doesn't run on a P.
// 加锁
lock(&allpLock)
// 如果要达到的 P 个数 nprocs 小于当前全局 P 切片的容量
if nprocs <= int32(cap(allp)) {
// 在当前全局 P 切片上截取前 nprocs 个 P
allp = allp[:nprocs]
} else {
// 否则,调大了,超出全局 P 切片的容量,创建容量为 nprocs 的新的 P 切片
nallp := make([]*p, nprocs)
// Copy everything up to allp's cap so we
// never lose old allocated Ps.
// 将原有的 p 复制到新创建的 nallp 中
copy(nallp, allp[:cap(allp)])
// 新的 nallp 切片赋值给旧的 allp
allp = nallp
}
if maskWords <= int32(cap(idlepMask)) {
idlepMask = idlepMask[:maskWords]
timerpMask = timerpMask[:maskWords]
} else {
nidlepMask := make([]uint32, maskWords)
// No need to copy beyond len, old Ps are irrelevant.
copy(nidlepMask, idlepMask)
idlepMask = nidlepMask
ntimerpMask := make([]uint32, maskWords)
copy(ntimerpMask, timerpMask)
timerpMask = ntimerpMask
}
unlock(&allpLock)
}
// initialize new P's
// 使用 new 创建新的 P 结构体并调用 runtime.p.init 初始化刚刚扩容的allp列表里的 P
for i := old; i < nprocs; i++ {
pp := allp[i]
// 如果 p 是新创建的(新创建的 p 在数组中为 nil),则申请新的 P 对象
if pp == nil {
pp = new(p)
}
pp.init(i)
atomicstorep(unsafe.Pointer(&allp[i]), unsafe.Pointer(pp))
}
gp := getg()
// 当前 G 的 M 上的 P 不为空,并且其 id 小于 nprocs,说明 ID 有效,则可以继续使用当前 G 的 P
if gp.m.p != 0 && gp.m.p.ptr().id < nprocs {
// continue to use the current P
// 继续使用当前 P, 其状态设置为 _Prunning
gp.m.p.ptr().status = _Prunning
gp.m.p.ptr().mcache.prepareForSweep()
} else {
// release the current P and acquire allp[0].
//
// We must do this before destroying our current P
// because p.destroy itself has write barriers, so we
// need to do that from a valid P.
// 否则,释放当前 P 并获取 allp[0]
if gp.m.p != 0 {
trace := traceAcquire()
if trace.ok() {
// Pretend that we were descheduled
// and then scheduled again to keep
// the trace consistent.
trace.GoSched()
trace.ProcStop(gp.m.p.ptr())
traceRelease(trace)
}
gp.m.p.ptr().m = 0
}
gp.m.p = 0
// 将处理器 allp[0] 绑定到当前 M
pp := allp[0]
pp.m = 0
// P 状态设置为 _Pidle
pp.status = _Pidle
// 将allp[0]绑定到当前的 M
acquirep(pp)
trace := traceAcquire()
if trace.ok() {
trace.GoStart()
traceRelease(trace)
}
}
// g.m.p is now set, so we no longer need mcache0 for bootstrapping.
mcache0 = nil
// release resources from unused P's
// 调用 runtime.p.destroy 释放从未使用的 P
for i := nprocs; i < old; i++ {
pp := allp[i]
pp.destroy()
// can't free P itself because it can be referenced by an M in syscall
// 不能释放 p 本身,因为他可能在 m 进入系统调用时被引用
}
// Trim allp.
// 裁剪 allp,保证allp长度与期望处理器数量相等
if int32(len(allp)) != nprocs {
lock(&allpLock)
allp = allp[:nprocs]
idlepMask = idlepMask[:maskWords]
timerpMask = timerpMask[:maskWords]
unlock(&allpLock)
}
var runnablePs *p
// 将除 allp[0] 之外的处理器 P 全部设置成 _Pidle 并加入到全局的空闲队列中
for i := nprocs - 1; i >= 0; i-- {
pp := allp[i]
// 跳过当前 P
if gp.m.p.ptr() == pp {
continue
}
// 设置 P 的状态为空闲状态
pp.status = _Pidle
if runqempty(pp) {
// 放入到全局结构体 schedt 的空闲 P 列表中
pidleput(pp, now)
} else {
// 如果有本地任务,则为其绑定一个 M
pp.m.set(mget())
pp.link.set(runnablePs)
runnablePs = pp
}
}
stealOrder.reset(uint32(nprocs))
var int32p *int32 = &gomaxprocs // make compiler check that gomaxprocs is an int32
atomic.Store((*uint32)(unsafe.Pointer(int32p)), uint32(nprocs))
if old != nprocs {
// Notify the limiter that the amount of procs has changed.
gcCPULimiter.resetCapacity(now, nprocs)
}
// 返回所有包含本地任务的 P 链表
return runnablePs
}
go
pp := allp[i]
// 跳过当前 P
if gp.m.p.ptr() == pp {
continue
}

跳过当前 P 是为了保证调度器的核心不变量:一个 P 同时只能被一个 M 持有;当前 P 已经和当前 M 绑定且用于继续执行,不能再被当作 idle 或 runnable P 去重新入队/重新绑定。
untime.procresize() 函数 的执行过程如下:
1)如果全局变量 allp 切片中的 P 数量少于期望数量,会对切片进行扩容;
2)使用 new 创建新的 P 结构体并调用 runtime.p.init 初始化刚刚扩容的 P;
3)通过指针将线程 m0 和处理器 allp[0] 绑定到一起;
4)调用 runtime.p.destroy 释放不再使用的 P 结构;
5)通过切片截断改变全局变量 allp 的长度,保证它与期望 P 数量相等;
6)将除 allp[0] 之外的处理器 P 全部设置成 _Pidle 并加入到全局 schedt 的空闲 P 队列中;
runtime.procresize() 函数的逻辑如图 4.3 所示:

图4.3 runtime.procresize() 函数逻辑
runtime.p.init 方法
runtime.procresize() 函数调用 runtime.p.init 初始化新创建的 P:
go
// init initializes pp, which may be a freshly allocated p or a
// previously destroyed p, and transitions it to status _Pgcstop.
func (pp *p) init(id int32) {
// p 的 id 就是它在 allp 中的索引
pp.id = id
// 新创建的 p 处于 _Pgcstop 状态
pp.status = _Pgcstop
pp.sudogcache = pp.sudogbuf[:0]
pp.deferpool = pp.deferpoolbuf[:0]
pp.wbBuf.reset()
// 为 P 分配 cache 对象,涉及对象分配
if pp.mcache == nil {
if id == 0 {
if mcache0 == nil {
throw("missing mcache?")
}
// Use the bootstrap mcache0. Only one P will get
// mcache0: the one with ID 0.
pp.mcache = mcache0
} else {
pp.mcache = allocmcache()
}
}
if raceenabled && pp.raceprocctx == 0 {
if id == 0 {
pp.raceprocctx = raceprocctx0
raceprocctx0 = 0 // bootstrap
} else {
pp.raceprocctx = raceproccreate()
}
}
lockInit(&pp.timers.mu, lockRankTimers)
// This P may get timers when it starts running. Set the mask here
// since the P may not go through pidleget (notably P 0 on startup).
timerpMask.set(id)
// Similarly, we may not go through pidleget before this P starts
// running if it is P 0 on startup.
idlepMask.clear(id)
}
需要说明的是,mcache内存结构原来是在 M 上的,自从引入了 P 之后,就将该结构体移到了P上,这样,就不用每个 M 维护自己的内存分配 mcache,由于 P 在有 M 可以执行时才会移动到其他 M 上去,空闲的 M 无须分配内存,这种灵活性使整体线程的内存分配大大减少。
4.3 怎样创建 G ?
我们再回到 4.1 节对程序启动函数 runtime·rt0_go,有个动作是通过 runtime.newproc 函数创建 G,runtime.newproc 入参是 funcval 结构体函数,代表 go 关键字后面调用的函数:
newproc 函数
go
// Create a new g running fn.
// Put it on the queue of g's waiting to run.
// The compiler turns a go statement into a call to this.
// 创建G,并放入 P 的运行队列
func newproc(fn *funcval) {
gp := getg()
// 获取调用方 PC 寄存器值,即调用方程序要执行的下一个指令地址
pc := sys.GetCallerPC()
// 用 g0 系统栈创建 Goroutine 对象
// 传递的参数包括 fn 函数入口地址, gp(g0),调用方 pc
systemstack(func() {
// 调用 newproc1 获取 Goroutine 结构
newg := newproc1(fn, gp, pc, false, waitReasonZero)
// 获取当前 G 的 P
pp := getg().m.p.ptr()
// 将新的 G 放入 P 的本地运行队列
runqput(pp, newg, true)
// M 启动时唤醒新的 P 执行 G
if mainStarted {
wakep()
}
})
}
runtime.newproce函数主要是调用 runtime.newproc1 获取新的 Goroutine 结构 ,将新的 G 放入P的运行队列,M 启动时唤醒新的 P 执行 G。
mainStarted 用来标记 runtime 启动流程已进入可正常调度/可安全创建线程的阶段 。在此之前调度器通常不会随意创建/唤醒新的 M;在此之后,若存在可运行的 G 且缺少可运行的 M/P,调度器会按需唤醒或创建 M 来执行这些 G
runtime.newproce函数的逻辑如图4.4所示:

图4.4 runtime.newproce() 函数逻辑
runtime.newproc1 函数
runtime.newproc1() 函数的逻辑是:
go
// Create a new g in state _Grunnable (or _Gwaiting if parked is true), starting at fn.
// callerpc is the address of the go statement that created this. The caller is responsible
// for adding the new g to the scheduler. If parked is true, waitreason must be non-zero.
// 创建一个运行fn函数的goroutine
func newproc1(fn *funcval, callergp *g, callerpc uintptr, parked bool, waitreason waitReason) *g {
// 因为是在系统栈运行所以此时的 g 为 g0
if fn == nil {
fatal("go of nil func value")
}
// 加锁,禁止这时 G 的 M 被抢占,因为它可以在一个局部变量中保存 P
mp := acquirem() // disable preemption because we hold M and P in local vars.
// 获取 P
pp := mp.p.ptr()
// 从 P 的空闲列表获取一个空闲的 G
newg := gfget(pp)
// 找不到则创建
if newg == nil {
// 创建一个栈大小为 2K 的 G
newg = malg(stackMin)
// CAS 改变 G 的状态为_Gdead
casgstatus(newg, _Gidle, _Gdead)
// 将 _Gdead 状态的 g 添加到 allg,这样 GC 不会扫描未初始化的栈
allgadd(newg) // publishes with a g->status of Gdead so GC scanner doesn't look at uninitialized stack.
}
if newg.stack.hi == 0 {
throw("newproc1: newg missing stack")
}
if readgstatus(newg) != _Gdead {
throw("newproc1: new g is not Gdead")
}
// 计算运行空间大小,对齐
totalSize := uintptr(4*goarch.PtrSize + sys.MinFrameSize) // extra space in case of reads slightly beyond frame
totalSize = alignUp(totalSize, sys.StackAlign)
// 确定 SP 和参数入栈位置
sp := newg.stack.hi - totalSize
if usesLR {
// caller's LR
*(*uintptr)(unsafe.Pointer(sp)) = 0
prepGoExitFrame(sp)
}
if GOARCH == "arm64" {
// caller's FP
*(*uintptr)(unsafe.Pointer(sp - goarch.PtrSize)) = 0
}
// 清理、创建并初始化 G 的运行现场
memclrNoHeapPointers(unsafe.Pointer(&newg.sched), unsafe.Sizeof(newg.sched))
newg.sched.sp = sp
newg.stktopsp = sp
// 保存goexit的地址到sched.pc
newg.sched.pc = abi.FuncPCABI0(goexit) + sys.PCQuantum // +PCQuantum so that previous instruction is in same function
newg.sched.g = guintptr(unsafe.Pointer(newg))
gostartcallfn(&newg.sched, fn)
newg.parentGoid = callergp.goid
// 初始化 G 的基本状态
newg.gopc = callerpc
newg.ancestors = saveAncestors(callergp)
newg.startpc = fn.fn
newg.runningCleanups.Store(false)
if isSystemGoroutine(newg, false) {
sched.ngsys.Add(1)
} else {
// Only user goroutines inherit synctest groups and pprof labels.
newg.bubble = callergp.bubble
if mp.curg != nil {
newg.labels = mp.curg.labels
}
if goroutineProfile.active {
// A concurrent goroutine profile is running. It should include
// exactly the set of goroutines that were alive when the goroutine
// profiler first stopped the world. That does not include newg, so
// mark it as not needing a profile before transitioning it from
// _Gdead.
newg.goroutineProfiled.Store(goroutineProfileSatisfied)
}
}
// Track initial transition?
newg.trackingSeq = uint8(cheaprand())
if newg.trackingSeq%gTrackingPeriod == 0 {
newg.tracking = true
}
gcController.addScannableStack(pp, int64(newg.stack.hi-newg.stack.lo))
// Get a goid and switch to runnable. Make all this atomic to the tracer.
trace := traceAcquire()
// 将 G 的状态设置为_Grunnable
var status uint32 = _Grunnable
if parked {
status = _Gwaiting
newg.waitreason = waitreason
}
if pp.goidcache == pp.goidcacheend {
// Sched.goidgen is the last allocated id,
// this batch must be [sched.goidgen+1, sched.goidgen+GoidCacheBatch].
// At startup sched.goidgen=0, so main goroutine receives goid=1.
pp.goidcache = sched.goidgen.Add(_GoidCacheBatch)
pp.goidcache -= _GoidCacheBatch - 1
pp.goidcacheend = pp.goidcache + _GoidCacheBatch
}
// 生成唯一的goid
newg.goid = pp.goidcache
casgstatus(newg, _Gdead, status)
pp.goidcache++
newg.trace.reset()
if trace.ok() {
trace.GoCreate(newg, newg.startpc, parked)
traceRelease(trace)
}
// Set up race context.
if raceenabled {
newg.racectx = racegostart(callerpc)
newg.raceignore = 0
if newg.labels != nil {
// See note in proflabel.go on labelSync's role in synchronizing
// with the reads in the signal handler.
racereleasemergeg(newg, unsafe.Pointer(&labelSync))
}
}
// 释放对 M 加的锁
releasem(mp)
return newg
}
// Get from gfree list.
// If local list is empty, grab a batch from global list.
func gfget(pp *p) *g {
retry:
if pp.gFree.empty() && (!sched.gFree.stack.empty() || !sched.gFree.noStack.empty()) {
lock(&sched.gFree.lock)
// Move a batch of free Gs to the P.
for pp.gFree.size < 32 {
// Prefer Gs with stacks.
gp := sched.gFree.stack.pop()
if gp == nil {
gp = sched.gFree.noStack.pop()
if gp == nil {
break
}
}
pp.gFree.push(gp)
}
unlock(&sched.gFree.lock)
goto retry
}
gp := pp.gFree.pop()
if gp == nil {
return nil
}
if gp.stack.lo != 0 && gp.stack.hi-gp.stack.lo != uintptr(startingStackSize) {
// Deallocate old stack. We kept it in gfput because it was the
// right size when the goroutine was put on the free list, but
// the right size has changed since then.
systemstack(func() {
stackfree(gp.stack)
gp.stack.lo = 0
gp.stack.hi = 0
gp.stackguard0 = 0
if valgrindenabled {
valgrindDeregisterStack(gp.valgrindStackID)
gp.valgrindStackID = 0
}
})
}
if gp.stack.lo == 0 {
// Stack was deallocated in gfput or just above. Allocate a new one.
systemstack(func() {
gp.stack = stackalloc(startingStackSize)
if valgrindenabled {
gp.valgrindStackID = valgrindRegisterStack(unsafe.Pointer(gp.stack.lo), unsafe.Pointer(gp.stack.hi))
}
})
gp.stackguard0 = gp.stack.lo + stackGuard
} else {
if raceenabled {
racemalloc(unsafe.Pointer(gp.stack.lo), gp.stack.hi-gp.stack.lo)
}
if msanenabled {
msanmalloc(unsafe.Pointer(gp.stack.lo), gp.stack.hi-gp.stack.lo)
}
if asanenabled {
asanunpoison(unsafe.Pointer(gp.stack.lo), gp.stack.hi-gp.stack.lo)
}
}
return gp
}

untime.newproc1() 函数主要执行三个动作:
1)获取或者创建新的 Goroutine 结构体,会先从处理器的 gFree 列表中查找空闲的 Goroutine,如果不存在空闲的 Goroutine,会通过 runtime.malg 创建一个栈大小足够的新结构体 ,新创建的 G 的状态为_Gdead;
2)将传入的参数 callergp,callerpc,fn更新到 G 的栈上,初始化 G 的相关参数;
3)将 G 状态设置为 _Grunnable 状态,返回;
runtime.newproc1() 函数的逻辑如图 4.5 所示:

图4.5 runtime.newproc1() 函数逻辑
runtime.gfget 函数
runtime.newproc1() 函数主要调用 runtime.gfget() 函数 获取 G:
go
// Get from gfree list.
// If local list is empty, grab a batch from global list.
func gfget(pp *p) *g {
retry:
// 如果 P 的空闲列表 gFree 为空,sched 的的空闲列表 gFree 不为空
if pp.gFree.empty() && (!sched.gFree.stack.empty() || !sched.gFree.noStack.empty()) {
lock(&sched.gFree.lock)
// Move a batch of free Gs to the P.
// 从 sched 的 gFree 列表中移动 32 个 G 到 P 的 gFree 中
for pp.gFree.size < 32 {
// Prefer Gs with stacks.
gp := sched.gFree.stack.pop()
if gp == nil {
gp = sched.gFree.noStack.pop()
if gp == nil {
break
}
}
pp.gFree.push(gp)
}
unlock(&sched.gFree.lock)
goto retry
}
// 如果此时 P 的空闲列表还是为空,返回nil,说明无空闲的G
gp := pp.gFree.pop()
if gp == nil {
return nil
}
if gp.stack.lo != 0 && gp.stack.hi-gp.stack.lo != uintptr(startingStackSize) {
// Deallocate old stack. We kept it in gfput because it was the
// right size when the goroutine was put on the free list, but
// the right size has changed since then.
systemstack(func() {
stackfree(gp.stack)
gp.stack.lo = 0
gp.stack.hi = 0
gp.stackguard0 = 0
if valgrindenabled {
valgrindDeregisterStack(gp.valgrindStackID)
gp.valgrindStackID = 0
}
})
}
// 设置 G 的栈空间
if gp.stack.lo == 0 {
// Stack was deallocated in gfput or just above. Allocate a new one.
systemstack(func() {
gp.stack = stackalloc(startingStackSize)
if valgrindenabled {
gp.valgrindStackID = valgrindRegisterStack(unsafe.Pointer(gp.stack.lo), unsafe.Pointer(gp.stack.hi))
}
})
gp.stackguard0 = gp.stack.lo + stackGuard
} else {
if raceenabled {
racemalloc(unsafe.Pointer(gp.stack.lo), gp.stack.hi-gp.stack.lo)
}
if msanenabled {
msanmalloc(unsafe.Pointer(gp.stack.lo), gp.stack.hi-gp.stack.lo)
}
if asanenabled {
asanunpoison(unsafe.Pointer(gp.stack.lo), gp.stack.hi-gp.stack.lo)
}
}
// 从 P 的空闲列表获取 G 返回
return gp
}
runtime.gfget() 函数的主要逻辑是:当 P 的空闲列表 gFree 为空时,从 sched 持有的全局空闲列表 gFree 中移动最多 32个 G 到当前的 P 的空闲列表上;然后从 P 的 gFree 列表头返回一个 G;如果还是没有,则返回空,说明获取不到空闲的 G。
runtime.malg 函数
在 runtime.newproc1() 函数中,如果不存在空闲的 G,会通过 runtime.malg() 创建一个栈大小足够的新结构体:
go
// Allocate a new g, with a stack big enough for stacksize bytes.
// 创建一个新的 g 结构体
func malg(stacksize int32) *g {
newg := new(g)
// 如果申请的堆栈大小大于 0,会通过 runtime.stackalloc 分配 2KB 的栈空间
if stacksize >= 0 {
stacksize = round2(stackSystem + stacksize)
systemstack(func() {
newg.stack = stackalloc(uint32(stacksize))
if valgrindenabled {
newg.valgrindStackID = valgrindRegisterStack(unsafe.Pointer(newg.stack.lo), unsafe.Pointer(newg.stack.hi))
}
})
newg.stackguard0 = newg.stack.lo + stackGuard
newg.stackguard1 = ^uintptr(0)
// Clear the bottom word of the stack. We record g
// there on gsignal stack during VDSO on ARM and ARM64.
*(*uintptr)(unsafe.Pointer(newg.stack.lo)) = 0
}
return newg
}
runtime.runqput 函数
回到 runtime.newproc1 函数,在获取到 G 后,会调用 runtime.runqput() 函数将 G 放入 P 本地队列,或全局队列:
go
// runqput tries to put g on the local runnable queue.
// If next is false, runqput adds g to the tail of the runnable queue.
// If next is true, runqput puts g in the pp.runnext slot.
// If the run queue is full, runnext puts g on the global queue.
// Executed only by the owner P.
func runqput(pp *p, gp *g, next bool) {
if !haveSysmon && next {
// A runnext goroutine shares the same time slice as the
// current goroutine (inheritTime from runqget). To prevent a
// ping-pong pair of goroutines from starving all others, we
// depend on sysmon to preempt "long-running goroutines". That
// is, any set of goroutines sharing the same time slice.
//
// If there is no sysmon, we must avoid runnext entirely or
// risk starvation.
next = false
}
// 保持一定的随机性,不将当前 G 设置为 P 的下一个执行的任务
if randomizeScheduler && next && randn(2) == 0 {
next = false
}
if next {
retryNext:
// 将 G 放入到 P 的 runnext 变量中,作为下一个 P 执行的任务
oldnext := pp.runnext
if !pp.runnext.cas(oldnext, guintptr(unsafe.Pointer(gp))) {
goto retryNext
}
if oldnext == 0 {
return
}
// Kick the old runnext out to the regular run queue.
// 获取原来的 runnext 存储的 G,放入 P 本地运行队列,或全局队列
gp = oldnext.ptr()
}
retry:
// 获取 P 环形队列的头和尾部指针
h := atomic.LoadAcq(&pp.runqhead) // load-acquire, synchronize with consumers
t := pp.runqtail
// P 本地环形队列没有满,将 G 放入本地环形队列
if t-h < uint32(len(pp.runq)) {
pp.runq[t%uint32(len(pp.runq))].set(gp)
atomic.StoreRel(&pp.runqtail, t+1) // store-release, makes the item available for consumption
return
}
// P 本地环形队列已满,将 G 放入全局队列
if runqputslow(pp, gp, h, t) {
return
}
// the queue is not full, now the put above must succeed
// 本地队列和全局队列没有满,则不会走到这里,否则循环尝试放入
goto retry
}
runtime.runqput() 函数的主要处理逻辑是:
1)保留一定的随机性,设置 next 为 false,即不将当前 G 设置为 P 的下一个执行的 G;
2)当 next 为 true 时,将 G 设置到 P 的 runnext 作为 P 下一个执行的任务;
3)当 next 为 false 并且本地运行队列还有剩余空间时,将 Goroutine 加入处理器持有的本地运行队列;
4)当处理器的本地运行队列已经没有剩余空间时,就会把本地队列中的一部分 G 和待加入的 G 通过 runtime.runqputslow 添加到调度器持有的全局运行队列上;
runtime.runqput() 函数的逻辑如图 4.6 所示:

图4.6 runtime.runqput() 函数的逻辑
runtime.runqputslow 函数
runtime.runqputslow() 函数的逻辑如下:
go
// Put g and a batch of work from local runnable queue on global queue.
// Executed only by the owner P.
// 将 G 和 P 本地队列的一部分放入全局队列
func runqputslow(pp *p, gp *g, h, t uint32) bool {
// 初始化一个本地队列长度一半 + 1 的 G 列表 batch
var batch [len(pp.runq)/2 + 1]*g
// First, grab a batch from local queue.
// 首先,从 P 本地队列中获取一部分 G 放入初始化的列表 batch
n := t - h
n = n / 2
if n != uint32(len(pp.runq)/2) {
throw("runqputslow: queue is not full")
}
// 将 P 本地环形队列的前一半 G 放入batch
for i := uint32(0); i < n; i++ {
batch[i] = pp.runq[(h+i)%uint32(len(pp.runq))].ptr()
}
if !atomic.CasRel(&pp.runqhead, h, h+n) { // cas-release, commits consume
return false
}
// 将传入的 G 放入列表 batch 的尾部
batch[n] = gp
// 打乱 batch 列表中G的顺序
if randomizeScheduler {
for i := uint32(1); i <= n; i++ {
j := cheaprandn(i + 1)
batch[i], batch[j] = batch[j], batch[i]
}
}
// Link the goroutines.
// 将 batch列表的 G 串成一个链表.
for i := uint32(0); i < n; i++ {
batch[i].schedlink.set(batch[i+1])
}
// 将 batch 列表设置成 gQueue 队列
q := gQueue{batch[0].guintptr(), batch[n].guintptr(), int32(n + 1)}
// Now put the batch on global queue.
// 现在把 gQueue 队列放入全局队列
lock(&sched.lock)
globrunqputbatch(&q)
unlock(&sched.lock)
return true
}
runtime.runqputslow() 函数会把 P 本地环形队列的前一半 G 获取出来,跟传入的 G 组成一个列表,打乱顺序,再放入全局队列。
综上所属,用下图表示调度器启动流程:


图4.7 调度器启动流程
5. 调度循环
我们再回到4.1节的程序启动流程,runtime·rt0_go 函数在调用 runtime.schedinit() 初始化好了调度器、调用 runtime.newproc()创建了main函数的 G 后,会调用runtime.mstart() 函数启动 M 去执行G。
bash
TEXT runtime·mstart(SB),NOSPLIT|TOPFRAME,$0
CALL runtime·mstart0(SB)
RET // not reached
runtime.mstart() 是用汇编写的,会直接调用 runtime.mstart0() 函数:
runtime.mstart0 函数
bash
// src/runtime/proc.go
func mstart0() {
_g_ := getg()
......
// 初始化 g0 的参数
_g_.stackguard0 = _g_.stack.lo + _StackGuard
_g_.stackguard1 = _g_.stackguard0
mstart1()
......
mexit(osStack)
}
runtime.mstart0() 函数主要调用 runtime.mstart1():
go
// src/runtime/proc.go
// The go:noinline is to guarantee the sys.GetCallerPC/sys.GetCallerSP below are safe,
// so that we can set up g0.sched to return to the call of mstart1 above.
//
//go:noinline
func mstart1() {
gp := getg()
if gp != gp.m.g0 {
throw("bad runtime·mstart")
}
// Set up m.g0.sched as a label returning to just
// after the mstart1 call in mstart0 above, for use by goexit0 and mcall.
// We're never coming back to mstart1 after we call schedule,
// so other calls can reuse the current frame.
// And goexit0 does a gogo that needs to return from mstart1
// and let mstart0 exit the thread.
// 记录当前栈帧,便于其他调用复用,当进入 schedule 之后,再也不会回到 mstart1
gp.sched.g = guintptr(unsafe.Pointer(gp))
gp.sched.pc = sys.GetCallerPC()
gp.sched.sp = sys.GetCallerSP()
asminit()
minit()
// Install signal handlers; after minit so that minit can
// prepare the thread to be able to handle the signals.
// 设置信号 handler;在 minit 之后,因为 minit 可以准备处理信号的的线程
if gp.m == &m0 {
mstartm0()
}
if debug.dataindependenttiming == 1 {
sys.EnableDIT()
}
// 执行启动函数
if fn := gp.m.mstartfn; fn != nil {
fn()
}
// 如果当前 m 并非 m0,则要求绑定 p
if gp.m != &m0 {
acquirep(gp.m.nextp.ptr())
gp.m.nextp = 0
}
// 准备好后,开始调度循环,永不返回
schedule()
}
runtime.mstart1() 保存调度信息后,会调用 runtime.schedule() 进入调度循环,寻找一个可执行的 G 并执行。
循环调度主逻辑如图5.1所示:

图5.1 循环调度主逻辑
runtime.schedule 函数
runtime.schedule() 函数的逻辑是:
go
// One round of scheduler: find a runnable goroutine and execute it.
// Never returns.
func schedule() {
mp := getg().m
if mp.locks != 0 {
throw("schedule: holding locks")
}
if mp.lockedg != 0 {
stoplockedm()
execute(mp.lockedg.ptr(), false) // Never returns.
}
// We should not schedule away from a g that is executing a cgo call,
// since the cgo call is using the m's g0 stack.
if mp.incgo {
throw("schedule: in cgo")
}
top:
pp := mp.p.ptr()
pp.preempt = false
// Safety check: if we are spinning, the run queue should be empty.
// Check this before calling checkTimers, as that might call
// goready to put a ready goroutine on the local run queue.
// 如果 G 所在的 M 在自旋,说明其P运行队列为空,如果不为空,则应该甩出错误
if mp.spinning && (pp.runnext != 0 || pp.runqhead != pp.runqtail) {
throw("schedule: spinning with local work")
}
// 阻塞式查找可用 G
gp, inheritTime, tryWakeP := findRunnable() // blocks until work is available
// findRunnable may have collected an allp snapshot. The snapshot is
// only required within findRunnable. Clear it to all GC to collect the
// slice.
mp.clearAllpSnapshot()
if debug.dontfreezetheworld > 0 && freezing.Load() {
// See comment in freezetheworld. We don't want to perturb
// scheduler state, so we didn't gcstopm in findRunnable, but
// also don't want to allow new goroutines to run.
//
// Deadlock here rather than in the findRunnable loop so if
// findRunnable is stuck in a loop we don't perturb that
// either.
lock(&deadlock)
lock(&deadlock)
}
// This thread is going to run a goroutine and is not spinning anymore,
// so if it was marked as spinning we need to reset it now and potentially
// start a new spinning M.
// M 这时候一定是获取到了G
// 如果 M 是自旋状态,重置其状态到非自旋
if mp.spinning {
resetspinning()
}
if sched.disable.user && !schedEnabled(gp) {
// Scheduling of this goroutine is disabled. Put it on
// the list of pending runnable goroutines for when we
// re-enable user scheduling and look again.
lock(&sched.lock)
if schedEnabled(gp) {
// Something re-enabled scheduling while we
// were acquiring the lock.
unlock(&sched.lock)
} else {
sched.disable.runnable.pushBack(gp)
unlock(&sched.lock)
goto top
}
}
// If about to schedule a not-normal goroutine (a GCworker or tracereader),
// wake a P if there is one.
if tryWakeP {
wakep()
}
if gp.lockedm != 0 {
// Hands off own p to the locked m,
// then blocks waiting for a new p.
startlockedm(gp)
goto top
}
// 执行 G
execute(gp, inheritTime)
}
// Finds a runnable goroutine to execute.
// Tries to steal from other P's, get g from local or global queue, poll network.
// tryWakeP indicates that the returned goroutine is not normal (GC worker, trace
// reader) so the caller should try to wake a P.
func findRunnable() (gp *g, inheritTime, tryWakeP bool) {
mp := getg().m
// The conditions here and in handoffp must agree: if
// findrunnable would return a G to run, handoffp must start
// an M.
top:
// We may have collected an allp snapshot below. The snapshot is only
// required in each loop iteration. Clear it to all GC to collect the
// slice.
mp.clearAllpSnapshot()
pp := mp.p.ptr()
// 如果需要 GC,不再进行调度
if sched.gcwaiting.Load() {
gcstopm()
goto top
}
// 不等于0,说明在安全点
if pp.runSafePointFn != 0 {
runSafePointFn()
}
// now and pollUntil are saved for work stealing later,
// which may steal timers. It's important that between now
// and then, nothing blocks, so these numbers remain mostly
// relevant.
now, pollUntil, _ := pp.timers.check(0, nil)
// Try to schedule the trace reader.
if traceEnabled() || traceShuttingDown() {
gp := traceReader()
if gp != nil {
trace := traceAcquire()
casgstatus(gp, _Gwaiting, _Grunnable)
if trace.ok() {
trace.GoUnpark(gp, 0)
traceRelease(trace)
}
return gp, false, true
}
}
// Try to schedule a GC worker.
if gcBlackenEnabled != 0 {
gp, tnow := gcController.findRunnableGCWorker(pp, now)
if gp != nil {
return gp, false, true
}
now = tnow
}
// Check the global runnable queue once in a while to ensure fairness.
// Otherwise two goroutines can completely occupy the local runqueue
// by constantly respawning each other.
// 说明不在 GC
// 每调度 61 次,就检查一次全局队列,保证公平性;否则两个 Goroutine 可以通过互换,一直占领本地的 runqueue
if pp.schedtick%61 == 0 && !sched.runq.empty() {
lock(&sched.lock)
// 从全局队列中偷 g
gp := globrunqget()
unlock(&sched.lock)
if gp != nil {
return gp, false, false
}
}
// Wake up the finalizer G.
if fingStatus.Load()&(fingWait|fingWake) == fingWait|fingWake {
if gp := wakefing(); gp != nil {
ready(gp, 0, true)
}
}
// Wake up one or more cleanup Gs.
if gcCleanups.needsWake() {
gcCleanups.wake()
}
if *cgo_yield != nil {
asmcgocall(*cgo_yield, nil)
}
// local runq
// 从 P 的本地队列获取 G
if gp, inheritTime := runqget(pp); gp != nil {
return gp, inheritTime, false
}
// global runq
if !sched.runq.empty() {
lock(&sched.lock)
gp, q := globrunqgetbatch(int32(len(pp.runq)) / 2)
unlock(&sched.lock)
if gp != nil {
if runqputbatch(pp, &q); !q.empty() {
throw("Couldn't put Gs into empty local runq")
}
return gp, false, false
}
}
// Poll network.
// This netpoll is only an optimization before we resort to stealing.
// We can safely skip it if there are no waiters or a thread is blocked
// in netpoll already. If there is any kind of logical race with that
// blocked thread (e.g. it has already returned from netpoll, but does
// not set lastpoll yet), this thread will do blocking netpoll below
// anyway.
// We only poll from one thread at a time to avoid kernel contention
// on machines with many cores.
if netpollinited() && netpollAnyWaiters() && sched.lastpoll.Load() != 0 && sched.pollingNet.Swap(1) == 0 {
list, delta := netpoll(0)
sched.pollingNet.Store(0)
if !list.empty() { // non-blocking
gp := list.pop()
injectglist(&list)
netpollAdjustWaiters(delta)
trace := traceAcquire()
casgstatus(gp, _Gwaiting, _Grunnable)
if trace.ok() {
trace.GoUnpark(gp, 0)
traceRelease(trace)
}
return gp, false, false
}
}
// Spinning Ms: steal work from other Ps.
//
// Limit the number of spinning Ms to half the number of busy Ps.
// This is necessary to prevent excessive CPU consumption when
// GOMAXPROCS>>1 but the program parallelism is low.
if mp.spinning || 2*sched.nmspinning.Load() < gomaxprocs-sched.npidle.Load() {
if !mp.spinning {
mp.becomeSpinning()
}
gp, inheritTime, tnow, w, newWork := stealWork(now)
if gp != nil {
// Successfully stole.
return gp, inheritTime, false
}
if newWork {
// There may be new timer or GC work; restart to
// discover.
goto top
}
now = tnow
if w != 0 && (pollUntil == 0 || w < pollUntil) {
// Earlier timer to wait for.
pollUntil = w
}
}
// We have nothing to do.
//
// If we're in the GC mark phase, can safely scan and blacken objects,
// and have work to do, run idle-time marking rather than give up the P.
if gcBlackenEnabled != 0 && gcMarkWorkAvailable(pp) && gcController.addIdleMarkWorker() {
node := (*gcBgMarkWorkerNode)(gcBgMarkWorkerPool.pop())
if node != nil {
pp.gcMarkWorkerMode = gcMarkWorkerIdleMode
gp := node.gp.ptr()
trace := traceAcquire()
casgstatus(gp, _Gwaiting, _Grunnable)
if trace.ok() {
trace.GoUnpark(gp, 0)
traceRelease(trace)
}
return gp, false, false
}
gcController.removeIdleMarkWorker()
}
// wasm only:
// If a callback returned and no other goroutine is awake,
// then wake event handler goroutine which pauses execution
// until a callback was triggered.
gp, otherReady := beforeIdle(now, pollUntil)
if gp != nil {
trace := traceAcquire()
casgstatus(gp, _Gwaiting, _Grunnable)
if trace.ok() {
trace.GoUnpark(gp, 0)
traceRelease(trace)
}
return gp, false, false
}
if otherReady {
goto top
}
// Before we drop our P, make a snapshot of the allp slice,
// which can change underfoot once we no longer block
// safe-points. We don't need to snapshot the contents because
// everything up to cap(allp) is immutable.
//
// We clear the snapshot from the M after return via
// mp.clearAllpSnapshop (in schedule) and on each iteration of the top
// loop.
allpSnapshot := mp.snapshotAllp()
// Also snapshot masks. Value changes are OK, but we can't allow
// len to change out from under us.
idlepMaskSnapshot := idlepMask
timerpMaskSnapshot := timerpMask
// return P and block
lock(&sched.lock)
if sched.gcwaiting.Load() || pp.runSafePointFn != 0 {
unlock(&sched.lock)
goto top
}
if !sched.runq.empty() {
gp, q := globrunqgetbatch(int32(len(pp.runq)) / 2)
unlock(&sched.lock)
if gp == nil {
throw("global runq empty with non-zero runqsize")
}
if runqputbatch(pp, &q); !q.empty() {
throw("Couldn't put Gs into empty local runq")
}
return gp, false, false
}
if !mp.spinning && sched.needspinning.Load() == 1 {
// See "Delicate dance" comment below.
mp.becomeSpinning()
unlock(&sched.lock)
goto top
}
if releasep() != pp {
throw("findrunnable: wrong p")
}
now = pidleput(pp, now)
unlock(&sched.lock)
// Delicate dance: thread transitions from spinning to non-spinning
// state, potentially concurrently with submission of new work. We must
// drop nmspinning first and then check all sources again (with
// #StoreLoad memory barrier in between). If we do it the other way
// around, another thread can submit work after we've checked all
// sources but before we drop nmspinning; as a result nobody will
// unpark a thread to run the work.
//
// This applies to the following sources of work:
//
// * Goroutines added to the global or a per-P run queue.
// * New/modified-earlier timers on a per-P timer heap.
// * Idle-priority GC work (barring golang.org/issue/19112).
//
// If we discover new work below, we need to restore m.spinning as a
// signal for resetspinning to unpark a new worker thread (because
// there can be more than one starving goroutine).
//
// However, if after discovering new work we also observe no idle Ps
// (either here or in resetspinning), we have a problem. We may be
// racing with a non-spinning M in the block above, having found no
// work and preparing to release its P and park. Allowing that P to go
// idle will result in loss of work conservation (idle P while there is
// runnable work). This could result in complete deadlock in the
// unlikely event that we discover new work (from netpoll) right as we
// are racing with _all_ other Ps going idle.
//
// We use sched.needspinning to synchronize with non-spinning Ms going
// idle. If needspinning is set when they are about to drop their P,
// they abort the drop and instead become a new spinning M on our
// behalf. If we are not racing and the system is truly fully loaded
// then no spinning threads are required, and the next thread to
// naturally become spinning will clear the flag.
//
// Also see "Worker thread parking/unparking" comment at the top of the
// file.
wasSpinning := mp.spinning
if mp.spinning {
mp.spinning = false
if sched.nmspinning.Add(-1) < 0 {
throw("findrunnable: negative nmspinning")
}
// Note the for correctness, only the last M transitioning from
// spinning to non-spinning must perform these rechecks to
// ensure no missed work. However, the runtime has some cases
// of transient increments of nmspinning that are decremented
// without going through this path, so we must be conservative
// and perform the check on all spinning Ms.
//
// See https://go.dev/issue/43997.
// Check global and P runqueues again.
lock(&sched.lock)
if !sched.runq.empty() {
pp, _ := pidlegetSpinning(0)
if pp != nil {
gp, q := globrunqgetbatch(int32(len(pp.runq)) / 2)
unlock(&sched.lock)
if gp == nil {
throw("global runq empty with non-zero runqsize")
}
if runqputbatch(pp, &q); !q.empty() {
throw("Couldn't put Gs into empty local runq")
}
acquirep(pp)
mp.becomeSpinning()
return gp, false, false
}
}
unlock(&sched.lock)
pp := checkRunqsNoP(allpSnapshot, idlepMaskSnapshot)
if pp != nil {
acquirep(pp)
mp.becomeSpinning()
goto top
}
// Check for idle-priority GC work again.
pp, gp := checkIdleGCNoP()
if pp != nil {
acquirep(pp)
mp.becomeSpinning()
// Run the idle worker.
pp.gcMarkWorkerMode = gcMarkWorkerIdleMode
trace := traceAcquire()
casgstatus(gp, _Gwaiting, _Grunnable)
if trace.ok() {
trace.GoUnpark(gp, 0)
traceRelease(trace)
}
return gp, false, false
}
// Finally, check for timer creation or expiry concurrently with
// transitioning from spinning to non-spinning.
//
// Note that we cannot use checkTimers here because it calls
// adjusttimers which may need to allocate memory, and that isn't
// allowed when we don't have an active P.
pollUntil = checkTimersNoP(allpSnapshot, timerpMaskSnapshot, pollUntil)
}
// We don't need allp anymore at this pointer, but can't clear the
// snapshot without a P for the write barrier..
// Poll network until next timer.
if netpollinited() && (netpollAnyWaiters() || pollUntil != 0) && sched.lastpoll.Swap(0) != 0 {
sched.pollUntil.Store(pollUntil)
if mp.p != 0 {
throw("findrunnable: netpoll with p")
}
if mp.spinning {
throw("findrunnable: netpoll with spinning")
}
delay := int64(-1)
if pollUntil != 0 {
if now == 0 {
now = nanotime()
}
delay = pollUntil - now
if delay < 0 {
delay = 0
}
}
if faketime != 0 {
// When using fake time, just poll.
delay = 0
}
list, delta := netpoll(delay) // block until new work is available
// Refresh now again, after potentially blocking.
now = nanotime()
sched.pollUntil.Store(0)
sched.lastpoll.Store(now)
if faketime != 0 && list.empty() {
// Using fake time and nothing is ready; stop M.
// When all M's stop, checkdead will call timejump.
stopm()
goto top
}
lock(&sched.lock)
pp, _ := pidleget(now)
unlock(&sched.lock)
if pp == nil {
injectglist(&list)
netpollAdjustWaiters(delta)
} else {
acquirep(pp)
if !list.empty() {
gp := list.pop()
injectglist(&list)
netpollAdjustWaiters(delta)
trace := traceAcquire()
casgstatus(gp, _Gwaiting, _Grunnable)
if trace.ok() {
trace.GoUnpark(gp, 0)
traceRelease(trace)
}
return gp, false, false
}
if wasSpinning {
mp.becomeSpinning()
}
goto top
}
} else if pollUntil != 0 && netpollinited() {
pollerPollUntil := sched.pollUntil.Load()
if pollerPollUntil == 0 || pollerPollUntil > pollUntil {
netpollBreak()
}
}
stopm()
goto top
}
// Schedules gp to run on the current M.
// If inheritTime is true, gp inherits the remaining time in the
// current time slice. Otherwise, it starts a new time slice.
// Never returns.
//
// Write barriers are allowed because this is called immediately after
// acquiring a P in several places.
//
//go:yeswritebarrierrec
func execute(gp *g, inheritTime bool) {
mp := getg().m
if goroutineProfile.active {
// Make sure that gp has had its stack written out to the goroutine
// profile, exactly as it was when the goroutine profiler first stopped
// the world.
tryRecordGoroutineProfile(gp, nil, osyield)
}
// Assign gp.m before entering _Grunning so running Gs have an M.
mp.curg = gp
gp.m = mp
gp.syncSafePoint = false // Clear the flag, which may have been set by morestack.
casgstatus(gp, _Grunnable, _Grunning)
gp.waitsince = 0
gp.preempt = false
gp.stackguard0 = gp.stack.lo + stackGuard
// 只有当 inheritTime == false(开始一个新的 time slice)时才会 schedtick++
if !inheritTime {
mp.p.ptr().schedtick++
}
// Check whether the profiler needs to be turned on or off.
hz := sched.profilehz
if mp.profilehz != hz {
setThreadCPUProfiler(hz)
}
trace := traceAcquire()
if trace.ok() {
trace.GoStart()
traceRelease(trace)
}
gogo(&gp.sched)
}
runtime.schedule() 函数会从下面几个地方查找待执行的 Goroutine:
1)为了保证公平,当全局运行队列中有待执行的 G 时,通过 schedtick 对 61 取模,表示每 61 次会有一次从全局的运行队列中查找对应的 G,这样可以避免两个 G 在 P 本地队列互换一直占有本地队列 ;
2)调用 runtime.runqget() 函数从 P 本地的运行队列中获取待执行的 G ;
3)如果前两种方法都没有找到 G,会通过 runtime.findrunnable() 进行阻塞地查找 G;


runtime.globrunqget 函数(老版)
runtime.schedule 函数从全局队列获取 G 的函数是 runtime.globrunqget() 函数:
go
// 从全局队列获取 G
func globrunqget(_p_ *p, max int32) *g {
assertLockHeld(&sched.lock)
// 如果全局队列没有 G,则直接返回
if sched.runqsize == 0 {
return nil
}
// 计算n,表示从全局队列放入本地队列的 G 的个数
n := sched.runqsize/gomaxprocs + 1
if n > sched.runqsize {
n = sched.runqsize
}
// n 不能超过取的要获取的max个数
if max > 0 && n > max {
n = max
}
// 计算能不能用本地队列的一般放下 n 个 G,如果放不下,则 n 设为本地队列的一半
if n > int32(len(_p_.runq))/2 {
n = int32(len(_p_.runq)) / 2
}
sched.runqsize -= n
// 拿到全局队列的队头作为返回的 G
gp := sched.runq.pop()
n-- // n计数减 1
// 继续取剩下的 n-1个全局队列 G 放入本地队列
for ; n > 0; n-- {
gp1 := sched.runq.pop()
runqput(_p_, gp1, false)
}
return gp
}
func globrunqget() *g 函数(新版)
只取 1 个 G:如果全局队列空就返回 nil;否则 pop() 一个出来。
go
// Try get a single G from the global runnable queue.
// sched.lock must be held.
func globrunqget() *g {
assertLockHeld(&sched.lock)
if sched.runq.size == 0 {
return nil
}
return sched.runq.pop()
}
func globrunqgetbatch(n int32) (gp *g, q gQueue) 函数(新版)
取一批:返回值拆成两部分:
gp:第一个要立刻运行的 G
q:额外取到的一串 G(放进一个 gQueue),通常会再批量塞回当前 P 的本地队列(runqputbatch)
go
// Try get a batch of G's from the global runnable queue.
// sched.lock must be held.
func globrunqgetbatch(n int32) (gp *g, q gQueue) {
assertLockHeld(&sched.lock)
if sched.runq.size == 0 {
return
}
n = min(n, sched.runq.size, sched.runq.size/gomaxprocs+1)
gp = sched.runq.pop()
n--
for ; n > 0; n-- {
gp1 := sched.runq.pop()
q.pushBack(gp1)
}
return
}
runtime.globrunqget() 函数会从全局队列获取 n 个 G,第一个 G 返回给调度器去执行,剩下的 n-1 个 G 放入本地队列,其中,n一般为全局队列长度 / P处理器个数 + 1,含义是平均每个 P 应该从全局队列中承担的 G 数量,且不能超过 P 本地长度的一半。
runtime.runqget 函数
runtime.schedule() 函数调用 runtime.runqget() 函数从 P 本地的运行队列中获取待执行的 G:
go
// Get g from local runnable queue.
// If inheritTime is true, gp should inherit the remaining time in the
// current time slice. Otherwise, it should start a new time slice.
// Executed only by the owner P.
// 从 P 本地队列中获取 G
func runqget(pp *p) (gp *g, inheritTime bool) {
// 如果 P 有一个 runnext,则它就是下一个要执行的 G.
// If there's a runnext, it's the next G to run.
next := pp.runnext
// If the runnext is non-0 and the CAS fails, it could only have been stolen by another P,
// because other Ps can race to set runnext to 0, but only the current P can set it to non-0.
// Hence, there's no need to retry this CAS if it fails.
// 如果 runnext 不为空,而 CAS 失败, 则它又可能被其他 P 偷取了,
// 因为其他 P 可以竞争机会到设置 runnext 为 0, 当前 P 只能设置该字段为非0
// cas(next, 0):用 CAS(Compare-And-Swap)把 pp.runnext 从 next 原子地改成 0(清空)。只有改成功的人才能拿走它。
if next != 0 && pp.runnext.cas(next, 0) {
return next.ptr(), true
}
for {
//从本地环形队列头遍历
h := atomic.LoadAcq(&pp.runqhead) // load-acquire, synchronize with other consumers
t := pp.runqtail
// 头尾指针相等,表示本地队列为空
if t == h {
return nil, false
}
// 获取头部指针指向的 G
gp := pp.runq[h%uint32(len(pp.runq))].ptr()
if atomic.CasRel(&pp.runqhead, h, h+1) { // cas-release, commits consume
return gp, false
}
}
}
本地队列的获取会先从 P 的 runnext 字段中获取,如果不为空则直接返回。如果 runnext 为空,那么从本地环形队列头指针遍历本地队列,取到了则返回。
runtime.findrunnable 函数
阻塞式获取 G 的 runtime.findrunnable() 函数的整个逻辑看起来比较繁琐,其实无非是按这个顺序获取 G: local -> global -> netpoll -> steal -> local -> global -> netpoll:
go
// Finds a runnable goroutine to execute.
// Tries to steal from other P's, get g from local or global queue, poll network.
// tryWakeP indicates that the returned goroutine is not normal (GC worker, trace
// reader) so the caller should try to wake a P.
// 找到一个可运行的 G 去执行
// 会从其他 P 的运行队列偷取,从本地或全局队列获取,或从网络轮询器获取
func findRunnable() (gp *g, inheritTime, tryWakeP bool) {
mp := getg().m
// The conditions here and in handoffp must agree: if
// findrunnable would return a G to run, handoffp must start
// an M.
top:
// We may have collected an allp snapshot below. The snapshot is only
// required in each loop iteration. Clear it to all GC to collect the
// slice.
mp.clearAllpSnapshot()
pp := mp.p.ptr()
// 如果在 gc,则休眠当前 m,直到复始后回到 top
if sched.gcwaiting.Load() {
gcstopm()
goto top
}
// 不等于0,说明在安全点
if pp.runSafePointFn != 0 {
runSafePointFn()
}
// now and pollUntil are saved for work stealing later,
// which may steal timers. It's important that between now
// and then, nothing blocks, so these numbers remain mostly
// relevant.
now, pollUntil, _ := pp.timers.check(0, nil)
// Try to schedule the trace reader.
if traceEnabled() || traceShuttingDown() {
gp := traceReader()
if gp != nil {
trace := traceAcquire()
casgstatus(gp, _Gwaiting, _Grunnable)
if trace.ok() {
trace.GoUnpark(gp, 0)
traceRelease(trace)
}
return gp, false, true
}
}
// Try to schedule a GC worker.
if gcBlackenEnabled != 0 {
gp, tnow := gcController.findRunnableGCWorker(pp, now)
if gp != nil {
return gp, false, true
}
now = tnow
}
// Check the global runnable queue once in a while to ensure fairness.
// Otherwise two goroutines can completely occupy the local runqueue
// by constantly respawning each other.
if pp.schedtick%61 == 0 && !sched.runq.empty() {
lock(&sched.lock)
gp := globrunqget()
unlock(&sched.lock)
if gp != nil {
return gp, false, false
}
}
// Wake up the finalizer G.
if fingStatus.Load()&(fingWait|fingWake) == fingWait|fingWake {
if gp := wakefing(); gp != nil {
ready(gp, 0, true)
}
}
// Wake up one or more cleanup Gs.
if gcCleanups.needsWake() {
gcCleanups.wake()
}
if *cgo_yield != nil {
asmcgocall(*cgo_yield, nil)
}
// local runq
// 取本地队列 local runq,如果已经拿到,立刻返回
if gp, inheritTime := runqget(pp); gp != nil {
return gp, inheritTime, false
}
// global runq
// 全局队列 global runq,如果已经拿到,立刻返回
if !sched.runq.empty() {
lock(&sched.lock)
gp, q := globrunqgetbatch(int32(len(pp.runq)) / 2)
unlock(&sched.lock)
if gp != nil {
if runqputbatch(pp, &q); !q.empty() {
throw("Couldn't put Gs into empty local runq")
}
return gp, false, false
}
}
// Poll network.
// This netpoll is only an optimization before we resort to stealing.
// We can safely skip it if there are no waiters or a thread is blocked
// in netpoll already. If there is any kind of logical race with that
// blocked thread (e.g. it has already returned from netpoll, but does
// not set lastpoll yet), this thread will do blocking netpoll below
// anyway.
// We only poll from one thread at a time to avoid kernel contention
// on machines with many cores.
// 从 netpoll 网络轮询器中尝试获取 G,优先级比从其他 P 偷取 G 要高
if netpollinited() && netpollAnyWaiters() && sched.lastpoll.Load() != 0 && sched.pollingNet.Swap(1) == 0 {
list, delta := netpoll(0)
sched.pollingNet.Store(0)
if !list.empty() { // non-blocking
gp := list.pop()
injectglist(&list)
netpollAdjustWaiters(delta)
trace := traceAcquire()
casgstatus(gp, _Gwaiting, _Grunnable)
if trace.ok() {
trace.GoUnpark(gp, 0)
traceRelease(trace)
}
return gp, false, false
}
}
// Spinning Ms: steal work from other Ps.
//
// Limit the number of spinning Ms to half the number of busy Ps.
// This is necessary to prevent excessive CPU consumption when
// GOMAXPROCS>>1 but the program parallelism is low.
// 自旋 M: 从其他 P 中窃取任务 G
// gomaxprocs:总 P 数
// npidle:空闲 P 数
// 所以 gomaxprocs - npidle = 非空闲(忙碌)P 数
// 2*sched.nmspinning.Load() < gomaxprocs-sched.npidle.Load()这个不等式等价于:nmspinning < (busyPs)/2
// 也就是:把"自旋中的 M"的数量限制在"忙碌 P 数量的一半"以内
// 限制自旋 M 数量到忙碌P数量的一半. 避免在 GOMAXPROCS 很大但可并行的工作很少时,过多 M 自旋/偷工作导致 CPU 空转消耗。
if mp.spinning || 2*sched.nmspinning.Load() < gomaxprocs-sched.npidle.Load() {
if !mp.spinning {
mp.becomeSpinning()
}
// 从其他 P 或 timer 中偷取G
gp, inheritTime, tnow, w, newWork := stealWork(now)
if gp != nil {
// Successfully stole.
return gp, inheritTime, false
}
if newWork {
// 可能有新的 timer 或 GC,重新开始
// There may be new timer or GC work; restart to
// discover.
goto top
}
now = tnow
if w != 0 && (pollUntil == 0 || w < pollUntil) {
// Earlier timer to wait for.
pollUntil = w
}
}
// We have nothing to do.
//
// If we're in the GC mark phase, can safely scan and blacken objects,
// and have work to do, run idle-time marking rather than give up the P.
// 没有可运行的 G 了(runnable work 用尽)。
// 如果处于并发 GC 标记阶段且仍有标记任务可做,
// 就启动/运行一个 idle-time mark worker,利用空闲 CPU 做标记,
// 而不是立即交还 P 并让 M 休眠。
if gcBlackenEnabled != 0 && gcMarkWorkAvailable(pp) && gcController.addIdleMarkWorker() {
node := (*gcBgMarkWorkerNode)(gcBgMarkWorkerPool.pop())
if node != nil {
pp.gcMarkWorkerMode = gcMarkWorkerIdleMode
gp := node.gp.ptr()
trace := traceAcquire()
casgstatus(gp, _Gwaiting, _Grunnable)
if trace.ok() {
trace.GoUnpark(gp, 0)
traceRelease(trace)
}
return gp, false, false
}
gcController.removeIdleMarkWorker()
}
// wasm only:
// If a callback returned and no other goroutine is awake,
// then wake event handler goroutine which pauses execution
// until a callback was triggered.
gp, otherReady := beforeIdle(now, pollUntil)
if gp != nil {
trace := traceAcquire()
casgstatus(gp, _Gwaiting, _Grunnable)
if trace.ok() {
trace.GoUnpark(gp, 0)
traceRelease(trace)
}
return gp, false, false
}
if otherReady {
goto top
}
// Before we drop our P, make a snapshot of the allp slice,
// which can change underfoot once we no longer block
// safe-points. We don't need to snapshot the contents because
// everything up to cap(allp) is immutable.
//
// We clear the snapshot from the M after return via
// mp.clearAllpSnapshop (in schedule) and on each iteration of the top
// loop.
// 放弃当前的 P 之前,对 allp 做一个快照
allpSnapshot := mp.snapshotAllp()
// Also snapshot masks. Value changes are OK, but we can't allow
// len to change out from under us.
idlepMaskSnapshot := idlepMask
timerpMaskSnapshot := timerpMask
// return P and block
// 准备归还 p,对调度器加锁
lock(&sched.lock)
// 进入了 gc,回到顶部并停止 m
if sched.gcwaiting.Load() || pp.runSafePointFn != 0 {
unlock(&sched.lock)
goto top
}
// 全局队列中又发现了任务
if !sched.runq.empty() {
// 赶紧偷掉返回
gp, q := globrunqgetbatch(int32(len(pp.runq)) / 2)
unlock(&sched.lock)
if gp == nil {
throw("global runq empty with non-zero runqsize")
}
if runqputbatch(pp, &q); !q.empty() {
throw("Couldn't put Gs into empty local runq")
}
return gp, false, false
}
if !mp.spinning && sched.needspinning.Load() == 1 {
// See "Delicate dance" comment below.
mp.becomeSpinning()
unlock(&sched.lock)
goto top
}
// 归还当前的 p
if releasep() != pp {
throw("findrunnable: wrong p")
}
// 将 p 放入 idle 链表
now = pidleput(pp, now)
// 完成归还,解锁
unlock(&sched.lock)
// Delicate dance: thread transitions from spinning to non-spinning
// state, potentially concurrently with submission of new work. We must
// drop nmspinning first and then check all sources again (with
// #StoreLoad memory barrier in between). If we do it the other way
// around, another thread can submit work after we've checked all
// sources but before we drop nmspinning; as a result nobody will
// unpark a thread to run the work.
//
// This applies to the following sources of work:
//
// * Goroutines added to the global or a per-P run queue.
// * New/modified-earlier timers on a per-P timer heap.
// * Idle-priority GC work (barring golang.org/issue/19112).
//
// If we discover new work below, we need to restore m.spinning as a
// signal for resetspinning to unpark a new worker thread (because
// there can be more than one starving goroutine).
//
// However, if after discovering new work we also observe no idle Ps
// (either here or in resetspinning), we have a problem. We may be
// racing with a non-spinning M in the block above, having found no
// work and preparing to release its P and park. Allowing that P to go
// idle will result in loss of work conservation (idle P while there is
// runnable work). This could result in complete deadlock in the
// unlikely event that we discover new work (from netpoll) right as we
// are racing with _all_ other Ps going idle.
//
// We use sched.needspinning to synchronize with non-spinning Ms going
// idle. If needspinning is set when they are about to drop their P,
// they abort the drop and instead become a new spinning M on our
// behalf. If we are not racing and the system is truly fully loaded
// then no spinning threads are required, and the next thread to
// naturally become spinning will clear the flag.
//
// Also see "Worker thread parking/unparking" comment at the top of the
// file.
// 这里要非常小心: 线程从自旋到非自旋状态的转换,可能与新 Goroutine 的提交同时发生
wasSpinning := mp.spinning
if mp.spinning {
// M 即将睡眠,状态不再是 spinning
mp.spinning = false
if sched.nmspinning.Add(-1) < 0 {
throw("findrunnable: negative nmspinning")
}
// Note the for correctness, only the last M transitioning from
// spinning to non-spinning must perform these rechecks to
// ensure no missed work. However, the runtime has some cases
// of transient increments of nmspinning that are decremented
// without going through this path, so we must be conservative
// and perform the check on all spinning Ms.
//
// See https://go.dev/issue/43997.
// Check global and P runqueues again.
lock(&sched.lock)
if !sched.runq.empty() {
pp, _ := pidlegetSpinning(0)
if pp != nil {
gp, q := globrunqgetbatch(int32(len(pp.runq)) / 2)
unlock(&sched.lock)
if gp == nil {
throw("global runq empty with non-zero runqsize")
}
if runqputbatch(pp, &q); !q.empty() {
throw("Couldn't put Gs into empty local runq")
}
acquirep(pp)
mp.becomeSpinning()
return gp, false, false
}
}
unlock(&sched.lock)
// 再次检查所有的 runqueue
pp := checkRunqsNoP(allpSnapshot, idlepMaskSnapshot)
if pp != nil {
acquirep(pp)
mp.becomeSpinning()
goto top
}
// Check for idle-priority GC work again.
// 再次检查 idle-priority GC work,和上面重新找 runqueue 的逻辑类似
pp, gp := checkIdleGCNoP()
if pp != nil {
acquirep(pp)
mp.becomeSpinning()
// Run the idle worker.
pp.gcMarkWorkerMode = gcMarkWorkerIdleMode
trace := traceAcquire()
casgstatus(gp, _Gwaiting, _Grunnable)
if trace.ok() {
trace.GoUnpark(gp, 0)
traceRelease(trace)
}
return gp, false, false
}
// Finally, check for timer creation or expiry concurrently with
// transitioning from spinning to non-spinning.
//
// Note that we cannot use checkTimers here because it calls
// adjusttimers which may need to allocate memory, and that isn't
// allowed when we don't have an active P.
// 最后, 检查 timer creation
pollUntil = checkTimersNoP(allpSnapshot, timerpMaskSnapshot, pollUntil)
}
// We don't need allp anymore at this pointer, but can't clear the
// snapshot without a P for the write barrier..
// Poll network until next timer.
// 再次检查 netpoll 网络轮询器,和上面重新找 runqueue 的逻辑类似
if netpollinited() && (netpollAnyWaiters() || pollUntil != 0) && sched.lastpoll.Swap(0) != 0 {
sched.pollUntil.Store(pollUntil)
if mp.p != 0 {
throw("findrunnable: netpoll with p")
}
if mp.spinning {
throw("findrunnable: netpoll with spinning")
}
delay := int64(-1)
if pollUntil != 0 {
if now == 0 {
now = nanotime()
}
delay = pollUntil - now
if delay < 0 {
delay = 0
}
}
if faketime != 0 {
// When using fake time, just poll.
delay = 0
}
list, delta := netpoll(delay) // block until new work is available
// Refresh now again, after potentially blocking.
now = nanotime()
sched.pollUntil.Store(0)
sched.lastpoll.Store(now)
if faketime != 0 && list.empty() {
// Using fake time and nothing is ready; stop M.
// When all M's stop, checkdead will call timejump.
stopm()
goto top
}
lock(&sched.lock)
pp, _ := pidleget(now)
unlock(&sched.lock)
if pp == nil {
injectglist(&list)
netpollAdjustWaiters(delta)
} else {
acquirep(pp)
if !list.empty() {
gp := list.pop()
injectglist(&list)
netpollAdjustWaiters(delta)
trace := traceAcquire()
casgstatus(gp, _Gwaiting, _Grunnable)
if trace.ok() {
trace.GoUnpark(gp, 0)
traceRelease(trace)
}
return gp, false, false
}
if wasSpinning {
mp.becomeSpinning()
}
goto top
}
} else if pollUntil != 0 && netpollinited() {
pollerPollUntil := sched.pollUntil.Load()
if pollerPollUntil == 0 || pollerPollUntil > pollUntil {
netpollBreak()
}
}
// 真的什么都没找到,暂止当前的 m
stopm()
goto top
}
runtime.findrunnable 函数的主要工作是:
1)首先检查是否正在进行 GC,如果是则休眠当前的 M ;
2)尝试从本地队列中取 G,如果取到,则直接返回,否则继续从全局队列中找 G,如果找到则直接返回;
3)检查 netpoll 网络轮询器中是否有 G,如果有,则直接返回;
4)如果此时仍然无法找到 G,则从其他 P 的本地队列中偷取;从其他 P 本地队列偷取的工作会执行四轮,如果找到 G,则直接返回;
5)所有的可能性都尝试过了,在准备休眠 M 之前,还要进行额外的检查;
6)首先检查此时是否是 GC mark 阶段,如果是,则直接返回 mark 阶段的 G;
7)如果仍然没有,则对当前的 P 进行快照,准备对调度器进行加锁;
8)当调度器被锁住后,仍然还需再次检查这段时间里是否有进入 GC,如果已经进入了 GC,则回到第一步,阻塞 M 并休眠;
9)如果又在全局队列中发现了 G,则直接返回;
10)当调度器被锁住后,我们彻底找不到任务了,则归还释放当前的 P,将其放入 idle 链表中,并解锁调度器;
11)当 M、P 已经解绑后,我们需要将 M 的状态切换出自旋状态,并减少 nmspinning;
12)此时仍然需要重新检查所有的队列,如果我们又在全局队列中发现了 g,则直接返回;
13)还需要再检查是否存在 poll 网络的 G,如果有,则直接返回;
14)什么也没找到,休眠当前的 M。
netpoll 里"就绪"的其实也是可运行的 G:大量 goroutine 会因为网络 I/O(读写 socket)被 gopark 挂到 netpoller 上,一旦内核告诉 runtime "这个 fd 可读/可写了",这些 G 就从 _Gwaiting 变成 _Grunnable。findRunnable 的任务就是"找一个能跑的 G",所以它必须把 网络 I/O 唤醒的 G 也纳入来源。
runtime.findrunnable 函数的逻辑如图 5.2 所示:

图5.2 runtime.findrunnable()函数逻辑
如果调度循环函数 runtime.schedule() 从通过 runtime.globrunqget() 从全局队列,通过 runtime.runqget() 从 P 本地队列,以及 runtime.findrunnable 从各个地方,获取到了一个可执行的 G, 则会调用 runtime.execute() 函数去执行它,它会通过 runtime.gogo() 将 G 调度到当前线程上开始真正执行,之后 runtime.gogo() 会调用 runtime.goexit(),并依次进入runtime.goexit1(),和 runtime.goexit0(),最后在 runtime.goexit0() 函数的结尾会再次调用 runtime.schedule() ,进入下一次调度循环。

6. 总结
总结的内容已经放在了开头的结论中了。
最近听到一句话:任何领域的顶尖高手,都是在花费大量时间和身心投入后,达到了用灵魂触碰到更高维度的真实存在的境界,而不是在用头脑在思考和工作,因此作出来的产品都极具美感、实用性和创造性,就好像偷取了上帝的创意一样。
在 Go 调度器的底层原理的学习中,不仅需要亲自花时间去分析源码的细节,更加要大量阅读 Go 开发者的文章,需要用心体会机制设计背后的原因。
Reference
Go语言设计与实现6.5节调度器
https://draveness.me/golang/docs/part3-runtime/ch06-concurrency/golang-goroutine/#g
Go语言原本6.3节MPG模型与并发调度单元
https://golang.design/under-the-hood/zh-cn/part2runtime/ch06sched/mpg/
Go调度器系列(3)图解调度原理
https://lessisbetter.site/2019/04/04/golang-scheduler-3-principle-with-graph/
Golang的协程调度器原理及GMP设计思想
https://www.yuque.com/aceld/golang/srxd6d
详解Go语言调度循环源码实现
https://www.luozhiyun.com/archives/448
golang源码分析之协程调度器底层实现( G、M、P )
https://blog.csdn.net/qq_25870633/article/details/83445946
【译】 Scheduling In Go Part I - OS Scheduler
https://blog.lever.wang/golang/os_schedule/
【译】 Scheduling In Go Part II - Go Scheduler
https://blog.lever.wang/golang/go_schedule/
深入 golang -- GMP调度
https://zhuanlan.zhihu.com/p/502740833
深度解密 Go 语言之 scheduler
https://qcrao.com/post/dive-into-go-scheduler/
之后我会持续更新,如果喜欢我的文章,请记得一键三连哦,点赞关注收藏,你的每一个赞每一份关注每一次收藏都将是我前进路上的无限动力 !!!↖(▔▽▔)↗感谢支持!