基于 Linux 6.8.12 内核源码,一步一步追踪用户线程从创建到执行的完整路径
目录
-
引言:用户线程的"前世今生"
-
用户态起点:
clone3调用与glibc的封装 -
内核初吻:
clone3系统调用与参数校验 -
核心缔造者:
kernel_clone与copy_process -
栈上的艺术:
copy_thread伪造新世界 -
唤醒新生命:
wake_up_new_task与调度器初识 -
调度器的选择:
__schedule与context_switch -
栈之舞:
__switch_to_asm切换内核栈 -
新线程的"第一句话":
ret_from_fork_asm -
最后的归途:
swapgs_restore_regs_and_return_to_usermode -
iretq与sysretq:两条返回用户态的神谕 -
用户态的最后一公里:
L(thread_start)调用start_routine -
总结:一次系统调用背后的宏大叙事
-
参考文献与延伸阅读
1. 引言:用户线程的"前世今生"
在 Linux 系统中,每一个用户线程都像是操作系统这座大城里的一位"居民"。它们有自己的"房产"(内存空间)、可以使用的"工具"(文件描述符),以及一套独立的"银行存款"(寄存器状态)。然而,这些居民并非凭空产生,它们必须通过一个称为 clone(或更新的 clone3)的系统调用,由父进程"生育"出来。
当我们在 C 程序中调用 pthread_create,或者在 Python 中启动一个 threading.Thread,最终都会落入 Linux 内核的 clone3 系统调用。内核会一丝不苟地复制或者共享父进程的资源,为新生儿搭建好内核栈,伪造一个完美的"出身背景",然后将它交给调度器。调度器在某个合适的时机,会让这个新线程第一次站上 CPU,执行它从父辈那里继承来的第一条指令------然后,它就开始了自己独立的人生。
你是不是曾经好奇过:新线程的第一个函数 start_routine 到底是怎么被调用的?为什么我们感觉不到任何"魔法"?今天,我们就带着 Linux 6.8.12 的内核源码,从 clone3 系统调用入口开始,一步一个脚印,追踪到用户态的 start_routine 被执行。你将看到内核如何在幕后精心构建一个"谎言"------通过伪造寄存器和栈帧,让新线程第一次被调度时就仿佛一直在运行,然后优雅地返回用户空间,执行用户指定的函数。
2. 用户态起点:clone3 调用与 glibc 的封装
我们先从用户态的角度看起。用户程序不会直接使用系统调用(虽然也可以),而是通过 C 标准库 glibc 提供的 clone3 包装函数。该函数的原型(简化)如下:
c
int clone3(struct clone_args *cl_args, size_t size, int (*func)(void *), void *arg);
注意:真正的系统调用 clone3 只有两个参数(cl_args 和 size),而 glibc 为了方便用户传递线程入口函数,额外增加了 func 和 arg。因此,glibc 需要在这两个参数被内核"看到"之前,妥善保护好它们。
在 x86-64 架构下,函数调用参数传递规则(System V AMD64 ABI):前六个整数/指针参数依次通过 rdi, rsi, rdx, rcx, r8, r9 传递。因此:
-
rdi= cl_args 指针 -
rsi= size -
rdx= func -
rcx= arg
glibc 的汇编实现(sysdeps/unix/sysv/linux/x86_64/clone3.S)会做以下关键工作:
assembly
ENTRY(__clone3)
/* 参数检查:func 不能为空 */
testq %rdx, %rdx
jz L(invalid)
/* 将 arg 保存到一个不会被系统调用破坏的寄存器 */
movq %rcx, %r8
/* 执行系统调用 */
movl $__NR_clone3, %eax
syscall
/* 系统调用返回后,检查返回值 */
testq %rax, %rax
jl L(error) /* 如果 < 0,出错跳转 */
jz L(thread_start) /* 如果 == 0,表示子线程,跳转到真正的执行入口 */
/* 父线程直接返回 */
retq
L(thread_start):
/* 子线程从这里开始执行 */
xorl %ebp, %ebp /* 清空帧指针,标记最外层帧 */
movq %r8, %rdi /* 将之前保存的 arg 作为第一个参数 */
call *%rdx /* 调用 func(arg) */
/* func 返回后,调用 exit 系统调用结束线程 */
movq %rax, %rdi
movl $__NR_exit, %eax
syscall
/* 不会执行到这里 */
END(__clone3)
关键点在于:
-
func和arg由glibc保存在rdx和r8中,然后才陷入内核。 -
内核完全不知道这两个寄存器的存在,因此当内核执行完系统调用,返回到用户态时,这些寄存器的值依然保留原样。
-
子线程从内核返回后,
rip指向L(thread_start)标签,此时rdx中仍然是func,r8中是arg,于是可以顺利调用用户提供的线程入口函数。
所以,start_routine 被调用的"秘密"其实在用户态就已经定下了。内核只负责把新线程的返回地址设置为 L(thread_start),剩下的就是 glibc 自己安排好的。
但是,内核又是如何让子线程的 rip 在第一次返回用户态时指向 L(thread_start) 的呢?这就需要我们深入内核的 copy_thread 和调度器返回路径了。
3. 内核初吻:clone3 系统调用与参数校验
当用户调用 syscall 指令后,CPU 会切换到内核态,并根据 MSR_LSTAR 中预存的地址跳转到 entry_SYSCALL_64(定义在 arch/x86/entry/entry_64.S)。我们来分析这个入口的代码:
assembly
SYM_CODE_START(entry_SYSCALL_64)
swapgs
/* tss.sp2 作为临时存储 */
movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
movq PER_CPU_VAR(pcpu_hot + X86_top_of_stack), %rsp
...
pushq $__USER_DS /* pt_regs->ss */
pushq PER_CPU_VAR(cpu_tss_rw + TSS_sp2) /* pt_regs->sp */
pushq %r11 /* pt_regs->flags */
pushq $__USER_CS /* pt_regs->cs */
pushq %rcx /* pt_regs->ip */
pushq %rax /* pt_regs->orig_ax */
PUSH_AND_CLEAR_REGS rax=$-ENOSYS
...
call do_syscall_64
...
这段代码做了几件重要的事情:
-
切换 GS 基址 :
swapgs将用户态的 GS 与内核态的 GS 交换,以便访问 per-CPU 变量。 -
切换到内核栈 :从
pcpu_hot.top_of_stack获取当前 CPU 的内核栈顶,将rsp指向它。此时栈是空的。 -
构造
pt_regs结构 :将用户态的 SS、RSP、RFLAGS、CS、RIP 以及系统调用号等压栈,形成一个标准的pt_regs结构(定义在arch/x86/include/asm/ptrace.h)。这个结构保存了用户现场的完整快照。 -
调用
do_syscall_64:这是通用的 64 位系统调用分发函数,它根据rax中的系统调用号,调用对应的内核实现函数。对于clone3,rax为__NR_clone3,所以会调用ksys_clone3(最终调用__do_sys_clone3)。
do_syscall_64 的简化形式如下:
c
#ifdef CONFIG_X86_64
__visible noinstr void do_syscall_64(struct pt_regs *regs, int nr)
{
nr = syscall_enter_from_user_mode(regs, nr);
instrumentation_begin();
if (nr < NR_syscalls)
regs->ax = sys_call_table[nr](regs);
instrumentation_end();
syscall_exit_to_user_mode(regs);
}
#endif
它会从 sys_call_table 中取出 clone3 的处理函数------其实就是 __x64_sys_clone3。
4. 核心缔造者:kernel_clone 与 copy_process
__x64_sys_clone3 在内核中被定义为 SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)。它的主要任务:
-
从用户空间安全地拷贝
clone_args结构。 -
调用
copy_clone_args_from_user填充内核的struct kernel_clone_args。 -
检查标志合法性(
clone3_args_valid)。 -
调用
kernel_clone(&kargs)。
kernel_clone 定义在 kernel/fork.c 中,是 fork/clone/vfork 三大件的中枢。它调用了 copy_process,然后进行收尾工作(如记录 PID、处理 vfork 等待等)。我们重点关注 copy_process,因为它负责创建新的 task_struct 并设置好所有资源。
copy_process 是一个高达几百行的函数,它做了:
-
复制或共享内存描述符(
mm)、文件系统信息(fs)、文件描述符表(files)、信号处理等。 -
为新的进程分配一个新的
pid。 -
调用
copy_thread来初始化新进程的内核栈和硬件上下文。
copy_thread 是与体系结构相关的函数,在 arch/x86/kernel/process.c 中实现。正是这个函数,为新线程的首次登场铺好了舞台。
5. 栈上的艺术:copy_thread 伪造新世界
copy_thread 接收 struct task_struct *p 和 struct kernel_clone_args *args。它的职责是:
-
为子进程(线程)准备内核栈上的
pt_regs结构和inactive_task_frame结构。 -
设置
p->thread.sp指向新内核栈的适当位置。 -
决定子进程第一次被调度时从何处开始执行(
ret_from_fork_asm)。
下面我们深入阅读 copy_thread 的关键部分(基于 Linux 6.8.12 源码):
c
int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
{
unsigned long clone_flags = args->flags;
unsigned long sp = args->stack; // 用户指定的栈,对于 pthread 通常是 NULL
unsigned long tls = args->tls;
struct inactive_task_frame *frame;
struct fork_frame *fork_frame;
struct pt_regs *childregs;
unsigned long new_ssp;
int ret = 0;
// 获取子进程的 pt_regs 位置(位于内核栈顶部)
childregs = task_pt_regs(p);
// fork_frame 包含了 pt_regs 和一个额外的 inactive_task_frame
fork_frame = container_of(childregs, struct fork_frame, regs);
frame = &fork_frame->frame;
// 设置 inactive_task_frame 中的 bp 和 ret_addr
frame->bp = encode_frame_pointer(childregs);
frame->ret_addr = (unsigned long) ret_from_fork_asm; // ★ 关键!
// 将子进程的内核栈指针指向 fork_frame(栈底方向)
p->thread.sp = (unsigned long) fork_frame;
// ... 其他初始化代码(FPU、TLS等)...
// 如果是内核线程 (PF_KTHREAD)
if (unlikely(p->flags & PF_KTHREAD)) {
memset(childregs, 0, sizeof(struct pt_regs));
kthread_frame_init(frame, args->fn, args->fn_arg);
return 0;
}
// 用户线程(普通进程/线程)
frame->bx = 0;
*childregs = *current_pt_regs(); // 复制父进程的 pt_regs
childregs->ax = 0; // 子进程的 fork/clone 返回值为 0
if (sp)
childregs->sp = sp; // 如果指定了栈,则替换用户栈
// ... 处理 CLONE_SETTLS 等 ...
return ret;
}
这里有几个关键点:
5.1 内核栈的布局
内核为每个任务分配一个固定大小的内核栈(通常是 THREAD_SIZE,对于 x86_64 是 16KB 或 8KB)。栈从高地址向低地址增长。在栈的最高地址 处,存放着一个 struct pt_regs(用户现场),而在 pt_regs 下面紧挨着的就是 struct fork_frame,它包含了一个 inactive_task_frame。
inactive_task_frame 的定义如下:
c
struct inactive_task_frame {
unsigned long r15;
unsigned long r14;
unsigned long r13;
unsigned long r12;
unsigned long bx;
unsigned long bp;
unsigned long ret_addr; // 返回地址
};
注意:这个结构的排列顺序与 __switch_to_asm 中的压栈/弹栈顺序完全匹配。
当我们设置 frame->ret_addr = ret_from_fork_asm,并将 p->thread.sp 指向 fork_frame 时,相当于告诉调度器:当这个任务被切换进来时,从栈上依次弹出 r15~rbp 后,最后的 ret 指令将跳转到 ret_from_fork_asm。
5.2 用户现场 pt_regs 的设置
childregs 是 pt_regs 的指针。对于普通的用户线程(非内核线程),我们复制父进程的 pt_regs,然后把 ax 设置为 0(使得子线程中 clone3 返回 0)。用户栈指针 sp 会被替换为 args->stack(通常是父进程指定的用户栈,例如 pthread_create 分配的栈)。用户指令指针 ip(即 pt_regs->ip)会被保留为父进程调用 clone3 时的返回地址------但这并不是 我们最终想要让子线程去的地方。等一下,我们会在返回用户态路径中看到,实际上子线程的 ip 会被覆盖。
实际上,对于通过 clone3 创建的新线程,我们并不希望它返回到父进程调用 clone3 之后的那条指令,因为那样会导致它执行父进程的逻辑。相反,我们希望它跳转到 glibc 的 L(thread_start)。可是,copy_thread 这里并没有把 childregs->ip 改成 L(thread_start) 的地址啊?这是关键 :我们不需要改 childregs->ip,因为子线程并不是通过 iret 从 copy_thread 直接返回用户态的,而是会经历一次完整的调度器切换,最终从 swapgs_restore_regs_and_return_to_usermode 中的 iretq 返回用户态。那时使用的 rip 是从内核栈上的 pt_regs 中弹出的。既然如此,我们只需要修改 pt_regs 中的 ip 为 L(thread_start) 不就行了吗?但是 copy_thread 里并没有做这件事------因为真正的 L(thread_start) 地址是 glibc 的用户空间地址,内核不知道它。
那么内核如何知道新线程应该在用户态从哪条指令开始执行呢?答案是:内核根本不需要知道 。因为 clone3 是一个系统调用,在父进程中调用它时,pt_regs->ip 已经指向了 glibc 包装函数中 syscall 指令之后的那条指令(即 L(thread_start))。当我们复制 pt_regs 时,子进程自动继承了相同的 ip。也就是说,如果父进程在调用 clone3 后,接着执行的是 L(thread_start) 代码,那么子进程恢复用户现场后,也会从 L(thread_start) 开始执行。这正是 glibc 的巧妙设计:父进程在 syscall 之后,会检查返回值并决定是继续执行父进程逻辑还是跳转到子线程逻辑,而子进程由于 ax 被强制设为 0,它就会进入 L(thread_start)。
但是这里有一个微妙之处:父进程调用 clone3 时,rcx 和 r11 被 syscall 指令覆盖(rcx 存返回地址,r11 存 rflags),所以父进程的 pt_regs 中 ip 不是 syscall 之后的地址吗?实际上,在 entry_SYSCALL_64 中,我们压入了 %rcx 作为 pt_regs->ip,而 rcx 恰恰是 syscall 指令将父进程的 rip 保存的地方,因此压入的就是 syscall 指令的下一条指令地址。所以子线程复制这个 ip,那么它返回到用户态后,也会从那条指令开始执行。而 glibc 在 syscall 之后,紧接着就是 testq %rax, %rax 来判断子进程/父进程,所以新线程会执行那个判断,然后走到 L(thread_start)。完美!
因此,copy_thread 不需要任何特殊处理,只需要复制父进程的 pt_regs 并将 ax 清零即可。所有用户态路由逻辑都已经在 glibc 中安排好了。
6. 唤醒新生命:wake_up_new_task 与调度器初识
当 copy_process 成功返回 task_struct *p 后,kernel_clone 会调用 wake_up_new_task(p)。这个函数定义在 kernel/sched/core.c,负责将新创建的任务放到运行队列中,并让调度器感知到它的存在。
关键代码:
c
void wake_up_new_task(struct task_struct *p)
{
struct rq_flags rf;
struct rq *rq;
raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
WRITE_ONCE(p->__state, TASK_RUNNING); // 将状态设为可运行
// ... 负载均衡相关的代码(选择CPU)...
rq = __task_rq_lock(p, &rf);
activate_task(rq, p, ENQUEUE_NOCLOCK); // 将任务加入运行队列
trace_sched_wakeup_new(p);
wakeup_preempt(rq, p, WF_FORK); // 检查是否应该抢占当前任务
task_rq_unlock(rq, p, &rf);
}
-
WRITE_ONCE(p->__state, TASK_RUNNING):新生儿的"心跳"开始,表示它已经准备好被调度。 -
activate_task:将任务添加到rq->cfs_tasks之类的队列中,并更新调度类相关的统计信息。 -
wakeup_preempt:比较新任务和当前运行任务的优先级。如果新任务优先级更高(例如实时任务),则设置TIF_NEED_RESCHED标志,请求重新调度。
至此,新线程已经进入了调度器的视野,等待合适的时机成为 CPU 的主人。
7. 调度器的选择:__schedule 与 context_switch
调度器会在很多时机被调用,比如:
-
当前进程主动调用
schedule()(如等待 I/O)。 -
时间片用完,时钟中断处理中触发抢占。
-
从系统调用或中断返回到用户空间之前,检查
TIF_NEED_RESCHED标志。
调度器的主函数是 __schedule,定义在 kernel/sched/core.c:
c
static void __sched notrace __schedule(unsigned int sched_mode)
{
struct task_struct *prev, *next;
struct rq *rq;
int cpu;
cpu = smp_processor_id();
rq = cpu_rq(cpu);
prev = rq->curr;
// ... 处理 prev 状态(可能因睡眠而 deactivate)...
next = pick_next_task(rq, prev, &rf); // 从运行队列中选出下一个任务
if (likely(prev != next)) {
rq->nr_switches++;
RCU_INIT_POINTER(rq->curr, next);
++*switch_count;
rq = context_switch(rq, prev, next, &rf); // 切换上下文
} else {
// ... 无需切换的路径 ...
}
}
pick_next_task 会调用调度类(如 CFS、实时类)的选择函数。对于普通的用户线程,最常用的是 fair_sched_class.pick_next_task,它从红黑树中取一个任务。
当选出 next 后,进入 context_switch:
c
static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next, struct rq_flags *rf)
{
prepare_task_switch(rq, prev, next);
// ... 切换内存管理(switch_mm)...
switch_to(prev, next, prev); // 切换寄存器和栈
barrier();
return finish_task_switch(prev);
}
switch_to 是一个宏,在 x86 下最终调用了 __switch_to_asm(汇编函数)。这就是我们前面多次提到的核心切换点。
8. 栈之舞:__switch_to_asm 切换内核栈
__switch_to_asm 定义在 arch/x86/entry/entry_64.S 中(你提供的源码里有):
assembly
SYM_FUNC_START(__switch_to_asm)
pushq %rbp
pushq %rbx
pushq %r12
pushq %r13
pushq %r14
pushq %r15
/* 切换栈 */
movq %rsp, TASK_threadsp(%rdi)
movq TASK_threadsp(%rsi), %rsp
/* 栈保护、RSB 填充等 */
popq %r15
popq %r14
popq %r13
popq %r12
popq %rbx
popq %rbp
jmp __switch_to
SYM_FUNC_END(__switch_to_asm)
我们来细致分析这段代码执行时发生了什么:
-
当前任务
prev的内核栈 上压入了%rbp,%rbx,%r12~%r15(callee-saved 寄存器)。这些寄存器在 C 函数调用中是被保护的,因此切换任务时必须保存。 -
保存当前任务的栈指针 :
movq %rsp, TASK_threadsp(%rdi),其中%rdi是prev的task_struct指针,TASK_threadsp是thread.sp在结构中的偏移。 -
切换到下一个任务的栈 :
movq TASK_threadsp(%rsi), %rsp,%rsi是next的task_struct指针。现在rsp指向了next的内核栈上的某个位置------对于新线程,这个位置就是fork_frame(即p->thread.sp指向的地方)。 -
恢复 callee-saved 寄存器 :从新的栈上弹出
%r15..%rbp。对于新创建的线程,这些值都是我们在copy_thread中通过frame->bx等成员预设的(bx=0, bp 为伪装的帧指针等)。 -
跳转到
__switch_to:注意这里是jmp而不是call,所以__switch_to函数执行完毕后,会执行ret指令。而这个ret指令的返回地址是什么?此时rsp已经指向了fork_frame.frame.ret_addr的位置(因为我们已经弹出了 6 个 8 字节的寄存器,rsp恰好指向偏移 48 处的ret_addr)。所以,__switch_to的ret会跳转到ret_from_fork_asm!
这样,新线程第一次被调度时,便从 __switch_to_asm 过渡到了 ret_from_fork_asm,而父线程则是从 __switch_to 正常返回到 context_switch 中继续执行。
9. 新线程的"第一句话":ret_from_fork_asm
ret_from_fork_asm 的定义在你的源码中:
assembly
SYM_CODE_START(ret_from_fork_asm)
UNWIND_HINT_END_OF_STACK
ANNOTATE_NOENDBR
CALL_DEPTH_ACCOUNT
movq %rax, %rdi /* prev */
movq %rsp, %rsi /* regs */
movq %rbx, %rdx /* fn */
movq %r12, %rcx /* fn_arg */
call ret_from_fork
UNWIND_HINT_REGS
jmp swapgs_restore_regs_and_return_to_usermode
SYM_CODE_END(ret_from_fork_asm)
-
%rax保存的是__switch_to_asm返回的prev任务(实际上就是切换前的任务,但新线程不关心)。 -
%rsp此时仍然指向内核栈的某个位置(仍然是fork_frame附近)。 -
%rbx和%r12在copy_thread中被设置为 0(对于用户线程)或内核线程函数指针(对于内核线程)。
ret_from_fork 是一个 C 函数(定义在 kernel/exit.c 或 arch/x86/kernel/process.c 中,取决于版本)。你的源码中提供了:
c
__visible void ret_from_fork(struct task_struct *prev, struct pt_regs *regs,
int (*fn)(void *), void *fn_arg)
{
schedule_tail(prev); // 调度收尾工作(如释放 rq 锁)
if (unlikely(fn)) {
fn(fn_arg);
regs->ax = 0;
}
syscall_exit_to_user_mode(regs);
}
注意:对于普通用户线程,fn 为 NULL,所以不会执行内核线程函数。
schedule_tail 是必要的,它完成一些调度后的清理,例如 finish_task_switch 中的部分工作。
最后调用 syscall_exit_to_user_mode(regs)------这个函数标志着我们即将返回用户态。
10. 最后的归途:swapgs_restore_regs_and_return_to_usermode
syscall_exit_to_user_mode 最终会调用到 exit_to_user_mode,然后进入 arch/x86/entry/common.c 中的 __exit_to_user_mode,但最底层的汇编路径仍然指向 swapgs_restore_regs_and_return_to_usermode。
你提供的 entry_64.S 中包含了这个标签的完整实现(前面已经贴过)。我们再仔细看几步:
assembly
SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
IBRS_EXIT
/* 处理 Xen PV 和 PTI 的替代跳转 */
STACKLEAK_ERASE
POP_REGS // 恢复所有通用寄存器
add $8, %rsp // 跳过 orig_ax
UNWIND_HINT_IRET_REGS
.Lswapgs_and_iret:
swapgs
CLEAR_CPU_BUFFERS
testb $3, 8(%rsp) // 检查 CS 的低 2 位,看是不是用户模式
jnz .Lnative_iret
ud2
.Lnative_iret:
iretq // 执行中断返回
注意:在调用 POP_REGS 之前,rsp 指向内核栈上 pt_regs 结构的起始位置。POP_REGS 会依次弹出 r15, r14, ... , rdi 等。然后 add $8, %rsp 跳过 orig_ax 字段。此时,rsp 正好指向 pt_regs 中的 ip 字段(即用户态的 RIP)。但是,iretq 期望栈顶依次是 RIP, CS, RFLAGS, RSP, SS。我们的 pt_regs 布局恰好就是这种顺序(因为 pt_regs 是严格按照中断帧设计的)。因此,执行 iretq 后,CPU 会弹出这些值,恢复用户态的所有寄存器,并且将 RIP 设置为 pt_regs->ip 中存放的地址------对于新线程,这个地址就是父进程调用 clone3 时 syscall 指令之后的那条指令,也就是 glibc 的 L(thread_start) 标签地址。
至此,新线程从内核态彻底返回到了用户态,并且 %rip 指向 L(thread_start)。
11. iretq 与 sysretq:两条返回用户态的神谕
在 swapgs_restore_regs_and_return_to_usermode 中,我们看到使用的是 iretq。但对于普通的系统调用返回(不是中断/异常),内核倾向于使用更快的 sysretq。这两种返回方式在 entry_64.S 中都有体现。为什么这里用了 iretq 而不是 sysretq?
主要原因是:sysretq 对返回条件要求严格,例如它要求返回地址(rcx)必须是规范地址,并且 RFLAGS 中某些位必须正确,同时不能有任何遗留的工作(如信号处理、抢占调度)。而新线程第一次返回用户态时,由于我们刚刚从 fork 路径出来,pt_regs 中的 ip 是正常的,但是否满足 sysretq 的所有条件?其实不一定,而且为了通用性,内核在 ret_from_fork 路径中统一使用 iretq,这样更安全。
另外,sysretq 不会自动切换 GS,所以需要单独的 swapgs 指令;而 iretq 在返回时只恢复 CS/SS,不影响 GS,因此也需要 swapgs。两者在顶层逻辑上类似,只是底层机制不同。
12. 用户态的最后一公里:L(thread_start) 调用 start_routine
终于,用户空间的代码开始执行。根据我们之前分析的 glibc 汇编片段:
assembly
L(thread_start):
xorl %ebp, %ebp
movq %r8, %rdi
call *%rdx /* 调用 start_routine(arg) */
movq %rax, %rdi
movl $SYS_ify(exit), %eax
syscall
此时 rdx 中保存着 func(即用户提供的 start_routine 函数指针),r8 中保存着 arg。因此,call *%rdx 就会执行用户写的线程函数。当该函数返回后,返回值传递给了 exit 系统调用,线程优雅地消亡。
至此,整个链条闭环:用户调用 pthread_create → glibc 封装 → clone3 系统调用 → 内核创建任务 → 调度器选中 → 栈切换 → 返回用户态 → 执行 start_routine。
13. 总结:一次系统调用背后的宏大叙事
通过追踪 Linux 6.8.12 内核源码,我们完成了一次从用户态到内核态再返回用户态的完整旅程。这一路上,我们看到了:
-
用户态
glibc的巧思 :利用寄存器传递线程入口函数与参数,并在syscall之后放置分支逻辑,使得父子进程可以分流。 -
内核
copy_thread的精巧构造 :伪造inactive_task_frame,并设置ret_addr = ret_from_fork_asm,为新线程定制了首次调度时的执行入口。 -
调度器
__schedule与__switch_to_asm:通过保存/恢复 callee-saved 寄存器和切换内核栈指针,实现了任务的原子切换。 -
ret_from_fork_asm到swapgs_restore_regs_and_return_to_usermode:完成最后的调度收尾,并跳转到通用返回路径。 -
iretq指令 :根据内核栈上的pt_regs结构,恢复用户态的所有寄存器,并切换特权级,将控制权交还给用户空间的L(thread_start)。 -
用户空间的
L(thread_start):最终调用了用户提供的start_routine。
整个过程涉及了硬件特权级切换、x86-64 汇编、C 语言调度器、进程管理、内存管理等多个模块,是操作系统课本中"进程创建与调度"知识点的真实落地。
为什么如此"麻烦"?因为这是稳定、安全、高效的多任务系统所必须付出的代价。现代操作系统能够在毫秒间完成数千次这样的切换,而你甚至感觉不到它的存在。这正是计算机科学的魅力所在------复杂逻辑被封装在层层抽象背后,留给用户的是一个简单而强大的接口。
##源码
/*
* __schedule() is the main scheduler function.
*
* The main means of driving the scheduler and thus entering this function are:
*
* 1. Explicit blocking: mutex, semaphore, waitqueue, etc.
*
* 2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return
* paths. For example, see arch/x86/entry_64.S.
*
* To drive preemption between tasks, the scheduler sets the flag in timer
* interrupt handler scheduler_tick().
*
* 3. Wakeups don't really cause entry into schedule(). They add a
* task to the run-queue and that's it.
*
* Now, if the new task added to the run-queue preempts the current
* task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets
* called on the nearest possible occasion:
*
* - If the kernel is preemptible (CONFIG_PREEMPTION=y):
*
* - in syscall or exception context, at the next outmost
* preempt_enable(). (this might be as soon as the wake_up()'s
* spin_unlock()!)
*
* - in IRQ context, return from interrupt-handler to
* preemptible context
*
* - If the kernel is not preemptible (CONFIG_PREEMPTION is not set)
* then at the next:
*
* - cond_resched() call
* - explicit schedule() call
* - return from syscall or exception to user-space
* - return from interrupt-handler to user-space
*
* WARNING: must be called with preemption disabled!
*/
static void __sched notrace __schedule(unsigned int sched_mode)
{
struct task_struct *prev, *next;
unsigned long *switch_count;
unsigned long prev_state;
struct rq_flags rf;
struct rq *rq;
int cpu;
cpu = smp_processor_id();
rq = cpu_rq(cpu);
prev = rq->curr;
schedule_debug(prev, !!sched_mode);
if (sched_feat(HRTICK) || sched_feat(HRTICK_DL))
hrtick_clear(rq);
local_irq_disable();
rcu_note_context_switch(!!sched_mode);
/*
* Make sure that signal_pending_state()->signal_pending() below
* can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
* done by the caller to avoid the race with signal_wake_up():
*
* __set_current_state(@state) signal_wake_up()
* schedule() set_tsk_thread_flag(p, TIF_SIGPENDING)
* wake_up_state(p, state)
* LOCK rq->lock LOCK p->pi_state
* smp_mb__after_spinlock() smp_mb__after_spinlock()
* if (signal_pending_state()) if (p->state & @state)
*
* Also, the membarrier system call requires a full memory barrier
* after coming from user-space, before storing to rq->curr.
*/
rq_lock(rq, &rf);
smp_mb__after_spinlock();
/* Promote REQ to ACT */
rq->clock_update_flags <<= 1;
update_rq_clock(rq);
rq->clock_update_flags = RQCF_UPDATED;
switch_count = &prev->nivcsw;
/*
* We must load prev->state once (task_struct::state is volatile), such
* that we form a control dependency vs deactivate_task() below.
*/
prev_state = READ_ONCE(prev->__state);
if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
if (signal_pending_state(prev_state, prev)) {
WRITE_ONCE(prev->__state, TASK_RUNNING);
} else {
prev->sched_contributes_to_load =
(prev_state & TASK_UNINTERRUPTIBLE) &&
!(prev_state & TASK_NOLOAD) &&
!(prev_state & TASK_FROZEN);
if (prev->sched_contributes_to_load)
rq->nr_uninterruptible++;
/*
* __schedule() ttwu()
* prev_state = prev->state; if (p->on_rq && ...)
* if (prev_state) goto out;
* p->on_rq = 0; smp_acquire__after_ctrl_dep();
* p->state = TASK_WAKING
*
* Where __schedule() and ttwu() have matching control dependencies.
*
* After this, schedule() must not care about p->state any more.
*/
deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
if (prev->in_iowait) {
atomic_inc(&rq->nr_iowait);
delayacct_blkio_start();
}
}
switch_count = &prev->nvcsw;
}
next = pick_next_task(rq, prev, &rf);
clear_tsk_need_resched(prev);
clear_preempt_need_resched();
#ifdef CONFIG_SCHED_DEBUG
rq->last_seen_need_resched_ns = 0;
#endif
if (likely(prev != next)) {
rq->nr_switches++;
/*
* RCU users of rcu_dereference(rq->curr) may not see
* changes to task_struct made by pick_next_task().
*/
RCU_INIT_POINTER(rq->curr, next);
/*
* The membarrier system call requires each architecture
* to have a full memory barrier after updating
* rq->curr, before returning to user-space.
*
* Here are the schemes providing that barrier on the
* various architectures:
* - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
* switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
* - finish_lock_switch() for weakly-ordered
* architectures where spin_unlock is a full barrier,
* - switch_to() for arm64 (weakly-ordered, spin_unlock
* is a RELEASE barrier),
*/
++*switch_count;
migrate_disable_switch(rq, prev);
psi_sched_switch(prev, next, !task_on_rq_queued(prev));
trace_sched_switch(sched_mode & SM_MASK_PREEMPT, prev, next, prev_state);
/* Also unlocks the rq: */
rq = context_switch(rq, prev, next, &rf);
} else {
rq_unpin_lock(rq, &rf);
__balance_callbacks(rq);
raw_spin_rq_unlock_irq(rq);
}
}
/*
* context_switch - switch to the new MM and the new thread's register state.
*/
static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next, struct rq_flags *rf)
{
prepare_task_switch(rq, prev, next);
/*
* For paravirt, this is coupled with an exit in switch_to to
* combine the page table reload and the switch backend into
* one hypercall.
*/
arch_start_context_switch(prev);
/*
* kernel -> kernel lazy + transfer active
* user -> kernel lazy + mmgrab_lazy_tlb() active
*
* kernel -> user switch + mmdrop_lazy_tlb() active
* user -> user switch
*
* switch_mm_cid() needs to be updated if the barriers provided
* by context_switch() are modified.
*/
if (!next->mm) { // to kernel
enter_lazy_tlb(prev->active_mm, next);
next->active_mm = prev->active_mm;
if (prev->mm) // from user
mmgrab_lazy_tlb(prev->active_mm);
else
prev->active_mm = NULL;
} else { // to user
membarrier_switch_mm(rq, prev->active_mm, next->mm);
/*
* sys_membarrier() requires an smp_mb() between setting
* rq->curr / membarrier_switch_mm() and returning to userspace.
*
* The below provides this either through switch_mm(), or in
* case 'prev->active_mm == next->mm' through
* finish_task_switch()'s mmdrop().
*/
switch_mm_irqs_off(prev->active_mm, next->mm, next);
lru_gen_use_mm(next->mm);
if (!prev->mm) { // from kernel
/* will mmdrop_lazy_tlb() in finish_task_switch(). */
rq->prev_mm = prev->active_mm;
prev->active_mm = NULL;
}
}
/* switch_mm_cid() requires the memory barriers above. */
switch_mm_cid(rq, prev, next);
prepare_lock_switch(rq, next, rf);
/* Here we just switch the register state and the stack. */
switch_to(prev, next, prev);
barrier();
return finish_task_switch(prev);
}
#define switch_to(prev, next, last) \
do { \
((last) = __switch_to_asm((prev), (next))); \
} while (0)
struct task_struct *__switch_to_asm(struct task_struct *prev,
struct task_struct *next);
/*
* %rdi: prev task
* %rsi: next task
*/
.pushsection .text, "ax"
SYM_FUNC_START(__switch_to_asm)
/*
* Save callee-saved registers
* This must match the order in inactive_task_frame
*/
pushq %rbp
pushq %rbx
pushq %r12
pushq %r13
pushq %r14
pushq %r15
/* switch stack */
movq %rsp, TASK_threadsp(%rdi)
movq TASK_threadsp(%rsi), %rsp
#ifdef CONFIG_STACKPROTECTOR
movq TASK_stack_canary(%rsi), %rbx
movq %rbx, PER_CPU_VAR(fixed_percpu_data) + FIXED_stack_canary
#endif
/*
* When switching from a shallower to a deeper call stack
* the RSB may either underflow or use entries populated
* with userspace addresses. On CPUs where those concerns
* exist, overwrite the RSB with entries which capture
* speculative execution to prevent attack.
*/
FILL_RETURN_BUFFER %r12, RSB_CLEAR_LOOPS, X86_FEATURE_RSB_CTXSW
/* restore callee-saved registers */
popq %r15
popq %r14
popq %r13
popq %r12
popq %rbx
popq %rbp
jmp __switch_to
SYM_FUNC_END(__switch_to_asm)
.popsection
/*
* switch_to(x,y) should switch tasks from x to y.
*
* This could still be optimized:
* - fold all the options into a flag word and test it with a single test.
* - could test fs/gs bitsliced
*
* Kprobes not supported here. Set the probe on schedule instead.
* Function graph tracer not supported too.
*/
__no_kmsan_checks
__visible __notrace_funcgraph struct task_struct *
__switch_to(struct task_struct *prev_p, struct task_struct *next_p)
{
struct thread_struct *prev = &prev_p->thread;
struct thread_struct *next = &next_p->thread;
struct fpu *prev_fpu = &prev->fpu;
int cpu = smp_processor_id();
WARN_ON_ONCE(IS_ENABLED(CONFIG_DEBUG_ENTRY) &&
this_cpu_read(pcpu_hot.hardirq_stack_inuse));
if (!test_thread_flag(TIF_NEED_FPU_LOAD))
switch_fpu_prepare(prev_fpu, cpu);
/* We must save %fs and %gs before load_TLS() because
* %fs and %gs may be cleared by load_TLS().
*
* (e.g. xen_load_tls())
*/
save_fsgs(prev_p);
/*
* Load TLS before restoring any segments so that segment loads
* reference the correct GDT entries.
*/
load_TLS(next, cpu);
/*
* Leave lazy mode, flushing any hypercalls made here. This
* must be done after loading TLS entries in the GDT but before
* loading segments that might reference them.
*/
arch_end_context_switch(next_p);
/* Switch DS and ES.
*
* Reading them only returns the selectors, but writing them (if
* nonzero) loads the full descriptor from the GDT or LDT. The
* LDT for next is loaded in switch_mm, and the GDT is loaded
* above.
*
* We therefore need to write new values to the segment
* registers on every context switch unless both the new and old
* values are zero.
*
* Note that we don't need to do anything for CS and SS, as
* those are saved and restored as part of pt_regs.
*/
savesegment(es, prev->es);
if (unlikely(next->es | prev->es))
loadsegment(es, next->es);
savesegment(ds, prev->ds);
if (unlikely(next->ds | prev->ds))
loadsegment(ds, next->ds);
x86_fsgsbase_load(prev, next);
x86_pkru_load(prev, next);
/*
* Switch the PDA and FPU contexts.
*/
raw_cpu_write(pcpu_hot.current_task, next_p);
raw_cpu_write(pcpu_hot.top_of_stack, task_top_of_stack(next_p));
switch_fpu_finish();
/* Reload sp0. */
update_task_stack(next_p);
switch_to_extra(prev_p, next_p);
if (static_cpu_has_bug(X86_BUG_SYSRET_SS_ATTRS)) {
/*
* AMD CPUs have a misfeature: SYSRET sets the SS selector but
* does not update the cached descriptor. As a result, if we
* do SYSRET while SS is NULL, we'll end up in user mode with
* SS apparently equal to __USER_DS but actually unusable.
*
* The straightforward workaround would be to fix it up just
* before SYSRET, but that would slow down the system call
* fast paths. Instead, we ensure that SS is never NULL in
* system call context. We do this by replacing NULL SS
* selectors at every context switch. SYSCALL sets up a valid
* SS, so the only way to get NULL is to re-enter the kernel
* from CPL 3 through an interrupt. Since that can't happen
* in the same task as a running syscall, we are guaranteed to
* context switch between every interrupt vector entry and a
* subsequent SYSRET.
*
* We read SS first because SS reads are much faster than
* writes. Out of caution, we force SS to __KERNEL_DS even if
* it previously had a different non-NULL value.
*/
unsigned short ss_sel;
savesegment(ss, ss_sel);
if (ss_sel != __KERNEL_DS)
loadsegment(ss, __KERNEL_DS);
}
/* Load the Intel cache allocation PQR MSR. */
resctrl_sched_in(next_p);
return prev_p;
}
static __always_inline void __schedule_loop(unsigned int sched_mode)
{
do {
preempt_disable();
__schedule(sched_mode);
sched_preempt_enable_no_resched();
} while (need_resched());
}
asmlinkage __visible void __sched schedule(void)
{
struct task_struct *tsk = current;
#ifdef CONFIG_RT_MUTEXES
lockdep_assert(!tsk->sched_rt_mutex);
#endif
if (!task_is_running(tsk))
sched_submit_work(tsk);
__schedule_loop(SM_NONE);
sched_update_worker(tsk);
}
EXPORT_SYMBOL(schedule);
/*
* Ok, this is the main fork-routine.
*
* It copies the process, and if successful kick-starts
* it and waits for it to finish using the VM if required.
*
* args->exit_signal is expected to be checked for sanity by the caller.
*/
pid_t kernel_clone(struct kernel_clone_args *args)
{
u64 clone_flags = args->flags;
struct completion vfork;
struct pid *pid;
struct task_struct *p;
int trace = 0;
pid_t nr;
/*
* For legacy clone() calls, CLONE_PIDFD uses the parent_tid argument
* to return the pidfd. Hence, CLONE_PIDFD and CLONE_PARENT_SETTID are
* mutually exclusive. With clone3() CLONE_PIDFD has grown a separate
* field in struct clone_args and it still doesn't make sense to have
* them both point at the same memory location. Performing this check
* here has the advantage that we don't need to have a separate helper
* to check for legacy clone().
*/
if ((args->flags & CLONE_PIDFD) &&
(args->flags & CLONE_PARENT_SETTID) &&
(args->pidfd == args->parent_tid))
return -EINVAL;
/*
* Determine whether and which event to report to ptracer. When
* called from kernel_thread or CLONE_UNTRACED is explicitly
* requested, no event is reported; otherwise, report if the event
* for the type of forking is enabled.
*/
if (!(clone_flags & CLONE_UNTRACED)) {
if (clone_flags & CLONE_VFORK)
trace = PTRACE_EVENT_VFORK;
else if (args->exit_signal != SIGCHLD)
trace = PTRACE_EVENT_CLONE;
else
trace = PTRACE_EVENT_FORK;
if (likely(!ptrace_event_enabled(current, trace)))
trace = 0;
}
p = copy_process(NULL, trace, NUMA_NO_NODE, args);
add_latent_entropy();
if (IS_ERR(p))
return PTR_ERR(p);
/*
* Do this prior waking up the new thread - the thread pointer
* might get invalid after that point, if the thread exits quickly.
*/
trace_sched_process_fork(current, p);
pid = get_task_pid(p, PIDTYPE_PID);
nr = pid_vnr(pid);
if (clone_flags & CLONE_PARENT_SETTID)
put_user(nr, args->parent_tid);
if (clone_flags & CLONE_VFORK) {
p->vfork_done = &vfork;
init_completion(&vfork);
get_task_struct(p);
}
if (IS_ENABLED(CONFIG_LRU_GEN_WALKS_MMU) && !(clone_flags & CLONE_VM)) {
/* lock the task to synchronize with memcg migration */
task_lock(p);
lru_gen_add_mm(p->mm);
task_unlock(p);
}
wake_up_new_task(p);
/* forking complete and child started to run, tell ptracer */
if (unlikely(trace))
ptrace_event_pid(trace, pid);
if (clone_flags & CLONE_VFORK) {
if (!wait_for_vfork_done(p, &vfork))
ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);
}
put_pid(pid);
return nr;
}
/*
* wake_up_new_task - wake up a newly created task for the first time.
*
* This function will do some initial scheduler statistics housekeeping
* that must be done for every newly created context, then puts the task
* on the runqueue and wakes it.
*/
void wake_up_new_task(struct task_struct *p)
{
struct rq_flags rf;
struct rq *rq;
raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
WRITE_ONCE(p->__state, TASK_RUNNING);
#ifdef CONFIG_SMP
/*
* Fork balancing, do it here and not earlier because:
* - cpus_ptr can change in the fork path
* - any previously selected CPU might disappear through hotplug
*
* Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,
* as we're not fully set-up yet.
*/
p->recent_used_cpu = task_cpu(p);
rseq_migrate(p);
__set_task_cpu(p, select_task_rq(p, task_cpu(p), WF_FORK));
#endif
rq = __task_rq_lock(p, &rf);
update_rq_clock(rq);
post_init_entity_util_avg(p);
activate_task(rq, p, ENQUEUE_NOCLOCK);
trace_sched_wakeup_new(p);
wakeup_preempt(rq, p, WF_FORK);
#ifdef CONFIG_SMP
if (p->sched_class->task_woken) {
/*
* Nothing relies on rq->lock after this, so it's fine to
* drop it.
*/
rq_unpin_lock(rq, &rf);
p->sched_class->task_woken(rq, p);
rq_repin_lock(rq, &rf);
}
#endif
task_rq_unlock(rq, p, &rf);
}
asmlinkage void ret_from_fork_asm(void);
__visible void ret_from_fork(struct task_struct *prev, struct pt_regs *regs,
int (*fn)(void *), void *fn_arg);
int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
{
unsigned long clone_flags = args->flags;
unsigned long sp = args->stack;
unsigned long tls = args->tls;
struct inactive_task_frame *frame;
struct fork_frame *fork_frame;
struct pt_regs *childregs;
unsigned long new_ssp;
int ret = 0;
childregs = task_pt_regs(p);
fork_frame = container_of(childregs, struct fork_frame, regs);
frame = &fork_frame->frame;
frame->bp = encode_frame_pointer(childregs);
frame->ret_addr = (unsigned long) ret_from_fork_asm;
p->thread.sp = (unsigned long) fork_frame;
p->thread.io_bitmap = NULL;
p->thread.iopl_warn = 0;
memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps));
#ifdef CONFIG_X86_64
current_save_fsgs();
p->thread.fsindex = current->thread.fsindex;
p->thread.fsbase = current->thread.fsbase;
p->thread.gsindex = current->thread.gsindex;
p->thread.gsbase = current->thread.gsbase;
savesegment(es, p->thread.es);
savesegment(ds, p->thread.ds);
if (p->mm && (clone_flags & (CLONE_VM | CLONE_VFORK)) == CLONE_VM)
set_bit(MM_CONTEXT_LOCK_LAM, &p->mm->context.flags);
#else
p->thread.sp0 = (unsigned long) (childregs + 1);
savesegment(gs, p->thread.gs);
/*
* Clear all status flags including IF and set fixed bit. 64bit
* does not have this initialization as the frame does not contain
* flags. The flags consistency (especially vs. AC) is there
* ensured via objtool, which lacks 32bit support.
*/
frame->flags = X86_EFLAGS_FIXED;
#endif
/*
* Allocate a new shadow stack for thread if needed. If shadow stack,
* is disabled, new_ssp will remain 0, and fpu_clone() will know not to
* update it.
*/
new_ssp = shstk_alloc_thread_stack(p, clone_flags, args->stack_size);
if (IS_ERR_VALUE(new_ssp))
return PTR_ERR((void *)new_ssp);
fpu_clone(p, clone_flags, args->fn, new_ssp);
/* Kernel thread ? */
if (unlikely(p->flags & PF_KTHREAD)) {
p->thread.pkru = pkru_get_init_value();
memset(childregs, 0, sizeof(struct pt_regs));
kthread_frame_init(frame, args->fn, args->fn_arg);
return 0;
}
/*
* Clone current's PKRU value from hardware. tsk->thread.pkru
* is only valid when scheduled out.
*/
p->thread.pkru = read_pkru();
frame->bx = 0;
*childregs = *current_pt_regs();
childregs->ax = 0;
if (sp)
childregs->sp = sp;
if (unlikely(args->fn)) {
/*
* A user space thread, but it doesn't return to
* ret_after_fork().
*
* In order to indicate that to tools like gdb,
* we reset the stack and instruction pointers.
*
* It does the same kernel frame setup to return to a kernel
* function that a kernel thread does.
*/
childregs->sp = 0;
childregs->ip = 0;
kthread_frame_init(frame, args->fn, args->fn_arg);
return 0;
}
/* Set a new TLS for the child thread? */
if (clone_flags & CLONE_SETTLS)
ret = set_new_tls(p, tls);
if (!ret && unlikely(test_tsk_thread_flag(current, TIF_IO_BITMAP)))
io_bitmap_share(p);
return ret;
}
/*
* %rdi: prev task
* %rsi: next task
*/
.pushsection .text, "ax"
SYM_FUNC_START(__switch_to_asm)
/*
* Save callee-saved registers
* This must match the order in inactive_task_frame
*/
pushq %rbp
pushq %rbx
pushq %r12
pushq %r13
pushq %r14
pushq %r15
/* switch stack */
movq %rsp, TASK_threadsp(%rdi)
movq TASK_threadsp(%rsi), %rsp
#ifdef CONFIG_STACKPROTECTOR
movq TASK_stack_canary(%rsi), %rbx
movq %rbx, PER_CPU_VAR(fixed_percpu_data) + FIXED_stack_canary
#endif
/*
* When switching from a shallower to a deeper call stack
* the RSB may either underflow or use entries populated
* with userspace addresses. On CPUs where those concerns
* exist, overwrite the RSB with entries which capture
* speculative execution to prevent attack.
*/
FILL_RETURN_BUFFER %r12, RSB_CLEAR_LOOPS, X86_FEATURE_RSB_CTXSW
/* restore callee-saved registers */
popq %r15
popq %r14
popq %r13
popq %r12
popq %rbx
popq %rbp
jmp __switch_to
SYM_FUNC_END(__switch_to_asm)
.popsection
/*
* A newly forked process directly context switches into this address.
*
* rax: prev task we switched from
* rbx: kernel thread func (NULL for user thread)
* r12: kernel thread arg
*/
.pushsection .text, "ax"
SYM_CODE_START(ret_from_fork_asm)
/*
* This is the start of the kernel stack; even through there's a
* register set at the top, the regset isn't necessarily coherent
* (consider kthreads) and one cannot unwind further.
*
* This ensures stack unwinds of kernel threads terminate in a known
* good state.
*/
UNWIND_HINT_END_OF_STACK
ANNOTATE_NOENDBR // copy_thread
CALL_DEPTH_ACCOUNT
movq %rax, %rdi /* prev */
movq %rsp, %rsi /* regs */
movq %rbx, %rdx /* fn */
movq %r12, %rcx /* fn_arg */
call ret_from_fork
/*
* Set the stack state to what is expected for the target function
* -- at this point the register set should be a valid user set
* and unwind should work normally.
*/
UNWIND_HINT_REGS
jmp swapgs_restore_regs_and_return_to_usermode
SYM_CODE_END(ret_from_fork_asm)
.popsection
__visible void ret_from_fork(struct task_struct *prev, struct pt_regs *regs,
int (*fn)(void *), void *fn_arg)
{
schedule_tail(prev);
/* Is this a kernel thread? */
if (unlikely(fn)) {
fn(fn_arg);
/*
* A kernel thread is allowed to return here after successfully
* calling kernel_execve(). Exit to userspace to complete the
* execve() syscall.
*/
regs->ax = 0;
}
syscall_exit_to_user_mode(regs);
}
__visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs)
{
instrumentation_begin();
__syscall_exit_to_user_mode_work(regs);
instrumentation_end();
exit_to_user_mode();
}
static __always_inline void __syscall_exit_to_user_mode_work(struct pt_regs *regs)
{
syscall_exit_to_user_mode_prepare(regs);
local_irq_disable_exit_to_user();
exit_to_user_mode_prepare(regs);
}
/*
* 64-bit SYSCALL instruction entry. Up to 6 arguments in registers.
*
* This is the only entry point used for 64-bit system calls. The
* hardware interface is reasonably well designed and the register to
* argument mapping Linux uses fits well with the registers that are
* available when SYSCALL is used.
*
* SYSCALL instructions can be found inlined in libc implementations as
* well as some other programs and libraries. There are also a handful
* of SYSCALL instructions in the vDSO used, for example, as a
* clock_gettimeofday fallback.
*
* 64-bit SYSCALL saves rip to rcx, clears rflags.RF, then saves rflags to r11,
* then loads new ss, cs, and rip from previously programmed MSRs.
* rflags gets masked by a value from another MSR (so CLD and CLAC
* are not needed). SYSCALL does not save anything on the stack
* and does not change rsp.
*
* Registers on entry:
* rax system call number
* rcx return address
* r11 saved rflags (note: r11 is callee-clobbered register in C ABI)
* rdi arg0
* rsi arg1
* rdx arg2
* r10 arg3 (needs to be moved to rcx to conform to C ABI)
* r8 arg4
* r9 arg5
* (note: r12-r15, rbp, rbx are callee-preserved in C ABI)
*
* Only called from user space.
*
* When user can change pt_regs->foo always force IRET. That is because
* it deals with uncanonical addresses better. SYSRET has trouble
* with them due to bugs in both AMD and Intel CPUs.
*/
SYM_CODE_START(entry_SYSCALL_64)
UNWIND_HINT_ENTRY
ENDBR
swapgs
/* tss.sp2 is scratch space. */
movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
movq PER_CPU_VAR(pcpu_hot + X86_top_of_stack), %rsp
SYM_INNER_LABEL(entry_SYSCALL_64_safe_stack, SYM_L_GLOBAL)
ANNOTATE_NOENDBR
/* Construct struct pt_regs on stack */
pushq $__USER_DS /* pt_regs->ss */
pushq PER_CPU_VAR(cpu_tss_rw + TSS_sp2) /* pt_regs->sp */
pushq %r11 /* pt_regs->flags */
pushq $__USER_CS /* pt_regs->cs */
pushq %rcx /* pt_regs->ip */
SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
pushq %rax /* pt_regs->orig_ax */
PUSH_AND_CLEAR_REGS rax=$-ENOSYS
/* IRQs are off. */
movq %rsp, %rdi
/* Sign extend the lower 32bit as syscall numbers are treated as int */
movslq %eax, %rsi
/* clobbers %rax, make sure it is after saving the syscall nr */
IBRS_ENTER
UNTRAIN_RET
CLEAR_BRANCH_HISTORY
call do_syscall_64 /* returns with IRQs disabled */
/*
* Try to use SYSRET instead of IRET if we're returning to
* a completely clean 64-bit userspace context. If we're not,
* go to the slow exit path.
* In the Xen PV case we must use iret anyway.
*/
ALTERNATIVE "testb %al, %al; jz swapgs_restore_regs_and_return_to_usermode", \
"jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
/*
* We win! This label is here just for ease of understanding
* perf profiles. Nothing jumps here.
*/
syscall_return_via_sysret:
IBRS_EXIT
POP_REGS pop_rdi=0
/*
* Now all regs are restored except RSP and RDI.
* Save old stack pointer and switch to trampoline stack.
*/
movq %rsp, %rdi
movq PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
UNWIND_HINT_END_OF_STACK
pushq RSP-RDI(%rdi) /* RSP */
pushq (%rdi) /* RDI */
/*
* We are on the trampoline stack. All regs except RDI are live.
* We can do future final exit work right here.
*/
STACKLEAK_ERASE_NOCLOBBER
SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
popq %rdi
popq %rsp
SYM_INNER_LABEL(entry_SYSRETQ_unsafe_stack, SYM_L_GLOBAL)
ANNOTATE_NOENDBR
swapgs
CLEAR_CPU_BUFFERS
sysretq
SYM_INNER_LABEL(entry_SYSRETQ_end, SYM_L_GLOBAL)
ANNOTATE_NOENDBR
int3
SYM_CODE_END(entry_SYSCALL_64)
SYM_INNER_LABEL(native_irq_return_iret, SYM_L_GLOBAL)
ANNOTATE_NOENDBR // exc_double_fault
/*
* This may fault. Non-paranoid faults on return to userspace are
* handled by fixup_bad_iret. These include #SS, #GP, and #NP.
* Double-faults due to espfix64 are handled in exc_double_fault.
* Other faults here are fatal.
*/
iretq
#ifdef CONFIG_X86_ESPFIX64
native_irq_return_ldt:
/*
* We are running with user GSBASE. All GPRs contain their user
* values. We have a percpu ESPFIX stack that is eight slots
* long (see ESPFIX_STACK_SIZE). espfix_waddr points to the bottom
* of the ESPFIX stack.
*
* We clobber RAX and RDI in this code. We stash RDI on the
* normal stack and RAX on the ESPFIX stack.
*
* The ESPFIX stack layout we set up looks like this:
*
* --- top of ESPFIX stack ---
* SS
* RSP
* RFLAGS
* CS
* RIP <-- RSP points here when we're done
* RAX <-- espfix_waddr points here
* --- bottom of ESPFIX stack ---
*/
pushq %rdi /* Stash user RDI */
swapgs /* to kernel GS */
SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi /* to kernel CR3 */
movq PER_CPU_VAR(espfix_waddr), %rdi
movq %rax, (0*8)(%rdi) /* user RAX */
movq (1*8)(%rsp), %rax /* user RIP */
movq %rax, (1*8)(%rdi)
movq (2*8)(%rsp), %rax /* user CS */
movq %rax, (2*8)(%rdi)
movq (3*8)(%rsp), %rax /* user RFLAGS */
movq %rax, (3*8)(%rdi)
movq (5*8)(%rsp), %rax /* user SS */
movq %rax, (5*8)(%rdi)
movq (4*8)(%rsp), %rax /* user RSP */
movq %rax, (4*8)(%rdi)
/* Now RAX == RSP. */
andl $0xffff0000, %eax /* RAX = (RSP & 0xffff0000) */
/*
* espfix_stack[31:16] == 0. The page tables are set up such that
* (espfix_stack | (X & 0xffff0000)) points to a read-only alias of
* espfix_waddr for any X. That is, there are 65536 RO aliases of
* the same page. Set up RSP so that RSP[31:16] contains the
* respective 16 bits of the /userspace/ RSP and RSP nonetheless
* still points to an RO alias of the ESPFIX stack.
*/
orq PER_CPU_VAR(espfix_stack), %rax
SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
swapgs /* to user GS */
popq %rdi /* Restore user RDI */
movq %rax, %rsp
UNWIND_HINT_IRET_REGS offset=8
/*
* At this point, we cannot write to the stack any more, but we can
* still read.
*/
popq %rax /* Restore user RAX */
CLEAR_CPU_BUFFERS
/*
* RSP now points to an ordinary IRET frame, except that the page
* is read-only and RSP[31:16] are preloaded with the userspace
* values. We can now IRET back to userspace.
*/
jmp native_irq_return_iret
#endif
.Lnative_iret:
UNWIND_HINT_IRET_REGS
/*
* Are we returning to a stack segment from the LDT? Note: in
* 64-bit mode SS:RSP on the exception stack is always valid.
*/
#ifdef CONFIG_X86_ESPFIX64
testb $4, (SS-RIP)(%rsp)
jnz native_irq_return_ldt
#endif
.Lswapgs_and_iret:
swapgs
CLEAR_CPU_BUFFERS
/* Assert that the IRET frame indicates user mode. */
testb $3, 8(%rsp)
jnz .Lnative_iret
ud2
#ifdef CONFIG_PAGE_TABLE_ISOLATION
.Lpti_restore_regs_and_return_to_usermode:
POP_REGS pop_rdi=0
/*
* The stack is now user RDI, orig_ax, RIP, CS, EFLAGS, RSP, SS.
* Save old stack pointer and switch to trampoline stack.
*/
movq %rsp, %rdi
movq PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
UNWIND_HINT_END_OF_STACK
/* Copy the IRET frame to the trampoline stack. */
pushq 6*8(%rdi) /* SS */
pushq 5*8(%rdi) /* RSP */
pushq 4*8(%rdi) /* EFLAGS */
pushq 3*8(%rdi) /* CS */
pushq 2*8(%rdi) /* RIP */
/* Push user RDI on the trampoline stack. */
pushq (%rdi)
/*
* We are on the trampoline stack. All regs except RDI are live.
* We can do future final exit work right here.
*/
STACKLEAK_ERASE_NOCLOBBER
push %rax
SWITCH_TO_USER_CR3 scratch_reg=%rdi scratch_reg2=%rax
pop %rax
/* Restore RDI. */
popq %rdi
jmp .Lswapgs_and_iret
#endif