深度剖析：从 clone3 到 start_routine —— Linux 新线程的“破茧成蝶”之旅

基于 Linux 6.8.12 内核源码，一步一步追踪用户线程从创建到执行的完整路径

引言：用户线程的"前世今生"
用户态起点：clone3 调用与 glibc 的封装
内核初吻：clone3 系统调用与参数校验
核心缔造者：kernel_clone 与 copy_process
栈上的艺术：copy_thread 伪造新世界
唤醒新生命：wake_up_new_task 与调度器初识
调度器的选择：__schedule 与 context_switch
栈之舞：__switch_to_asm 切换内核栈
新线程的"第一句话"：ret_from_fork_asm
最后的归途：swapgs_restore_regs_and_return_to_usermode
iretq 与 sysretq：两条返回用户态的神谕
用户态的最后一公里：L(thread_start) 调用 start_routine
总结：一次系统调用背后的宏大叙事
参考文献与延伸阅读

1. 引言：用户线程的"前世今生"

在 Linux 系统中，每一个用户线程都像是操作系统这座大城里的一位"居民"。它们有自己的"房产"（内存空间）、可以使用的"工具"（文件描述符），以及一套独立的"银行存款"（寄存器状态）。然而，这些居民并非凭空产生，它们必须通过一个称为 clone（或更新的 clone3）的系统调用，由父进程"生育"出来。

当我们在 C 程序中调用 pthread_create，或者在 Python 中启动一个 threading.Thread，最终都会落入 Linux 内核的 clone3 系统调用。内核会一丝不苟地复制或者共享父进程的资源，为新生儿搭建好内核栈，伪造一个完美的"出身背景"，然后将它交给调度器。调度器在某个合适的时机，会让这个新线程第一次站上 CPU，执行它从父辈那里继承来的第一条指令------然后，它就开始了自己独立的人生。

你是不是曾经好奇过：新线程的第一个函数 start_routine 到底是怎么被调用的？为什么我们感觉不到任何"魔法"？今天，我们就带着 Linux 6.8.12 的内核源码，从 clone3 系统调用入口开始，一步一个脚印，追踪到用户态的 start_routine 被执行。你将看到内核如何在幕后精心构建一个"谎言"------通过伪造寄存器和栈帧，让新线程第一次被调度时就仿佛一直在运行，然后优雅地返回用户空间，执行用户指定的函数。

2. 用户态起点：`clone3` 调用与 `glibc` 的封装

我们先从用户态的角度看起。用户程序不会直接使用系统调用（虽然也可以），而是通过 C 标准库 glibc 提供的 clone3 包装函数。该函数的原型（简化）如下：

复制代码

int clone3(struct clone_args *cl_args, size_t size, int (*func)(void *), void *arg);

注意：真正的系统调用 clone3 只有两个参数（cl_args 和 size），而 glibc 为了方便用户传递线程入口函数，额外增加了 func 和 arg。因此，glibc 需要在这两个参数被内核"看到"之前，妥善保护好它们。

在 x86-64 架构下，函数调用参数传递规则（System V AMD64 ABI）：前六个整数/指针参数依次通过 rdi, rsi, rdx, rcx, r8, r9 传递。因此：

rdi = cl_args 指针
rsi = size
rdx = func
rcx = arg

glibc 的汇编实现（sysdeps/unix/sysv/linux/x86_64/clone3.S）会做以下关键工作：

assembly

复制代码

ENTRY(__clone3)
    /* 参数检查：func 不能为空 */
    testq   %rdx, %rdx
    jz      L(invalid)

    /* 将 arg 保存到一个不会被系统调用破坏的寄存器 */
    movq    %rcx, %r8

    /* 执行系统调用 */
    movl    $__NR_clone3, %eax
    syscall

    /* 系统调用返回后，检查返回值 */
    testq   %rax, %rax
    jl      L(error)          /* 如果 < 0，出错跳转 */
    jz      L(thread_start)   /* 如果 == 0，表示子线程，跳转到真正的执行入口 */

    /* 父线程直接返回 */
    retq

L(thread_start):
    /* 子线程从这里开始执行 */
    xorl    %ebp, %ebp        /* 清空帧指针，标记最外层帧 */
    movq    %r8, %rdi         /* 将之前保存的 arg 作为第一个参数 */
    call    *%rdx             /* 调用 func(arg) */
    /* func 返回后，调用 exit 系统调用结束线程 */
    movq    %rax, %rdi
    movl    $__NR_exit, %eax
    syscall
    /* 不会执行到这里 */
END(__clone3)

关键点在于：

func 和 arg 由 glibc 保存在 rdx 和 r8 中，然后才陷入内核。
内核完全不知道这两个寄存器的存在，因此当内核执行完系统调用，返回到用户态时，这些寄存器的值依然保留原样。
子线程从内核返回后，rip 指向 L(thread_start) 标签，此时 rdx 中仍然是 func，r8 中是 arg，于是可以顺利调用用户提供的线程入口函数。

所以，start_routine 被调用的"秘密"其实在用户态就已经定下了。内核只负责把新线程的返回地址设置为 L(thread_start)，剩下的就是 glibc 自己安排好的。

但是，内核又是如何让子线程的 rip 在第一次返回用户态时指向 L(thread_start) 的呢？这就需要我们深入内核的 copy_thread 和调度器返回路径了。

3. 内核初吻：`clone3` 系统调用与参数校验

当用户调用 syscall 指令后，CPU 会切换到内核态，并根据 MSR_LSTAR 中预存的地址跳转到 entry_SYSCALL_64（定义在 arch/x86/entry/entry_64.S）。我们来分析这个入口的代码：

assembly

复制代码

SYM_CODE_START(entry_SYSCALL_64)
    swapgs
    /* tss.sp2 作为临时存储 */
    movq    %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
    SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
    movq    PER_CPU_VAR(pcpu_hot + X86_top_of_stack), %rsp
    ...
    pushq   $__USER_DS                /* pt_regs->ss */
    pushq   PER_CPU_VAR(cpu_tss_rw + TSS_sp2)  /* pt_regs->sp */
    pushq   %r11                      /* pt_regs->flags */
    pushq   $__USER_CS                /* pt_regs->cs */
    pushq   %rcx                      /* pt_regs->ip */
    pushq   %rax                      /* pt_regs->orig_ax */
    PUSH_AND_CLEAR_REGS rax=$-ENOSYS
    ...
    call    do_syscall_64
    ...

这段代码做了几件重要的事情：

切换 GS 基址 ：swapgs 将用户态的 GS 与内核态的 GS 交换，以便访问 per-CPU 变量。
切换到内核栈 ：从 pcpu_hot.top_of_stack 获取当前 CPU 的内核栈顶，将 rsp 指向它。此时栈是空的。
构造 pt_regs 结构 ：将用户态的 SS、RSP、RFLAGS、CS、RIP 以及系统调用号等压栈，形成一个标准的 pt_regs 结构（定义在 arch/x86/include/asm/ptrace.h）。这个结构保存了用户现场的完整快照。
调用 do_syscall_64 ：这是通用的 64 位系统调用分发函数，它根据 rax 中的系统调用号，调用对应的内核实现函数。对于 clone3，rax 为 __NR_clone3，所以会调用 ksys_clone3（最终调用 __do_sys_clone3）。

do_syscall_64 的简化形式如下：

复制代码

#ifdef CONFIG_X86_64
__visible noinstr void do_syscall_64(struct pt_regs *regs, int nr)
{
    nr = syscall_enter_from_user_mode(regs, nr);
    instrumentation_begin();
    if (nr < NR_syscalls)
        regs->ax = sys_call_table[nr](regs);
    instrumentation_end();
    syscall_exit_to_user_mode(regs);
}
#endif

它会从 sys_call_table 中取出 clone3 的处理函数------其实就是 __x64_sys_clone3。

4. 核心缔造者：`kernel_clone` 与 `copy_process`

__x64_sys_clone3 在内核中被定义为 SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)。它的主要任务：

从用户空间安全地拷贝 clone_args 结构。
调用 copy_clone_args_from_user 填充内核的 struct kernel_clone_args。
检查标志合法性（clone3_args_valid）。
调用 kernel_clone(&kargs)。

kernel_clone 定义在 kernel/fork.c 中，是 fork/clone/vfork 三大件的中枢。它调用了 copy_process，然后进行收尾工作（如记录 PID、处理 vfork 等待等）。我们重点关注 copy_process，因为它负责创建新的 task_struct 并设置好所有资源。

copy_process 是一个高达几百行的函数，它做了：

复制或共享内存描述符（mm）、文件系统信息（fs）、文件描述符表（files）、信号处理等。
为新的进程分配一个新的 pid。
调用 copy_thread 来初始化新进程的内核栈和硬件上下文。

copy_thread 是与体系结构相关的函数，在 arch/x86/kernel/process.c 中实现。正是这个函数，为新线程的首次登场铺好了舞台。

5. 栈上的艺术：`copy_thread` 伪造新世界

copy_thread 接收 struct task_struct *p 和 struct kernel_clone_args *args。它的职责是：

为子进程（线程）准备内核栈上的 pt_regs 结构和 inactive_task_frame 结构。
设置 p->thread.sp 指向新内核栈的适当位置。
决定子进程第一次被调度时从何处开始执行（ret_from_fork_asm）。

下面我们深入阅读 copy_thread 的关键部分（基于 Linux 6.8.12 源码）：

复制代码

int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
{
    unsigned long clone_flags = args->flags;
    unsigned long sp = args->stack;          // 用户指定的栈，对于 pthread 通常是 NULL
    unsigned long tls = args->tls;
    struct inactive_task_frame *frame;
    struct fork_frame *fork_frame;
    struct pt_regs *childregs;
    unsigned long new_ssp;
    int ret = 0;

    // 获取子进程的 pt_regs 位置（位于内核栈顶部）
    childregs = task_pt_regs(p);
    // fork_frame 包含了 pt_regs 和一个额外的 inactive_task_frame
    fork_frame = container_of(childregs, struct fork_frame, regs);
    frame = &fork_frame->frame;

    // 设置 inactive_task_frame 中的 bp 和 ret_addr
    frame->bp = encode_frame_pointer(childregs);
    frame->ret_addr = (unsigned long) ret_from_fork_asm;   // ★ 关键！

    // 将子进程的内核栈指针指向 fork_frame（栈底方向）
    p->thread.sp = (unsigned long) fork_frame;

    // ... 其他初始化代码（FPU、TLS等）...

    // 如果是内核线程 (PF_KTHREAD)
    if (unlikely(p->flags & PF_KTHREAD)) {
        memset(childregs, 0, sizeof(struct pt_regs));
        kthread_frame_init(frame, args->fn, args->fn_arg);
        return 0;
    }

    // 用户线程（普通进程/线程）
    frame->bx = 0;
    *childregs = *current_pt_regs();   // 复制父进程的 pt_regs
    childregs->ax = 0;                 // 子进程的 fork/clone 返回值为 0
    if (sp)
        childregs->sp = sp;            // 如果指定了栈，则替换用户栈

    // ... 处理 CLONE_SETTLS 等 ...
    return ret;
}

这里有几个关键点：

5.1 内核栈的布局

内核为每个任务分配一个固定大小的内核栈（通常是 THREAD_SIZE，对于 x86_64 是 16KB 或 8KB）。栈从高地址向低地址增长。在栈的最高地址 处，存放着一个 struct pt_regs（用户现场），而在 pt_regs 下面紧挨着的就是 struct fork_frame，它包含了一个 inactive_task_frame。

inactive_task_frame 的定义如下：

复制代码

struct inactive_task_frame {
    unsigned long r15;
    unsigned long r14;
    unsigned long r13;
    unsigned long r12;
    unsigned long bx;
    unsigned long bp;
    unsigned long ret_addr;   // 返回地址
};

注意：这个结构的排列顺序与 __switch_to_asm 中的压栈/弹栈顺序完全匹配。

当我们设置 frame->ret_addr = ret_from_fork_asm，并将 p->thread.sp 指向 fork_frame 时，相当于告诉调度器：当这个任务被切换进来时，从栈上依次弹出 r15~rbp 后，最后的 ret 指令将跳转到 ret_from_fork_asm。

5.2 用户现场 `pt_regs` 的设置

childregs 是 pt_regs 的指针。对于普通的用户线程（非内核线程），我们复制父进程的 pt_regs，然后把 ax 设置为 0（使得子线程中 clone3 返回 0）。用户栈指针 sp 会被替换为 args->stack（通常是父进程指定的用户栈，例如 pthread_create 分配的栈）。用户指令指针 ip（即 pt_regs->ip）会被保留为父进程调用 clone3 时的返回地址------但这并不是 我们最终想要让子线程去的地方。等一下，我们会在返回用户态路径中看到，实际上子线程的 ip 会被覆盖。

实际上，对于通过 clone3 创建的新线程，我们并不希望它返回到父进程调用 clone3 之后的那条指令，因为那样会导致它执行父进程的逻辑。相反，我们希望它跳转到 glibc 的 L(thread_start)。可是，copy_thread 这里并没有把 childregs->ip 改成 L(thread_start) 的地址啊？这是关键 ：我们不需要改 childregs->ip，因为子线程并不是通过 iret 从 copy_thread 直接返回用户态的，而是会经历一次完整的调度器切换，最终从 swapgs_restore_regs_and_return_to_usermode 中的 iretq 返回用户态。那时使用的 rip 是从内核栈上的 pt_regs 中弹出的。既然如此，我们只需要修改 pt_regs 中的 ip 为 L(thread_start) 不就行了吗？但是 copy_thread 里并没有做这件事------因为真正的 L(thread_start) 地址是 glibc 的用户空间地址，内核不知道它。

那么内核如何知道新线程应该在用户态从哪条指令开始执行呢？答案是：内核根本不需要知道 。因为 clone3 是一个系统调用，在父进程中调用它时，pt_regs->ip 已经指向了 glibc 包装函数中 syscall 指令之后的那条指令（即 L(thread_start)）。当我们复制 pt_regs 时，子进程自动继承了相同的 ip。也就是说，如果父进程在调用 clone3 后，接着执行的是 L(thread_start) 代码，那么子进程恢复用户现场后，也会从 L(thread_start) 开始执行。这正是 glibc 的巧妙设计：父进程在 syscall 之后，会检查返回值并决定是继续执行父进程逻辑还是跳转到子线程逻辑，而子进程由于 ax 被强制设为 0，它就会进入 L(thread_start)。

但是这里有一个微妙之处：父进程调用 clone3 时，rcx 和 r11 被 syscall 指令覆盖（rcx 存返回地址，r11 存 rflags），所以父进程的 pt_regs 中 ip 不是 syscall 之后的地址吗？实际上，在 entry_SYSCALL_64 中，我们压入了 %rcx 作为 pt_regs->ip，而 rcx 恰恰是 syscall 指令将父进程的 rip 保存的地方，因此压入的就是 syscall 指令的下一条指令地址。所以子线程复制这个 ip，那么它返回到用户态后，也会从那条指令开始执行。而 glibc 在 syscall 之后，紧接着就是 testq %rax, %rax 来判断子进程/父进程，所以新线程会执行那个判断，然后走到 L(thread_start)。完美！

因此，copy_thread 不需要任何特殊处理，只需要复制父进程的 pt_regs 并将 ax 清零即可。所有用户态路由逻辑都已经在 glibc 中安排好了。

6. 唤醒新生命：`wake_up_new_task` 与调度器初识

当 copy_process 成功返回 task_struct *p 后，kernel_clone 会调用 wake_up_new_task(p)。这个函数定义在 kernel/sched/core.c，负责将新创建的任务放到运行队列中，并让调度器感知到它的存在。

关键代码：

复制代码

void wake_up_new_task(struct task_struct *p)
{
    struct rq_flags rf;
    struct rq *rq;

    raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
    WRITE_ONCE(p->__state, TASK_RUNNING);        // 将状态设为可运行

    // ... 负载均衡相关的代码（选择CPU）...

    rq = __task_rq_lock(p, &rf);
    activate_task(rq, p, ENQUEUE_NOCLOCK);       // 将任务加入运行队列
    trace_sched_wakeup_new(p);
    wakeup_preempt(rq, p, WF_FORK);              // 检查是否应该抢占当前任务
    task_rq_unlock(rq, p, &rf);
}

WRITE_ONCE(p->__state, TASK_RUNNING)：新生儿的"心跳"开始，表示它已经准备好被调度。
activate_task：将任务添加到 rq->cfs_tasks 之类的队列中，并更新调度类相关的统计信息。
wakeup_preempt：比较新任务和当前运行任务的优先级。如果新任务优先级更高（例如实时任务），则设置 TIF_NEED_RESCHED 标志，请求重新调度。

至此，新线程已经进入了调度器的视野，等待合适的时机成为 CPU 的主人。

7. 调度器的选择：`__schedule` 与 `context_switch`

调度器会在很多时机被调用，比如：

当前进程主动调用 schedule()（如等待 I/O）。
时间片用完，时钟中断处理中触发抢占。
从系统调用或中断返回到用户空间之前，检查 TIF_NEED_RESCHED 标志。

调度器的主函数是 __schedule，定义在 kernel/sched/core.c：

复制代码

static void __sched notrace __schedule(unsigned int sched_mode)
{
    struct task_struct *prev, *next;
    struct rq *rq;
    int cpu;

    cpu = smp_processor_id();
    rq = cpu_rq(cpu);
    prev = rq->curr;

    // ... 处理 prev 状态（可能因睡眠而 deactivate）...

    next = pick_next_task(rq, prev, &rf);   // 从运行队列中选出下一个任务
    if (likely(prev != next)) {
        rq->nr_switches++;
        RCU_INIT_POINTER(rq->curr, next);
        ++*switch_count;
        rq = context_switch(rq, prev, next, &rf);   // 切换上下文
    } else {
        // ... 无需切换的路径 ...
    }
}

pick_next_task 会调用调度类（如 CFS、实时类）的选择函数。对于普通的用户线程，最常用的是 fair_sched_class.pick_next_task，它从红黑树中取一个任务。

当选出 next 后，进入 context_switch：

复制代码

static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
               struct task_struct *next, struct rq_flags *rf)
{
    prepare_task_switch(rq, prev, next);

    // ... 切换内存管理（switch_mm）...

    switch_to(prev, next, prev);   // 切换寄存器和栈
    barrier();

    return finish_task_switch(prev);
}

switch_to 是一个宏，在 x86 下最终调用了 __switch_to_asm（汇编函数）。这就是我们前面多次提到的核心切换点。

8. 栈之舞：`__switch_to_asm` 切换内核栈

__switch_to_asm 定义在 arch/x86/entry/entry_64.S 中（你提供的源码里有）：

assembly

复制代码

SYM_FUNC_START(__switch_to_asm)
    pushq   %rbp
    pushq   %rbx
    pushq   %r12
    pushq   %r13
    pushq   %r14
    pushq   %r15

    /* 切换栈 */
    movq    %rsp, TASK_threadsp(%rdi)
    movq    TASK_threadsp(%rsi), %rsp

    /* 栈保护、RSB 填充等 */

    popq    %r15
    popq    %r14
    popq    %r13
    popq    %r12
    popq    %rbx
    popq    %rbp

    jmp     __switch_to
SYM_FUNC_END(__switch_to_asm)

我们来细致分析这段代码执行时发生了什么：

当前任务 prev 的内核栈 上压入了 %rbp, %rbx, %r12~%r15（callee-saved 寄存器）。这些寄存器在 C 函数调用中是被保护的，因此切换任务时必须保存。
保存当前任务的栈指针 ：movq %rsp, TASK_threadsp(%rdi)，其中 %rdi 是 prev 的 task_struct 指针，TASK_threadsp 是 thread.sp 在结构中的偏移。
切换到下一个任务的栈 ：movq TASK_threadsp(%rsi), %rsp，%rsi 是 next 的 task_struct 指针。现在 rsp 指向了 next 的内核栈上的某个位置------对于新线程，这个位置就是 fork_frame（即 p->thread.sp 指向的地方）。
恢复 callee-saved 寄存器 ：从新的栈上弹出 %r15..%rbp。对于新创建的线程，这些值都是我们在 copy_thread 中通过 frame->bx 等成员预设的（bx=0, bp 为伪装的帧指针等）。
跳转到 __switch_to ：注意这里是 jmp 而不是 call，所以 __switch_to 函数执行完毕后，会执行 ret 指令。而这个 ret 指令的返回地址是什么？此时 rsp 已经指向了 fork_frame.frame.ret_addr 的位置（因为我们已经弹出了 6 个 8 字节的寄存器，rsp 恰好指向偏移 48 处的 ret_addr）。所以，__switch_to 的 ret 会跳转到 ret_from_fork_asm！

这样，新线程第一次被调度时，便从 __switch_to_asm 过渡到了 ret_from_fork_asm，而父线程则是从 __switch_to 正常返回到 context_switch 中继续执行。

9. 新线程的"第一句话"：`ret_from_fork_asm`

ret_from_fork_asm 的定义在你的源码中：

assembly

复制代码

SYM_CODE_START(ret_from_fork_asm)
    UNWIND_HINT_END_OF_STACK
    ANNOTATE_NOENDBR
    CALL_DEPTH_ACCOUNT

    movq    %rax, %rdi      /* prev */
    movq    %rsp, %rsi      /* regs */
    movq    %rbx, %rdx      /* fn */
    movq    %r12, %rcx      /* fn_arg */
    call    ret_from_fork

    UNWIND_HINT_REGS
    jmp     swapgs_restore_regs_and_return_to_usermode
SYM_CODE_END(ret_from_fork_asm)

%rax 保存的是 __switch_to_asm 返回的 prev 任务（实际上就是切换前的任务，但新线程不关心）。
%rsp 此时仍然指向内核栈的某个位置（仍然是 fork_frame 附近）。
%rbx 和 %r12 在 copy_thread 中被设置为 0（对于用户线程）或内核线程函数指针（对于内核线程）。

ret_from_fork 是一个 C 函数（定义在 kernel/exit.c 或 arch/x86/kernel/process.c 中，取决于版本）。你的源码中提供了：

复制代码

__visible void ret_from_fork(struct task_struct *prev, struct pt_regs *regs,
                             int (*fn)(void *), void *fn_arg)
{
    schedule_tail(prev);      // 调度收尾工作（如释放 rq 锁）
    if (unlikely(fn)) {
        fn(fn_arg);
        regs->ax = 0;
    }
    syscall_exit_to_user_mode(regs);
}

注意：对于普通用户线程，fn 为 NULL，所以不会执行内核线程函数。

schedule_tail 是必要的，它完成一些调度后的清理，例如 finish_task_switch 中的部分工作。

最后调用 syscall_exit_to_user_mode(regs)------这个函数标志着我们即将返回用户态。

10. 最后的归途：`swapgs_restore_regs_and_return_to_usermode`

syscall_exit_to_user_mode 最终会调用到 exit_to_user_mode，然后进入 arch/x86/entry/common.c 中的 __exit_to_user_mode，但最底层的汇编路径仍然指向 swapgs_restore_regs_and_return_to_usermode。

你提供的 entry_64.S 中包含了这个标签的完整实现（前面已经贴过）。我们再仔细看几步：

assembly

复制代码

SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
    IBRS_EXIT
    /* 处理 Xen PV 和 PTI 的替代跳转 */
    STACKLEAK_ERASE
    POP_REGS               // 恢复所有通用寄存器
    add $8, %rsp           // 跳过 orig_ax
    UNWIND_HINT_IRET_REGS

.Lswapgs_and_iret:
    swapgs
    CLEAR_CPU_BUFFERS
    testb $3, 8(%rsp)      // 检查 CS 的低 2 位，看是不是用户模式
    jnz .Lnative_iret
    ud2

.Lnative_iret:
    iretq                  // 执行中断返回

注意：在调用 POP_REGS 之前，rsp 指向内核栈上 pt_regs 结构的起始位置。POP_REGS 会依次弹出 r15, r14, ... , rdi 等。然后 add $8, %rsp 跳过 orig_ax 字段。此时，rsp 正好指向 pt_regs 中的 ip 字段（即用户态的 RIP）。但是，iretq 期望栈顶依次是 RIP, CS, RFLAGS, RSP, SS。我们的 pt_regs 布局恰好就是这种顺序（因为 pt_regs 是严格按照中断帧设计的）。因此，执行 iretq 后，CPU 会弹出这些值，恢复用户态的所有寄存器，并且将 RIP 设置为 pt_regs->ip 中存放的地址------对于新线程，这个地址就是父进程调用 clone3 时 syscall 指令之后的那条指令，也就是 glibc 的 L(thread_start) 标签地址。

至此，新线程从内核态彻底返回到了用户态，并且 %rip 指向 L(thread_start)。

11. `iretq` 与 `sysretq`：两条返回用户态的神谕

在 swapgs_restore_regs_and_return_to_usermode 中，我们看到使用的是 iretq。但对于普通的系统调用返回（不是中断/异常），内核倾向于使用更快的 sysretq。这两种返回方式在 entry_64.S 中都有体现。为什么这里用了 iretq 而不是 sysretq？

主要原因是：sysretq 对返回条件要求严格，例如它要求返回地址（rcx）必须是规范地址，并且 RFLAGS 中某些位必须正确，同时不能有任何遗留的工作（如信号处理、抢占调度）。而新线程第一次返回用户态时，由于我们刚刚从 fork 路径出来，pt_regs 中的 ip 是正常的，但是否满足 sysretq 的所有条件？其实不一定，而且为了通用性，内核在 ret_from_fork 路径中统一使用 iretq，这样更安全。

另外，sysretq 不会自动切换 GS，所以需要单独的 swapgs 指令；而 iretq 在返回时只恢复 CS/SS，不影响 GS，因此也需要 swapgs。两者在顶层逻辑上类似，只是底层机制不同。

12. 用户态的最后一公里：`L(thread_start)` 调用 `start_routine`

终于，用户空间的代码开始执行。根据我们之前分析的 glibc 汇编片段：

assembly

复制代码

L(thread_start):
    xorl    %ebp, %ebp
    movq    %r8, %rdi
    call    *%rdx               /* 调用 start_routine(arg) */
    movq    %rax, %rdi
    movl    $SYS_ify(exit), %eax
    syscall

此时 rdx 中保存着 func（即用户提供的 start_routine 函数指针），r8 中保存着 arg。因此，call *%rdx 就会执行用户写的线程函数。当该函数返回后，返回值传递给了 exit 系统调用，线程优雅地消亡。

至此，整个链条闭环：用户调用 pthread_create → glibc 封装 → clone3 系统调用 → 内核创建任务 → 调度器选中 → 栈切换 → 返回用户态 → 执行 start_routine。

13. 总结：一次系统调用背后的宏大叙事

通过追踪 Linux 6.8.12 内核源码，我们完成了一次从用户态到内核态再返回用户态的完整旅程。这一路上，我们看到了：

用户态 glibc 的巧思 ：利用寄存器传递线程入口函数与参数，并在 syscall 之后放置分支逻辑，使得父子进程可以分流。
内核 copy_thread 的精巧构造 ：伪造 inactive_task_frame，并设置 ret_addr = ret_from_fork_asm，为新线程定制了首次调度时的执行入口。
调度器 __schedule 与 __switch_to_asm：通过保存/恢复 callee-saved 寄存器和切换内核栈指针，实现了任务的原子切换。
ret_from_fork_asm 到 swapgs_restore_regs_and_return_to_usermode：完成最后的调度收尾，并跳转到通用返回路径。
iretq 指令 ：根据内核栈上的 pt_regs 结构，恢复用户态的所有寄存器，并切换特权级，将控制权交还给用户空间的 L(thread_start)。
用户空间的 L(thread_start) ：最终调用了用户提供的 start_routine。

整个过程涉及了硬件特权级切换、x86-64 汇编、C 语言调度器、进程管理、内存管理等多个模块，是操作系统课本中"进程创建与调度"知识点的真实落地。

为什么如此"麻烦"？因为这是稳定、安全、高效的多任务系统所必须付出的代价。现代操作系统能够在毫秒间完成数千次这样的切换，而你甚至感觉不到它的存在。这正是计算机科学的魅力所在------复杂逻辑被封装在层层抽象背后，留给用户的是一个简单而强大的接口。

##源码

复制代码

/*
 * __schedule() is the main scheduler function.
 *
 * The main means of driving the scheduler and thus entering this function are:
 *
 *   1. Explicit blocking: mutex, semaphore, waitqueue, etc.
 *
 *   2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return
 *      paths. For example, see arch/x86/entry_64.S.
 *
 *      To drive preemption between tasks, the scheduler sets the flag in timer
 *      interrupt handler scheduler_tick().
 *
 *   3. Wakeups don't really cause entry into schedule(). They add a
 *      task to the run-queue and that's it.
 *
 *      Now, if the new task added to the run-queue preempts the current
 *      task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets
 *      called on the nearest possible occasion:
 *
 *       - If the kernel is preemptible (CONFIG_PREEMPTION=y):
 *
 *         - in syscall or exception context, at the next outmost
 *           preempt_enable(). (this might be as soon as the wake_up()'s
 *           spin_unlock()!)
 *
 *         - in IRQ context, return from interrupt-handler to
 *           preemptible context
 *
 *       - If the kernel is not preemptible (CONFIG_PREEMPTION is not set)
 *         then at the next:
 *
 *          - cond_resched() call
 *          - explicit schedule() call
 *          - return from syscall or exception to user-space
 *          - return from interrupt-handler to user-space
 *
 * WARNING: must be called with preemption disabled!
 */
static void __sched notrace __schedule(unsigned int sched_mode)
{
	struct task_struct *prev, *next;
	unsigned long *switch_count;
	unsigned long prev_state;
	struct rq_flags rf;
	struct rq *rq;
	int cpu;

	cpu = smp_processor_id();
	rq = cpu_rq(cpu);
	prev = rq->curr;

	schedule_debug(prev, !!sched_mode);

	if (sched_feat(HRTICK) || sched_feat(HRTICK_DL))
		hrtick_clear(rq);

	local_irq_disable();
	rcu_note_context_switch(!!sched_mode);

	/*
	 * Make sure that signal_pending_state()->signal_pending() below
	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
	 * done by the caller to avoid the race with signal_wake_up():
	 *
	 * __set_current_state(@state)		signal_wake_up()
	 * schedule()				  set_tsk_thread_flag(p, TIF_SIGPENDING)
	 *					  wake_up_state(p, state)
	 *   LOCK rq->lock			    LOCK p->pi_state
	 *   smp_mb__after_spinlock()		    smp_mb__after_spinlock()
	 *     if (signal_pending_state())	    if (p->state & @state)
	 *
	 * Also, the membarrier system call requires a full memory barrier
	 * after coming from user-space, before storing to rq->curr.
	 */
	rq_lock(rq, &rf);
	smp_mb__after_spinlock();

	/* Promote REQ to ACT */
	rq->clock_update_flags <<= 1;
	update_rq_clock(rq);
	rq->clock_update_flags = RQCF_UPDATED;

	switch_count = &prev->nivcsw;

	/*
	 * We must load prev->state once (task_struct::state is volatile), such
	 * that we form a control dependency vs deactivate_task() below.
	 */
	prev_state = READ_ONCE(prev->__state);
	if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
		if (signal_pending_state(prev_state, prev)) {
			WRITE_ONCE(prev->__state, TASK_RUNNING);
		} else {
			prev->sched_contributes_to_load =
				(prev_state & TASK_UNINTERRUPTIBLE) &&
				!(prev_state & TASK_NOLOAD) &&
				!(prev_state & TASK_FROZEN);

			if (prev->sched_contributes_to_load)
				rq->nr_uninterruptible++;

			/*
			 * __schedule()			ttwu()
			 *   prev_state = prev->state;    if (p->on_rq && ...)
			 *   if (prev_state)		    goto out;
			 *     p->on_rq = 0;		  smp_acquire__after_ctrl_dep();
			 *				  p->state = TASK_WAKING
			 *
			 * Where __schedule() and ttwu() have matching control dependencies.
			 *
			 * After this, schedule() must not care about p->state any more.
			 */
			deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);

			if (prev->in_iowait) {
				atomic_inc(&rq->nr_iowait);
				delayacct_blkio_start();
			}
		}
		switch_count = &prev->nvcsw;
	}

	next = pick_next_task(rq, prev, &rf);
	clear_tsk_need_resched(prev);
	clear_preempt_need_resched();
#ifdef CONFIG_SCHED_DEBUG
	rq->last_seen_need_resched_ns = 0;
#endif

	if (likely(prev != next)) {
		rq->nr_switches++;
		/*
		 * RCU users of rcu_dereference(rq->curr) may not see
		 * changes to task_struct made by pick_next_task().
		 */
		RCU_INIT_POINTER(rq->curr, next);
		/*
		 * The membarrier system call requires each architecture
		 * to have a full memory barrier after updating
		 * rq->curr, before returning to user-space.
		 *
		 * Here are the schemes providing that barrier on the
		 * various architectures:
		 * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
		 *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
		 * - finish_lock_switch() for weakly-ordered
		 *   architectures where spin_unlock is a full barrier,
		 * - switch_to() for arm64 (weakly-ordered, spin_unlock
		 *   is a RELEASE barrier),
		 */
		++*switch_count;

		migrate_disable_switch(rq, prev);
		psi_sched_switch(prev, next, !task_on_rq_queued(prev));

		trace_sched_switch(sched_mode & SM_MASK_PREEMPT, prev, next, prev_state);

		/* Also unlocks the rq: */
		rq = context_switch(rq, prev, next, &rf);
	} else {
		rq_unpin_lock(rq, &rf);
		__balance_callbacks(rq);
		raw_spin_rq_unlock_irq(rq);
	}
}

/*
 * context_switch - switch to the new MM and the new thread's register state.
 */
static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
	       struct task_struct *next, struct rq_flags *rf)
{
	prepare_task_switch(rq, prev, next);

	/*
	 * For paravirt, this is coupled with an exit in switch_to to
	 * combine the page table reload and the switch backend into
	 * one hypercall.
	 */
	arch_start_context_switch(prev);

	/*
	 * kernel -> kernel   lazy + transfer active
	 *   user -> kernel   lazy + mmgrab_lazy_tlb() active
	 *
	 * kernel ->   user   switch + mmdrop_lazy_tlb() active
	 *   user ->   user   switch
	 *
	 * switch_mm_cid() needs to be updated if the barriers provided
	 * by context_switch() are modified.
	 */
	if (!next->mm) {                                // to kernel
		enter_lazy_tlb(prev->active_mm, next);

		next->active_mm = prev->active_mm;
		if (prev->mm)                           // from user
			mmgrab_lazy_tlb(prev->active_mm);
		else
			prev->active_mm = NULL;
	} else {                                        // to user
		membarrier_switch_mm(rq, prev->active_mm, next->mm);
		/*
		 * sys_membarrier() requires an smp_mb() between setting
		 * rq->curr / membarrier_switch_mm() and returning to userspace.
		 *
		 * The below provides this either through switch_mm(), or in
		 * case 'prev->active_mm == next->mm' through
		 * finish_task_switch()'s mmdrop().
		 */
		switch_mm_irqs_off(prev->active_mm, next->mm, next);
		lru_gen_use_mm(next->mm);

		if (!prev->mm) {                        // from kernel
			/* will mmdrop_lazy_tlb() in finish_task_switch(). */
			rq->prev_mm = prev->active_mm;
			prev->active_mm = NULL;
		}
	}

	/* switch_mm_cid() requires the memory barriers above. */
	switch_mm_cid(rq, prev, next);

	prepare_lock_switch(rq, next, rf);

	/* Here we just switch the register state and the stack. */
	switch_to(prev, next, prev);
	barrier();

	return finish_task_switch(prev);
}

#define switch_to(prev, next, last)					\
do {									\
	((last) = __switch_to_asm((prev), (next)));			\
} while (0)

struct task_struct *__switch_to_asm(struct task_struct *prev,
				    struct task_struct *next);


/*
 * %rdi: prev task
 * %rsi: next task
 */
.pushsection .text, "ax"
SYM_FUNC_START(__switch_to_asm)
	/*
	 * Save callee-saved registers
	 * This must match the order in inactive_task_frame
	 */
	pushq	%rbp
	pushq	%rbx
	pushq	%r12
	pushq	%r13
	pushq	%r14
	pushq	%r15

	/* switch stack */
	movq	%rsp, TASK_threadsp(%rdi)
	movq	TASK_threadsp(%rsi), %rsp

#ifdef CONFIG_STACKPROTECTOR
	movq	TASK_stack_canary(%rsi), %rbx
	movq	%rbx, PER_CPU_VAR(fixed_percpu_data) + FIXED_stack_canary
#endif

	/*
	 * When switching from a shallower to a deeper call stack
	 * the RSB may either underflow or use entries populated
	 * with userspace addresses. On CPUs where those concerns
	 * exist, overwrite the RSB with entries which capture
	 * speculative execution to prevent attack.
	 */
	FILL_RETURN_BUFFER %r12, RSB_CLEAR_LOOPS, X86_FEATURE_RSB_CTXSW

	/* restore callee-saved registers */
	popq	%r15
	popq	%r14
	popq	%r13
	popq	%r12
	popq	%rbx
	popq	%rbp

	jmp	__switch_to
SYM_FUNC_END(__switch_to_asm)
.popsection

/*
 *	switch_to(x,y) should switch tasks from x to y.
 *
 * This could still be optimized:
 * - fold all the options into a flag word and test it with a single test.
 * - could test fs/gs bitsliced
 *
 * Kprobes not supported here. Set the probe on schedule instead.
 * Function graph tracer not supported too.
 */
__no_kmsan_checks
__visible __notrace_funcgraph struct task_struct *
__switch_to(struct task_struct *prev_p, struct task_struct *next_p)
{
	struct thread_struct *prev = &prev_p->thread;
	struct thread_struct *next = &next_p->thread;
	struct fpu *prev_fpu = &prev->fpu;
	int cpu = smp_processor_id();

	WARN_ON_ONCE(IS_ENABLED(CONFIG_DEBUG_ENTRY) &&
		     this_cpu_read(pcpu_hot.hardirq_stack_inuse));

	if (!test_thread_flag(TIF_NEED_FPU_LOAD))
		switch_fpu_prepare(prev_fpu, cpu);

	/* We must save %fs and %gs before load_TLS() because
	 * %fs and %gs may be cleared by load_TLS().
	 *
	 * (e.g. xen_load_tls())
	 */
	save_fsgs(prev_p);

	/*
	 * Load TLS before restoring any segments so that segment loads
	 * reference the correct GDT entries.
	 */
	load_TLS(next, cpu);

	/*
	 * Leave lazy mode, flushing any hypercalls made here.  This
	 * must be done after loading TLS entries in the GDT but before
	 * loading segments that might reference them.
	 */
	arch_end_context_switch(next_p);

	/* Switch DS and ES.
	 *
	 * Reading them only returns the selectors, but writing them (if
	 * nonzero) loads the full descriptor from the GDT or LDT.  The
	 * LDT for next is loaded in switch_mm, and the GDT is loaded
	 * above.
	 *
	 * We therefore need to write new values to the segment
	 * registers on every context switch unless both the new and old
	 * values are zero.
	 *
	 * Note that we don't need to do anything for CS and SS, as
	 * those are saved and restored as part of pt_regs.
	 */
	savesegment(es, prev->es);
	if (unlikely(next->es | prev->es))
		loadsegment(es, next->es);

	savesegment(ds, prev->ds);
	if (unlikely(next->ds | prev->ds))
		loadsegment(ds, next->ds);

	x86_fsgsbase_load(prev, next);

	x86_pkru_load(prev, next);

	/*
	 * Switch the PDA and FPU contexts.
	 */
	raw_cpu_write(pcpu_hot.current_task, next_p);
	raw_cpu_write(pcpu_hot.top_of_stack, task_top_of_stack(next_p));

	switch_fpu_finish();

	/* Reload sp0. */
	update_task_stack(next_p);

	switch_to_extra(prev_p, next_p);

	if (static_cpu_has_bug(X86_BUG_SYSRET_SS_ATTRS)) {
		/*
		 * AMD CPUs have a misfeature: SYSRET sets the SS selector but
		 * does not update the cached descriptor.  As a result, if we
		 * do SYSRET while SS is NULL, we'll end up in user mode with
		 * SS apparently equal to __USER_DS but actually unusable.
		 *
		 * The straightforward workaround would be to fix it up just
		 * before SYSRET, but that would slow down the system call
		 * fast paths.  Instead, we ensure that SS is never NULL in
		 * system call context.  We do this by replacing NULL SS
		 * selectors at every context switch.  SYSCALL sets up a valid
		 * SS, so the only way to get NULL is to re-enter the kernel
		 * from CPL 3 through an interrupt.  Since that can't happen
		 * in the same task as a running syscall, we are guaranteed to
		 * context switch between every interrupt vector entry and a
		 * subsequent SYSRET.
		 *
		 * We read SS first because SS reads are much faster than
		 * writes.  Out of caution, we force SS to __KERNEL_DS even if
		 * it previously had a different non-NULL value.
		 */
		unsigned short ss_sel;
		savesegment(ss, ss_sel);
		if (ss_sel != __KERNEL_DS)
			loadsegment(ss, __KERNEL_DS);
	}

	/* Load the Intel cache allocation PQR MSR. */
	resctrl_sched_in(next_p);

	return prev_p;
}

static __always_inline void __schedule_loop(unsigned int sched_mode)
{
	do {
		preempt_disable();
		__schedule(sched_mode);
		sched_preempt_enable_no_resched();
	} while (need_resched());
}

asmlinkage __visible void __sched schedule(void)
{
	struct task_struct *tsk = current;

#ifdef CONFIG_RT_MUTEXES
	lockdep_assert(!tsk->sched_rt_mutex);
#endif

	if (!task_is_running(tsk))
		sched_submit_work(tsk);
	__schedule_loop(SM_NONE);
	sched_update_worker(tsk);
}
EXPORT_SYMBOL(schedule);


/*
 *  Ok, this is the main fork-routine.
 *
 * It copies the process, and if successful kick-starts
 * it and waits for it to finish using the VM if required.
 *
 * args->exit_signal is expected to be checked for sanity by the caller.
 */
pid_t kernel_clone(struct kernel_clone_args *args)
{
	u64 clone_flags = args->flags;
	struct completion vfork;
	struct pid *pid;
	struct task_struct *p;
	int trace = 0;
	pid_t nr;

	/*
	 * For legacy clone() calls, CLONE_PIDFD uses the parent_tid argument
	 * to return the pidfd. Hence, CLONE_PIDFD and CLONE_PARENT_SETTID are
	 * mutually exclusive. With clone3() CLONE_PIDFD has grown a separate
	 * field in struct clone_args and it still doesn't make sense to have
	 * them both point at the same memory location. Performing this check
	 * here has the advantage that we don't need to have a separate helper
	 * to check for legacy clone().
	 */
	if ((args->flags & CLONE_PIDFD) &&
	    (args->flags & CLONE_PARENT_SETTID) &&
	    (args->pidfd == args->parent_tid))
		return -EINVAL;

	/*
	 * Determine whether and which event to report to ptracer.  When
	 * called from kernel_thread or CLONE_UNTRACED is explicitly
	 * requested, no event is reported; otherwise, report if the event
	 * for the type of forking is enabled.
	 */
	if (!(clone_flags & CLONE_UNTRACED)) {
		if (clone_flags & CLONE_VFORK)
			trace = PTRACE_EVENT_VFORK;
		else if (args->exit_signal != SIGCHLD)
			trace = PTRACE_EVENT_CLONE;
		else
			trace = PTRACE_EVENT_FORK;

		if (likely(!ptrace_event_enabled(current, trace)))
			trace = 0;
	}

	p = copy_process(NULL, trace, NUMA_NO_NODE, args);
	add_latent_entropy();

	if (IS_ERR(p))
		return PTR_ERR(p);

	/*
	 * Do this prior waking up the new thread - the thread pointer
	 * might get invalid after that point, if the thread exits quickly.
	 */
	trace_sched_process_fork(current, p);

	pid = get_task_pid(p, PIDTYPE_PID);
	nr = pid_vnr(pid);

	if (clone_flags & CLONE_PARENT_SETTID)
		put_user(nr, args->parent_tid);

	if (clone_flags & CLONE_VFORK) {
		p->vfork_done = &vfork;
		init_completion(&vfork);
		get_task_struct(p);
	}

	if (IS_ENABLED(CONFIG_LRU_GEN_WALKS_MMU) && !(clone_flags & CLONE_VM)) {
		/* lock the task to synchronize with memcg migration */
		task_lock(p);
		lru_gen_add_mm(p->mm);
		task_unlock(p);
	}

	wake_up_new_task(p);

	/* forking complete and child started to run, tell ptracer */
	if (unlikely(trace))
		ptrace_event_pid(trace, pid);

	if (clone_flags & CLONE_VFORK) {
		if (!wait_for_vfork_done(p, &vfork))
			ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);
	}

	put_pid(pid);
	return nr;
}


/*
 * wake_up_new_task - wake up a newly created task for the first time.
 *
 * This function will do some initial scheduler statistics housekeeping
 * that must be done for every newly created context, then puts the task
 * on the runqueue and wakes it.
 */
void wake_up_new_task(struct task_struct *p)
{
	struct rq_flags rf;
	struct rq *rq;

	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
	WRITE_ONCE(p->__state, TASK_RUNNING);
#ifdef CONFIG_SMP
	/*
	 * Fork balancing, do it here and not earlier because:
	 *  - cpus_ptr can change in the fork path
	 *  - any previously selected CPU might disappear through hotplug
	 *
	 * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,
	 * as we're not fully set-up yet.
	 */
	p->recent_used_cpu = task_cpu(p);
	rseq_migrate(p);
	__set_task_cpu(p, select_task_rq(p, task_cpu(p), WF_FORK));
#endif
	rq = __task_rq_lock(p, &rf);
	update_rq_clock(rq);
	post_init_entity_util_avg(p);

	activate_task(rq, p, ENQUEUE_NOCLOCK);
	trace_sched_wakeup_new(p);
	wakeup_preempt(rq, p, WF_FORK);
#ifdef CONFIG_SMP
	if (p->sched_class->task_woken) {
		/*
		 * Nothing relies on rq->lock after this, so it's fine to
		 * drop it.
		 */
		rq_unpin_lock(rq, &rf);
		p->sched_class->task_woken(rq, p);
		rq_repin_lock(rq, &rf);
	}
#endif
	task_rq_unlock(rq, p, &rf);
}

asmlinkage void ret_from_fork_asm(void);
__visible void ret_from_fork(struct task_struct *prev, struct pt_regs *regs,
			     int (*fn)(void *), void *fn_arg);

int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
{
	unsigned long clone_flags = args->flags;
	unsigned long sp = args->stack;
	unsigned long tls = args->tls;
	struct inactive_task_frame *frame;
	struct fork_frame *fork_frame;
	struct pt_regs *childregs;
	unsigned long new_ssp;
	int ret = 0;

	childregs = task_pt_regs(p);
	fork_frame = container_of(childregs, struct fork_frame, regs);
	frame = &fork_frame->frame;

	frame->bp = encode_frame_pointer(childregs);
	frame->ret_addr = (unsigned long) ret_from_fork_asm;
	p->thread.sp = (unsigned long) fork_frame;
	p->thread.io_bitmap = NULL;
	p->thread.iopl_warn = 0;
	memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps));

#ifdef CONFIG_X86_64
	current_save_fsgs();
	p->thread.fsindex = current->thread.fsindex;
	p->thread.fsbase = current->thread.fsbase;
	p->thread.gsindex = current->thread.gsindex;
	p->thread.gsbase = current->thread.gsbase;

	savesegment(es, p->thread.es);
	savesegment(ds, p->thread.ds);

	if (p->mm && (clone_flags & (CLONE_VM | CLONE_VFORK)) == CLONE_VM)
		set_bit(MM_CONTEXT_LOCK_LAM, &p->mm->context.flags);
#else
	p->thread.sp0 = (unsigned long) (childregs + 1);
	savesegment(gs, p->thread.gs);
	/*
	 * Clear all status flags including IF and set fixed bit. 64bit
	 * does not have this initialization as the frame does not contain
	 * flags. The flags consistency (especially vs. AC) is there
	 * ensured via objtool, which lacks 32bit support.
	 */
	frame->flags = X86_EFLAGS_FIXED;
#endif

	/*
	 * Allocate a new shadow stack for thread if needed. If shadow stack,
	 * is disabled, new_ssp will remain 0, and fpu_clone() will know not to
	 * update it.
	 */
	new_ssp = shstk_alloc_thread_stack(p, clone_flags, args->stack_size);
	if (IS_ERR_VALUE(new_ssp))
		return PTR_ERR((void *)new_ssp);

	fpu_clone(p, clone_flags, args->fn, new_ssp);

	/* Kernel thread ? */
	if (unlikely(p->flags & PF_KTHREAD)) {
		p->thread.pkru = pkru_get_init_value();
		memset(childregs, 0, sizeof(struct pt_regs));
		kthread_frame_init(frame, args->fn, args->fn_arg);
		return 0;
	}

	/*
	 * Clone current's PKRU value from hardware. tsk->thread.pkru
	 * is only valid when scheduled out.
	 */
	p->thread.pkru = read_pkru();

	frame->bx = 0;
	*childregs = *current_pt_regs();
	childregs->ax = 0;
	if (sp)
		childregs->sp = sp;

	if (unlikely(args->fn)) {
		/*
		 * A user space thread, but it doesn't return to
		 * ret_after_fork().
		 *
		 * In order to indicate that to tools like gdb,
		 * we reset the stack and instruction pointers.
		 *
		 * It does the same kernel frame setup to return to a kernel
		 * function that a kernel thread does.
		 */
		childregs->sp = 0;
		childregs->ip = 0;
		kthread_frame_init(frame, args->fn, args->fn_arg);
		return 0;
	}

	/* Set a new TLS for the child thread? */
	if (clone_flags & CLONE_SETTLS)
		ret = set_new_tls(p, tls);

	if (!ret && unlikely(test_tsk_thread_flag(current, TIF_IO_BITMAP)))
		io_bitmap_share(p);

	return ret;
}


/*
 * %rdi: prev task
 * %rsi: next task
 */
.pushsection .text, "ax"
SYM_FUNC_START(__switch_to_asm)
	/*
	 * Save callee-saved registers
	 * This must match the order in inactive_task_frame
	 */
	pushq	%rbp
	pushq	%rbx
	pushq	%r12
	pushq	%r13
	pushq	%r14
	pushq	%r15

	/* switch stack */
	movq	%rsp, TASK_threadsp(%rdi)
	movq	TASK_threadsp(%rsi), %rsp

#ifdef CONFIG_STACKPROTECTOR
	movq	TASK_stack_canary(%rsi), %rbx
	movq	%rbx, PER_CPU_VAR(fixed_percpu_data) + FIXED_stack_canary
#endif

	/*
	 * When switching from a shallower to a deeper call stack
	 * the RSB may either underflow or use entries populated
	 * with userspace addresses. On CPUs where those concerns
	 * exist, overwrite the RSB with entries which capture
	 * speculative execution to prevent attack.
	 */
	FILL_RETURN_BUFFER %r12, RSB_CLEAR_LOOPS, X86_FEATURE_RSB_CTXSW

	/* restore callee-saved registers */
	popq	%r15
	popq	%r14
	popq	%r13
	popq	%r12
	popq	%rbx
	popq	%rbp

	jmp	__switch_to
SYM_FUNC_END(__switch_to_asm)
.popsection

/*
 * A newly forked process directly context switches into this address.
 *
 * rax: prev task we switched from
 * rbx: kernel thread func (NULL for user thread)
 * r12: kernel thread arg
 */
.pushsection .text, "ax"
SYM_CODE_START(ret_from_fork_asm)
	/*
	 * This is the start of the kernel stack; even through there's a
	 * register set at the top, the regset isn't necessarily coherent
	 * (consider kthreads) and one cannot unwind further.
	 *
	 * This ensures stack unwinds of kernel threads terminate in a known
	 * good state.
	 */
	UNWIND_HINT_END_OF_STACK
	ANNOTATE_NOENDBR // copy_thread
	CALL_DEPTH_ACCOUNT

	movq	%rax, %rdi		/* prev */
	movq	%rsp, %rsi		/* regs */
	movq	%rbx, %rdx		/* fn */
	movq	%r12, %rcx		/* fn_arg */
	call	ret_from_fork

	/*
	 * Set the stack state to what is expected for the target function
	 * -- at this point the register set should be a valid user set
	 * and unwind should work normally.
	 */
	UNWIND_HINT_REGS
	jmp	swapgs_restore_regs_and_return_to_usermode
SYM_CODE_END(ret_from_fork_asm)
.popsection

__visible void ret_from_fork(struct task_struct *prev, struct pt_regs *regs,
				     int (*fn)(void *), void *fn_arg)
{
	schedule_tail(prev);

	/* Is this a kernel thread? */
	if (unlikely(fn)) {
		fn(fn_arg);
		/*
		 * A kernel thread is allowed to return here after successfully
		 * calling kernel_execve().  Exit to userspace to complete the
		 * execve() syscall.
		 */
		regs->ax = 0;
	}

	syscall_exit_to_user_mode(regs);
}

__visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs)
{
	instrumentation_begin();
	__syscall_exit_to_user_mode_work(regs);
	instrumentation_end();
	exit_to_user_mode();
}

static __always_inline void __syscall_exit_to_user_mode_work(struct pt_regs *regs)
{
	syscall_exit_to_user_mode_prepare(regs);
	local_irq_disable_exit_to_user();
	exit_to_user_mode_prepare(regs);
}


/*
 * 64-bit SYSCALL instruction entry. Up to 6 arguments in registers.
 *
 * This is the only entry point used for 64-bit system calls.  The
 * hardware interface is reasonably well designed and the register to
 * argument mapping Linux uses fits well with the registers that are
 * available when SYSCALL is used.
 *
 * SYSCALL instructions can be found inlined in libc implementations as
 * well as some other programs and libraries.  There are also a handful
 * of SYSCALL instructions in the vDSO used, for example, as a
 * clock_gettimeofday fallback.
 *
 * 64-bit SYSCALL saves rip to rcx, clears rflags.RF, then saves rflags to r11,
 * then loads new ss, cs, and rip from previously programmed MSRs.
 * rflags gets masked by a value from another MSR (so CLD and CLAC
 * are not needed). SYSCALL does not save anything on the stack
 * and does not change rsp.
 *
 * Registers on entry:
 * rax  system call number
 * rcx  return address
 * r11  saved rflags (note: r11 is callee-clobbered register in C ABI)
 * rdi  arg0
 * rsi  arg1
 * rdx  arg2
 * r10  arg3 (needs to be moved to rcx to conform to C ABI)
 * r8   arg4
 * r9   arg5
 * (note: r12-r15, rbp, rbx are callee-preserved in C ABI)
 *
 * Only called from user space.
 *
 * When user can change pt_regs->foo always force IRET. That is because
 * it deals with uncanonical addresses better. SYSRET has trouble
 * with them due to bugs in both AMD and Intel CPUs.
 */

SYM_CODE_START(entry_SYSCALL_64)
	UNWIND_HINT_ENTRY
	ENDBR

	swapgs
	/* tss.sp2 is scratch space. */
	movq	%rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
	movq	PER_CPU_VAR(pcpu_hot + X86_top_of_stack), %rsp

SYM_INNER_LABEL(entry_SYSCALL_64_safe_stack, SYM_L_GLOBAL)
	ANNOTATE_NOENDBR

	/* Construct struct pt_regs on stack */
	pushq	$__USER_DS				/* pt_regs->ss */
	pushq	PER_CPU_VAR(cpu_tss_rw + TSS_sp2)	/* pt_regs->sp */
	pushq	%r11					/* pt_regs->flags */
	pushq	$__USER_CS				/* pt_regs->cs */
	pushq	%rcx					/* pt_regs->ip */
SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
	pushq	%rax					/* pt_regs->orig_ax */

	PUSH_AND_CLEAR_REGS rax=$-ENOSYS

	/* IRQs are off. */
	movq	%rsp, %rdi
	/* Sign extend the lower 32bit as syscall numbers are treated as int */
	movslq	%eax, %rsi

	/* clobbers %rax, make sure it is after saving the syscall nr */
	IBRS_ENTER
	UNTRAIN_RET
	CLEAR_BRANCH_HISTORY

	call	do_syscall_64		/* returns with IRQs disabled */

	/*
	 * Try to use SYSRET instead of IRET if we're returning to
	 * a completely clean 64-bit userspace context.  If we're not,
	 * go to the slow exit path.
	 * In the Xen PV case we must use iret anyway.
	 */

	ALTERNATIVE "testb %al, %al; jz swapgs_restore_regs_and_return_to_usermode", \
		"jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV

	/*
	 * We win! This label is here just for ease of understanding
	 * perf profiles. Nothing jumps here.
	 */
syscall_return_via_sysret:
	IBRS_EXIT
	POP_REGS pop_rdi=0

	/*
	 * Now all regs are restored except RSP and RDI.
	 * Save old stack pointer and switch to trampoline stack.
	 */
	movq	%rsp, %rdi
	movq	PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
	UNWIND_HINT_END_OF_STACK

	pushq	RSP-RDI(%rdi)	/* RSP */
	pushq	(%rdi)		/* RDI */

	/*
	 * We are on the trampoline stack.  All regs except RDI are live.
	 * We can do future final exit work right here.
	 */
	STACKLEAK_ERASE_NOCLOBBER

	SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi

	popq	%rdi
	popq	%rsp
SYM_INNER_LABEL(entry_SYSRETQ_unsafe_stack, SYM_L_GLOBAL)
	ANNOTATE_NOENDBR
	swapgs
	CLEAR_CPU_BUFFERS
	sysretq
SYM_INNER_LABEL(entry_SYSRETQ_end, SYM_L_GLOBAL)
	ANNOTATE_NOENDBR
	int3
SYM_CODE_END(entry_SYSCALL_64)

SYM_INNER_LABEL(native_irq_return_iret, SYM_L_GLOBAL)
	ANNOTATE_NOENDBR // exc_double_fault
	/*
	 * This may fault.  Non-paranoid faults on return to userspace are
	 * handled by fixup_bad_iret.  These include #SS, #GP, and #NP.
	 * Double-faults due to espfix64 are handled in exc_double_fault.
	 * Other faults here are fatal.
	 */
	iretq

#ifdef CONFIG_X86_ESPFIX64
native_irq_return_ldt:
	/*
	 * We are running with user GSBASE.  All GPRs contain their user
	 * values.  We have a percpu ESPFIX stack that is eight slots
	 * long (see ESPFIX_STACK_SIZE).  espfix_waddr points to the bottom
	 * of the ESPFIX stack.
	 *
	 * We clobber RAX and RDI in this code.  We stash RDI on the
	 * normal stack and RAX on the ESPFIX stack.
	 *
	 * The ESPFIX stack layout we set up looks like this:
	 *
	 * --- top of ESPFIX stack ---
	 * SS
	 * RSP
	 * RFLAGS
	 * CS
	 * RIP  <-- RSP points here when we're done
	 * RAX  <-- espfix_waddr points here
	 * --- bottom of ESPFIX stack ---
	 */

	pushq	%rdi				/* Stash user RDI */
	swapgs					/* to kernel GS */
	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi	/* to kernel CR3 */

	movq	PER_CPU_VAR(espfix_waddr), %rdi
	movq	%rax, (0*8)(%rdi)		/* user RAX */
	movq	(1*8)(%rsp), %rax		/* user RIP */
	movq	%rax, (1*8)(%rdi)
	movq	(2*8)(%rsp), %rax		/* user CS */
	movq	%rax, (2*8)(%rdi)
	movq	(3*8)(%rsp), %rax		/* user RFLAGS */
	movq	%rax, (3*8)(%rdi)
	movq	(5*8)(%rsp), %rax		/* user SS */
	movq	%rax, (5*8)(%rdi)
	movq	(4*8)(%rsp), %rax		/* user RSP */
	movq	%rax, (4*8)(%rdi)
	/* Now RAX == RSP. */

	andl	$0xffff0000, %eax		/* RAX = (RSP & 0xffff0000) */

	/*
	 * espfix_stack[31:16] == 0.  The page tables are set up such that
	 * (espfix_stack | (X & 0xffff0000)) points to a read-only alias of
	 * espfix_waddr for any X.  That is, there are 65536 RO aliases of
	 * the same page.  Set up RSP so that RSP[31:16] contains the
	 * respective 16 bits of the /userspace/ RSP and RSP nonetheless
	 * still points to an RO alias of the ESPFIX stack.
	 */
	orq	PER_CPU_VAR(espfix_stack), %rax

	SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
	swapgs					/* to user GS */
	popq	%rdi				/* Restore user RDI */

	movq	%rax, %rsp
	UNWIND_HINT_IRET_REGS offset=8

	/*
	 * At this point, we cannot write to the stack any more, but we can
	 * still read.
	 */
	popq	%rax				/* Restore user RAX */

	CLEAR_CPU_BUFFERS

	/*
	 * RSP now points to an ordinary IRET frame, except that the page
	 * is read-only and RSP[31:16] are preloaded with the userspace
	 * values.  We can now IRET back to userspace.
	 */
	jmp	native_irq_return_iret
#endif	

.Lnative_iret:
	UNWIND_HINT_IRET_REGS
	/*
	 * Are we returning to a stack segment from the LDT?  Note: in
	 * 64-bit mode SS:RSP on the exception stack is always valid.
	 */
#ifdef CONFIG_X86_ESPFIX64
	testb	$4, (SS-RIP)(%rsp)
	jnz	native_irq_return_ldt
#endif

.Lswapgs_and_iret:
	swapgs
	CLEAR_CPU_BUFFERS
	/* Assert that the IRET frame indicates user mode. */
	testb	$3, 8(%rsp)
	jnz	.Lnative_iret
	ud2	
	

#ifdef CONFIG_PAGE_TABLE_ISOLATION
.Lpti_restore_regs_and_return_to_usermode:
	POP_REGS pop_rdi=0

	/*
	 * The stack is now user RDI, orig_ax, RIP, CS, EFLAGS, RSP, SS.
	 * Save old stack pointer and switch to trampoline stack.
	 */
	movq	%rsp, %rdi
	movq	PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
	UNWIND_HINT_END_OF_STACK

	/* Copy the IRET frame to the trampoline stack. */
	pushq	6*8(%rdi)	/* SS */
	pushq	5*8(%rdi)	/* RSP */
	pushq	4*8(%rdi)	/* EFLAGS */
	pushq	3*8(%rdi)	/* CS */
	pushq	2*8(%rdi)	/* RIP */

	/* Push user RDI on the trampoline stack. */
	pushq	(%rdi)

	/*
	 * We are on the trampoline stack.  All regs except RDI are live.
	 * We can do future final exit work right here.
	 */
	STACKLEAK_ERASE_NOCLOBBER

	push	%rax
	SWITCH_TO_USER_CR3 scratch_reg=%rdi scratch_reg2=%rax
	pop	%rax

	/* Restore RDI. */
	popq	%rdi
	jmp	.Lswapgs_and_iret
#endif

深度剖析：从 clone3 到 start_routine —— Linux 新线程的“破茧成蝶”之旅

目录

1. 引言：用户线程的"前世今生"

2. 用户态起点：clone3 调用与 glibc 的封装

3. 内核初吻：clone3 系统调用与参数校验

4. 核心缔造者：kernel_clone 与 copy_process

5. 栈上的艺术：copy_thread 伪造新世界

5.1 内核栈的布局

5.2 用户现场 pt_regs 的设置

6. 唤醒新生命：wake_up_new_task 与调度器初识

7. 调度器的选择：__schedule 与 context_switch

8. 栈之舞：__switch_to_asm 切换内核栈

9. 新线程的"第一句话"：ret_from_fork_asm

10. 最后的归途：swapgs_restore_regs_and_return_to_usermode

11. iretq 与 sysretq：两条返回用户态的神谕

12. 用户态的最后一公里：L(thread_start) 调用 start_routine

13. 总结：一次系统调用背后的宏大叙事

2. 用户态起点：`clone3` 调用与 `glibc` 的封装

3. 内核初吻：`clone3` 系统调用与参数校验

4. 核心缔造者：`kernel_clone` 与 `copy_process`

5. 栈上的艺术：`copy_thread` 伪造新世界

5.2 用户现场 `pt_regs` 的设置

6. 唤醒新生命：`wake_up_new_task` 与调度器初识

7. 调度器的选择：`__schedule` 与 `context_switch`

8. 栈之舞：`__switch_to_asm` 切换内核栈

9. 新线程的"第一句话"：`ret_from_fork_asm`

10. 最后的归途：`swapgs_restore_regs_and_return_to_usermode`

11. `iretq` 与 `sysretq`：两条返回用户态的神谕

12. 用户态的最后一公里：`L(thread_start)` 调用 `start_routine`