深入浅出：从JVM线程创建到Linux内核clone系统调用

Java程序中的线程，在JVM层面是一个java.lang.Thread对象，而真正执行代码的实体，则是操作系统内核调度的轻量级进程（LWP）。对于OpenJDK 17在Linux上的实现，其线程创建路径大致为：JVM → pthread_create（glibc） → clone / clone3 系统调用 → 内核。本文将以OpenJDK 17和glibc 2.34+的源码为基础，沿着这条路径逐层剖析，带你理解一个Java线程从诞生到被内核接纳的完整过程。

一、JVM层：`os::create_thread` ------ 准备`pthread`属性

OpenJDK中与操作系统交互的线程创建函数是os::create_thread（位于os_posix.cpp）。它负责构造pthread属性对象，并最终调用pthread_create。我们来看看它做了哪些关键工作。

cpp

复制代码

bool os::create_thread(Thread* thread, ThreadType thr_type, size_t req_stack_size) {
    // 1. 创建OSThread对象，关联JVM的Thread
    OSThread* osthread = new OSThread(NULL, NULL);
    osthread->set_thread_type(thr_type);
    osthread->set_state(ALLOCATED);
    thread->set_osthread(osthread);

    // 2. 初始化pthread属性
    pthread_attr_t attr;
    pthread_attr_init(&attr);
    pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
    
    // 3. 设置线程作用域（仅AIX需要PTHREAD_SCOPE_SYSTEM，Linux默认就是系统级竞争）
    // 4. 让新线程一开始就处于挂起状态（SUSPENDED），等JVM准备好后再唤醒
    pthread_attr_setsuspendstate_np(&attr, PTHREAD_CREATE_SUSPENDED_NP);
    
    // 5. 确定栈大小（处理JDK-8187028的workaround：小栈额外加64K）
    size_t stack_size = os::Posix::get_initial_stack_size(thr_type, req_stack_size);
    if (stack_size < 4096 * K) stack_size += 64 * K;
    pthread_attr_setstacksize(&attr, stack_size);
    
    // 6. 对于Java线程和编译器线程，禁用操作系统guard page（JVM自己管理）
    if (thr_type == java_thread || thr_type == compiler_thread)
        pthread_attr_setguardsize(&attr, 0);
    
    // 7. 真正创建线程
    pthread_t tid;
    int ret = pthread_create(&tid, &attr, thread_native_entry, thread);
    // ... 错误处理及日志
}

值得注意的设计：

挂起启动 ：通过PTHREAD_CREATE_SUSPENDED_NP（非标准扩展）使新线程创建后不立即运行，等待JVM完成一些初始化（如设置Thread对象状态）后再显式唤醒。这避免了线程过早进入run()方法时JVM状态未就绪的问题。
栈大小补偿：某些glibc版本下，请求的小栈（<4MB）实际分配可能比预期小64KB，因此JVM主动追加64KB。这是一个与底层C库实现细节博弈的产物。
Guard Page：JVM自己会为Java线程栈设置保护页，所以通知pthread不再额外添加，节省内存。

pthread_create是POSIX线程标准接口，但glibc内部并不是直接发起clone系统调用，而是经过了一系列封装。下面我们深入glibc的源码。

二、glibc层：从`pthread_create`到`clone`系统调用

2.1 `__pthread_create_2_1` ------ 线程创建的总控

glibc中pthread_create的实现在nptl/pthread_create.c。它经历了以下主要步骤（已简化）：

处理线程属性（如果传入NULL则获取默认属性）。
分配栈和线程描述符（struct pthread *pd）------ allocate_stack。
初始化pd中的各种字段：启动函数start_routine、参数arg、信号掩码、调度策略等。
如果调试器在跟踪线程创建，则设置stopped_start = true，让新线程创建后先阻塞，等待调试器附加。
调用create_thread函数，真正发起系统调用。

2.2 `create_thread` ------ 准备`clone`参数

create_thread（位于同一文件）负责构造clone_args结构体，并调用__clone_internal。

cpp

复制代码

static int create_thread (struct pthread *pd, const struct pthread_attr *attr,
                          bool *stopped_start, void *stackaddr,
                          size_t stacksize, bool *thread_ran) {
    // clone_flags 决定新进程（线程）与父进程共享的资源
    const int clone_flags = (CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SYSVSEM
                             | CLONE_SIGHAND | CLONE_THREAD
                             | CLONE_SETTLS | CLONE_PARENT_SETTID
                             | CLONE_CHILD_CLEARTID | 0);
    
    TLS_DEFINE_INIT_TP (tp, pd);   // 获取线程局部存储（TLS）指针
    
    struct clone_args args = {
        .flags = clone_flags,
        .pidfd = (uintptr_t) &pd->tid,
        .parent_tid = (uintptr_t) &pd->tid,
        .child_tid = (uintptr_t) &pd->tid,
        .stack = (uintptr_t) stackaddr,
        .stack_size = stacksize,
        .tls = (uintptr_t) tp,
    };
    
    int ret = __clone_internal (&args, &start_thread, pd);
    // ...
}

clone_flags的每一位都代表新线程与父进程共享的资源：

CLONE_VM：共享内存地址空间（线程的必要条件）。
CLONE_FS：共享文件系统信息（当前工作目录、umask等）。
CLONE_FILES：共享文件描述符表。
CLONE_SIGHAND：共享信号处理器表。
CLONE_THREAD：将新进程放入同一个线程组（getpid()返回相同值）。
CLONE_SETTLS：设置线程局部存储段。
CLONE_PARENT_SETTID：将线程ID写入父进程提供的地址（pd->tid）。
CLONE_CHILD_CLEARTID：线程退出时清零pd->tid，用于pthread_join的唤醒。

2.3 `__clone_internal` ------ `clone3`优先，回退`clone`

glibc 2.34引入了一个更现代的系统调用clone3，它通过一个结构体传递参数，比旧的clone（使用寄存器传参）更易扩展。__clone_internal实现了自动选择机制：

cpp

复制代码

int __clone_internal (struct clone_args *cl_args, int (*func)(void *), void *arg) {
#ifdef HAVE_CLONE3_WRAPPER
    int saved_errno = errno;
    int ret = __clone3_internal (cl_args, func, arg);
    if (ret != -1 || errno != ENOSYS)
        return ret;
    __set_errno (saved_errno);
#endif
    return __clone_internal_fallback (cl_args, func, arg); // 传统clone
}

__clone3_internal会检查内核是否支持clone3（通过第一次调用结果ENOSYS缓存判断），支持则调用__clone3汇编包装器，否则降级到传统clone。

2.4 汇编层的`__clone3` ------ 陷入内核

x86-64架构下，__clone3的汇编实现（来自sysdeps/unix/sysv/linux/x86_64/clone3.S）：

assembly

复制代码

ENTRY (__clone3)
    // 参数：rdi = cl_args, rsi = size, rdx = func, rcx = arg
    test %RDI_LP, %RDI_LP   // cl_args不能为NULL
    jz   SYSCALL_ERROR_LABEL
    test %RDX_LP, %RDX_LP   // func不能为NULL
    jz   SYSCALL_ERROR_LABEL
    
    mov %RCX_LP, %R8_LP     // 保存arg到r8（跨syscall保留）
    movl $SYS_ify(clone3), %eax  // syscall number 435
    syscall
    test %RAX_LP, %RAX_LP
    jl   SYSCALL_ERROR_LABEL
    jz   L(thread_start)    // 返回0表示子进程（新线程）
    ret                     // 父线程返回
    
L(thread_start):
    xorl %ebp, %ebp          // 清空帧指针，标记栈底
    mov %R8_LP, %RDI_LP      // 参数：arg
    call *%rdx               // 调用func(arg)
    movq %rax, %rdi          // exit(func返回值)
    movl $SYS_ify(exit), %eax
    syscall
END (__clone3)

关键点：

子进程（新线程）从L(thread_start)开始执行，直接调用func(arg)，然后调用exit系统调用终止，不会返回到pthread_create中。
父线程（原线程）从syscall返回后继续执行，ret返回到__clone_internal。

传统clone系统调用通过寄存器传递栈指针、标志位等，参数个数有限；clone3使用结构体，更干净且易于未来扩展。Linux内核从5.3开始支持clone3。

三、内核层：`clone` / `clone3` 系统调用的归宿

当syscall指令执行后，CPU陷入内核，根据系统调用号（435）找到内核中的sys_clone3或sys_clone（兼容旧接口）。内核会：

复制当前进程的task_struct。
根据clone_flags决定共享哪些资源（CLONE_VM→共享内存描述符，CLONE_FILES→共享文件表等）。
为新进程分配内核栈、设置TLS（如果CLONE_SETTLS被设置）。
将新进程加入调度队列，根据调度策略决定是否立即抢占。

对于线程（CLONE_THREAD），内核会设置task_struct的group_leader指向父进程的线程组组长，使得getpid()返回相同的PID。同时，tgid（线程组ID）与父进程相同，而pid（线程ID）是新分配的。

当新线程被调度运行时，它会从内核态返回到用户态，但返回的入口点并非原先的syscall指令之后------因为clone系统调用在新线程中的返回地址被特殊处理为L(thread_start)（由glibc提供的函数指针）。这正是汇编中"子进程从L(thread_start)开始执行"的由来。

四、JVM线程创建的特殊之处

回到JVM，pthread_create返回后，新线程处于挂起状态（因为PTHREAD_CREATE_SUSPENDED_NP）。JVM接下来会：

设置线程的Thread对象状态。
如果必要，将线程附加到ThreadGroup。
最后调用os::start_thread(thread)，它通过pthread_cond_signal或其他同步原语唤醒新线程，使其开始执行thread_native_entry。

thread_native_entry会调用Thread::call_run()，最终执行Java代码的run()方法。

五、总结与思考

从JVM到内核，线程创建涉及多个抽象层，每层都承担了不同的职责：

层次	职责	关键动作
JVM	管理Java线程对象，控制启动时机	设置pthread属性（挂起启动、栈大小补偿、禁用guard page）
glibc	实现POSIX线程语义，分配用户栈和TCB	构造clone_args，调用clone3/clone系统调用
Linux内核	创建内核task_struct，调度执行	根据flags共享资源，设置TLS，返回用户态执行函数

几个值得深思的设计点：

挂起启动的必要性：JVM需要确保新线程在完全初始化之前不执行Java代码，否则可能看到不一致的JVM内部状态。挂起启动比在临界区使用互斥锁更轻量，避免了锁竞争。
栈大小的精确控制：JVM主动补偿glibc可能"偷走"的栈空间，反映了对内存使用的精细化管理。同时禁用操作系统guard page，因为JVM已经实现了自己的保护机制（黄色页、红色页），可以更高效地处理栈溢出。
clone3的优雅演进 ：glibc通过ENOSYS回退机制，既能利用新内核的特性，又保持了与旧内核的兼容性。这种"运行时检测"模式在系统编程中很常见。

理解这一整套流程，不仅有助于诊断多线程程序的疑难杂症（如栈溢出、线程创建失败），也能让人更深刻地体会到用户态与内核态之间精妙的协作关系。下次当你写new Thread(() -> {...}).start()时，或许可以想象一下这一长串代码在底层掀起的波澜。

##源码

cpp 复制代码

bool os::create_thread(Thread* thread, ThreadType thr_type,
                       size_t req_stack_size) {

  assert(thread->osthread() == NULL, "caller responsible");

  // Allocate the OSThread object.
  OSThread* osthread = new OSThread(NULL, NULL);
  if (osthread == NULL) {
    return false;
  }

  // Set the correct thread state.
  osthread->set_thread_type(thr_type);

  // Initial state is ALLOCATED but not INITIALIZED
  osthread->set_state(ALLOCATED);

  thread->set_osthread(osthread);

  // Init thread attributes.
  pthread_attr_t attr;
  pthread_attr_init(&attr);
  guarantee(pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED) == 0, "???");

  // Make sure we run in 1:1 kernel-user-thread mode.
  if (os::Aix::on_aix()) {
    guarantee(pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM) == 0, "???");
    guarantee(pthread_attr_setinheritsched(&attr, PTHREAD_EXPLICIT_SCHED) == 0, "???");
  }

  // Start in suspended state, and in os::thread_start, wake the thread up.
  guarantee(pthread_attr_setsuspendstate_np(&attr, PTHREAD_CREATE_SUSPENDED_NP) == 0, "???");

  // Calculate stack size if it's not specified by caller.
  size_t stack_size = os::Posix::get_initial_stack_size(thr_type, req_stack_size);

  // JDK-8187028: It was observed that on some configurations (4K backed thread stacks)
  // the real thread stack size may be smaller than the requested stack size, by as much as 64K.
  // This very much looks like a pthread lib error. As a workaround, increase the stack size
  // by 64K for small thread stacks (arbitrarily choosen to be < 4MB)
  if (stack_size < 4096 * K) {
    stack_size += 64 * K;
  }

  // On Aix, pthread_attr_setstacksize fails with huge values and leaves the
  // thread size in attr unchanged. If this is the minimal stack size as set
  // by pthread_attr_init this leads to crashes after thread creation. E.g. the
  // guard pages might not fit on the tiny stack created.
  int ret = pthread_attr_setstacksize(&attr, stack_size);
  if (ret != 0) {
    log_warning(os, thread)("The %sthread stack size specified is invalid: " SIZE_FORMAT "k",
                            (thr_type == compiler_thread) ? "compiler " : ((thr_type == java_thread) ? "" : "VM "),
                            stack_size / K);
    thread->set_osthread(NULL);
    delete osthread;
    return false;
  }

  // Save some cycles and a page by disabling OS guard pages where we have our own
  // VM guard pages (in java threads). For other threads, keep system default guard
  // pages in place.
  if (thr_type == java_thread || thr_type == compiler_thread) {
    ret = pthread_attr_setguardsize(&attr, 0);
  }

  pthread_t tid = 0;
  if (ret == 0) {
    ret = pthread_create(&tid, &attr, (void* (*)(void*)) thread_native_entry, thread);
  }

  if (ret == 0) {
    char buf[64];
    log_info(os, thread)("Thread started (pthread id: " UINTX_FORMAT ", attributes: %s). ",
      (uintx) tid, os::Posix::describe_pthread_attr(buf, sizeof(buf), &attr));
  } else {
    char buf[64];
    log_warning(os, thread)("Failed to start thread - pthread_create failed (%d=%s) for attributes: %s.",
      ret, os::errno_name(ret), os::Posix::describe_pthread_attr(buf, sizeof(buf), &attr));
    // Log some OS information which might explain why creating the thread failed.
    log_info(os, thread)("Number of threads approx. running in the VM: %d", Threads::number_of_threads());
    LogStream st(Log(os, thread)::info());
    os::Posix::print_rlimit_info(&st);
    os::print_memory_info(&st);
  }

  pthread_attr_destroy(&attr);

  if (ret != 0) {
    // Need to clean up stuff we've allocated so far.
    thread->set_osthread(NULL);
    delete osthread;
    return false;
  }

  // OSThread::thread_id is the pthread id.
  osthread->set_thread_id(tid);

  return true;
}

#define __NR_clone3 435

/* The clone3 syscall wrapper.  Linux/x86-64 version.
   Copyright (C) 2021-2024 Free Software Foundation, Inc.
   This file is part of the GNU C Library.

   The GNU C Library is free software; you can redistribute it and/or
   modify it under the terms of the GNU Lesser General Public
   License as published by the Free Software Foundation; either
   version 2.1 of the License, or (at your option) any later version.

   The GNU C Library is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   Lesser General Public License for more details.

   You should have received a copy of the GNU Lesser General Public
   License along with the GNU C Library; if not, see
   <https://www.gnu.org/licenses/>.  */

/* clone3() is even more special than fork() as it mucks with stacks
   and invokes a function in the right context after its all over.  */

#include <sysdep.h>

/* The userland implementation is:
   int clone3 (struct clone_args *cl_args, size_t size,
	       int (*func)(void *arg), void *arg);
   the kernel entry is:
   int clone3 (struct clone_args *cl_args, size_t size);

   The parameters are passed in registers from userland:
   rdi: cl_args
   rsi: size
   rdx: func
   rcx: arg

   The kernel expects:
   rax: system call number
   rdi: cl_args
   rsi: size  */

        .text
ENTRY (__clone3)
	/* Sanity check arguments.  */
	movl	$-EINVAL, %eax
	test	%RDI_LP, %RDI_LP	/* No NULL cl_args pointer.  */
	jz	SYSCALL_ERROR_LABEL
	test	%RDX_LP, %RDX_LP	/* No NULL function pointer.  */
	jz	SYSCALL_ERROR_LABEL

	/* Save the cl_args pointer in R8 which is preserved by the
	   syscall.  */
	mov	%RCX_LP, %R8_LP

	/* Do the system call.  */
	movl	$SYS_ify(clone3), %eax

	/* End FDE now, because in the child the unwind info will be
	   wrong.  */
	cfi_endproc
	syscall

	test	%RAX_LP, %RAX_LP
	jl	SYSCALL_ERROR_LABEL
	jz	L(thread_start)

	ret

L(thread_start):
	cfi_startproc
	/* Clearing frame pointer is insufficient, use CFI.  */
	cfi_undefined (rip)
	/* Clear the frame pointer.  The ABI suggests this be done, to mark
	   the outermost frame obviously.  */
	xorl	%ebp, %ebp

	/* Set up arguments for the function call.  */
	mov	%R8_LP, %RDI_LP	/* Argument.  */
	call	*%rdx		/* Call function.  */
	/* Call exit with return value from function call. */
	movq	%rax, %rdi
	movl	$SYS_ify(exit), %eax
	syscall
	cfi_endproc

	cfi_startproc
PSEUDO_END (__clone3)

libc_hidden_def (__clone3)
weak_alias (__clone3, clone3)

/* The clone3 syscall provides a superset of the functionality of the clone
   interface.  The kernel might extend __CL_ARGS struct in the future, with
   each version with a different __SIZE.  If the child is created, it will
   start __FUNC function with __ARG arguments.

   Different than kernel, the implementation also returns EINVAL for an
   invalid NULL __CL_ARGS or __FUNC (similar to __clone).

   All callers are responsible for correctly aligning the stack.  The stack is
   not aligned prior to the syscall (this differs from the exported __clone).

   This function is only implemented if the ABI defines HAVE_CLONE3_WRAPPER.
*/
extern int __clone3 (struct clone_args *__cl_args, size_t __size,
		     int (*__func) (void *__arg), void *__arg);



static int create_thread (struct pthread *pd, const struct pthread_attr *attr,
			  bool *stopped_start, void *stackaddr,
			  size_t stacksize, bool *thread_ran)
{
  /* Determine whether the newly created threads has to be started
     stopped since we have to set the scheduling parameters or set the
     affinity.  */
  bool need_setaffinity = (attr != NULL && attr->extension != NULL
			   && attr->extension->cpuset != 0);
  if (attr != NULL
      && (__glibc_unlikely (need_setaffinity)
	  || __glibc_unlikely ((attr->flags & ATTR_FLAG_NOTINHERITSCHED) != 0)))
    *stopped_start = true;

  pd->stopped_start = *stopped_start;
  if (__glibc_unlikely (*stopped_start))
    lll_lock (pd->lock, LLL_PRIVATE);

  /* We rely heavily on various flags the CLONE function understands:

     CLONE_VM, CLONE_FS, CLONE_FILES
	These flags select semantics with shared address space and
	file descriptors according to what POSIX requires.

     CLONE_SIGHAND, CLONE_THREAD
	This flag selects the POSIX signal semantics and various
	other kinds of sharing (itimers, POSIX timers, etc.).

     CLONE_SETTLS
	The sixth parameter to CLONE determines the TLS area for the
	new thread.

     CLONE_PARENT_SETTID
	The kernels writes the thread ID of the newly created thread
	into the location pointed to by the fifth parameters to CLONE.

	Note that it would be semantically equivalent to use
	CLONE_CHILD_SETTID but it is be more expensive in the kernel.

     CLONE_CHILD_CLEARTID
	The kernels clears the thread ID of a thread that has called
	sys_exit() in the location pointed to by the seventh parameter
	to CLONE.

     The termination signal is chosen to be zero which means no signal
     is sent.  */
  const int clone_flags = (CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SYSVSEM
			   | CLONE_SIGHAND | CLONE_THREAD
			   | CLONE_SETTLS | CLONE_PARENT_SETTID
			   | CLONE_CHILD_CLEARTID
			   | 0);

  TLS_DEFINE_INIT_TP (tp, pd);

  struct clone_args args =
    {
      .flags = clone_flags,
      .pidfd = (uintptr_t) &pd->tid,
      .parent_tid = (uintptr_t) &pd->tid,
      .child_tid = (uintptr_t) &pd->tid,
      .stack = (uintptr_t) stackaddr,
      .stack_size = stacksize,
      .tls = (uintptr_t) tp,
    };
  int ret = __clone_internal (&args, &start_thread, pd);
  if (__glibc_unlikely (ret == -1))
    return errno;

  /* It's started now, so if we fail below, we'll have to let it clean itself
     up.  */
  *thread_ran = true;

  /* Now we have the possibility to set scheduling parameters etc.  */
  if (attr != NULL)
    {
      /* Set the affinity mask if necessary.  */
      if (need_setaffinity)
	{
	  assert (*stopped_start);

	  int res = INTERNAL_SYSCALL_CALL (sched_setaffinity, pd->tid,
					   attr->extension->cpusetsize,
					   attr->extension->cpuset);
	  if (__glibc_unlikely (INTERNAL_SYSCALL_ERROR_P (res)))
	    return INTERNAL_SYSCALL_ERRNO (res);
	}

      /* Set the scheduling parameters.  */
      if ((attr->flags & ATTR_FLAG_NOTINHERITSCHED) != 0)
	{
	  assert (*stopped_start);

	  int res = INTERNAL_SYSCALL_CALL (sched_setscheduler, pd->tid,
					   pd->schedpolicy, &pd->schedparam);
	  if (__glibc_unlikely (INTERNAL_SYSCALL_ERROR_P (res)))
	    return INTERNAL_SYSCALL_ERRNO (res);
	}
    }

  return 0;
}


int
__clone3_internal (struct clone_args *cl_args, int (*func) (void *args),
		   void *arg)
{
#ifdef HAVE_CLONE3_WRAPPER
# if __ASSUME_CLONE3
  return __clone3 (cl_args, sizeof (*cl_args), func, arg);
# else
  static int clone3_supported = 1;
  if (atomic_load_relaxed (&clone3_supported) == 1)
    {
      int ret = __clone3 (cl_args, sizeof (*cl_args), func, arg);
      if (ret != -1 || errno != ENOSYS)
	return ret;

      atomic_store_relaxed (&clone3_supported, 0);
    }
# endif
#endif
  __set_errno (ENOSYS);
  return -1;
}

int
__clone_internal (struct clone_args *cl_args,
		  int (*func) (void *arg), void *arg)
{
#ifdef HAVE_CLONE3_WRAPPER
  int saved_errno = errno;
  int ret = __clone3_internal (cl_args, func, arg);
  if (ret != -1 || errno != ENOSYS)
    return ret;

  /* NB: Restore errno since errno may be checked against non-zero
     return value.  */
  __set_errno (saved_errno);
#endif

  return __clone_internal_fallback (cl_args, func, arg);
}


int
__pthread_create_2_1 (pthread_t *newthread, const pthread_attr_t *attr,
		      void *(*start_routine) (void *), void *arg)
{
  void *stackaddr = NULL;
  size_t stacksize = 0;

  /* Avoid a data race in the multi-threaded case, and call the
     deferred initialization only once.  */
  if (__libc_single_threaded_internal)
    {
      late_init ();
      __libc_single_threaded_internal = 0;
      /* __libc_single_threaded can be accessed through copy relocations, so
	 it requires to update the external copy.  */
      __libc_single_threaded = 0;
    }

  const struct pthread_attr *iattr = (struct pthread_attr *) attr;
  union pthread_attr_transparent default_attr;
  bool destroy_default_attr = false;
  bool c11 = (attr == ATTR_C11_THREAD);
  if (iattr == NULL || c11)
    {
      int ret = __pthread_getattr_default_np (&default_attr.external);
      if (ret != 0)
	return ret;
      destroy_default_attr = true;
      iattr = &default_attr.internal;
    }

  struct pthread *pd = NULL;
  int err = allocate_stack (iattr, &pd, &stackaddr, &stacksize);
  int retval = 0;

  if (__glibc_unlikely (err != 0))
    /* Something went wrong.  Maybe a parameter of the attributes is
       invalid or we could not allocate memory.  Note we have to
       translate error codes.  */
    {
      retval = err == ENOMEM ? EAGAIN : err;
      goto out;
    }


  /* Initialize the TCB.  All initializations with zero should be
     performed in 'get_cached_stack'.  This way we avoid doing this if
     the stack freshly allocated with 'mmap'.  */

#if TLS_TCB_AT_TP
  /* Reference to the TCB itself.  */
  pd->header.self = pd;

  /* Self-reference for TLS.  */
  pd->header.tcb = pd;
#endif

  /* Store the address of the start routine and the parameter.  Since
     we do not start the function directly the stillborn thread will
     get the information from its thread descriptor.  */
  pd->start_routine = start_routine;
  pd->arg = arg;
  pd->c11 = c11;

  /* Copy the thread attribute flags.  */
  struct pthread *self = THREAD_SELF;
  pd->flags = ((iattr->flags & ~(ATTR_FLAG_SCHED_SET | ATTR_FLAG_POLICY_SET))
	       | (self->flags & (ATTR_FLAG_SCHED_SET | ATTR_FLAG_POLICY_SET)));

  /* Inherit rseq registration state.  Without seccomp filters, rseq
     registration will either always fail or always succeed.  */
  if ((int) THREAD_GETMEM_VOLATILE (self, rseq_area.cpu_id) >= 0)
    pd->flags |= ATTR_FLAG_DO_RSEQ;

  /* Initialize the field for the ID of the thread which is waiting
     for us.  This is a self-reference in case the thread is created
     detached.  */
  pd->joinid = iattr->flags & ATTR_FLAG_DETACHSTATE ? pd : NULL;

  /* The debug events are inherited from the parent.  */
  pd->eventbuf = self->eventbuf;


  /* Copy the parent's scheduling parameters.  The flags will say what
     is valid and what is not.  */
  pd->schedpolicy = self->schedpolicy;
  pd->schedparam = self->schedparam;

  /* Copy the stack guard canary.  */
#ifdef THREAD_COPY_STACK_GUARD
  THREAD_COPY_STACK_GUARD (pd);
#endif

  /* Copy the pointer guard value.  */
#ifdef THREAD_COPY_POINTER_GUARD
  THREAD_COPY_POINTER_GUARD (pd);
#endif

  /* Setup tcbhead.  */
  tls_setup_tcbhead (pd);

  /* Verify the sysinfo bits were copied in allocate_stack if needed.  */
#ifdef NEED_DL_SYSINFO
  CHECK_THREAD_SYSINFO (pd);
#endif

  /* Determine scheduling parameters for the thread.  */
  if (__builtin_expect ((iattr->flags & ATTR_FLAG_NOTINHERITSCHED) != 0, 0)
      && (iattr->flags & (ATTR_FLAG_SCHED_SET | ATTR_FLAG_POLICY_SET)) != 0)
    {
      /* Use the scheduling parameters the user provided.  */
      if (iattr->flags & ATTR_FLAG_POLICY_SET)
        {
          pd->schedpolicy = iattr->schedpolicy;
          pd->flags |= ATTR_FLAG_POLICY_SET;
        }
      if (iattr->flags & ATTR_FLAG_SCHED_SET)
        {
          /* The values were validated in pthread_attr_setschedparam.  */
          pd->schedparam = iattr->schedparam;
          pd->flags |= ATTR_FLAG_SCHED_SET;
        }

      if ((pd->flags & (ATTR_FLAG_SCHED_SET | ATTR_FLAG_POLICY_SET))
          != (ATTR_FLAG_SCHED_SET | ATTR_FLAG_POLICY_SET))
        collect_default_sched (pd);
    }

  if (__glibc_unlikely (__nptl_nthreads == 1))
    _IO_enable_locks ();

  /* Pass the descriptor to the caller.  */
  *newthread = (pthread_t) pd;

  LIBC_PROBE (pthread_create, 4, newthread, attr, start_routine, arg);

  /* One more thread.  We cannot have the thread do this itself, since it
     might exist but not have been scheduled yet by the time we've returned
     and need to check the value to behave correctly.  We must do it before
     creating the thread, in case it does get scheduled first and then
     might mistakenly think it was the only thread.  In the failure case,
     we momentarily store a false value; this doesn't matter because there
     is no kosher thing a signal handler interrupting us right here can do
     that cares whether the thread count is correct.  */
  atomic_fetch_add_relaxed (&__nptl_nthreads, 1);

  /* Our local value of stopped_start and thread_ran can be accessed at
     any time. The PD->stopped_start may only be accessed if we have
     ownership of PD (see CONCURRENCY NOTES above).  */
  bool stopped_start = false; bool thread_ran = false;

  /* Block all signals, so that the new thread starts out with
     signals disabled.  This avoids race conditions in the thread
     startup.  */
  internal_sigset_t original_sigmask;
  internal_signal_block_all (&original_sigmask);

  if (iattr->extension != NULL && iattr->extension->sigmask_set)
    /* Use the signal mask in the attribute.  The internal signals
       have already been filtered by the public
       pthread_attr_setsigmask_np interface.  */
    internal_sigset_from_sigset (&pd->sigmask, &iattr->extension->sigmask);
  else
    {
      /* Conceptually, the new thread needs to inherit the signal mask
	 of this thread.  Therefore, it needs to restore the saved
	 signal mask of this thread, so save it in the startup
	 information.  */
      pd->sigmask = original_sigmask;
      /* Reset the cancellation signal mask in case this thread is
	 running cancellation.  */
      internal_sigdelset (&pd->sigmask, SIGCANCEL);
    }

  /* Start the thread.  */
  if (__glibc_unlikely (report_thread_creation (pd)))
    {
      stopped_start = true;

      /* We always create the thread stopped at startup so we can
	 notify the debugger.  */
      retval = create_thread (pd, iattr, &stopped_start, stackaddr,
			      stacksize, &thread_ran);
      if (retval == 0)
	{
	  /* We retain ownership of PD until (a) (see CONCURRENCY NOTES
	     above).  */

	  /* Assert stopped_start is true in both our local copy and the
	     PD copy.  */
	  assert (stopped_start);
	  assert (pd->stopped_start);

	  /* Now fill in the information about the new thread in
	     the newly created thread's data structure.  We cannot let
	     the new thread do this since we don't know whether it was
	     already scheduled when we send the event.  */
	  pd->eventbuf.eventnum = TD_CREATE;
	  pd->eventbuf.eventdata = pd;

	  /* Enqueue the descriptor.  */
	  do
	    pd->nextevent = __nptl_last_event;
	  while (atomic_compare_and_exchange_bool_acq (&__nptl_last_event,
						       pd, pd->nextevent)
		 != 0);

	  /* Now call the function which signals the event.  See
	     CONCURRENCY NOTES for the nptl_db interface comments.  */
	  __nptl_create_event ();
	}
    }
  else
    retval = create_thread (pd, iattr, &stopped_start, stackaddr,
			    stacksize, &thread_ran);

  /* Return to the previous signal mask, after creating the new
     thread.  */
  internal_signal_restore_set (&original_sigmask);

  if (__glibc_unlikely (retval != 0))
    {
      if (thread_ran)
	/* State (c) and we not have PD ownership (see CONCURRENCY NOTES
	   above).  We can assert that STOPPED_START must have been true
	   because thread creation didn't fail, but thread attribute setting
	   did.  */
        {
	  assert (stopped_start);
	  /* Signal the created thread to release PD ownership and early
	     exit so it could be joined.  */
	  pd->setup_failed = 1;
	  lll_unlock (pd->lock, LLL_PRIVATE);

	  /* Similar to pthread_join, but since thread creation has failed at
	     startup there is no need to handle all the steps.  */
	  pid_t tid;
	  while ((tid = atomic_load_acquire (&pd->tid)) != 0)
	    __futex_abstimed_wait_cancelable64 ((unsigned int *) &pd->tid,
						tid, 0, NULL, LLL_SHARED);
        }

      /* State (c) or (d) and we have ownership of PD (see CONCURRENCY
	 NOTES above).  */

      /* Oops, we lied for a second.  */
      atomic_fetch_add_relaxed (&__nptl_nthreads, -1);

      /* Free the resources.  */
      __nptl_deallocate_stack (pd);

      /* We have to translate error codes.  */
      if (retval == ENOMEM)
	retval = EAGAIN;
    }
  else
    {
      /* We don't know if we have PD ownership.  Once we check the local
         stopped_start we'll know if we're in state (a) or (b) (see
	 CONCURRENCY NOTES above).  */
      if (stopped_start)
	/* State (a), we own PD. The thread blocked on this lock either
	   because we're doing TD_CREATE event reporting, or for some
	   other reason that create_thread chose.  Now let it run
	   free.  */
	lll_unlock (pd->lock, LLL_PRIVATE);

      /* We now have for sure more than one thread.  The main thread might
	 not yet have the flag set.  No need to set the global variable
	 again if this is what we use.  */
      THREAD_SETMEM (THREAD_SELF, header.multiple_threads, 1);
    }

 out:
  if (destroy_default_attr)
    __pthread_attr_destroy (&default_attr.external);

  return retval;
}
versioned_symbol (libc, __pthread_create_2_1, pthread_create, GLIBC_2_34);
libc_hidden_ver (__pthread_create_2_1, __pthread_create)
#ifndef SHARED
strong_alias (__pthread_create_2_1, __pthread_create)
#endif

深入浅出：从JVM线程创建到Linux内核clone系统调用

一、JVM层：os::create_thread ------ 准备pthread属性

二、glibc层：从pthread_create到clone系统调用

2.1 __pthread_create_2_1 ------ 线程创建的总控

2.2 create_thread ------ 准备clone参数

2.3 __clone_internal ------ clone3优先，回退clone

2.4 汇编层的__clone3 ------ 陷入内核

三、内核层：clone / clone3 系统调用的归宿

四、JVM线程创建的特殊之处

五、总结与思考

一、JVM层：`os::create_thread` ------ 准备`pthread`属性

二、glibc层：从`pthread_create`到`clone`系统调用

2.1 `__pthread_create_2_1` ------ 线程创建的总控

2.2 `create_thread` ------ 准备`clone`参数

2.3 `__clone_internal` ------ `clone3`优先，回退`clone`

2.4 汇编层的`__clone3` ------ 陷入内核

三、内核层：`clone` / `clone3` 系统调用的归宿