Linux Kernel 源码学习：PER_CPU 变量、swapgs及栈切换（一）

说明：

本文采用Linux 内核 v3.10 版本 x86_64架构

本文不涉及调试、跟踪及异常处理的细节

在x86-64 系统调用实现细节这篇文章里，我们详细讲解了系统调用的过程，但是有三个细节没有讲清楚：

swapgs是如何起作用的；
PER_CPU_VAR宏到底做了啥；
系统调用中会进行内核栈与用户栈的切换，那么进程的内核栈是何时初始化的，又是如何切换的呢；

一、Per-cpu变量简介

Per-cpu变量是在 linux 内核 2.6 版本时加入的特征。顾名思义，对于任何Per-cpu变量，内核会为每一个cpu分配独立的内存空间，用来保存该变量的副本。

在对称多处理器（SMP-Symmetry Multi Processing）系统中，当多个cpu并发访问同一个资源时，需要使用锁机制及缓存一致性协议来保证数据一致性。但是，访问Per-cpu变量时，由于每个cpu都只访问自己副本中的变量，所以不需要使用锁机制 ；同时，避免了共享内存时因缓存一致性带来的缓存行失效问题，提高了缓存命中率。所以，使用Per-cpu变量能够有效的提升性能，是典型的以空间换时间的技术。

另外，进程在访问Per-cpu变量时，需要禁止抢占，以免该进程被切换到其它cpu执行，导致数据错误。

根据 Intel 64 and IA-32 Architectures Software Developer Manuals（以下简称 Intel SDM ） Volume 1 第 3.4.2.1节 Segment Registers in 64-Bit Mode 中的说明，在x86_64架构 64 位模式下， CS， DS， ES,，SS 四个段寄存器的段基址始终被当做0来处理，不论其实际值是多少。而 FS 和 GS 寄存器是两个例外，它们可被用来作为线性地址寻址时的基址寄存器。

3.4.2.1 Segment Registers in 64-Bit Mode

In 64-bit mode: CS, DS, ES, SS are treated as if each segment base is 0, regardless of the value of the associated

segment descriptor base. This creates a flat address space for code, data, and stack. FS and GS are exceptions.

Both segment registers may be used as additional base registers in linear address calculations (in the addressing

of local data and certain operating system data structures).

于是，Linux内核充分利用 FS 和 GS 寄存器的特性，将 GS 寄存器作为per-cpu变量寻址时的基址寄存器。

比较有趣的一点是，在x86_64架构中，提供了两个GS基址寄存器，分别是 IA32_GS_BASE 和 IA32_KERNEL_GS_BASE ，其中IA32_KERNEL_GS_BASE 只用于内核寻址，IA32_GS_BASE 用于用户态寻址。这两个都是 MSR （MODEL-SPECIFIC REGISTER ）寄存器，其中 IA32_GS_BASE 寄存器只是对原有 GS 寄存器中隐藏部分 gs.base 的简单映射，或者说起了个别名。当进程从用户态进入内核态时，需要使用 swapgs指令将 IA32_KERNEL_GS_BASE 寄存器中的内容加载 gs.base ，使其指向内核专用地址；反之，进程离开内核态进入用户态时，同样需要使用 swapgs指令将 IA32_GS_BASE 寄存器中的内容加载至 gs.base。

Intel SDM Volume 4 中对这两个寄存器的说明如下：

Table 2-2. IA-32 Architectural MSRs

Register Address(Hex)	Architectural MSR Name	MSR Description
C000_0101H	IA32_GS_BASE	Map of BASE Address of GS (R/W)
C000_0102H	IA32_KERNEL_GS_BASE	Swap Target of BASE Address of GS (R/W)

经过上文的介绍，大家对per-cpu变量的工作原理有了一个初步印象，下面给出一张架构图，供大家参考。

二、基本概念及数据结构

Per-cpu变量分为静态（static）和动态（dynamic） 的两种。静态 per-cpu变量，其总的占用空间是在编译时确定的，不会随着系统的运行而变化；动态per-cpu变量，是在系统运行时动态分配的，其占用空间是不固定的，可能增加或减少。本篇文章主要讲解的是静态per-cpu变量的原理及实现。

Per-cpu数据是分配在chunk中的。每个chunk包含一定数量的unit，first chunk被用来存储静态 per-cpu变量。当系统中分配的per-cpu变量越来越多，当前chunk空间不足以容纳时，就会分配新的chunk来容纳新分配的数据。

2.1 chunk

每个 chunk 被分割成 nr_units 个unit，每个 unit 对应于一个 cpu。

first chunk 在内核初始化时被创建，first chunk 中的每个unit 由三部分组成：static 区域、 reserved 区域以及 dynamic 区域。

对于UMA系统，其 chunk 结构图如下：

对于NUMA系统，其 chunk 结构图如下：

2.2 unit

一个chunk 是由多个 unit 组成的，unit 是为每个cpu分配的存储空间，也就是说一个unit 对应着一个cpu。unit 被分成三种存储区域，分别是：

静态区（static area），静态区存储的是内核的静态数据。
保留区（reserved area），保留区存储的是内核内建模块的静态数据。
动态区（dynamic area），动态区是内核运行时产生的动态数据。

静态区的数据和大小，是在内核编译时就已经确定了的；保留区的大小也是在内核编译时确定的，但是是预估值，其数据是在模块加载时载入的；而动态区的数据是在运行时产生的。静态数据只存在于 first chunk中，其它 chunk 内的数据都是动态的。

2.3 group

在 NUMA 系统里，内存和 cpu 是按照 node 来组织的。对于per-cpu 数据来说，由于数据要在对应 cpu 的 node 里进行分配，所以 chunk 内的数据地址是不连续的。每个 chunk 里的数据按照其所在的node，划分为多个组（group），node 和 group是一一对应的。对于UMA系统来说，相当于只有一个node，所以其chunk 数据是连续的。

2.4 pcpu_base_addr

pcpu_base_addr 是 first chunk 的地址，即静态区域（static area）的地址。内核对其的说明如下：

C 复制代码

    /* the address of the first chunk which starts with the kernel static area */
    void *pcpu_base_addr __read_mostly;

三、静态Per-cpu变量的实现

3.1 创建静态 per-cpu 变量

内核提供了创建静态per-cpu变量的API -- 宏 DEFINE_PER_CPU ，该宏定义在 include/linux/percpu-defs.h 头文件中。

C 复制代码

    // file: include/linux/percpu-defs.h 
    #define DEFINE_PER_CPU(type, name)					\
    	DEFINE_PER_CPU_SECTION(type, name, "")

宏 DEFINE_PER_CPU 接收 2 个参数，分别是变量的类型和名称，比如在系统调用过程中用到的变量 old_rsp，其定义如下：

C 复制代码

    // arch/x86/kernel/process_64.c
    DEFINE_PER_CPU(unsigned long, old_rsp);

而另一个per-cpu变量 kernel_stack，其定义如下：

C 复制代码

    // file: arch/x86/kernel/cpu/common.c
    DEFINE_PER_CPU(unsigned long, kernel_stack) =
    	(unsigned long)&init_thread_union - KERNEL_STACK_OFFSET + THREAD_SIZE;

宏 DEFINE_PER_CPU 内部将入参直接透传给了 DEFINE_PER_CPU_SECTION 宏，其定义如下：

C 复制代码

    // file: include/linux/percpu-defs.h 
    #define DEFINE_PER_CPU_SECTION(type, name, sec)				\
    	__PCPU_ATTRS(sec) PER_CPU_DEF_ATTRIBUTES			\
    	__typeof__(type) name
    #endif

C 复制代码

    // file: include/linux/percpu-defs.h 
    #define __PCPU_ATTRS(sec)						\
    	__percpu __attribute__((section(PER_CPU_BASE_SECTION sec)))	\
    	PER_CPU_ATTRIBUTES

PER_CPU_BASE_SECTION、PER_CPU_DEF_ATTRIBUTES和 PER_CPU_ATTRIBUTES 这三个宏定义在 include/asm-generic/percpu.h 头文件中，其中 PER_CPU_DEF_ATTRIBUTES 和 PER_CPU_ATTRIBUTES 并未定义任何内容；PER_CPU_BASE_SECTION扩展为 ".data..percpu"。

C 复制代码

    // file: include/asm-generic/percpu.h
    #define PER_CPU_BASE_SECTION ".data..percpu"

    #ifndef PER_CPU_DEF_ATTRIBUTES
    #define PER_CPU_DEF_ATTRIBUTES
    #endif

    #ifndef PER_CPU_ATTRIBUTES
    #define PER_CPU_ATTRIBUTES
    #endif

当参数 sec 为空字符串时，__PCPU_ATTRS(sec) 会被扩展成：

C 复制代码

    __percpu __attribute__((section(".data..percpu")))

宏__percpu定义在头文件include/linux/compiler.h 中。

C 复制代码

    // file: include/linux/compiler.h
    #ifdef __CHECKER__
    ...
    # define __percpu	__attribute__((noderef, address_space(3)))
    ...
    #else
    ...
    # define __percpu
    ...
    #endif

__CHECKER__ 宏是由代码静态检查工具 Sparse 定义的，用于在编译时做一些静态检查。Sparse工具需要单独安装，这里可以忽略。

__attribute__((section("section_name")) 的组合，是 gcc 编译器支持的一个编译特性，编译器会把该组合修饰的变量，放置到由section_name指定的节（section）里。更多内容，请参考官方文档 Variable-Attributes。

__typeof__ 是 gcc 的一个扩展，该操作符会返回其参数的类型。具体可参考gcc官方文档 Typeof及 Alternate-Keywords。

所以 DEFINE_PER_CPU(type, name)最终会扩展成：

C 复制代码

    __attribute__((section(".data..percpu"))) __typeof__(type) name

其中 type 是变量类型，name 是变量名称。编译器会根据这行代码的指示，把对应类型的变量放置到名称为.data..percpu 的 section 里。

3.2 其它宏函数

除了 DEFINE_PER_CPU宏外，还有其它一些宏用来定义 per-cpu 相关的变量。其原理类似，只不过这些宏定义的变量处于不同的 section 里。

3.3 链接

下面我们来看下Linux kernel 的链接脚本是如何链接 section(.data..percpu) 的。

在 x86_64 及 SMP 架构下，链接脚本使用 PERCPU_VADDR 宏来指定 per-cpu 区域的加载地址、虚拟地址等配置：

C 复制代码

    // file: arch/x86/kernel/vmlinux.lds.S
    #if defined(CONFIG_X86_64) && defined(CONFIG_SMP)
    	/*
    	 * percpu offsets are zero-based on SMP.  PERCPU_VADDR() changes the
    	 * output PHDR, so the next output section - .init.text - should
    	 * start another segment - init.
    	 */
    	PERCPU_VADDR(INTERNODE_CACHE_BYTES, 0, :percpu)
    #endif

宏 PERCPU_VADDR 定义在头文件 include/asm-generic/vmlinux.lds.h 中，该宏有三个参数：

cacheline: 缓存行大小，用来对齐 subsection；
vaddr：虚拟地址
phdr: 程序头

C 复制代码

    // file: include/asm-generic/vmlinux.lds.h
    /**
     * PERCPU_VADDR - define output section for percpu area
     * @cacheline: cacheline size
     * @vaddr: explicit base address (optional)
     * @phdr: destination PHDR (optional)
     *
     * Macro which expands to output section for percpu area.
     *
     * @cacheline is used to align subsections to avoid false cacheline
     * sharing between subsections for different purposes.
     *
     * If @vaddr is not blank, it specifies explicit base address and all
     * percpu symbols will be offset from the given address.  If blank,
     * @vaddr always equals @laddr + LOAD_OFFSET.
     *
     * @phdr defines the output PHDR to use if not blank.  Be warned that
     * output PHDR is sticky.  If @phdr is specified, the next output
     * section in the linker script will go there too.  @phdr should have
     * a leading colon.
     *
     * Note that this macros defines __per_cpu_load as an absolute symbol.
     * If there is no need to put the percpu section at a predetermined
     * address, use PERCPU_SECTION.
     */
    #define PERCPU_VADDR(cacheline, vaddr, phdr)				\
    	VMLINUX_SYMBOL(__per_cpu_load) = .;				\
    	.data..percpu vaddr : AT(VMLINUX_SYMBOL(__per_cpu_load)		\
    				- LOAD_OFFSET) {			\
    		PERCPU_INPUT(cacheline)					\
    	} phdr								\
    	. = VMLINUX_SYMBOL(__per_cpu_load) + SIZEOF(.data..percpu);

PERCPU_VADDR 宏又引用了 PERCPU_INPUT，该宏同样定义在头文件 include/asm-generic/vmlinux.lds.h 中：

C 复制代码

    // file: include/asm-generic/vmlinux.lds.h
    /**
     * PERCPU_INPUT - the percpu input sections
     * @cacheline: cacheline size
     *
     * The core percpu section names and core symbols which do not rely
     * directly upon load addresses.
     *
     * @cacheline is used to align subsections to avoid false cacheline
     * sharing between subsections for different purposes.
     */
    #define PERCPU_INPUT(cacheline)						\
    	VMLINUX_SYMBOL(__per_cpu_start) = .;				\
    	*(.data..percpu..first)						\
    	. = ALIGN(PAGE_SIZE);						\
    	*(.data..percpu..page_aligned)					\
    	. = ALIGN(cacheline);						\
    	*(.data..percpu..readmostly)					\
    	. = ALIGN(cacheline);						\
    	*(.data..percpu)						\
    	*(.data..percpu..shared_aligned)				\
    	VMLINUX_SYMBOL(__per_cpu_end) = .;

PERCPU_VADDR宏的作用，就是把所有文件中以 .data..percpu 开头的 section 的数据，合并输出到 section(.data..percpu) 中。

VMA vs LMA

至于输出地址，根据 gnu ld 工具官方文档说明，每一个可加载的或可定位的输出 section 都有两个地址 。一个叫做 VMA （virtual memory address）即虚拟内存地址，这个是文件运行时该 section 所在的地址。另一个叫做 LMA （ load memory address），即加载内存地址，这个是该 section 被加载到的地址。大部分情况下，这两个地址是一致的。一种不一致的情况，是数据类型的 section 被加载到ROM，而程序启动时需要拷贝到 RAM 中。在这种情况下，ROM 的地址被称为 LMA ，而 RAM 中的地址被称为 VMA。

Every loadable or allocatable output section has two addresses. The first is the VMA , or virtual memory address. This is the address the section will have when the output file is run. The second is the LMA, or load memory address. This is the address at which the section will be loaded. In most cases the two addresses will be the same. An example of when they might be different is when a data section is loaded into ROM, and then copied into RAM when the program starts up (this technique is often used to initialize global variables in a ROM based system). In this case the ROM address would be the LMA, and the RAM address would be the VMA.

官方文档可参考：3.1 Basic Linker Script Concepts，3.6.3 Output Section Address，3.6.8 Output Section Attributes，3.6.8.2 Output Section LMA。

从代码片段和文档说明可以看出，section(.data..percpu) 的 VMA 就是参数 vaddr的值，被指定为 0 ，而 LMA 就是 AT() 的参数，即VMLINUX_SYMBOL(__per_cpu_load) - LOAD_OFFSET。

宏 VMLINUX_SYMBOL 定义如下：

C 复制代码

    // file: include/linux/export.h
    /* Indirect, so macros are expanded before pasting. */
    #define VMLINUX_SYMBOL(x) __VMLINUX_SYMBOL(x)

    #define __VMLINUX_SYMBOL(x) x

LOAD_OFFSET 定义如下，在 x86_64 架构下，其扩展成__START_KERNEL_map：

C 复制代码

    // file: arch/x86/kernel/vmlinux.lds.S
    #ifdef CONFIG_X86_32
    #define LOAD_OFFSET __PAGE_OFFSET
    #else
    #define LOAD_OFFSET __START_KERNEL_map
    #endif

__START_KERNEL_map 宏定义如下：

C 复制代码

    // file: arch/x86/include/asm/page_64_types.h
    #define __START_KERNEL_map	_AC(0xffffffff80000000, UL)

也就是说，VMA 是 0， LMA 等于 __per_cpu_load - 0xffffffff80000000。

使用 nm 命令查看下符号 __per_cpu_load 的地址：

bash 复制代码

    $ nm vmlinux|grep __per_cpu_load
    ffffffff81d31000 D __per_cpu_load

计算一下，LMA 最终值为 0xffffffff81d31000 - 0xffffffff80000000 = 0x1d31000。

使用 objdump -h 查看一下：

bash 复制代码

$ objdump -h vmlinux

vmlinux:     file format elf64-x86-64

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .text         006db5cb  ffffffff81000000  0000000001000000  00200000  2**12
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
 	......
    
 14 .data         0012f880  ffffffff81c00000  0000000001c00000  00e00000  2**12
                  CONTENTS, ALLOC, LOAD, DATA
 15 .vvar         000000f0  ffffffff81d30000  0000000001d30000  00f30000  2**4
                  CONTENTS, ALLOC, LOAD, DATA
 16 .data..percpu 00015140  0000000000000000  0000000001d31000  01000000  2**12
                  CONTENTS, ALLOC, LOAD, DATA
    ......

可以看到，section(.data..percpu) 的 VMA 为 0， LMA 为 0x1d31000，跟我们自己计算的一致。

另外，符号 __per_cpu_start 和 __per_cpu_end 分别表示 section(.data..percpu)的起始和结束地址，两者之差表示该段的大小：

bash 复制代码

    $ nm vmlinux|grep __per_cpu_start
    0000000000000000 D __per_cpu_start

    $ nm vmlinux|grep __per_cpu_end
    0000000000015140 D __per_cpu_end

3.4 复制及初始化

setup_per_cpu_areas

经过编译和链接，内核中的静态 per-cpu 数据已经被聚合输出到 vmlinux 镜像文件的 section(.data..percpu) 中。下一步，就是要将这部分数据为每个cpu都复制一份并将部分数据初始化，这个功能是通过 setup_per_cpu_areas 函数来实现的，该函数定义于 arch/x86/kernel/setup_percpu.c文件。

setup_per_cpu_areas()函数定义如下：

C 复制代码

    // file: arch/x86/kernel/setup_percpu.c
    void __init setup_per_cpu_areas(void)
    {
    	unsigned int cpu;
    	unsigned long delta;
    	int rc;

    	......
            
    	rc = -EINVAL;
    	if (pcpu_chosen_fc != PCPU_FC_PAGE) {
    		const size_t dyn_size = PERCPU_MODULE_RESERVE +
    			PERCPU_DYNAMIC_RESERVE - PERCPU_FIRST_CHUNK_RESERVE;
    		size_t atom_size;

    		/*
    		 * On 64bit, use PMD_SIZE for atom_size so that embedded
    		 * percpu areas are aligned to PMD.  This, in the future,
    		 * can also allow using PMD mappings in vmalloc area.  Use
    		 * PAGE_SIZE on 32bit as vmalloc space is highly contended
    		 * and large vmalloc area allocs can easily fail.
    		 */
    #ifdef CONFIG_X86_64
    		atom_size = PMD_SIZE;
    #else
    		atom_size = PAGE_SIZE;
    #endif
    		rc = pcpu_embed_first_chunk(PERCPU_FIRST_CHUNK_RESERVE,
    					    dyn_size, atom_size,
    					    pcpu_cpu_distance,
    					    pcpu_fc_alloc, pcpu_fc_free);
    		if (rc < 0)
    			pr_warning("%s allocator failed (%d), falling back to page size\n",
    				   pcpu_fc_names[pcpu_chosen_fc], rc);
    	}
    	if (rc < 0)
    		rc = pcpu_page_first_chunk(PERCPU_FIRST_CHUNK_RESERVE,
    					   pcpu_fc_alloc, pcpu_fc_free,
    					   pcpup_populate_pte);
    	if (rc < 0)
    		panic("cannot initialize percpu area (err=%d)", rc);

    	/* alrighty, percpu areas up and running */
    	delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
    	for_each_possible_cpu(cpu) {
    		per_cpu_offset(cpu) = delta + pcpu_unit_offsets[cpu];
    		per_cpu(this_cpu_off, cpu) = per_cpu_offset(cpu);
    		per_cpu(cpu_number, cpu) = cpu;
    		setup_percpu_segment(cpu);
    		setup_stack_canary_segment(cpu);
            
    		......
                
    		/*
    		 * Up to this point, the boot CPU has been using .init.data
    		 * area.  Reload any changed state for the boot CPU.
    		 */
    		if (!cpu)
    			switch_to_new_gdt(cpu);
    	}

    	......
    }

函数执行时，先是检查要使用哪种first chunk分配器。内核提供了两种分配机制，分别是 "embed" 以及 "page"，Linux 内核通过命令行参数 percpu_alloc来决定使用哪种 first chunk分配器，内核文档 Documentation/kernel-parameters.txt对此说明如下：

txt 复制代码

    // file: Documentation/kernel-parameters.txt
    percpu_alloc=	Select which percpu first chunk allocator to use.
        Currently supported values are "embed" and "page".
        Archs may support subset or none of the	selections.
        See comments in mm/percpu.c for details on each
        allocator.  This parameter is primarily	for debugging
        and performance comparison.

参数处理函数定义在 mm/percpu.c 文件中：

C 复制代码

    // file: mm/percpu.c
    early_param("percpu_alloc", percpu_alloc_setup);

    enum pcpu_fc pcpu_chosen_fc __initdata = PCPU_FC_AUTO;

    static int __init percpu_alloc_setup(char *str)
    {
    	if (!str)
    		return -EINVAL;

    	if (0)
    		/* nada */;
    #ifdef CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK
    	else if (!strcmp(str, "embed"))
    		pcpu_chosen_fc = PCPU_FC_EMBED;
    #endif
    #ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
    	else if (!strcmp(str, "page"))
    		pcpu_chosen_fc = PCPU_FC_PAGE;
    #endif
    	else
    		pr_warning("PERCPU: unknown allocator %s specified\n", str);

    	return 0;
    }

C 复制代码

    // file: include/linux/percpu.h
    enum pcpu_fc {
    	PCPU_FC_AUTO,
    	PCPU_FC_EMBED,
    	PCPU_FC_PAGE,

    	PCPU_FC_NR,
    };

从参数处理函数可以看到，根据参数不同， first chunk 分配器可能有 auto 、 embed 及 page 三种类型，其中auto 是一种伪类型。embed分配器是Linux内核优先使用的分配器，只要命令行参数不是 page ，就会使用embed分配器。

检查 first chunk 分配器时，只要pcpu_chosen_fc不等于 PCPU_FC_PAGE，就会使用pcpu_embed_first_chunk函数为 first chunk 分配空间。在 x86_64 系统中，默认是使用 embed 分配器的，所以我们主要介绍 pcpu_embed_first_chunk 分配函数。

在分配空间之前，先计算 dyn_size的大小，该值是动态区（dynamic area）的最小值，也是 first chunk 里动态区（dynamic area）大小。

C 复制代码

    // file: arch/x86/kernel/setup_percpu.c
    		const size_t dyn_size = PERCPU_MODULE_RESERVE +
    			PERCPU_DYNAMIC_RESERVE - PERCPU_FIRST_CHUNK_RESERVE;
    		size_t atom_size;

    		/*
    		 * On 64bit, use PMD_SIZE for atom_size so that embedded
    		 * percpu areas are aligned to PMD.  This, in the future,
    		 * can also allow using PMD mappings in vmalloc area.  Use
    		 * PAGE_SIZE on 32bit as vmalloc space is highly contended
    		 * and large vmalloc area allocs can easily fail.
    		 */
    #ifdef CONFIG_X86_64
    		atom_size = PMD_SIZE;
    #else
    		atom_size = PAGE_SIZE;
    #endif

在计算 dyn_size时，引用了三个宏，这三个宏的定义如下：

C 复制代码

    // file: include/linux/percpu.h
    #define PERCPU_MODULE_RESERVE		(8 << 10)	// 8K

    #define PERCPU_DYNAMIC_RESERVE		(20 << 10)	// 20K

    #define PERCPU_FIRST_CHUNK_RESERVE	PERCPU_MODULE_RESERVE	// 8K

所以在 x86_64 架构下，dyn_size 大小为 8K + 20K - 8K = 20K。计算完dyn_size之后，接着定义了 atom_size 的大小。atom_size 字面意思为原子大小，也就是每次分配内存的最小空间。换句话说，每次分配的内存大小都是atom_size的整数倍。在 x86_64 架构下，其值为PMD_SIZE。PMD是 Page Middle Directory 的缩写，即中层页目录。

C 复制代码

    // file: arch/x86/kernel/setup_percpu.c
    		size_t atom_size;

    		/*
    		 * On 64bit, use PMD_SIZE for atom_size so that embedded
    		 * percpu areas are aligned to PMD.  This, in the future,
    		 * can also allow using PMD mappings in vmalloc area.  Use
    		 * PAGE_SIZE on 32bit as vmalloc space is highly contended
    		 * and large vmalloc area allocs can easily fail.
    		 */
    #ifdef CONFIG_X86_64
    		atom_size = PMD_SIZE;
    #else
    		atom_size = PAGE_SIZE;
    #endif

PMD_SIZE 宏定义如下，扩展之后，其值为2M。也就是说，每次为per-cpu数据分配空间时，最少分配 2M 的内存大小且为 2M 的整数倍。

C 复制代码

    // file: arch/x86/include/asm/pgtable_64_types.h
    #define PMD_SIZE	(_AC(1, UL) << PMD_SHIFT)	// 2M
    /*
     * PMD_SHIFT determines the size of the area a middle-level
     * page table can map
     */
    #define PMD_SHIFT	21

pcpu_embed_first_chunk

在选定了 embed 分配器后，就会调用pcpu_embed_first_chunk函数，将 first chunk嵌入到 bootmem 里。

C 复制代码

    // file: arch/x86/kernel/setup_percpu.c
    pcpu_embed_first_chunk(PERCPU_FIRST_CHUNK_RESERVE,
    					    dyn_size, atom_size,
    					    pcpu_cpu_distance,
    					    pcpu_fc_alloc, pcpu_fc_free);

调用时pcpu_embed_first_chunk函数，向其传递了 6 个参数：

PERCPU_FIRST_CHUNK_RESERVE：静态区域里的reserved area 的大小，其宏扩展为 8K；
dyn_size：最小动态区域大小，上文已经计算过为 20K；
atom_size：原子分配大小，实际分配大小是该值的整数倍，上文已经计算过，为 2M；
pcpu_cpu_distance：NUMA系统里，计算cpu之间距离的函数
pcpu_fc_alloc：分配 per-cpu内存的函数
pcpu_fc_free：释放per-cpu内存的函数

pcpu_cpu_distance

pcpu_cpu_distance 函数定义如下：

C 复制代码

    // file: arch/x86/kernel/setup_percpu.c
    static int __init pcpu_cpu_distance(unsigned int from, unsigned int to)
    {
    #ifdef CONFIG_NEED_MULTIPLE_NODES
    	if (early_cpu_to_node(from) == early_cpu_to_node(to))
    		return LOCAL_DISTANCE;
    	else
    		return REMOTE_DISTANCE;
    #else
    	return LOCAL_DISTANCE;
    #endif
    }

其中 LOCAL_DISTANCE 和 REMOTE_DISTANCE定义如下：

C 复制代码

    // file: include/linux/topology.h
    /* Conform to ACPI 2.0 SLIT distance definitions */
    #define LOCAL_DISTANCE		10
    #define REMOTE_DISTANCE		20

pcpu_cpu_distance 函数定义了不同节点的距离，其内部实现是先判断 2个cpu是否在同一个节点（node），如果是同一节点，则其距离为 LOCAL_DISTANCE（10 ），否则其距离为 REMOTE_DISTANCE（20）。

pcpu_fc_alloc

pcpu_fc_alloc 和 pcpu_fc_free 函数定义如下，它们内部又分别调用了 pcpu_alloc_bootmem 和 free_bootmem 函数。

C 复制代码

    // file: arch/x86/kernel/setup_percpu.c
    /*
     * Helpers for first chunk memory allocation
     */
    static void * __init pcpu_fc_alloc(unsigned int cpu, size_t size, size_t align)
    {
    	return pcpu_alloc_bootmem(cpu, size, align);
    }

    static void __init pcpu_fc_free(void *ptr, size_t size)
    {
    	free_bootmem(__pa(ptr), size);
    }

pcpu_alloc_bootmem 函数接收 3 个参数，分别为：

cpu：需要分配空间的cpu；
size：需要分配的空间大小；
align：对齐的字节数

其内部根据 CONFIG_NEED_MULTIPLE_NODES 配置项，来决定使用哪种分配函数。在配置了 CONFIG_NEED_MULTIPLE_NODES选项的条件下，会使用__alloc_bootmem_node_nopanic来分配内存，该函数会在指定的节点（node）分配内存；否则会使用 __alloc_bootmem_nopanic来分配，该函数可能在任意节点分配内存。分配成功后，会返回已分配内存地址。

C 复制代码

    // file: arch/x86/kernel/setup_percpu.c
    /**
     * pcpu_alloc_bootmem - NUMA friendly alloc_bootmem wrapper for percpu
     * @cpu: cpu to allocate for
     * @size: size allocation in bytes
     * @align: alignment
     *
     * Allocate @size bytes aligned at @align for cpu @cpu.  This wrapper
     * does the right thing for NUMA regardless of the current
     * configuration.
     *
     * RETURNS:
     * Pointer to the allocated area on success, NULL on failure.
     */
    static void * __init pcpu_alloc_bootmem(unsigned int cpu, unsigned long size,
    					unsigned long align)
    {
    	const unsigned long goal = __pa(MAX_DMA_ADDRESS);
    #ifdef CONFIG_NEED_MULTIPLE_NODES
    	int node = early_cpu_to_node(cpu);
    	void *ptr;

    	if (!node_online(node) || !NODE_DATA(node)) {
    		ptr = __alloc_bootmem_nopanic(size, align, goal);
    		pr_info("cpu %d has no node %d or node-local memory\n",
    			cpu, node);
    		pr_debug("per cpu data for cpu%d %lu bytes at %016lx\n",
    			 cpu, size, __pa(ptr));
    	} else {
    		ptr = __alloc_bootmem_node_nopanic(NODE_DATA(node),
    						   size, align, goal);
    		pr_debug("per cpu data for cpu%d %lu bytes on node%d at %016lx\n",
    			 cpu, size, node, __pa(ptr));
    	}
    	return ptr;
    #else
    	return __alloc_bootmem_nopanic(size, align, goal);
    #endif
    }

在函数内部，首先定义了一个常量 goal :

C 复制代码

    const unsigned long goal = __pa(MAX_DMA_ADDRESS);

在 x86_64 架构中，宏 MAX_DMA_ADDRESS定义如下：

C 复制代码

    // file: arch/x86/include/asm/dma.h
    #define MAX_DMA_ADDRESS ((unsigned long)__va(MAX_DMA_PFN << PAGE_SHIFT))

    /* 16MB ISA DMA zone */
    #define MAX_DMA_PFN   ((16 * 1024 * 1024) >> PAGE_SHIFT)	// 4 * 1024 = 4K

其中，PAGE_SHIFT宏定义在文件 arch/x86/include/asm/page_64_types.h，表示单个页面（page）的移位大小。PAGE_SHIFT宏定义为 12，表示单页大小为4K。

C 复制代码

    // file: arch/x86/include/asm/page_64_types.h
    /* PAGE_SHIFT determines the page size */
    #define PAGE_SHIFT	12
    #define PAGE_SIZE	(_AC(1,UL) << PAGE_SHIFT)

PFN 是 Page Frame Number 的缩写，表示页帧号，MAX_DMA_PFN 表示DMA区域最大的页帧号。由于DMA区域大小为16M，所以 MAX_DMA_ADDRESS 宏表示的是DMA区域的最大虚拟地址。我们看到，该宏中引用了 __va 宏，__va 宏会将物理地址转换成虚拟地址，其定义如下：

C 复制代码

    // file: arch/x86/include/asm/page.h
    #define __va(x)			((void *)((unsigned long)(x)+PAGE_OFFSET))

其中，宏 PAGE_OFFSET 定义如下：

C 复制代码

    #define PAGE_OFFSET		((unsigned long)__PAGE_OFFSET)

宏 __PAGE_OFFSET 定义如下：

C 复制代码

    // file: arch/x86/include/asm/page_64_types.h
    /*
     * Set __PAGE_OFFSET to the most negative possible address +
     * PGDIR_SIZE*16 (pgd slot 272).  The gap is to allow a space for a
     * hypervisor to fit.  Choosing 16 slots here is arbitrary, but it's
     * what Xen requires.
     */
    #define __PAGE_OFFSET           _AC(0xffff880000000000, UL)

从文档Documentation/x86/x86_64/mm.txt 中的注释说明可以看到， 0xffff880000000000 是一个比较特殊的地址，它是物理内存直接映射区的起始地址。所谓直接映射区，是指把物理内存从地址 0 开始完全映射到虚拟内存的区域。在 x86_64 系统上，该区域大小为 64T，空间足够大，完全可以支持整个物理内存的映射。而在32位x86系统上，该区域大小为896M。也就是说，在32位系统上，只有地址从 0~896M的物理内存能直接映射到虚拟虚拟内存，超过896M的部分需要动态映射。

txt 复制代码

    // file: Documentation/x86/x86_64/mm.txt
    Virtual memory map with 4 level page tables:

    0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm
    hole caused by [48:63] sign extension
    ffff800000000000 - ffff80ffffffffff (=40 bits) guard hole
    ffff880000000000 - ffffc7ffffffffff (=64 TB) direct mapping of all phys. memory
    ffffc80000000000 - ffffc8ffffffffff (=40 bits) hole
    ffffc90000000000 - ffffe8ffffffffff (=45 bits) vmalloc/ioremap space
    ffffe90000000000 - ffffe9ffffffffff (=40 bits) hole
    ffffea0000000000 - ffffeaffffffffff (=40 bits) virtual memory map (1TB)
    ... unused hole ...
    ffffffff80000000 - ffffffffa0000000 (=512 MB)  kernel text mapping, from phys 0
    ffffffffa0000000 - ffffffffff5fffff (=1525 MB) module mapping space
    ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
    ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole

x86_64 系统虚拟内存空间布局如下：

现在我们知道了PAGE_OFFSET 宏其实是虚拟地址中物理内存直接映射区的起始地址 0xffff880000000000，回过头再接着说__va宏。从定义中可以看到，该宏实际上，是以PAGE_OFFSET为基地址，加上入参后直接返回。由于物理地址 0 直接映射到了虚拟地址PAGE_OFFSET ，所以物理地址 x 加上 PAGE_OFFSET 直接就转变成了虚拟地址。

C 复制代码

    // file: arch/x86/include/asm/page.h
    #define __va(x)			((void *)((unsigned long)(x)+PAGE_OFFSET))

所以 MAX_DMA_ADDRESS扩展后，其值为 PAGE_OFFSET + 16 * 1024 * 1024 = 0xffff880001000000 ，也就是 PAGE_OFFSET 向上偏移 16M的地址。

再来看看宏__pa，该宏的功能是实现虚拟地址到物理地址的转换，其定义如下：

C 复制代码

    // file: arch/x86/include/asm/page.h
    #define __pa(x)		__phys_addr((unsigned long)(x))

__pa中又引用了__phys_addr宏，__phys_addr宏定义如下：

C 复制代码

    // file: arch/x86/include/asm/page_64.h
    #define __phys_addr(x)		__phys_addr_nodebug(x)
    static inline unsigned long __phys_addr_nodebug(unsigned long x)
    {
    	unsigned long y = x - __START_KERNEL_map;

    	/* use the carry flag to determine if x was < __START_KERNEL_map */
    	x = y + ((x > y) ? phys_base : (__START_KERNEL_map - PAGE_OFFSET));

    	return x;
    }

我们在上文中已经提到过 __START_KERNEL_map，其表示的是内核代码段映射地址，定义如下：

C 复制代码

    // file: arch/x86/include/asm/page_64_types.h
    #define __START_KERNEL_map	_AC(0xffffffff80000000, UL)

`phys_base`，定义在`head_64.S`文件中：

    // file: arch/x86/kernel/head_64.S
    ENTRY(phys_base)
    	/* This must match the first entry in level2_kernel_pgt */
    	.quad   0x0000000000000000

在内联函数 __phys_addr_nodebug 中，先是计算出入参 x 与 __START_KERNEL_map 的差值 y。然后把 x 和 y 做比较，来决定基地址的值。最后，把 y 加上基地址作为返回值。

注意，这里使用了一个小技巧，其返回值 y 是 unsigned long 类型。所以，当 x 大于 __START_KERNEL_map 时，y 值比 x 小；否则，y 值比 x 大。

回到函数 pcpu_alloc_bootmem 中来，继续计算目标地址 goal，其定义如下：

C 复制代码

    const unsigned long goal = __pa(MAX_DMA_ADDRESS);

其中__pa的参数 MAX_DMA_ADDRESS刚才已经计算过了，其值小于__START_KERNEL_map，所以在执行 __phys_addr_nodebug时，其 x 是小于 y 的，最终返回的 x 值为：

C 复制代码

    x = y + (__START_KERNEL_map - PAGE_OFFSET);

而 y = x - __START_KERNEL_map，所以 x = x - PAGE_OFFSET。这里入参 x 为 PAGE_OFFSET + 16M，所以通过计算，最终 x = 16M。

所以经过计算之后，goal的值为16M，该值将会做为可分配区域的起始地址。

如果配置了 CONFIG_NEED_MULTIPLE_NODES 选项，那么接下来，会计算cpu对应的节点（node）:

C 复制代码

    int node = early_cpu_to_node(cpu);

如果该节点不在线或者该节点没有对应的内存，则退化成执行 __alloc_bootmem_nopanic 函数来分配内存。函数中使用到了 NODE_DATA宏，其定义如下：

C 复制代码

    // file: arch/x86/include/asm/mmzone_64.h
    #define NODE_DATA(nid)		(node_data[nid])
    extern struct pglist_data *node_data[];

其中，变量node_data是一个数组，其元素类型为 struct pglist_data，每个 pglist_data 结构体对应着 NUMA 的一个 node。struct pglist_data定义在头文件 include/linux/mmzone.h中，该结构体还有一个别名叫 pg_data_t 。

C 复制代码

    /*
     * The pg_data_t structure is used in machines with CONFIG_DISCONTIGMEM
     * (mostly NUMA machines?) to denote a higher-level memory zone than the
     * zone denotes.
     *
     * On NUMA machines, each NUMA node would have a pg_data_t to describe
     * it's memory layout.
     *
     * Memory statistics and page replacement data structures are maintained on a
     * per-zone basis.
     */
    typedef struct pglist_data {

    	......
    	
    } pg_data_t;

接下来看一下 __alloc_bootmem_nopanic 函数，该函数在 mm/nobootmem.c 和 mm/bootmem.c 文件中都有定义，根据 linux内核官方文档Boot time memory management 的说明，最终使用哪个文件中的函数是由两个内核配置选项决定的。

Early system initialization cannot use "normal" memory management simply because it is not set up yet. But there is still need to allocate memory for various data structures, for instance for the physical page allocator. To address this, a specialized allocator called the Boot Memory Allocator, or bootmem, was introduced. Several years later PowerPC developers added a "Logical Memory Blocks" allocator, which was later adopted by other architectures and renamed to memblock. There is also a compatibility layer called nobootmem that translates bootmem allocation interfaces to memblock calls.

The selection of the early allocator is done using CONFIG_NO_BOOTMEM and CONFIG_HAVE_MEMBLOCK kernel configuration options. These options are enabled or disabled statically by the architectures' Kconfig files.

Architectures that rely only on bootmem select CONFIG_NO_BOOTMEM=n && CONFIG_HAVE_MEMBLOCK=n.

The users of memblock with the nobootmem compatibility layer set CONFIG_NO_BOOTMEM=y && CONFIG_HAVE_MEMBLOCK=y.

And for those that use both memblock and bootmem the configuration includes CONFIG_NO_BOOTMEM=n && CONFIG_HAVE_MEMBLOCK=y.

翻译过来如下：

系统初始化早期，由于此时普通内存管理机制还没有设置好，所以不能使用普通的内存管理机制。为此，系统提供了两种早期内存分配器，分别是 Boot Memory Allocator （也称作 bootmem ）和 memblock（刚开始叫做 "Logical Memory Blocks" allocator，后来改名为 memblock）。memblock出现在 bootmem之后，是一种相对较新的分配器，它提供了一个兼容层用来把 bootmem 的分配接口转换成对 memblock 的调用，该接口叫做 nobootmem。

内核配置选项 CONFIG_NO_BOOTMEM 和 CONFIG_HAVE_MEMBLOCK，决定了早期分配内存时使用那种分配器：

只依赖 bootmem 的架构，其内核配置选项为： CONFIG_NO_BOOTMEM=n && CONFIG_HAVE_MEMBLOCK=n.
使用 memblock 中的 nobootmem 作为兼容层的，其内核配置选项为： CONFIG_NO_BOOTMEM=y && CONFIG_HAVE_MEMBLOCK=y.
既使用 memblock 又使用 bootmem 的，其内核配置选项为： CONFIG_NO_BOOTMEM=n && CONFIG_HAVE_MEMBLOCK=y.

编译内核时，其默认配置就是第二种，所以实际使用 memblock 中的 nobootmem 作为兼容层； __alloc_bootmem_nopanic 函数使用的是 mm/nobootmem.c 文件中的，而不是 mm/bootmem.c 中的。

__alloc_bootmem_nopanic 函数定义如下，其内部又调用了 ___alloc_bootmem_nopanic 函数：

C 复制代码

    // file: mm/nobootmem.c
    /**
     * __alloc_bootmem_nopanic - allocate boot memory without panicking
     * @size: size of the request in bytes
     * @align: alignment of the region
     * @goal: preferred starting address of the region
     *
     * The goal is dropped if it can not be satisfied and the allocation will
     * fall back to memory below @goal.
     *
     * Allocation may happen on any node in the system.
     *
     * Returns NULL on failure.
     */
    void * __init __alloc_bootmem_nopanic(unsigned long size, unsigned long align,
    					unsigned long goal)
    {
    	unsigned long limit = -1UL;

    	return ___alloc_bootmem_nopanic(size, align, goal, limit);
    }

___alloc_bootmem_nopanic 函数内部又调用了 __alloc_memory_core_early 函数：

C 复制代码

    // file: mm/nobootmem.c
    static void * __init ___alloc_bootmem_nopanic(unsigned long size,
    					unsigned long align,
    					unsigned long goal,
    					unsigned long limit)
    {
    	void *ptr;

    	if (WARN_ON_ONCE(slab_is_available()))
    		return kzalloc(size, GFP_NOWAIT);

    restart:

    	ptr = __alloc_memory_core_early(MAX_NUMNODES, size, align, goal, limit);

    	if (ptr)
    		return ptr;

    	if (goal != 0) {
    		goal = 0;
    		goto restart;
    	}

    	return NULL;
    }

宏 `MAX_NUMNODES` 定义如下，扩展后为1024，其中 `CONFIG_NODES_SHIFT` 为内核配置参数，默认为10：

    // file: include/linux/numa.h
    #define MAX_NUMNODES    (1 << NODES_SHIFT)	// 1 << 10 = 1024
    #define NODES_SHIFT     CONFIG_NODES_SHIFT	// 10

C 复制代码

    // file: include/generated/autoconf.h
    #define CONFIG_NODES_SHIFT 10

__alloc_memory_core_early 函数定义如下，其内部会调用 memblock_find_in_range_node函数使用 memblock 分配器来分配内存：

C 复制代码

    // file: 
    static void * __init __alloc_memory_core_early(int nid, u64 size, u64 align,
    					u64 goal, u64 limit)
    {
    	void *ptr;
    	u64 addr;

    	if (limit > memblock.current_limit)
    		limit = memblock.current_limit;

    	addr = memblock_find_in_range_node(goal, limit, size, align, nid);
    	if (!addr)
    		return NULL;

    	memblock_reserve(addr, size);
    	ptr = phys_to_virt(addr);
    	memset(ptr, 0, size);
    	/*
    	 * The min_count is set to 0 so that bootmem allocated blocks
    	 * are never reported as leaks.
    	 */
    	kmemleak_alloc(ptr, size, 0, 0);
    	return ptr;
    }

memblock_find_in_range_node 函数接收5个参数，分别如下：

start：可分配内存的最小地址
end：可分配内存的最大地址
size：需要分配的内存大小
align：对齐字节
nid：指定要分配内存的节点（node），当该参数为 MAX_NUMNODES时，可在任意节点分配

C 复制代码

    // file: mm/memblock.c
    /**
     * memblock_find_in_range_node - find free area in given range and node
     * @start: start of candidate range
     * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
     * @size: size of free area to find
     * @align: alignment of free area to find
     * @nid: nid of the free area to find, %MAX_NUMNODES for any node
     *
     * Find @size free area aligned to @align in the specified range and node.
     *
     * RETURNS:
     * Found address on success, %0 on failure.
     */
    phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
    					phys_addr_t end, phys_addr_t size,
    					phys_addr_t align, int nid)
    {
    	phys_addr_t this_start, this_end, cand;
    	u64 i;

    	/* pump up @end */
    	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
    		end = memblock.current_limit;

    	/* avoid allocating the first page */
    	start = max_t(phys_addr_t, start, PAGE_SIZE);
    	end = max(start, end);

    	for_each_free_mem_range_reverse(i, nid, &this_start, &this_end, NULL) {
    		this_start = clamp(this_start, start, end);
    		this_end = clamp(this_end, start, end);

    		if (this_end < size)
    			continue;

    		cand = round_down(this_end - size, align);
    		if (cand >= this_start)
    			return cand;
    	}
    	return 0;
    }

memblock_find_in_range_node 函数内部引用了 for_each_free_mem_range_reverse宏，该宏会在指定 nid 的节点中寻找可分配内存区域；另外，从宏的名字中也能看到，其查找顺序是反向查找 ，即从高地址开始查找可分配内存。for_each_free_mem_range_reverse的内部涉及到 memblock 分配器的实现，我们就不去看实现细节了。

内存分配成功后，memblock_find_in_range_node函数会返回分配内存的物理地址。然后调用 memblock_reserve(addr, size);把该段内存加入到保留区，防止被再次分配。接着调用 ptr = phys_to_virt(addr);把物理地址转换成虚拟地址。phys_to_virt函数实现如下：

C 复制代码

    // file: arch/x86/include/asm/io.h
    static inline void *phys_to_virt(phys_addr_t address)
    {
    	return __va(address);
    }

可以看到，内联函数 phys_to_virt 内部只是调用了宏 __va，该宏的实现细节在上文中已经讲解过了，这里不在赘述。

再接着，使用 memset(ptr, 0, size);将该内存区域初始化为0，然后将指向该内存块的指针返回。至此，完成了整个内存分配过程。

内存分配过程见下图：

再来看一下 __alloc_bootmem_node_nopanic 函数的执行过程。

C 复制代码

    		ptr = __alloc_bootmem_node_nopanic(NODE_DATA(node),
    						   size, align, goal);

如上文所述，通过NODE_DATA(node)可以获取到该 node 对应的结构体 struct pglist_data。

__alloc_bootmem_node_nopanic 定义如下，其内部调用了___alloc_bootmem_node_nopanic函数。

C 复制代码

    // file: mm/nobootmem.c
    void * __init __alloc_bootmem_node_nopanic(pg_data_t *pgdat, unsigned long size,
    				   unsigned long align, unsigned long goal)
    {
    	if (WARN_ON_ONCE(slab_is_available()))
    		return kzalloc_node(size, GFP_NOWAIT, pgdat->node_id);

    	return ___alloc_bootmem_node_nopanic(pgdat, size, align, goal, 0);
    }

___alloc_bootmem_node_nopanic 函数定义如下，其内部实现同样调用了 __alloc_memory_core_early函数：

C 复制代码

    // file: mm/nobootmem.c
    void * __init ___alloc_bootmem_node_nopanic(pg_data_t *pgdat,
    						   unsigned long size,
    						   unsigned long align,
    						   unsigned long goal,
    						   unsigned long limit)
    {
    	void *ptr;

    again:
    	ptr = __alloc_memory_core_early(pgdat->node_id, size, align,
    					goal, limit);
    	if (ptr)
    		return ptr;

    	ptr = __alloc_memory_core_early(MAX_NUMNODES, size, align,
    					goal, limit);
    	if (ptr)
    		return ptr;

    	if (goal) {
    		goal = 0;
    		goto again;
    	}

    	return NULL;
    }

从函数实现中可以看到，该函数首先尝试在指定的节点分配内存（传递给 __alloc_memory_core_early函数的第一个参数为 pgdat->node_id）；如果该节点分配失败，则会从任意节点分配内存（传递给 __alloc_memory_core_early函数的第一个参数为 MAX_NUMNODES）。

至此，在 pcpu_embed_first_chunk函数中用来分配内存的函数 pcpu_fc_alloc 就分析完了。

总结一下，由于使用的 memblock 是物理内存分配器，所以在分配完成后，需要使用 _va 宏或者内联函数 phys_to_virt将分配的物理地址转换成虚拟地址。转换方法也非常简单，把物理地址直接加上 PAGE_OFFSET就完成了转换。由于PAGE_OFFSET是物理内存直接映射区的基地址，所以使用memblock 分配器分配的内存是位于物理内存映射区的。另外，在使用 memblock 分配内存时，可以指定 node 及分配区间的起始和结束地址。如未指定node，则可能在任意节点（node）分配内存。

其流程图如下：

分析完内存分配函数之后，回到pcpu_embed_first_chunk函数来看下该函数的具体实现：

pcpu_embed_first_chunk -1

C 复制代码

// file: mm/percpu.c
#if defined(BUILD_EMBED_FIRST_CHUNK)
/**
 * pcpu_embed_first_chunk - embed the first percpu chunk into bootmem
 * @reserved_size: the size of reserved percpu area in bytes
 * @dyn_size: minimum free size for dynamic allocation in bytes
 * @atom_size: allocation atom size
 * @cpu_distance_fn: callback to determine distance between cpus, optional
 * @alloc_fn: function to allocate percpu page
 * @free_fn: function to free percpu page
 *
 * This is a helper to ease setting up embedded first percpu chunk and
 * can be called where pcpu_setup_first_chunk() is expected.
 *
 * If this function is used to setup the first chunk, it is allocated
 * by calling @alloc_fn and used as-is without being mapped into
 * vmalloc area.  Allocations are always whole multiples of @atom_size
 * aligned to @atom_size.
 *
 * This enables the first chunk to piggy back on the linear physical
 * mapping which often uses larger page size.  Please note that this
 * can result in very sparse cpu->unit mapping on NUMA machines thus
 * requiring large vmalloc address space.  Don't use this allocator if
 * vmalloc space is not orders of magnitude larger than distances
 * between node memory addresses (ie. 32bit NUMA machines).
 *
 * @dyn_size specifies the minimum dynamic area size.
 *
 * If the needed size is smaller than the minimum or specified unit
 * size, the leftover is returned using @free_fn.
 *
 * RETURNS:
 * 0 on success, -errno on failure.
 */
int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size,
				  size_t atom_size,
				  pcpu_fc_cpu_distance_fn_t cpu_distance_fn,
				  pcpu_fc_alloc_fn_t alloc_fn,
				  pcpu_fc_free_fn_t free_fn)
{
	void *base = (void *)ULONG_MAX;
	void **areas = NULL;
	struct pcpu_alloc_info *ai;
	size_t size_sum, areas_size, max_distance;
	int group, i, rc;

	ai = pcpu_build_alloc_info(reserved_size, dyn_size, atom_size,
				   cpu_distance_fn);
	if (IS_ERR(ai))
		return PTR_ERR(ai);

	......

pcpu_embed_first_chunk函数内部，先是调用了 pcpu_build_alloc_info 函数，该函数会使用提供的参数来填充pcpu_alloc_info结构体，该结构体定义如下：

C 复制代码

    // file: include/linux/percpu.h
    struct pcpu_alloc_info {
    	size_t			static_size;
    	size_t			reserved_size;
    	size_t			dyn_size;
    	size_t			unit_size;
    	size_t			atom_size;
    	size_t			alloc_size;
    	size_t			__ai_size;	/* internal, don't use */
    	int			nr_groups;	/* 0 if grouping unnecessary */
    	struct pcpu_group_info	groups[];
    };

各字段含义如下：

static_size：
- unit 中静态区域的大小，此区域只存在于 first chunk；
reserved_size：
- uni t中保留区域的大小，此区域只存在于 first chunk；
- 上文已经分析过，在x86_64架构上，该值为 8K；
dyn_size：
- 动态区域最小值，即在 first chunk 中的大小；
- 上文已经分析过，在x86_64架构里，该值为20K；
unit_size：
- 每个unit的大小；
atom_size：
- 最小分配大小
- 上文已经分析过，在x86_64架构里，该值为2M；
alloc_size：
- 需要分配的内存大小
nr_groups：
- 需要分配的组数量，即 NUMA 的 node 数量
groups[]：
- 分配的组信息

chunk 中的组信息使用 pcpu_group_info结构体来管理，由于每个node可以包含多个cpu，每个cpu对应着一个unit，所以每个组里可以包含多个unit。UMA系统中，只有一个 node，所以只有一个组。

struct pcpu_group_info结构体共包含3个字段，其定义如下：

C 复制代码

    // file: include/linux/percpu.h
    struct pcpu_group_info {
    	int			nr_units;	/* aligned # of units */
    	unsigned long		base_offset;	/* base address offset */
    	unsigned int		*cpu_map;	/* unit->cpu map, empty
    						 * entries contain NR_CPUS */
    };

每个字段含义如下：

nr_units：该组里的unit数量；
base_offset：该组的基本偏移地址；
cpu_map：该组里unit到cpu的映射关系。其本质是一个数组，数组下标表示unit，对应的值为cpu编号。

C 复制代码

`pcpu_embed_first_chunk` -2

    	size_sum = ai->static_size + ai->reserved_size + ai->dyn_size;
    	areas_size = PFN_ALIGN(ai->nr_groups * sizeof(void *));

    	areas = alloc_bootmem_nopanic(areas_size);
    	if (!areas) {
    		rc = -ENOMEM;
    		goto out_free;
    	}

填充完pcpu_alloc_info之后，size_sum计算的是静态区、保留区和动态区的总大小。areas_size计算的是组（groups，即nodes）的数量个 (void *)地址所需要的内存空间。areas[] 是一个数组，其占用的内存大小为 areas_size，数组内保存的是各个组的per-cpu数据的地址。接下来，使用 alloc_bootmem_nopanic为 areas 分配空间。alloc_bootmem_nopanic 是一个宏，其定义如下：

C 复制代码

    // file: include/linux/bootmem.h
    #define alloc_bootmem_nopanic(x) \
    	__alloc_bootmem_nopanic(x, SMP_CACHE_BYTES, BOOTMEM_LOW_LIMIT)

宏 BOOTMEM_LOW_LIMIT的值依赖于内核配置选项 CONFIG_NO_BOOTMEM，当该选项启用时，BOOTMEM_LOW_LIMIT为0，否则为 __pa(MAX_DMA_ADDRESS)。默认情况下 CONFIG_NO_BOOTMEM 是启用的，所以此时 BOOTMEM_LOW_LIMIT为0。另外，从注释中能够看出，其内存分配是自上而下的，所以0值是安全的。

C 复制代码

    #ifdef CONFIG_NO_BOOTMEM
    /* We are using top down, so it is safe to use 0 here */
    #define BOOTMEM_LOW_LIMIT 0
    #else
    #define BOOTMEM_LOW_LIMIT __pa(MAX_DMA_ADDRESS)
    #endif

宏 SMP_CACHE_BYTES定义如下：

C 复制代码

    // file: include/linux/cache.h
    #define SMP_CACHE_BYTES L1_CACHE_BYTES

宏L1_CACHE_BYTES，默认是 64字节。该参数是分配内存时对齐的字节数，主要是为了提高性能。

C 复制代码

    // file: arch/x86/include/asm/cache.h
    /* L1 cache line size */
    #define L1_CACHE_BYTES	(1 << L1_CACHE_SHIFT)	// 1 << 6 = 64
    #define L1_CACHE_SHIFT	(CONFIG_X86_L1_CACHE_SHIFT)

C 复制代码

    // file: include/generated/autoconf.h
    #define CONFIG_X86_L1_CACHE_SHIFT 6

说完了几个宏，我们来看下__alloc_bootmem_nopanic，这个函数我们上文在讲解 first chunk 的内存分配函数时曾经遇到过，当时的调用关系是：pcpu_embed_first_chunk -> pcpu_fc_alloc -> pcpu_alloc_bootmem -> __alloc_bootmem_nopanic 。由于未指定需分配内存的nodeid，该函数会从任意 node 分配内存。虽然说是从任意 node 分配内存，但其内部实现是自上而下来查找可分配内存的，所以高物理地址的内存会优先被分配。

pcpu_embed_first_chunk -3

C 复制代码

    	/* allocate, copy and determine base address */
    	for (group = 0; group < ai->nr_groups; group++) {
    		struct pcpu_group_info *gi = &ai->groups[group];
    		unsigned int cpu = NR_CPUS;
    		void *ptr;

    		for (i = 0; i < gi->nr_units && cpu == NR_CPUS; i++)
    			cpu = gi->cpu_map[i];
    		BUG_ON(cpu == NR_CPUS);

    		/* allocate space for the whole group */
    		ptr = alloc_fn(cpu, gi->nr_units * ai->unit_size, atom_size);
    		if (!ptr) {
    			rc = -ENOMEM;
    			goto out_free_areas;
    		}
    		/* kmemleak tracks the percpu allocations separately */
    		kmemleak_free(ptr);
    		areas[group] = ptr;

    		base = min(ptr, base);
    	}

这段代码会遍历每个组，在获取组信息之后，遍历该组内的 unit，找出 gi->cpu_map 中第一个映射值不为NR_CPUS的cpu；如果找不到，则报错。

在 pcpu_build_alloc_info函数中，会为每个node 填充 pcpu_group_info结构体，在填充 cpu_map[] 成员时，凡是未能映射到实际cpu的unit，其cpu_map[unit]对应的值就会被初始化成 NR_CPUS。所以在遍历完整个 gi->nr_units 之后，如果cpu值全是NR_CPUS，说明该组内没有cpu，这肯定是有问题的，所以就报错了。

接下来，会为该cpu所在的组分配内存，alloc_fn就是上文分析的内存分配函数；atom_size为最小分配空间，在x86_64系统上是2M；gi->nr_units * ai->unit_size 为需要分配的内存大小，unit_size 是在 pcpu_build_alloc_info 函数中计算的。该函数会根据组的unit数量、unit_size的大小，计算该组数据所需要的内存空间，然后按照atom_size 进行对齐，最后在指定的node 上分配空间。

注意，在该分配函数中，会根据cpu查找到对应的node，然后在该node对应的内存区分配内存，也就是说，为这些group分配的内存，其地址是不连续的。

分配完成后，把分配的组内存地址，赋值给 areas[group]，areas数组的每个元素，都是指向组per-cpu数据的指针。

最后，把 base 值替换成当前 ptr 和 base 的较小值，使base指向地址最小的group。

pcpu_embed_first_chunk -4

C 复制代码

    	/*
    	 * Copy data and free unused parts.  This should happen after all
    	 * allocations are complete; otherwise, we may end up with
    	 * overlapping groups.
    	 */
    	for (group = 0; group < ai->nr_groups; group++) {
    		struct pcpu_group_info *gi = &ai->groups[group];
    		void *ptr = areas[group];

    		for (i = 0; i < gi->nr_units; i++, ptr += ai->unit_size) {
    			if (gi->cpu_map[i] == NR_CPUS) {
    				/* unused unit, free whole */
    				free_fn(ptr, ai->unit_size);
    				continue;
    			}
    			/* copy and return the unused part */
    			memcpy(ptr, __per_cpu_load, ai->static_size);
    			free_fn(ptr + size_sum, ai->unit_size - size_sum);
    		}
    	}

该段代码遍历所有组，然后遍历每个组的unit，如果该unit映射的cpuid为NR_CPUS（gi->cpu_map[i] == NR_CPUS），表示该unit并没有映射到实际的cpu，就会释放该unit的内存；否则，memcpy(ptr, __per_cpu_load, ai->static_size)会从地址 __per_cpu_load处，拷贝per-cpu静态数据到每个unit的静态数据区，其数据大小为ai->static_size。然后，会将unit里多余的内存释放掉。

__per_cpu_load是在链接脚本里定义的，是静态per-cpu数据起始地址的符号，其定义在文件 include/asm-generic/vmlinux.lds.h的PERCPU_VADDR宏中，上文已经介绍过。通过 nm 命令查看得知，该地址位于内核代码段。

bash 复制代码

    $ nm vmlinux|grep per_cpu_load
    ffffffff81d31000 D __per_cpu_load

pcpu_embed_first_chunk -5

C 复制代码

    	/* base address is now known, determine group base offsets */
    	max_distance = 0;
    	for (group = 0; group < ai->nr_groups; group++) {
    		ai->groups[group].base_offset = areas[group] - base;
    		max_distance = max_t(size_t, max_distance,
    				     ai->groups[group].base_offset);
    	}
    	max_distance += ai->unit_size;

    	/* warn if maximum distance is further than 75% of vmalloc space */
    	if (max_distance > (VMALLOC_END - VMALLOC_START) * 3 / 4) {
    		pr_warning("PERCPU: max_distance=0x%zx too large for vmalloc "
    			   "space 0x%lx\n", max_distance,
    			   (unsigned long)(VMALLOC_END - VMALLOC_START));
    #ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
    		/* and fail if we have fallback */
    		rc = -EINVAL;
    		goto out_free;
    #endif
    	}

这段代码更新每个组的 base offsets，其值为该组的基地址与第一个组的基地址之间的差值；同时计算出 max_distance，然后把 max_distance又加上了一个unit_size的大小。

然后，判断 max_distance是否大于 vmalloc空间的 3/4。如果 max_distance比 vmalloc空间的 3/4 要大，说明 vmalloc 空间不足，就会打印warning信息，释放内存，最终返回错误码。如果这里分配失败，那么会退化成使用 page 类型的分配器 pcpu_page_first_chunk 来分配。

C 复制代码

    		rc = pcpu_embed_first_chunk(PERCPU_FIRST_CHUNK_RESERVE,
    					    dyn_size, atom_size,
    					    pcpu_cpu_distance,
    					    pcpu_fc_alloc, pcpu_fc_free);
    		if (rc < 0)
    			pr_warning("%s allocator failed (%d), falling back to page size\n",
    				   pcpu_fc_names[pcpu_chosen_fc], rc);
    	}
    	if (rc < 0)
    		rc = pcpu_page_first_chunk(PERCPU_FIRST_CHUNK_RESERVE,
    					   pcpu_fc_alloc, pcpu_fc_free,
    					   pcpup_populate_pte);

至于为什么要与 vmalloc空间的 3/4做比较，是因为只有 first chunk是在直接映射区分配的，其它chunk是在vmalloc区分配的。如果max_distance值过大，那么留给其它变量可用的内存就会不足。

pcpu_embed_first_chunk -6

C 复制代码

    	pr_info("PERCPU: Embedded %zu pages/cpu @%p s%zu r%zu d%zu u%zu\n",
    		PFN_DOWN(size_sum), base, ai->static_size, ai->reserved_size,
    		ai->dyn_size, ai->unit_size);

    	rc = pcpu_setup_first_chunk(ai, base);
    	goto out_free;

    out_free_areas:
    	for (group = 0; group < ai->nr_groups; group++)
    		free_fn(areas[group],
    			ai->groups[group].nr_units * ai->unit_size);
    out_free:
    	pcpu_free_alloc_info(ai);
    	if (areas)
    		free_bootmem(__pa(areas), areas_size);
    	return rc;
    }
    #endif /* BUILD_EMBED_FIRST_CHUNK */

在这段代码中，先是输出 first chunk 配置的信息，其中size_sum为 unit 中static、reserved、dynamic三个区域的总大小；PFN_DOWN宏将计算 size_sum所占用的页数，并向下圆整；base 是 first chunk 的基地址。PFN_DOWN宏定义如下：

C 复制代码

    // file: include/linux/pfn.h
    #define PFN_DOWN(x)	((x) >> PAGE_SHIFT)		// PAGE_SHIFT = 12

可以使用 dmesg 来查看相关信息：

bash 复制代码

    [root@localhost ~]# dmesg|grep -i PERCPU
    [    0.000000] PERCPU: Embedded 33 pages/cpu @ffff88007fc00000 s97048 r8192 d29928 u2097152

可以看到，在该主机上，向下圆整后的sum_size大小为33页；first chunk的基地址 base 为 0xffff88007fc00000。

输出信息之后，调用 pcpu_setup_first_chunk函数设置 first chunk 的管理信息，然后跳转到 out_free标签处，释放掉无用的内存，包括 ai 结构体以及 areas[] 数组，最后 return 返回。

pcpu_setup_first_chunk函数定义在文件 mm/percpu.c，其主要设置了 first chunk 的管理信息，然后将 first chunk的基地址，赋值给 pcpu_base_addr，这样我们在 setup_per_cpu_areas函数中，就可以使用该变量了。

在setup_per_cpu_areas函数中，我们利用 pcpu_base_addr 和 __per_cpu_start来计算 delta。

C 复制代码

    	/* alrighty, percpu areas up and running */
    	delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
    	for_each_possible_cpu(cpu) {
    		per_cpu_offset(cpu) = delta + pcpu_unit_offsets[cpu];
    		per_cpu(this_cpu_off, cpu) = per_cpu_offset(cpu);
            per_cpu(cpu_number, cpu) = cpu;
            
    		......
        }

__per_cpu_start 我们在上文中介绍过，是section(.data..percpu) 的虚拟地址，其值为0。delta值示意如下：

per_cpu_offset是一个宏，其定义如下：

C 复制代码

    #define per_cpu_offset(x) (__per_cpu_offset[x])

计算完 delta 之后，遍历每个 possible cpu，来填充 __per_cpu_offset 数组。 __per_cpu_offset 数组的大小为 NR_CPUS，NR_CPUS是内核配置参数，指的是系统允许的最大cpu数量，并不一定是系统实际的cpu数量。__per_cpu_offset 数组中的元素，被初始化为 __per_cpu_load 的地址。

C 复制代码

    // file: arch/x86/kernel/setup_percpu.c
    unsigned long __per_cpu_offset[NR_CPUS] __read_mostly = {
    	[0 ... NR_CPUS-1] = BOOT_PERCPU_OFFSET,
    };

    #define BOOT_PERCPU_OFFSET ((unsigned long)__per_cpu_load)

pcpu_unit_offsets是一个全局变量，其在pcpu_setup_first_chunk被赋值，我们来看一下其计算过程：

C 复制代码

    // file: mm/percpu.c
    int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
    				  void *base_addr)
    {
    	......
            
    	unsigned long *unit_off;

    	......

    	unit_off = alloc_bootmem(nr_cpu_ids * sizeof(unit_off[0]));
    	
        ......

    	for (group = 0, unit = 0; group < ai->nr_groups; group++, unit += i) {
    		const struct pcpu_group_info *gi = &ai->groups[group];

    		......

    		for (i = 0; i < gi->nr_units; i++) {
                cpu = gi->cpu_map[i];
    			......
    			unit_off[cpu] = gi->base_offset + i * ai->unit_size;
    			......
    		}
    	}

    	......
            
    	pcpu_unit_offsets = unit_off;

    	......

    	/* we're done */
    	pcpu_base_addr = base_addr;
    	return 0;
    }

在 pcpu_setup_first_chunk 函数中，先是定义了 unsigned long 类型的数组 unit_off，然后给unit_off分配内存。再接着遍历每个组及组内的unit，计算每个 unit 相对于pcpu_base_addr的偏移量，最后把 unit_off 赋值给 pcpu_unit_offsets。换句话说，pcpu_unit_offsets数组的下标为cpu号，对应的值为该cpu对应的unit到pcpu_base_addr的距离。进而计算__per_cpu_offset数组中元素的值，其数组下标为cpu号，值为 delta + pcpu_unit_offsets[cpu]，也就是 __per_cpu_start 到 unit 基地址的值。

在填充完__per_cpu_offset数组之后，就是使用per_cpu宏为相应cpu的per-cpu数据赋值，如cpu_number、this_cpu_off、irq_stack_ptr 等。

per_cpu宏定义如下，该宏接收2个参数，分别是变量名和cpu编号，内部直接调用了SHIFT_PERCPU_PTR宏：

C 复制代码

    // file: include/asm-generic/percpu.h
    #define per_cpu(var, cpu) \
    	(*SHIFT_PERCPU_PTR(&(var), per_cpu_offset(cpu)))

per_cpu_offset宏上文已经介绍过，其内部引用了数组 __per_cpu_offset[]，其下标为cpu号，值为该cpu对应unit的偏移地址。

其中，SHIFT_PERCPU_PTR定义如下，其接收2个参数，分别是变量指针及偏移地址，内部引用了RELOC_HIDE宏：

C 复制代码

    // file: include/asm-generic/percpu.h
    #define SHIFT_PERCPU_PTR(__p, __offset)	({				\
    	__verify_pcpu_ptr((__p));					\
    	RELOC_HIDE((typeof(*(__p)) __kernel __force *)(__p), (__offset)); \
    })

RELOC_HIDE宏定义：

C 复制代码

    // file: include/linux/compiler.h
    # define RELOC_HIDE(ptr, off)					\
      ({ unsigned long __ptr;					\
         __ptr = (unsigned long) (ptr);				\
        (typeof(ptr)) (__ptr + (off)); })

可以看到，RELOC_HIDE宏只是简单返回了(typeof(ptr)) (__ptr + (off))，该值是一个指向per-cpu变量的指针，计算如下图所示：

当使用 DEFINE_PER_CPU(type, var)创建一个per-cpu变量时，变量地址会落在 __per_cpu_start 和 __per_cpu_end之间。当要获取指定cpu的变量时，只需把该地址加上该cpu对应的偏移地址 __per_cpu_offset[cpu]即可。

说完per_cpu宏实现原理，我们来简单看下几个per-cpu变量的值。

先看下this_cpu_off，该变量定义如下：

C 复制代码

    // file: arch/x86/kernel/setup_percpu.c
    DEFINE_PER_CPU(unsigned long, this_cpu_off) = BOOT_PERCPU_OFFSET;

在setup_per_cpu_areas函数中，每个cpu对应的this_cpu_off分别被初始化为per_cpu_offset(cpu)，也就是__per_cpu_offset[cpu]，cpu对应unit的偏移地址；变量cpu_number被初始化为cpu。

其它变量实现类似，就不在一一赘述了。

至此，静态per-cpu 数据区域的初始化就完成了。

四、APIs

创建变量

使用宏DEFINE_PER_CPU 来创建静态per-cpu变量。

变量操作

内核提供一下api用来访问静态per-cpu变量。

get_cpu_var(var)
put_cpu_var(var)

get_cpu_var实现如下：

C 复制代码

    // file: include/linux/percpu.h
    #define get_cpu_var(var) (*({				\
    	preempt_disable();				\
    	&__get_cpu_var(var); }))

可以看到，在get_cpu_var内部，首先调用了 preempt_disable宏禁止抢占。这是因为在SMP系统中，per-cpu变量是绑定到相应的cpu的，在操作per-cpu变量过程中不能因抢占而导致当前代码被调度到其它cpu上执行，否则会发生数据错误。禁止抢占后，调用__get_cpu_var宏，该宏实现如下：

C 复制代码

    // file:include/asm-generic/percpu.h
    #define __get_cpu_var(var) (*this_cpu_ptr(&(var)))

其中，宏this_cpu_ptr定义如下：

C 复制代码

    // file: include/asm-generic/percpu.h
    #define this_cpu_ptr(ptr) __this_cpu_ptr(ptr)

宏 __this_cpu_ptr及其内部引用的其它宏定义如下：

C 复制代码

    // file: arch/x86/include/asm/percpu.h
    #define __this_cpu_ptr(ptr)				\
    ({							\
    	unsigned long tcp_ptr__;			\
    	__verify_pcpu_ptr(ptr);				\
    	asm volatile("add " __percpu_arg(1) ", %0"	\
    		     : "=r" (tcp_ptr__)			\
    		     : "m" (this_cpu_off), "0" (ptr));	\
    	(typeof(*(ptr)) __kernel __force *)tcp_ptr__;	\
    })

    #define __percpu_arg(x)		__percpu_prefix "%P" #x

    #define __percpu_prefix		"%%"__stringify(__percpu_seg)":"

    #define __percpu_seg		gs

最终，__this_cpu_ptr(ptr) 会扩展成：

C 复制代码

    ({							\
    	unsigned long tcp_ptr__;			\
    	__verify_pcpu_ptr(ptr);				\
    	asm volatile("add %%gs:%P1, %0"	\
    		     : "=r" (tcp_ptr__)			\
    		     : "m" (this_cpu_off), "0" (ptr));	\
    	(typeof(*(ptr)) __kernel __force *)tcp_ptr__;	\
    })

该宏会把 %gs 段寄存器基地址加上变量 this_cpu_off的地址形成一个新的地址，从新的地址取出内存数据再跟入参ptr相加，返回相加后的数值。

gs寄存器的初始化目前还为涉及，会在下一篇讲解，这里先说一下，gs 寄存器保存的是对应 cpu 的偏移地址，也就是__per_cpu_offset[cpu]的值，该值在上文已经介绍过了。于是，gs段寄存器基地址加上this_cpu_off的地址，得到的是对应cpu的this_cpu_off地址，该地址保存的是__per_cpu_offset[cpu]的值，该值和gs段寄存器基地址是相等的，都是对应cpu的偏移地址。然后把该偏移地址加上变量地址ptr，就得到了当前cpu对应的变量指针，通过该指针获取到的就是当前cpu的变量值了。也就是说，__this_cpu_ptr(ptr)返回的是当前cpu对应的变量指针；__get_cpu_var(var)通过解引用，获取到的是当前cpu下变量的值。

put_cpu_var(var)实现如下，该宏会允许中断：

C 复制代码

    // file: include/linux/percpu.h
    #define put_cpu_var(var) do {				\
    	(void)&(var);					\
    	preempt_enable();				\
    } while (0)

get_cpu_var(var) 和 put_cpu_var(var)应该配合使用，其格式如下：

C 复制代码

    get_cpu_var(var);
    ...
    //Do something with the 'var'
    ...
    put_cpu_var(var);

指针操作

指针操作与变量操作类似，内核同样提供了api来操作指针：

get_cpu_ptr(var)
put_cpu_ptr(var)

可以看到，这2个api同样需要禁止或打开中断，只不过这2个api的入参是指针，其它与变量操作类似。

C 复制代码

     // file: include/linux/percpu.h
     #define get_cpu_ptr(var) ({             \
         preempt_disable();              \
         this_cpu_ptr(var); })
     
     #define put_cpu_ptr(var) do {               \
         (void)(var);                    \
         preempt_enable();               \
     } while (0)

五、参考资料

1、Per-CPU variables

2、Per-cpu-1- (Basic)

3、Per-cpu-2- (init)

4、Boot time memory management

5、Linker Scripts

6、 Intel 64 and IA-32 Architectures Software Developer Manuals