注:本文为 "内存知识" 相关译文。
英文引文,机翻未校。
如有内容异常,请看原文。
图片清晰度受引文原图所限。
Getting Physical With Memory
深入解析内存物理机制
Jan 15th, 2009
When trying to understand complex systems, you can often learn a lot by stripping away abstractions and looking at their lowest levels. In that spirit we take a look at memory and I/O ports in their simplest and most fundamental level: the interface between the processor and bus. These details underlie higher level topics like thread synchronization and the need for the Core i7. Also, since I'm a programmer I ignore things EE people care about. Here's our friend the Core 2 again:
想要理解复杂系统,抛开抽象概念、从底层架构入手往往能收获大量知识。基于这一思路,本文将从最基础的层面讲解内存与输入输出端口,也就是处理器与总线之间的交互接口。这些底层知识是线程同步、酷睿 i7 架构设计等上层内容的基础。作为一名程序员,本文不会涉及电子工程领域关注的相关内容。下面继续以酷睿 2 处理器为例展开说明:

A Core 2 processor has 775 pins, about half of which only provide power and carry no data. Once you group the pins by functionality, the physical interface to the processor is surprisingly simple. The diagram shows the key pins involved in a memory or I/O port operation: address lines, data pins, and request pins. These operations take place in the context of a transaction on the front side bus. FSB transactions go through 5 phases: arbitration, request, snoop, response, and data. Throughout these phases, different roles are played by the components on the FSB, which are called agents . Normally the agents are all the processors plus the northbridge.
酷睿 2 处理器拥有 775 根引脚,其中约半数引脚仅用于供电,不传输数据。按照功能对引脚进行分类后可以发现,处理器的物理接口结构十分简洁。上图展示了内存与输入输出端口操作用到的核心引脚:地址线、数据引脚与请求引脚。这类操作都依托前端总线的事务 完成。前端总线事务分为 5 个阶段:仲裁阶段、请求阶段、侦听阶段、响应阶段与数据传输阶段。前端总线上参与工作的硬件单元被称为代理单元,在各个阶段中承担不同功能。通常情况下,代理单元包含所有处理器以及北桥芯片。
We only look at the request phase in this post, in which 2 packets are output by the request agent , who is usually a processor. Here are the juiciest bits of the first packet, output by the address and request pins:
本文仅介绍请求阶段 。该阶段内,请求代理单元(一般为处理器)会向外发送两组数据包。地址线与请求引脚发出的第一组数据包核心信息如下:

The address lines output the starting physical memory address for the transaction. We have 33 bits but they are interpreted as bits 35-3 of an address in which bits 2-0 are zero. Hence we have a 36-bit address, aligned to 8 bytes, for a total of 2 36 2^{36} 236 bytes addressable physical memory. This has been the case since the Pentium Pro. The request pins specify what type of transaction is being initiated; in I/O requests the address pins specify an I/O port rather than a memory address. After the first packet is output, the same pins transmit a second packet in the subsequent bus clock cycle:
地址线会输出本次事务对应的物理内存起始地址。系统使用 33 位信号,对应完整地址的第 35 位至第 3 位,地址的第 2 位至第 0 位固定为 0。由此构成 36 位地址,地址以 8 字节为对齐单位,可寻址的物理内存总容量为 2 36 2^{36} 236 字节。自奔腾 Pro 处理器开始,该寻址规则便沿用至今。请求引脚用于标识当前启动的事务类型;若是输入输出请求,地址线标注的则是输入输出端口编号,而非内存地址。第一组数据包发送完成后,在接下来的一个总线时钟周期内,同一批引脚会继续传输第二组数据包:

The attribute signals are interesting: they reflect the 5 types of memory caching behavior available in Intel processors. By putting this information on the FSB, the request agent lets other processors know how this transaction affects their caches, and how the memory controller (northbridge) should behave. The processor determines the type of a given memory region mainly by looking at page tables, which are maintained by the kernel.
属性信号具备特殊作用,它对应英特尔处理器支持的 5 种内存缓存工作模式。请求代理单元将这类信息发送至前端总线后,其他处理器能够知晓本次事务对自身缓存造成的影响,内存控制器(北桥芯片)也可据此执行对应操作。处理器依靠内核维护的页表,判定不同内存区域对应的缓存模式。
Typically kernels treat all RAM memory as write-back , which yields the best performance. In write-back mode the unit of memory access is the [cache line, 64 bytes in the Core 2. If a program reads a single byte in memory, the processor loads the whole cache line that contains that byte into the L2 and L1 caches. When a program writes to memory, the processor only modifies the line in the cache, but does not update main memory. Later, when it becomes necessary to post the modified line to the bus, the whole cache line is written at once. So most requests have 11 in their length field, for 64 bytes. Here's a read example in which the data is not in the caches:
操作系统内核通常将全部随机存取内存设置为回写模式,该模式可以实现最优运行效率。回写模式下,内存的访问单位为缓存行,酷睿 2 处理器的缓存行大小为 64 字节。程序读取内存中单个字节时,处理器会将该字节所在的整行缓存数据载入二级缓存与一级缓存。程序向内存写入数据时,处理器仅修改缓存内的数据,不会同步更新主内存。当需要将已修改的缓存数据发送至总线时,整行缓存数据会一次性完成写入。因此绝大多数请求的数据长度字段数值为 11,对应 64 字节数据量。下图为数据未命中缓存时的内存读取流程:

Some of the physical memory range in an Intel computer is [mapped to devices like hard drives and network cards instead of actual RAM memory. This allows drivers to communicate with their devices by writing to and reading from memory. The kernel marks these memory regions as uncacheable in the page tables. Accesses to uncacheable memory regions are reproduced in the bus exactly as requested by a program or driver. Hence it's possible to read or write single bytes, words, and so on. This is done via the byte enable mask in packet B above.
英特尔架构计算机中,部分物理内存地址空间会映射至硬件设备,例如硬盘、网卡等,而非对应实际的随机存取内存。设备驱动程序可通过读写这片内存空间,实现与硬件设备的交互。操作系统内核会在页表中将这类内存区域标记为不可缓存。程序或驱动程序对不可缓存内存区域发起的访问请求,会原封不动地在总线上执行。因此这类内存支持单字节、单字等粒度的读写操作,该功能依靠上文数据包 B 中的字节使能掩码实现。
The primitives discussed here have many implications. For example:
本文介绍的底层运行机制会带来多方面影响,举例如下:
-
Performance-sensitive applications should try to pack data that is accessed together into the same cache line. Once the cache line is loaded, further reads are [much faster and extra RAM accesses are avoided.
对运行效率要求较高的应用程序,应将频繁一同访问的数据整合至同一缓存行内。缓存行完成加载后,后续的读取操作速度会大幅提升,同时减少对随机存取内存的额外访问。
-
Any memory access that falls within a single cache line is guaranteed to be atomic (assuming write-back memory). Such an access is serviced by the processor's L1 cache and the data is read or written all at once; it cannot be affected halfway by other processors or threads. In particular, 32-bit and 64-bit operations that don't cross cache line boundaries are atomic.
在内存为回写模式的前提下,单次缓存行范围内的所有内存访问操作均具备原子性。这类访问由处理器一级缓存直接处理,数据会一次性完成读写,过程不会被其他处理器或线程中断。其中,不跨越缓存行边界的 32 位、64 位数据操作,同样具备原子性。
-
The front bus is shared by all agents, who must arbitrate for bus ownership before they can start a transaction. Moreover, all agents must listen to all transactions in order to maintain cache coherence. Thus bus contention becomes a severe problem as more cores and processors are added to Intel computers. The Core i7 solves this by having processors attached directly to memory and communicating in a point-to-point rather than broadcast fashion.
前端总线由所有代理单元共享,代理单元发起事务前必须竞争总线使用权。同时,为维持缓存一致性,所有代理单元都需要监听总线上的全部事务。随着处理器核心数量与处理器设备不断增加,前端总线的资源争抢问题会愈发突出。酷睿 i7 处理器对此做出改进,处理器直接与内存相连,采用点对点通信模式替代原有的广播通信模式。
These are the highlights of physical memory requests; the bus will surface again later in connection with locking, multi-threading, and cache coherence. The first time I saw FSB packet descriptions I had a huge "ahhh!" moment so I hope someone out there gets the same benefit. In the next post we'll go back up the abstraction ladder to take a thorough look at virtual memory.
以上便是物理内存请求的相关核心内容,后续讲解锁机制、多线程技术与缓存一致性时,会再次提及总线相关知识。我第一次了解前端总线数据包的细节时豁然开朗,也希望本文能为读者带来帮助。在下一篇文章中,我们将回归上层抽象概念,全面讲解虚拟内存相关内容。
Anatomy of a Program in Memory
程序内存剖析
Jan 27th, 2009
Memory management is the heart of operating systems; it is crucial for both programming and system administration. In the next few posts I'll cover memory with an eye towards practical aspects, but without shying away from internals. While the concepts are generic, examples are mostly from Linux and Windows on 32-bit x86. This first post describes how programs are laid out in memory.
内存管理是操作系统的核心;它对编程和系统管理都至关重要。在接下来的几篇文章中,我将从实用角度出发探讨内存,同时也不会回避底层实现。虽然这些概念具有通用性,但示例主要来自 Linux 和 Windows 上的 32 位 x86 架构。本文首先描述程序在内存中的布局方式。
Each process in a multi-tasking OS runs in its own memory sandbox. This sandbox is the virtual address space , which in 32-bit mode is always a 4GB block of memory addresses . These virtual addresses are mapped to physical memory by page tables , which are maintained by the operating system kernel and consulted by the processor. Each process has its own set of page tables, but there is a catch. Once virtual addresses are enabled, they apply to all software running in the machine, including the kernel itself . Thus a portion of the virtual address space must be reserved to the kernel:
多任务操作系统中的每个进程都在自己的内存沙箱中运行。这个沙箱就是虚拟地址空间 ,在 32 位模式下,它始终是一个 4 GB 的内存地址块 。这些虚拟地址通过页表 映射到物理内存,页表由操作系统内核维护,并由处理器查询。每个进程都有自己的页表,但这里有一个问题:一旦启用虚拟地址,它们就适用于机器上运行的所有软件 ,包括内核本身。因此,虚拟地址空间的一部分必须保留给内核:

This does not mean the kernel uses that much physical memory, only that it has that portion of address space available to map whatever physical memory it wishes. Kernel space is flagged in the page tables as exclusive to [privileged code (ring 2 or lower), hence a page fault is triggered if user-mode programs try to touch it. In Linux, kernel space is constantly present and maps the same physical memory in all processes. Kernel code and data are always addressable, ready to handle interrupts or system calls at any time. By contrast, the mapping for the user-mode portion of the address space changes whenever a process switch happens:
这并不意味着内核使用了那么多物理内存,而只是意味着它拥有这部分地址空间,可以用来映射它想要的任何物理内存。内核空间在页表中被标记为仅供特权代码(ring 2 或更低)访问,因此如果用户态程序试图触碰它,就会触发页错误。在 Linux 中,内核空间始终存在,并且在所有进程中映射相同的物理内存。内核代码和数据始终可寻址,随时准备处理中断或系统调用。相比之下,用户态地址空间部分的映射在每次进程切换时都会发生变化:

Blue regions represent virtual addresses that are mapped to physical memory, whereas white regions are unmapped. In the example above, Firefox has used far more of its virtual address space due to its legendary memory hunger. The distinct bands in the address space correspond to memory segments like the heap, stack, and so on. Keep in mind these segments are simply a range of memory addresses and have nothing to do with [Intel-style segments. Anyway, here is the standard segment layout in a Linux process:
蓝色区域代表已映射到物理内存的虚拟地址,而白色区域是未映射的。在上面的例子中,Firefox 由于其传奇般的内存消耗,使用了远超其虚拟地址空间的范围。地址空间中不同的带状区域对应于内存段 ,如堆、栈等。请记住,这些段只是内存地址的一个范围,与 Intel 风格的段毫无关系。总之,以下是 Linux 进程中的标准段布局:

When computing was happy and safe and cuddly, the starting virtual addresses for the segments shown above were exactly the same for nearly every process in a machine. This made it easy to exploit security vulnerabilities remotely. An exploit often needs to reference absolute memory locations: an address on the stack, the address for a library function, etc. Remote attackers must choose this location blindly, counting on the fact that address spaces are all the same. When they are, people get pwned. Thus address space randomization has become popular. Linux randomizes the [stack, [memory mapping segment, and [heap by adding offsets to their starting addresses. Unfortunately the 32-bit address space is pretty tight, leaving little room for randomization and [hampering its effectiveness.
在计算世界还无忧无虑、安全舒适的时候,上图所示各段的起始虚拟地址在机器上几乎每个进程中都完全相同。这使得远程利用安全漏洞变得轻而易举。漏洞利用通常需要引用绝对内存位置:栈上的地址、库函数的地址等。远程攻击者必须盲目地选择这个位置,依赖于所有地址空间都相同这一事实。当它们确实相同时,人们就被攻陷了。因此,地址空间随机化变得流行起来。Linux 通过在起始地址上添加偏移量来随机化栈、内存映射段和堆。不幸的是,32 位地址空间相当紧张,留给随机化的空间很小,从而限制了其有效性。
The topmost segment in the process address space is the stack, which stores local variables and function parameters in most programming languages. Calling a method or function pushes a new stack frame onto the stack. The stack frame is destroyed when the function returns. This simple design, possible because the data obeys strict LIFO order, means that no complex data structure is needed to track stack contents - a simple pointer to the top of the stack will do. Pushing and popping are thus very fast and deterministic. Also, the constant reuse of stack regions tends to keep active stack memory in the [cpu caches, speeding up access. Each thread in a process gets its own stack.
进程地址空间中最顶部的段是栈,在大多数编程语言中,它存储局部变量和函数参数。调用方法或函数时,会将一个新的栈帧压入栈。当函数返回时,栈帧被销毁。这种简单的设计之所以可行,是因为数据遵循严格的 LIFO(后进先出)顺序,这意味着不需要复杂的数据结构来跟踪栈内容------一个简单的指向栈顶的指针就足够了。因此,压栈和弹栈非常快速且确定。此外,栈区域的持续复用有助于将活跃的栈内存保留在 CPU 缓存中,从而加快访问速度。进程中的每个线程都有自己的栈。
It is possible to exhaust the area mapping the stack by pushing more data than it can fit. This triggers a page fault that is handled in Linux by [expand_stack(), which in turn calls [acct_stack_growth() to check whether it's appropriate to grow the stack. If the stack size is below RLIMIT_STACK (usually 8MB), then normally the stack grows and the program continues merrily, unaware of what just happened. This is the normal mechanism whereby stack size adjusts to demand. However, if the maximum stack size has been reached, we have a stack overflow and the program receives a Segmentation Fault. While the mapped stack area expands to meet demand, it does not shrink back when the stack gets smaller. Like the federal budget, it only expands.
有可能通过压入超过栈可容纳的数据量来耗尽栈的映射区域。这会触发一个页错误,在 Linux 中由 expand_stack() 处理,后者又调用 acct_stack_growth() 来检查是否可以增长栈。如果栈大小低于 RLIMIT_STACK(通常为 8 MB),那么栈通常会增长,程序继续愉快地运行,完全不知道刚刚发生了什么。这是栈大小根据需求调整的正常机制。然而,如果已达到最大栈大小,就会发生栈溢出,程序收到段错误。虽然映射的栈区域会扩展以满足需求,但当栈变小时,它不会收缩。就像联邦预算一样,它只增不减。
Dynamic stack growth is the [only situation in which access to an unmapped memory region, shown in white above, might be valid. Any other access to unmapped memory triggers a page fault that results in a Segmentation Fault. Some mapped areas are read-only, hence write attempts to these areas also lead to segfaults.
动态栈增长是访问未映射内存区域(上图中的白色区域)可能有效的唯一情况。任何其他对未映射内存的访问都会触发页错误,导致段错误。某些映射区域是只读的,因此对这些区域的写入尝试也会导致段错误。
Below the stack, we have the memory mapping segment. Here the kernel maps contents of files directly to memory. Any application can ask for such a mapping via the Linux mmap() system call ([implementation) or CreateFileMapping() / MapViewOfFile() in Windows. Memory mapping is a convenient and high-performance way to do file I/O, so it is used for loading dynamic libraries. It is also possible to create an anonymous memory mapping that does not correspond to any files, being used instead for program data. In Linux, if you request a large block of memory via malloc(), the C library will create such an anonymous mapping instead of using heap memory. 'Large' means larger than MMAP_THRESHOLD bytes, 128 kB by default and adjustable via mallopt().
在栈的下方,是内存映射段。在这里,内核将文件内容直接映射到内存。任何应用程序都可以通过 Linux 的 mmap() 系统调用(实现)或 Windows 的 CreateFileMapping() / MapViewOfFile() 来请求这种映射。内存映射是一种方便且高性能的文件 I/O 方式,因此它被用于加载动态库。也可以创建一种匿名内存映射 ,它不对应任何文件,而是用于程序数据。在 Linux 中,如果你通过 malloc() 请求一大块内存,C 库会创建这样的匿名映射,而不是使用堆内存。"大"指的是超过 MMAP_THRESHOLD 字节,默认值为 128 kB,可通过 mallopt() 调整。
Speaking of the heap, it comes next in our plunge into address space. The heap provides runtime memory allocation, like the stack, meant for data that must outlive the function doing the allocation, unlike the stack. Most languages provide heap management to programs. Satisfying memory requests is thus a joint affair between the language runtime and the kernel. In C, the interface to heap allocation is malloc() and friends, whereas in a garbage-collected language like C# the interface is the new keyword.
说到堆,它是我们深入地址空间的下一个目标。与栈一样,堆提供运行时内存分配,但用于那些必须比执行分配操作的函数存活更久的数据。大多数语言为程序提供堆管理。因此,满足内存请求是语言运行时和内核之间的共同事务。在 C 中,堆分配的接口是 malloc() 及其相关函数,而在像 C# 这样的垃圾回收语言中,接口是 new 关键字。
If there is enough space in the heap to satisfy a memory request, it can be handled by the language runtime without kernel involvement. Otherwise the heap is enlarged via the brk() system call ([implementation) to make room for the requested block. Heap management is [complex, requiring sophisticated algorithms that strive for speed and efficient memory usage in the face of our programs' chaotic allocation patterns. The time needed to service a heap request can vary substantially. Real-time systems have [special-purpose allocators to deal with this problem. Heaps also become fragmented , shown below:
如果堆中有足够的空间来满足内存请求,语言运行时可以在不涉及内核的情况下处理它。否则,堆通过 brk() 系统调用(实现)来扩大,为请求的块腾出空间。堆管理是复杂的,需要精妙的算法,以在程序混乱的分配模式下追求速度和高效的内存使用。满足堆请求所需的时间可能差异很大。实时系统有专门的分配器来处理这个问题。堆也会变得碎片化,如下图所示:

Finally, we get to the lowest segments of memory: BSS, data, and program text. Both BSS and data store contents for static (global) variables in C. The difference is that BSS stores the contents of uninitialized static variables, whose values are not set by the programmer in source code. The BSS memory area is anonymous: it does not map any file. If you say static int cntActiveUsers, the contents of cntActiveUsers live in the BSS.
最后,我们来到内存的最低段:BSS、数据和程序文本。BSS 和数据都存储 C 中静态(全局)变量的内容。区别在于,BSS 存储的是未初始化 的静态变量的内容,其值在源代码中未被程序员设置。BSS 内存区域是匿名的:它不映射任何文件。如果你写 static int cntActiveUsers,cntActiveUsers 的内容就存在于 BSS 中。
The data segment, on the other hand, holds the contents for static variables initialized in source code. This memory area is not anonymous . It maps the part of the program's binary image that contains the initial static values given in source code. So if you say static int cntWorkerBees = 10, the contents of cntWorkerBees live in the data segment and start out as 10. Even though the data segment maps a file, it is a private memory mapping , which means that updates to memory are not reflected in the underlying file. This must be the case, otherwise assignments to global variables would change your on-disk binary image. Inconceivable!
另一方面,数据段保存源代码中已初始化的静态变量的内容。这个内存区域不是匿名的 。它映射程序二进制镜像中包含源代码中给出的初始静态值的部分。所以如果你写 static int cntWorkerBees = 10,cntWorkerBees 的内容就存在于数据段中,初始值为 10。尽管数据段映射了一个文件,但它是私有内存映射,这意味着对内存的更新不会反映到底层文件中。必须如此,否则对全局变量的赋值会改变你的磁盘二进制镜像。不可想象!
The data example in the diagram is trickier because it uses a pointer. In that case, the contents of pointer gonzo - a 4-byte memory address - live in the data segment. The actual string it points to does not, however. The string lives in the text segment, which is read-only and stores all of your code in addition to tidbits like string literals. The text segment also maps your binary file in memory, but writes to this area earn your program a Segmentation Fault. This helps prevent pointer bugs, though not as effectively as avoiding C in the first place. Here's a diagram showing these segments and our example variables:
图中的数据示例更复杂,因为它使用了指针。在这种情况下,指针 gonzo 的内容 ------一个 4 字节的内存地址------存在于数据段中。然而,它所指向的实际字符串并不在数据段中。该字符串存在于文本段中,该段是只读的,除了存储所有代码外,还存储字符串字面量等零碎内容。文本段也将你的二进制文件映射到内存中,但对该区域的写入会使你的程序收到段错误。这有助于防止指针错误,尽管不如一开始就不使用 C 语言那么有效。下图展示了这些段和我们的示例变量:

You can examine the memory areas in a Linux process by reading the file /proc/pid_of_process/maps. Keep in mind that a segment may contain many areas. For example, each memory mapped file normally has its own area in the mmap segment, and dynamic libraries have extra areas similar to BSS and data. The next post will clarify what 'area' really means. Also, sometimes people say "data segment" meaning all of data + bss + heap.
你可以通过读取文件 /proc/pid_of_process/maps 来检查 Linux 进程中的内存区域。请记住,一个段可能包含多个区域。例如,每个内存映射文件通常在 mmap 段中有自己的区域,动态库有类似 BSS 和数据的额外区域。下一篇文章将阐明"区域"的真正含义。此外,有时人们说"数据段"指的是数据 + BSS + 堆的全部。
You can examine binary images using the nm and objdump commands to display symbols, their addresses, segments, and so on. Finally, the virtual address layout described above is the "flexible" layout in Linux, which has been the default for a few years. It assumes that we have a value for RLIMIT_STACK. When that's not the case, Linux reverts back to the "classic" layout shown below:
你可以使用 nm 和 objdump 命令来检查二进制镜像,以显示符号、它们的地址、段等。最后,上述虚拟地址布局是 Linux 中的"灵活"布局,它已经成为默认布局几年了。它假设我们有一个 RLIMIT_STACK 的值。当不是这种情况时,Linux 会回退到下面所示的"经典"布局:

That's it for virtual address space layout. The next post discusses how the kernel keeps track of these memory areas. Coming up we'll look at memory mapping, how file reading and writing ties into all this and what memory usage figures mean.
以上就是虚拟地址空间布局的全部内容。下一篇文章将讨论内核如何跟踪这些内存区域。接下来我们将研究内存映射、文件读写如何与这一切关联,以及内存使用数据的含义。
How The Kernel Manages Your Memory
内核如何管理内存
Feb 4th, 2009
2009 年 2 月 4 日
After examining the [virtual address layout of a process, we turn to the kernel and its mechanisms for managing user memory. Here is gonzo again:
在审视了进程的虚拟地址布局之后,我们来关注内核及其管理用户内存的机制。再次请出 Gonzo:

Linux processes are implemented in the kernel as instances of task_struct, the process descriptor. The mm field in task_struct points to the memory descriptor , mm_struct, which is an executive summary of a program's memory. It stores the start and end of memory segments as shown above, the number of physical memory pages used by the process (rss stands for Resident Set Size), the amount of virtual address space used, and other tidbits. Within the memory descriptor we also find the two work horses for managing program memory: the set of virtual memory areas and the page tables . Gonzo's memory areas are shown below:
Linux 进程在内核中以 task_struct 的实例形式实现,即进程描述符。task_struct 中的 mm 字段指向内存描述符 mm_struct,它是一份程序内存的执行摘要。它存储了如上图所示的内存段起始和结束地址、进程使用的物理内存页数量(rss 代表驻留集大小,Resident Set Size)、已使用的虚拟地址空间大小以及其他细节。在内存描述符中,我们还可以找到管理程序内存的两个主力军:虚拟内存区域集合和页表。Gonzo 的内存区域如下所示:

Each virtual memory area (VMA) is a contiguous range of virtual addresses; these areas never overlap. An instance of vm_area_struct fully describes a memory area, including its start and end addresses, flags to determine access rights and behaviors, and the vm_file field to specify which file is being mapped by the area, if any. A VMA that does not map a file is anonymous . Each memory segment above (e.g. , heap, stack) corresponds to a single VMA, with the exception of the memory mapping segment. This is not a requirement, though it is usual in x86 machines. VMAs do not care which segment they are in.
每个虚拟内存区域(VMA)是一段连续的虚拟地址范围;这些区域互不重叠。vm_area_struct 的实例完整描述了一个内存区域,包括其起始和结束地址、用于确定访问权限和行为的标志位,以及 vm_file 字段(用于指明该区域映射的文件,如果有的话)。不映射文件的 VMA 称为匿名区域。上述每个内存段(例如堆、栈)对应一个单独的 VMA,内存映射段除外。这不是硬性要求,但在 x86 机器上通常如此。VMA 并不关心自己属于哪个段。
A program's VMAs are stored in its memory descriptor both as a linked list in the mmap field, ordered by starting virtual address, and as a red-black tree rooted at the mm_rb field. The red-black tree allows the kernel to search quickly for the memory area covering a given virtual address. When you read file /proc/pid_of_process/maps, the kernel is simply going through the linked list of VMAs for the process and printing each one.
程序的 VMA 存储在其内存描述符中,既以 mmap 字段中的链表形式按起始虚拟地址排序,也以 mm_rb 字段为根的红黑树形式组织。红黑树使内核能够快速搜索覆盖给定虚拟地址的内存区域。当你读取文件 /proc/pid_of_process/maps 时,内核只是在遍历该进程的 VMA 链表并逐个打印。
In Windows, the [EPROCESS block is roughly a mix of task_struct and mm_struct. The Windows analog to a VMA is the Virtual Address Descriptor, or [VAD; they are stored in an [AVL tree. You know what the funniest thing about Windows and Linux is? It's the little differences.
在 Windows 中,EPROCESS 块大致是 task_struct 和 mm_struct 的混合体。Windows 中与 VMA 对应的概念是虚拟地址描述符,即 VAD;它们以 AVL 树的形式存储。你知道 Windows 和 Linux 最有趣的地方是什么吗?就是那些细微的差异。
The 4GB virtual address space is divided into pages . x86 processors in 32-bit mode support page sizes of 4KB, 2MB, and 4MB. Both Linux and Windows map the user portion of the virtual address space using 4KB pages. Bytes 0-4095 fall in page 0, bytes 4096-8191 fall in page 1, and so on. The size of a VMA must be a multiple of page size . Here's 3GB of user space in 4KB pages:
4 GB 的虚拟地址空间被划分为页。32 位模式下的 x86 处理器支持 4 KB、2 MB 和 4 MB 的页大小。Linux 和 Windows 都使用 4 KB 页来映射虚拟地址空间的用户部分。字节 0--4095 属于第 0 页,字节 4096--8191 属于第 1 页,依此类推。VMA 的大小必须是页大小的整数倍。以下是按 4 KB 页划分的 3 GB 用户空间:

The processor consults page tables to translate a virtual address into a physical memory address. Each process has its own set of page tables; whenever a process switch occurs, page tables for user space are switched as well. Linux stores a pointer to a process' page tables in the pgd field of the memory descriptor. To each virtual page there corresponds one page table entry (PTE) in the page tables, which in regular x86 paging is a simple 4-byte record shown below:
处理器通过查询页表将虚拟地址转换为物理内存地址。每个进程都有自己的一组页表;每当发生进程切换时,用户空间的页表也会被切换。Linux 将指向进程页表的指针存储在内存描述符的 pgd 字段中。每个虚拟页在页表中对应一个页表项(PTE),在常规 x86 分页中,它是一个简单的 4 字节记录,如下所示:

Linux has functions to read and set each flag in a PTE. Bit P tells the processor whether the virtual page is present in physical memory. If clear (equal to 0), accessing the page triggers a page fault. Keep in mind that when this bit is zero, the kernel can do whatever it pleases with the remaining fields. The R/W flag stands for read/write; if clear, the page is read-only. Flag U/S stands for user/supervisor; if clear, then the page can only be accessed by the kernel. These flags are used to implement the read-only memory and protected kernel space we saw before.
Linux 提供了读取和设置 PTE 中每个标志位的函数。P 位告诉处理器该虚拟页是否存在于物理内存中。如果该位清零(等于 0),访问该页将触发缺页异常。请记住,当该位为零时,内核可以对剩余字段随意处置。R/W 标志代表读/写;如果清零,该页为只读。U/S 标志代表用户/特权级;如果清零,则该页只能由内核访问。这些标志位用于实现我们之前看到的只读内存和受保护的内核空间。
Bits D and A are for dirty and accessed . A dirty page has had a write, while an accessed page has had a write or read. Both flags are sticky: the processor only sets them, they must be cleared by the kernel. Finally, the PTE stores the starting physical address that corresponds to this page, aligned to 4KB. This naive-looking field is the source of some pain, for it limits addressable physical memory to [4 GB. The other PTE fields are for another day, as is Physical Address Extension.
D 位和 A 位分别表示脏页和已访问页。脏页表示有过写操作,已访问页表示有过读或写操作。这两个标志位都是粘滞的:处理器只能设置它们,必须由内核来清除。最后,PTE 存储了与该页对应的起始物理地址,按 4 KB 对齐。这个看似简单的字段却是一些痛苦的根源,因为它将可寻址的物理内存限制在 4 GB。PTE 的其他字段以及物理地址扩展(PAE)留待日后讨论。
A virtual page is the unit of memory protection because all of its bytes share the U/S and R/W flags. However, the same physical memory could be mapped by different pages, possibly with different protection flags. Notice that execute permissions are nowhere to be seen in the PTE. This is why classic x86 paging allows code on the stack to be executed, making it easier to exploit stack buffer overflows (it's still possible to exploit non-executable stacks using return-to-libc and other techniques). This lack of a PTE no-execute flag illustrates a broader fact: permission flags in a VMA may or may not translate cleanly into hardware protection. The kernel does what it can, but ultimately the architecture limits what is possible.
虚拟页是内存保护的基本单位,因为其所有字节共享 U/S 和 R/W 标志位。然而,同一块物理内存可能被不同的页映射,且可能带有不同的保护标志位。请注意,PTE 中根本看不到执行权限。这就是为什么经典的 x86 分页允许执行栈上的代码,从而更容易利用栈缓冲区溢出(即使栈不可执行,仍然可以通过 return-to-libc 和其他技术来利用)。PTE 缺少不可执行标志位说明了一个更广泛的事实:VMA 中的权限标志位可能无法完全对应到硬件保护。内核尽其所能,但最终架构限制了可实现的功能。
Virtual memory doesn't store anything, it simply maps a program's address space onto the underlying physical memory, which is accessed by the processor as a large block called the physical address space . While memory operations on the bus are [somewhat involved, we can ignore that here and assume that physical addresses range from zero to the top of available memory in one-byte increments. This physical address space is broken down by the kernel into page frames . The processor doesn't know or care about frames, yet they are crucial to the kernel because the page frame is the unit of physical memory management. Both Linux and Windows use 4KB page frames in 32-bit mode; here is an example of a machine with 2GB of RAM:
虚拟内存并不存储任何内容,它只是将程序的地址空间映射到底层物理内存之上,处理器将物理内存作为一个称为物理地址空间的大块来访问。虽然总线上的内存操作有些复杂,但这里我们可以忽略这一点,并假设物理地址从 0 开始以 1 字节为增量一直到可用内存的顶端。内核将物理地址空间划分为页帧。处理器并不知道也不关心页帧的存在,但它们对内核至关重要,因为页帧是物理内存管理的基本单位。Linux 和 Windows 在 32 位模式下都使用 4 KB 的页帧;以下是一台拥有 2 GB RAM 的机器的示例:

In Linux each page frame is tracked by a descriptor and several flags. Together these descriptors track the entire physical memory in the computer; the precise state of each page frame is always known. Physical memory is managed with the buddy memory allocation technique, hence a page frame is free if it's available for allocation via the buddy system. An allocated page frame might be anonymous , holding program data, or it might be in the page cache , holding data stored in a file or block device. There are other exotic page frame uses, but leave them alone for now. Windows has an analogous Page Frame Number (PFN) database to track physical memory.
在 Linux 中,每个页帧由一个描述符和若干标志位来跟踪。这些描述符共同跟踪计算机中的全部物理内存;每个页帧的精确状态始终可知。物理内存通过伙伴内存分配技术来管理,因此如果页帧可以通过伙伴系统分配,则它是空闲的。已分配的页帧可能是匿名的,保存程序数据;也可能位于页缓存中,保存文件或块设备中存储的数据。还有其他一些特殊的页帧用途,但暂时先不管它们。Windows 有一个类似的页帧号(PFN)数据库来跟踪物理内存。
Let's put together virtual memory areas, page table entries and page frames to understand how this all works. Below is an example of a user heap:
让我们把虚拟内存区域、页表项和页帧结合起来,理解这一切是如何工作的。以下是一个用户堆的示例:

Blue rectangles represent pages in the VMA range, while arrows represent page table entries mapping pages onto page frames. Some virtual pages lack arrows; this means their corresponding PTEs have the Present flag clear. This could be because the pages have never been touched or because their contents have been swapped out. In either case access to these pages will lead to page faults, even though they are within the VMA. It may seem strange for the VMA and the page tables to disagree, yet this often happens.
蓝色矩形代表 VMA 范围内的页,箭头代表将页映射到页帧的页表项。某些虚拟页缺少箭头;这意味着它们对应的 PTE 的 Present 标志位被清零。这可能是因为这些页从未被访问过,或者因为它们的内容已被换出。无论哪种情况,访问这些页都会导致缺页异常,即使它们位于 VMA 范围内。VMA 和页表之间出现不一致可能看起来很奇怪,但这经常发生。
A VMA is like a contract between your program and the kernel. You ask for something to be done (memory allocated, a file mapped, etc.), the kernel says "sure", and it creates or updates the appropriate VMA. But it does not actually honor the request right away, it waits until a page fault happens to do real work. The kernel is a lazy, deceitful sack of scum; this is the fundamental principle of virtual memory. It applies in most situations, some familiar and some surprising, but the rule is that VMAs record what has been agreed upon , while PTEs reflect what has actually been done by the lazy kernel. These two data structures together manage a program's memory; both play a role in resolving page faults, freeing memory, swapping memory out, and so on. Let's take the simple case of memory allocation:
VMA 就像你的程序与内核之间的一份契约。你请求做某事(分配内存、映射文件等),内核说"好的",然后创建或更新相应的 VMA。但它并不会立即履行请求,而是等到发生缺页异常时才真正干活。内核是一个懒惰、狡猾的混蛋;这是虚拟内存的基本原则。它适用于大多数情况,有些我们熟悉,有些则令人惊讶,但规则是:VMA 记录的是已经商定的事项,而 PTE 反映的是懒惰的内核实际已经完成的事项。这两个数据结构共同管理程序的内存;两者都在解决缺页异常、释放内存、换出内存等过程中发挥作用。让我们来看内存分配的简单情况:

When the program asks for more memory via the brk() system call, the kernel simply updates the heap VMA and calls it good. No page frames are actually allocated at this point and the new pages are not present in physical memory. Once the program tries to access the pages, the processor page faults and do_page_fault() is called. It searches for the VMA covering the faulted virtual address using find_vma(). If found, the permissions on the VMA are also checked against the attempted access (read or write). If there's no suitable VMA, no contract covers the attempted memory access and the process is punished by Segmentation Fault.
当程序通过 brk() 系统调用请求更多内存时,内核只是简单地更新堆 VMA 就算完事。此时实际上并未分配任何页帧,新页也不存在于物理内存中。一旦程序尝试访问这些页,处理器就会产生缺页异常并调用 do_page_fault()。它使用 find_vma() 搜索覆盖故障虚拟地址的 VMA。如果找到,还会根据尝试的访问方式(读或写)检查 VMA 的权限。如果没有合适的 VMA,则没有契约覆盖这次内存访问尝试,进程就会受到段错误的惩罚。
When a VMA is found the kernel must handle the fault by looking at the PTE contents and the type of VMA. In our case, the PTE shows the page is not present. In fact, our PTE is completely blank (all zeros), which in Linux means the virtual page has never been mapped. Since this is an anonymous VMA, we have a purely RAM affair that must be handled by do_anonymous_page(), which allocates a page frame and makes a PTE to map the faulted virtual page onto the freshly allocated frame.
当找到 VMA 后,内核必须通过查看 PTE 内容和 VMA 类型来处理故障。在我们的例子中,PTE 显示该页不存在。事实上,我们的 PTE 完全是空白的(全为零),这在 Linux 中意味着该虚拟页从未被映射过。由于这是一个匿名 VMA,我们面对的是纯粹的 RAM 事务,必须由 do_anonymous_page() 来处理,它分配一个页帧并创建一个 PTE,将故障虚拟页映射到新分配的帧上。
Things could have been different. The PTE for a swapped out page, for example, has 0 in the Present flag but is not blank. Instead, it stores the swap location holding the page contents, which must be read from disk and loaded into a page frame by do_swap_page() in what is called a major fault.
事情本可以有所不同。例如,被换出页的 PTE 在 Present 标志位上为 0,但并非空白。相反,它存储了保存页内容的交换位置,必须从磁盘读取并由 do_swap_page() 加载到页帧中,这称为主要缺页异常(major fault)。
This concludes the first half of our tour through the kernel's user memory management. In the next post, we'll throw files into the mix to build a complete picture of memory fundamentals, including consequences for performance.
至此,我们完成了内核用户内存管理之旅的前半部分。在下一篇文章中,我们将把文件加入进来,构建一幅完整的内存基础图景,包括对性能的影响。
Page Cache, the Affair Between Memory and Files
页面缓存:内存与文件的交互
Feb 11th, 2009
Previously we looked at how the kernel [manages virtual memory for a user process, but files and I/O were left out. This post covers the important and often misunderstood relationship between files and memory and its consequences for performance.
此前我们介绍过内核如何为用户进程管理虚拟内存,但并未提及文件与输入输出。本文讲解文件与内存之间常被误解的关联,以及该关联对运行性能带来的影响。
Two serious problems must be solved by the OS when it comes to files. The first one is the mind-blowing slowness of hard drives, and [disk seeks in particular, relative to memory. The second is the need to load file contents in physical memory once and share the contents among programs. If you use [Process Explorer to poke at Windows processes, you'll see there are ~15 MB worth of common DLLs loaded in every process. My Windows box right now is running 100 processes, so without sharing I'd be using up to ~1.5 GB of physical RAM just for common DLLs . No good. Likewise, nearly all Linux programs need [ld.so and libc, plus other common libraries.
操作系统在处理文件时需要解决两大问题。第一,相较于内存,机械硬盘的运行速度差距极大,磁盘寻道操作的耗时尤为突出。第二,文件内容只需加载一次至物理内存,便可供多个程序共享使用。使用进程资源管理器查看 Windows 进程能够发现,每个进程都会加载约 15 MB 的通用动态链接库。我当前的 Windows 设备正在运行 100 个进程,若不启用共享机制,仅通用动态链接库就会占用约 1.5 GB 物理内存,这种方式显然不可取。与之类似,几乎所有 Linux 程序都依赖 ld.so、libc 以及其他通用库文件。
Happily, both problems can be dealt with in one shot: the page cache , where the kernel stores page-sized chunks of files. To illustrate the page cache, I'll conjure a Linux program named render , which opens file scene.dat and reads it 512 bytes at a time, storing the file contents into a heap-allocated block. The first read goes like this:
页面缓存可以同时解决上述两个问题,内核会将文件数据按照内存页的大小分块存入页面缓存。为直观讲解页面缓存,这里假设存在一个名为 render 的 Linux 程序,该程序打开 scene.dat 文件,每次读取 512 字节数据,并将数据存入堆内存块中。首次读取的流程如下:

After 12 KB have been read, render's heap and the relevant page frames look thus:
读取 12 KB 数据后,render 程序的堆内存以及对应物理页帧的状态如下:

This looks innocent enough, but there's a lot going on. First, even though this program uses regular read calls, three 4 KB page frames are now in the page cache storing part of scene.dat. People are sometimes surprised by this, but all regular file I/O happens through the page cache . In x86 Linux, the kernel thinks of a file as a sequence of 4 KB chunks. If you read a single byte from a file, the whole 4 KB chunk containing the byte you asked for is read from disk and placed into the page cache. This makes sense because sustained disk throughput is pretty good and programs normally read more than just a few bytes from a file region. The page cache knows the position of each 4 KB chunk within the file, depicted above as #0, #1, etc. Windows uses 256 KB views analogous to pages in the Linux page cache.
该过程看似简单,底层却包含多项操作。即便程序调用标准 read 接口,scene.dat 的部分数据也会被存入 3 个 4 KB 大小的物理页帧,并驻留在页面缓存中。不少人对此并不了解,事实上所有常规文件输入输出操作都依托页面缓存完成 。在 x86 架构的 Linux 系统中,内核会将文件划分为连续的 4 KB 数据块。即便仅读取文件内 1 个字节,系统也会从硬盘读取该字节所在的整个 4 KB 数据块,并载入页面缓存。该设计具备合理性,硬盘的连续读写吞吐量表现优异,且程序通常不会仅读取文件局部的少量数据。页面缓存会记录每个 4 KB 数据块在文件内的位置,上文图示中以 #0、#1 等编号标识。Windows 系统采用 256 KB 的视图机制,作用等同于 Linux 页面缓存中的内存页。
Sadly, in a regular file read the kernel must copy the contents of the page cache into a user buffer, which not only takes cpu time and hurts the [cpu caches, but also wastes physical memory with duplicate data . As per the diagram above, the scene.dat contents are stored twice, and each instance of the program would store the contents an additional time. We've mitigated the disk latency problem but failed miserably at everything else. Memory-mapped files are the way out of this madness:
常规文件读取存在明显缺陷,内核需要将页面缓存中的数据拷贝至用户缓冲区。这一操作不仅占用处理器运算资源、影响处理器缓存的使用效率,还会因数据冗余浪费物理内存 。结合上文图示可见,scene.dat 的数据会被存储两份,程序每新增一个运行实例,数据副本就会再增加一份。这种方式缓解了硬盘读写延迟问题,但在其他方面表现较差。内存映射文件可以解决这类问题:

When you use file mapping, the kernel maps your program's virtual pages directly onto the page cache. This can deliver a significant performance boost: Windows System Programming reports run time improvements of 30% and up relative to regular file reads, while similar figures are reported for Linux and Solaris in Advanced Programming in the Unix Environment. You might also save large amounts of physical memory, depending on the nature of your application.
启用文件映射后,内核会将程序的虚拟页直接关联至页面缓存,能够显著提升运行效率。《Windows 系统编程》一书中提到,相比常规文件读取,文件映射可将运行耗时降低 30% 及以上;《UNIX 环境高级编程》也记载,Linux 与 Solaris 系统存在相近的性能提升效果。根据应用程序的运行特性,文件映射还能大幅节省物理内存。
As always with performance, [measurement is everything, but memory mapping earns its keep in a programmer's toolbox. The API is pretty nice too, it allows you to access a file as bytes in memory and does not require your soul and code readability in exchange for its benefits. Mind your [address space and experiment with mmap in Unix-like systems, CreateFileMapping in Windows, or the many wrappers available in high level languages. When you map a file its contents are not brought into memory all at once, but rather on demand via [page faults. The fault handler [maps your virtual pages onto the page cache after [obtaining a page frame with the needed file contents. This involves disk I/O if the contents weren't cached to begin with.
性能优劣需要依靠实测数据判定,而内存映射是开发过程中实用的技术手段。对应的应用程序接口使用便捷,可像访问内存字节数据一样操作文件,且不会牺牲代码可读性。开发时需要留意地址空间相关规则,类 UNIX 系统可使用 mmap,Windows 系统可使用 CreateFileMapping,各类高级编程语言也提供了对应的封装接口。文件映射不会一次性将整个文件载入内存,而是通过缺页异常按需加载数据。异常处理程序先分配物理页帧并加载目标文件数据,再将程序虚拟页关联至页面缓存。若对应数据未提前缓存,该过程会触发硬盘读写操作。
Now for a pop quiz. Imagine that the last instance of our render program exits. Would the pages storing scene.dat in the page cache be freed immediately? People often think so, but that would be a bad idea. When you think about it, it is very common for us to create a file in one program, exit, then use the file in a second program. The page cache must handle that case. When you think more about it, why should the kernel ever get rid of page cache contents? Remember that disk is 5 orders of magnitude slower than RAM, hence a page cache hit is a huge win. So long as there's enough free physical memory, the cache should be kept full. It is therefore not dependent on a particular process, but rather it's a system-wide resource. If you run render a week from now and scene.dat is still cached, bonus! This is why the kernel cache size climbs steadily until it hits a ceiling. It's not because the OS is garbage and hogs your RAM, it's actually good behavior because in a way free physical memory is a waste. Better use as much of the stuff for caching as possible.
这里提出一个小问题:假设 render 程序的最后一个运行实例退出,页面缓存中存储 scene.dat 的内存页是否会被立刻释放?多数人会认为答案是肯定的,但这种设计并不合理。日常使用中经常出现这类场景:使用一款程序创建文件后退出,再通过另一款程序读取该文件,页面缓存需要适配这类使用场景。进一步分析,内核没有必要主动清空页面缓存内的数据。硬盘的运行速度比内存低 5 个数量级,命中页面缓存可以大幅提升运行效率。只要系统存在空闲物理内存,页面缓存就会持续保留已有数据。页面缓存不属于单个进程,而是面向整个系统的资源。即便一周后再次运行 render 程序,只要 scene.dat 仍在缓存中,读取效率就会得到优化。这也是内核缓存占用量会持续增长直至达到上限的原因。操作系统并非恶意占用内存,空闲的物理内存无法发挥作用,将其用于数据缓存才是更合理的利用方式。
Due to the page cache architecture, when a program calls write() bytes are simply copied to the page cache and the page is marked dirty. Disk I/O normally does not happen immediately, thus your program doesn't block waiting for the disk. On the downside, if the computer crashes your writes will never make it, hence critical files like database transaction logs must be fsync()ed (though one must still worry about drive controller caches, oy!). Reads, on the other hand, normally block your program until the data is available. Kernels employ eager loading to mitigate this problem, an example of which is read ahead where the kernel preloads a few pages into the page cache in anticipation of your reads. You can help the kernel tune its eager loading behavior by providing hints on whether you plan to read a file sequentially or randomly (see madvise(), readahead(), Windows cache hints ). Linux [does read-ahead for memory-mapped files, but I'm not sure about Windows. Finally, it's possible to bypass the page cache using O_DIRECT in Linux or NO_BUFFERING in Windows, something database software often does.
结合页面缓存的运行机制,程序调用 write() 接口写入数据时,数据仅会被拷贝至页面缓存,对应内存页会被标记为脏页。系统通常不会立即执行硬盘写入操作,程序也不会因等待硬盘操作而阻塞。该机制存在安全隐患,若设备意外宕机,缓存中未落盘的写入数据将会丢失。因此数据库事务日志等重要文件,必须调用 fsync() 接口强制数据写入硬盘,同时还需要考虑磁盘控制器缓存带来的影响。文件读取操作则有所不同,程序一般会进入阻塞状态,直至数据读取完成。为改善该问题,内核采用预加载机制,预读是典型应用:内核会预判后续读取行为,提前将部分内存页载入页面缓存。开发者可以告知内核文件的读取模式为顺序读取或随机读取,以此优化预加载策略,相关接口可参考 madvise()、readahead() 以及 Windows 缓存提示。Linux 系统会为内存映射文件执行预读操作,Windows 系统的对应机制暂不明确。此外,Linux 系统可通过 O_DIRECT 标识、Windows 系统可通过 NO_BUFFERING 标识绕过页面缓存,数据库类软件常会使用该运行模式。
A file mapping may be private or shared . This refers only to updates made to the contents in memory: in a private mapping the updates are not committed to disk or made visible to other processes, whereas in a shared mapping they are. Kernels use the copy on write mechanism, enabled by page table entries, to implement private mappings. In the example below, both render and another program called render3d (am I creative or what?) have mapped scene.dat privately. Render then writes to its virtual memory area that maps the file:
文件映射分为私有映射 与共享映射 ,二者的区别仅针对内存中的数据修改操作 。私有映射模式下,数据修改不会写入硬盘,也不会被其他进程感知;共享映射模式下,修改内容对其他进程可见,并最终写入硬盘。内核依托页表项实现写时复制 机制,以此支撑私有映射功能。如下示例中,render 程序与另一款名为 render3d 的程序均以私有模式映射 scene.dat 文件,随后 render 向该文件对应的虚拟内存区域执行写入操作:

The read-only page table entries shown above do not mean the mapping is read only, they're merely a kernel trick to share physical memory until the last possible moment. You can see how 'private' is a bit of a misnomer until you remember it only applies to updates. A consequence of this design is that a virtual page that maps a file privately sees changes done to the file by other programs as long as the page has only been read from . Once copy-on-write is done, changes by others are no longer seen. This behavior is not guaranteed by the kernel, but it's what you get in x86 and makes sense from an API perspective. By contrast, a shared mapping is simply mapped onto the page cache and that's it. Updates are visible to other processes and end up in the disk. Finally, if the mapping above were read-only, page faults would trigger a segmentation fault instead of copy on write.
图示中的只读页表项,并不代表该映射为只读模式。这是内核的优化方式,目的是尽可能延长物理内存的共享时长。理解私有映射仅作用于数据修改行为后,就能明白"私有"这一命名的含义。该设计会产生一种现象:以私有模式映射文件的虚拟页,在仅执行读取操作时,可以感知其他程序对原文件做出的修改;一旦触发写时复制,便无法再获取外部程序的修改内容。该表现并非内核强制规定,但 x86 架构系统普遍如此,也符合应用程序接口的设计逻辑。共享映射则是直接将虚拟页关联至页面缓存,数据修改会同步给其他进程并最终写入硬盘。若文件映射被设置为只读模式,缺页异常不会触发写时复制,而是直接引发段错误。
Dynamically loaded libraries are brought into your program's address space via file mapping. There's nothing magical about it, it's the same private file mapping available to you via regular APIs. Below is an example showing part of the address spaces from two running instances of the file-mapping render program, along with physical memory, to tie together many of the concepts we've seen.
动态链接库也是通过文件映射加载至程序地址空间,其原理和普通接口实现的私有文件映射完全一致。下方示例展示了两个以文件映射方式运行的 render 程序实例的部分地址空间,以及对应的物理内存状态,整合了前文讲解的各类知识点。

This concludes our 3-part series on memory fundamentals. I hope the series was useful and provided you with a good mental model of these OS topics.
至此,本系列共三篇的内存基础内容讲解全部结束。希望本系列内容能够帮助大家建立对这类操作系统知识点的认知框架。
内存一致性(Memory Consistency)与缓存一致性(Cache Coherence)概述
iccnewer 原创于 2021-07-23 21:29:44 发布
为实现 PPA(Performance:性能、Power:功耗、Area:面积成本) 的设计目标,现阶段多数现代计算机系统与多核处理器芯片均搭载 共享硬件内存 架构。配置共享内存的存储系统中,任意处理器均可对指定共享地址空间执行读、写两类操作。
共享内存架构落地应用的基础前提,是保障内存读写行为的有效性与合规性,该技术场景涉及两类基础概念:内存一致性(Memory Consistency) 与 缓存一致性(Cache Coherence)。
一、内存一致性(Memory Consistency)
内存一致性也可称内存模型、内存一致性模型,该模型用于规定多核共享内存系统中,所有处理器读写共享数据的全局有序规则。硬件乱序执行、存储缓冲区、缓存异步刷新等架构特性,会打乱多线程 load \text{load} load 与 store \text{store} store 操作的原始程序顺序,内存一致性模型用于界定该类无序行为的合法边界。
内存一致性依托 loads \text{loads} loads、 stores \text{stores} stores 两类内存访问操作完成行为判定,以此界定共享内存系统的合法运行模式。在多核运行场景下,不同处理器核心对共享内存发起的访问操作存在时序偏差,若无统一约束规则,将造成内存访问逻辑紊乱。
内存一致性模型的本质,是划定共享内存系统的运行规范,为系统内全部访问进程制定统一的访问约束条例,保障读写交互环节中共享数据的有效性。
二、缓存一致性(Cache Coherence)
从计算机体系结构层级划分,缓存一致性与内存一致性为相互独立且互补的基础机制,二者不存在从属关系。缓存一致性的作用为:在共享内存系统中,维护同一物理内存地址对应的全部高速缓存副本与主内存数据的统一性。两类机制的观测维度存在本质差异:缓存一致性聚焦单一数据地址,维持该地址在各级缓存、主内存内所有副本数据的一致性;内存一致性聚焦全局内存访问序列,约束多项内存访问操作的合法执行顺序。
共享内存架构下,各处理器核心均可缓存全局共享数据,同一物理地址会在多级缓存与主内存中生成多份数据副本。若不同存储介质内的副本数据出现偏差,处理器执行内存访问指令时会出现逻辑错误,进而引发程序运行异常,该类问题即为缓存一致性问题。
三、两类机制的差异化总结
结合作用对象与约束维度,可对两类机制进行明确区分:
- 缓存一致性:作用维度为单个内存地址,约束所有处理器缓存与主内存之间的副本数据,维持同一地址下全部副本数据同步,解决多副本数据差异化问题。
- 内存一致性 :作用维度为完整内存空间,约束多处理器所有 load \text{load} load、 store \text{store} store 操作的执行时序,解决跨地址、跨核心的操作排序问题。
补充层级关系:缓存一致性是内存一致性的底层基础,完备的缓存一致性机制无法直接满足内存一致性要求,系统仍需依托对应的内存一致性模型,管控全局内存访问顺序。
reference
- Getting Physical With Memory | Jan 15th, 2009
https://manybutfinite.com/post/getting-physical-with-memory/ - Anatomy of a Program in Memory | Jan 27th, 2009
https://manybutfinite.com/post/anatomy-of-a-program-in-memory/ - How The Kernel Manages Your Memory | Feb 4th, 2009
https://manybutfinite.com/post/how-the-kernel-manages-your-memory/ - Page Cache, the Affair Between Memory and Files | Feb 11th, 2009
https://manybutfinite.com/post/page-cache-the-affair-between-memory-and-files/ - 介绍内存一致性(Memory Consistency)和缓存一致性(Cache Coherence)-CSDN博客
https://blog.csdn.net/iNostory/article/details/119047985