注:本文为 "CPU 缓存" 相关讨论合辑。
英文引文,略作重排。
如有内容异常,请看原文。
CPU 缓存架构演进:统一缓存与分离缓存的技术抉择
一、Intel 为何放弃统一 CPU 缓存?
问题
When Intel introduced the 80486 in 1989, they included their first on-chip cache, ostensibly to compete better with Motorola, who had been including on-chip caches for 5 years (MC68020, 1984).
1989 年 Intel 推出 80486 时,首次在芯片上集成了缓存,其表面动机是为了更好地与 Motorola 竞争------后者自 1984 年的 MC68020 起已集成片上缓存长达 5 年。
Unlike the Motorola CPUs, Intel went with unified L1 cache for the '486. Then, they switched to separate Instruction & Data L1 cache with the Pentium and its successors. Why the switch, and why not use separate cache for the 80486, if it was obviously better?
与 Motorola CPU 不同,Intel 在 486 上采用了统一 L1 缓存,而在 Pentium 及其后续产品中则切换为独立的指令缓存与数据缓存。为何做出这一转变?既然分离缓存显然更优,为何不在 80486 上就采用?
评论
A possible consideration could have been maintaining functionality of existing software relying on immediate availability of self-modified code, while giving developers time to adapt.
一种可能的考量是维持现有软件的功能性------这些软件依赖于自修改代码的立即可用性,同时为开发者留出适应时间。
@Leo why would a separate instruction cache cause problems for self-modifying code? The x86 caches are coherent.
@Leo 独立的指令缓存为何会对自修改代码造成问题?x86 缓存是保持一致的。
@StephenKitt The decision to make caches coherent had to be made. I/D coherency is costly, and with proper programming discipline it is not necessary.
@StephenKitt 使缓存保持一致性是一个必须做出的设计决策。指令/数据一致性代价高昂,且若遵循恰当的编程规范,则并非必需。
回答一:设计历史与架构演进
I'm not sure the separate cache was "obviously better" back when the Intel designers were working on the 80486, at least, not to the designers in question.
我不确定在 Intel 设计师着手 80486 设计的年代,分离缓存是否"显然更优"------至少对当时的设计师而言并非如此。
But "better" might not even have been much of a factor. The design history of the cache systems in Motorola and Intel CPUs is quite different, which can explain the different approaches used in the 68040 and 80486.
然而"优劣"本身可能并非决定性因素。Motorola 与 Intel CPU 缓存系统的设计历史截然不同,这可以解释 68040 与 80486 为何采用不同的方案。
On the Motorola side of things, the 68010 introduced a loop mode for tight loops, allowing them to run without instruction fetches from memory. The 68020 replaced this with a dedicated 256-byte instruction cache, but no data cache. The 68030 added a 256-byte data cache alongside this, with a modified Harvard architecture. The 68040 revamped the caches and expanded them to 4 KiB each. The split cache is a natural progression given the architecture's history.
在 Motorola 方面,68010 引入了紧循环模式,使紧循环无需从内存取指即可运行。68020 以专用的 256 字节指令缓存取代了这一模式,但未配备数据缓存。68030 在此基础上增加了 256 字节数据缓存,采用改进型哈佛架构。68040 重新设计了缓存,并将容量扩展至各 4 KiB。鉴于该架构的历史演进,分离缓存是顺理成章的发展。
On the Intel side of things, caches (apart from the TLB and prefetch buffer) didn't appear on-CPU until the 486. On the 386, it was common enough to have external caches via the 82385 cache controller; this was unsurprisingly a unified instruction and data cache. So pulling the cache on-die with a single MMU and a unified cache feels logical given the history here.
从英特尔产品迭代脉络来看,除转译后备缓冲区( TLB \text{TLB} TLB)与预取缓冲区以外,通用片上缓存直至 486 486 486 处理器才完成芯片集成。 80386 80386 80386 平台普遍借助 82385 82385 82385 缓存控制器搭载外置缓存,该类外置缓存均采用指令、数据一体化的统一架构。立足前代技术沿革,英特尔将缓存与单片内存管理单元( MMU \text{MMU} MMU)一并集成至处理器硅片、延续统一缓存方案具备技术逻辑。
TLB \boldsymbol{\text{TLB}} TLB:全称 Translation Lookaside Buffer,转译后备缓冲区
TLB \text{TLB} TLB 为处理器内置专用高速缓冲,用于存放虚拟地址至物理地址的页表映射项,省去频繁访问内存页表的开销、压缩地址转换时延;在 80386 80386 80386、 80486 80486 80486 的硬件定义中, TLB \text{TLB} TLB 仅服务地址转换,不属于通用指令/数据缓存,这也是原文将其排除在片上通用缓存统计范围的原因。
Both the 68040 and the 80486 benefited from process changes which allowed them to have million-transistor budgets (in both cases, up from under 300,000 transistors in the previous generation to around 1.2 million transistors). The bump seems huge, but once you pull in the FPU, MMU(s), add cache etc., there are still compromises to be made (famously, to the FPU on the 040). In both cases too, the new CPUs were evolutions in their respective architectures, not the huge revamps that the Pentium ended up being; so it's plausible to imagine the designers working mostly on merging existing external features and improving the core design incrementally, rather than looking around and re-evaluating all the design choices.
68040 与 80486 均受益于工艺进步,使其晶体管预算达到百万级(两者均从前代不足 30 万晶体管跃升至约 120 万)。这一增幅看似巨大,但一旦纳入 FPU、MMU、缓存等模块,仍需做出权衡(68040 的 FPU 削减便是著名案例)。此外,这两款新 CPU 均为各自架构的演进版本,而非 Pentium 那样的彻底重构;因此可以合理推测,设计师主要致力于将现有外部功能集成到芯片上,并对核心设计进行渐进式改进,而非全面重新审视所有设计抉择。
By the time they started working on the Pentium, the Intel designers had some feedback from the 486 cache. The 486's architecture had been "good enough" that they could double and even triple its internal clockspeed; this means that the cache subsystem was sufficient to keep the instruction units busy in many workloads (since clock multiplying without improving the memory bus significantly relies on cache even more heavily). However the 486 did suffer from some level of cache contention, and this became particularly troublesome with the additional processing bandwidth in the Pentium instruction units, especially given the availability of two instruction pipes (the U and V pipes). According to Inside the Pentium (Bob Ryan, Byte Magazine, May 1993), this is the main factor which drove Intel to split its cache into two. In addition, the TLB and cache tags were changed to be triple ported, to allow simultaneous access from both ALUs. [The Pentium die layout certainly benefited from the split caches, and this is a common reason to [favour split caches (the data and instruction caches are used by different sections of the CPU, and they can be laid out close by).
待到着手 Pentium 设计时,Intel 设计师已从 486 缓存获得了反馈。486 的架构"足够优秀",使其内部时钟频率得以翻倍甚至三倍提升;这意味着缓存子系统足以在多数工作负载下保持指令单元繁忙(因为在未显著改进内存总线的情况下提升时钟频率,更加依赖缓存)。然而 486 确实存在一定程度的缓存争用,而 Pentium 指令单元额外的处理带宽使这一问题尤为突出,特别是考虑到双指令流水线(U 管与 V 管)的存在。据 Inside the Pentium(Bob Ryan, Byte Magazine, 1993 年 5 月)记载,这是促使 Intel 将缓存一分为二的主要因素。此外,TLB 与缓存标记改为三端口设计,以允许两个 ALU 同时访问。Pentium 的芯片布局显然受益于分离缓存,这也是支持分离缓存的常见理由------数据缓存与指令缓存分别由 CPU 的不同部分使用,可就近布局。
Importantly given the development practices at the time on x86, the split caches are coherent on x86, so that the caches don't cause problems with self-modifying code.
鉴于当时 x86 平台上的开发实践,x86 的分离缓存保持了一致性,因此不会因自修改代码而产生问题。
讨论
Solid answer. I saw something in my own research suggesting that the physical placement/location on-die of the cache was critical for minimizing cache latency, and this led some designers to Harvard arch. So wondering if that also played a role with Intel switch for P5.
回答很扎实。我在研究中注意到,缓存在芯片上的物理布局位置对于最小化缓存延迟至关重要,这促使部分设计师采用哈佛架构。因此我想知道,这是否也是 Intel 在 P5 上做出转变的因素之一。
Yes, I came across that too; if (and this is a big if) the on-die layout of the 486 is anything like [this block diagram, it wouldn't be an issue there, but it could easily have been a problem on the Pentium (and also on the 68040). I imagine there are annotated die shots of 486s and Pentiums somewhere...
是的,我也注意到了这一点;如果(这是一个很大的假设)486 的芯片布局类似于该框图,那么这在其上不构成问题,但在 Pentium(以及 68040)上可能很容易成为问题。我想某处应该存在 486 和 Pentium 的带注释芯片照片......
@BrianH: Split L1 caches are certainly more natural for a pipelined CPU that's normally fetching instructions in parallel with data loads / stores. Two separate caches are cheaper to build than 1 larger multi-ported cache. And yes, tightly integrating the L1d with load/store ports and L1dTLB, and L1i with L1iTLB and instruction fetch, is another well-known advantage of split caches, a[s [@HadiBrais discussed in an SO answer. Many wires (silicon or metal) over long distances are something to avoid. (P5 has 64-bit wide cache access paths).
对于通常并行执行指令取指与数据加载/存储的流水线 CPU 而言,分离 L1 缓存显然更为自然。构建两个独立的缓存比构建一个更大的多端口缓存成本更低。此外,将 L1d 与加载/存储端口及 L1dTLB 紧密集成,将 L1i 与 L1iTLB 及指令取指紧密集成,也是分离缓存的另一项广为人知的优势。长距离的连线(无论是硅连线还是金属连线)应尽量避免。(P5 采用 64 位宽的缓存访问通路。)
Most classic-RISC ISAs (like MIPS that were designed from the ground up for pipelined implementations) do not have coherent instruction caches: you have to run a sync/flush instruction before you can safely jump to an address where the CPU recently used store instructions to store new machine code. x86 on paper required a serializing instruction (like cpuid except it didn't exist until late 486, so there weren't any good user-space choices). In practice CPU vendors wanted to not break existing self-modifying code (primitive JITs or whatever). Coherent L1i + pipeline is harder!
大多数经典 RISC 指令集架构(如从头开始为流水线实现而设计的 MIPS)并不具备一致的指令缓存:在跳转到 CPU 最近通过存储指令写入新机器码的地址之前,必须执行同步/刷新指令。从纸面规范看,x86 需要串行化指令(如 cpuid,但该指令直到 486 后期才出现,因此当时缺乏良好的用户态选择)。实际上,CPU 厂商希望不破坏现有的自修改代码(原始 JIT 等)。实现一致的 L1i 加上流水线更为困难!
回答二:缓存策略与关联度分析
First to keep in mind is that the 68k was way more in need of a cache than x86 CPUs, as its memory access was in line with execution, while the x86 prefetch buffer used 'free' cycles to read ahead, thus utilizing the memory much better than the 68k could do.
首先需要记住的是,68k 系列比 x86 CPU 更需要缓存,因为其内存访问与执行同步进行,而 x86 的预取缓冲利用"空闲"周期提前读取,从而比 68k 更高效地利用内存。
Next, it depends on your processor structure to make separate caches worthwhile. Both CPUs use a single address space for instruction and data (von Neumann style) thus the simplest way to speed up memory access is to add a cache which operates on the plain memory interface without any change to the CPU. Caching that way is a feature of the memory system transparent to the CPU.
其次,分离缓存是否值得取决于处理器结构。两种 CPU 均采用单一的指令与数据地址空间(冯·诺依曼风格),因此加速内存访问的最简单方式是在内存接口上添加缓存,而无需改变 CPU 本身。这种方式的缓存对 CPU 透明,属于内存系统的特性。
Regions that get accessed get cached into faster memory in hope of future reuse from there. That's what happened with the various cache designs for 286/386 systems. Simple and straightforward.
被访问的区域被缓存到更快的存储器中,以期将来再次利用。286/386 系统的各类缓存设计便是如此。简单直接。
Now while a simple (unified) cache is a quick and great solution, not all memory locations are equal. Most notably, instructions get reused more often than pure data. So if chip space is scarce and only a few bytes of cache are possible, then it will be better to reserve them for (hopefully) repeated instructions. That's the way Motorola went on the 68020 by adding 256 bytes of instruction cache.
尽管简单的(统一)缓存是一种快速且优秀的解决方案,但并非所有内存位置都同等重要。最显著的是,指令比纯数据更频繁地被复用。因此,若芯片空间有限且仅能容纳少量缓存字节,则将其保留给(有望被重复使用的)指令更为合理。Motorola 在 68020 上增加 256 字节指令缓存正是遵循这一思路。
Such a 'just instructions' cache isn't more complicated than a generic cache. The single difference is that data read access doesn't get cached, everything else works the same. So far the 68020 did not bring a separate cache, but a partial one.
这种"仅指令"缓存并不比通用缓存更复杂。唯一的区别是数据读取访问不被缓存,其余功能完全相同。因此 68020 并未引入分离缓存,而是部分缓存。
It was the 68030 that introduced a data cache of 256 bytes as well --- plus adding a burst mode to read up to 16 bytes at once. Adding a data cache adds complications. Separating instruction and data cache is often described as Modified Harvard. While it's quite easy to add two separate caches on a true Harvard architecture, it's a nightmare for von Neumann style memories. Now each of the caches can hold the same data, so they must be synchronized with each other and memory as well.
68030 引入了 256 字节数据缓存,并增加了突发模式以一次性读取最多 16 字节。增加数据缓存带来了复杂性。将指令缓存与数据缓存分离通常被描述为改进型哈佛架构。虽然在真正的哈佛架构上添加两个独立缓存相当容易,但对于冯·诺依曼风格的内存而言则是一场噩梦。此时两个缓存可能持有相同数据,因此必须相互同步,并与内存同步。
Motorola's decision to go with separate caches in the '030 is more likely a result of keeping investment down. After all, the '030 is mainly a shrink of the '020, so adding a data cache on top might have been less work than redesigning the whole cache system. Also, since the added cache wasn't exactly large, keeping it divided preserved the speed characteristics the 68020 already showed from its instruction cache.
Motorola 在 030 上采用分离缓存的决定,更可能是出于控制投资的考虑。毕竟 030 主要是 020 的缩微版,因此在原有基础上增加数据缓存可能比重新设计整个缓存系统工作量更小。此外,由于新增缓存容量不大,保持分离状态得以维持 68020 指令缓存已展现的速度特性。
When the 68030 became available, desktop 80386 systems were already sold with up to 64 KiB of cache. The sheer size of a more than 100 times bigger cache made Motorola's effort seem useless in comparison. Intel kept the single cache for the 80486 (1989) but now integrated 8 KiB of cache right onto the CPU --- which didn't stop mainboard manufacturers from adding external caches, now in the region of 256 KiB to 1 MiB.
当 68030 问世时,桌面 80386 系统已配备高达 64 KiB 的缓存。超过 100 倍的容量差距使 Motorola 的努力相形见绌。Intel 在 80486(1989 年)上保留了单一缓存,但将 8 KiB 缓存直接集成到 CPU 上------这并未阻止主板厂商增加外部缓存,容量达到 256 KiB 至 1 MiB 级别。
Fact: Size does matter (and simply beats strategy).
事实:容量至关重要(策略在绝对容量面前黯然失色)。
As Stephen already mentioned, these designs were still good to feed the processing units of a 486 fast enough. So when the 68040 came around, Motorola just increased the size to 4+4 KiB cache with just little improvements.
正如 Stephen 所述,这些设计仍足以快速为 486 的处理单元提供数据。因此当 68040 问世时,Motorola 仅将容量提升至 4+4 KiB,改进甚微。
While Motorola's strategy wasn't bad, it wasn't superior either. And Intel's switch to separate the P5's cache into two wasn't driven by some inherent advantage, but by 'need for speed'.
Motorola 的策略虽不差,但也谈不上优越。Intel 将 P5 缓存一分为二并非出于某种固有优势,而是受"速度需求"驱动。
The 486 cache was 4-way associative, meaning each memory location could be buffered in any of 4 cache locations. To find the right location, the tags need to be compared; this takes time. By splitting the cache into two separate parts, this could be reduced to a 2-way system while keeping (in most cases) the same performance --- but due to the less complicated hardware, its response time could be improved.
486 缓存采用 4 路组相联,意味着每个内存位置可缓冲到 4 个缓存位置中的任意一个。为找到正确位置,需要比较标记位,这需要时间。将缓存分为两个独立部分后,可降至 2 路系统,同时(在大多数情况下)保持相同性能------但由于硬件复杂度降低,响应时间得以改善。
The basic idea of Intel's switch is not so much about specializing on instruction or data, but split the cache between two memory regions that (hopefully) overlap as little as possible, resulting in the same (lower) thrashing rate of a 4-way system but using the faster hardware of a 2-way system. Gaining a higher hit rate within instructions is a rather welcome side effect.
Intel 转变的基本理念并非专门针对指令或数据进行特化,而是将缓存分割给两个(希望尽可能少重叠的)内存区域,从而在保持 4 路系统相同(较低)的抖动率的同时,使用 2 路系统更快的硬件。在指令中获得更高的命中率只是一个颇为受欢迎的副作用。
二、缓存存储器的历史与发展
问题
I have tried to research the history and development of memory caching online, but I find it hard to find good information. Many resources online would have you believe caching was introduced with the Intel 80486, and generally assuming it's a only thing for microprocessors. The Stanford Superfoonly design from the early 1970s included a cache that was projected to provide a ~10x speedup over a PDP-10. I'm sure even earlier examples could be found elsewhere.
我尝试在线研究缓存存储器的历史与发展,但发现难以找到优质资料。许多网络资源让人误以为缓存随 Intel 80486 引入,且普遍认为缓存仅与微处理器相关。1970 年代初的 Stanford Superfoonly 设计包含了缓存,预计可比 PDP-10 提供约 10 倍加速。我相信更早的实例亦可在别处找到。
回答一:从存储层次到现代缓存
Preface: This focuses on real machines, available as production units, not prototypes or experimental designs. Nor are the examples exhaustive. I will also spare any discussion of memory hierarchy but go with the meaning of CPU cache as it's canon today.
前言:本文聚焦于实际生产机器,而非原型或实验设计。所举示例亦非穷尽。此外,我将略过存储层次的讨论,而采用当今公认的 CPU 缓存定义。
The development of Cache is a continuation of storage hierarchy, a principle still visible in IBM Mainframes and interlinked with the development of virtual memory. Both are methods to increase speed of most active memory regions while at the same time providing ever larger amounts of usable address space.
缓存的发展是存储层次结构的延续,这一原则在 IBM 大型机中依然可见,并与虚拟存储器的发展相互交织。两者均为加速最活跃内存区域的方法,同时提供日益增大的可用地址空间。
The first step might have been machines like the Z23, a 1961 transistor based reimplementation of the earlier Z22. While the Z22 only placed the first 16 words (the registers) in Core, the Z23 had 256 additional words of core within the (Drum) address space.
第一步可能始于 Z23 等机器------1961 年基于晶体管重新实现的早期 Z22。Z22 仅将前 16 个字(寄存器)置于磁芯存储器中,而 Z23 在(磁鼓)地址空间内额外拥有 256 个磁芯字。
The mid 1960s also mark the point in time where core installation became large enough to completely replace Drum as main memory, redesignating it as very fast external storage. Beside rapid growing size and independent address space layout, this also resulted in independence from Drum timing. This independence allowed the use of different cycle times depending on different memory types, etc.
1960 年代中期也标志着磁芯存储器容量增长到足以完全取代磁鼓作为主存的转折点,磁鼓被重新指定为超高速外部存储。除了快速增长的容量和独立的地址空间布局外,这也带来了脱离磁鼓定时的独立性。这种独立性允许根据不同类型的存储器使用不同的周期时间等。
Next step was virtual addressing, allowing to put arbitrary memory regions in limited core while keeping the rest on larger but less expensive media than drums --- aka magnetic Disks. Base is a TLB (Translation Lookaside Buffer). First production machines to implement a TLB might have been the IBM 360/67 of 1965 and the GE 645 of 1967, although the latter might still be counted as SST built for the project.
下一步是虚拟寻址,允许将任意内存区域置于有限的磁芯中,同时将其余部分存放在比磁鼓更大但更廉价的介质上------即磁盘。其基础是 TLB(转换检测缓冲器)。首批实现 TLB 的生产机器可能是 1965 年的 IBM 360/67 和 1967 年的 GE 645,尽管后者可能仍被视为为该项目专门制造的 SST。
Core was, at the time, with below 1 µs access, already incredibly fast, but semiconductor memory became a possibility soon after. This created the same opportunity of speed increase as with Core vs. Drum.
当时的磁芯存储器访问时间低于 1 µs,已极为快速,但不久后半导体存储器成为可能。这创造了与磁芯对磁鼓相同的加速机遇。
In January 1968 IBM introduced the /360 Model 85 providing Cache as we know it today. The first units were eventually delivered in December of 1969. The Model 85 used the 2385 Processor Storage in 2 or 4 way configuration with 512 KiB to 4 MiB and 960 ns cycle time, plus a 32 KiB cache at 240 ns.
1968 年 1 月,IBM 推出了 /360 Model 85,提供了我们今天所知的缓存。 首批设备最终于 1969 年 12 月交付。Model 85 采用 2385 处理器存储器,以 2 路或 4 路配置,容量 512 KiB 至 4 MiB,周期时间 960 ns,另加 32 KiB 缓存,访问时间 240 ns。
The Model 85 became especially influential due to a description of its cache design by John Liptay's article Structural aspects of the System/360 Model 85 --- Part II The cache , published in IBM's Systems Journal Vol.7 No.1 of March 1968, p. 15-21.
Model 85 之所以影响深远,特别是因为 John Liptay 的文章 Structural aspects of the System/360 Model 85 --- Part II The cache 对其缓存设计进行了描述,该文发表于 IBM Systems Journal 第 7 卷第 1 期(1968 年 3 月),第 15--21 页。
In fact, IBM did beat themselves to market in August 1969 , when they delivered the /360 Model 195 , including as well 32 KiB cache, putting memory speed to the extreme: four megabyte Core at 754 ns, one megabyte Thin-Film Memory at 120 ns, 32 KiB Semiconductor RAM acting as cache at 54 ns.
事实上,IBM 在 1969 年 8 月交付 /360 Model 195 时,抢在了自己前面,该机型同样配备 32 KiB 缓存,将内存速度推向极致:4 MiB 磁芯存储器 754 ns,1 MiB 薄膜存储器 120 ns,32 KiB 半导体 RAM 作为缓存 54 ns。
The 195 was the fastest general purpose computer at its time, only beaten in pure FP power by Cray's CDC6600. The Model 195 performance was comparable to an early 1990s Pentium.
195 是当时最快的通用计算机,仅在纯浮点运算能力上被 Cray 的 CDC6600 超越。Model 195 的性能可与 1990 年代初的 Pentium 相媲美。
Now, for the x86 timeline, the 486 was the first (Intel) implementation with an on-chip cache. Caches have been used before for 286 and 386 systems as well. For the 386 Intel even offered a dedicated cache controller, the 82385, which has been used in a series of motherboards.
就 x86 时间线而言,486 是首个集成片上缓存的 Intel 实现。286 和 386 系统此前也已使用缓存。针对 386,Intel 甚至提供了专用缓存控制器 82385,用于一系列主板。
After all, cache isn't anything special, just a logic that saves some memory in a fast RAM and slows the CPU down when the desired content is not within that fast section. As a result even 8-bit systems used cache. The best known examples may be the ZIP Chip and Rocket Chip for the Apple II, both utilizing 8 KiB of static RAM to have a 65C02 run at 4--10 MHz in a standard 1 MHz Apple II. But it was implemented already as early as 1985 with the Speed Demon in 1985, going 3.5 MHz with a 4 KiB Cache.
归根结底,缓存并无特殊之处,不过是一种将部分内存保存在快速 RAM 中的逻辑,当所需内容不在该快速区域时使 CPU 降速。因此,即便是 8 位系统也使用了缓存。最著名的例子可能是 Apple II 的 ZIP Chip 和 Rocket Chip,两者均利用 8 KiB 静态 RAM 使 65C02 在标准 1 MHz 的 Apple II 上运行于 4--10 MHz。但早在 1985 年,Speed Demon 就已实现这一点,以 4 KiB 缓存运行于 3.5 MHz。
In 1988 Apple introduced the Apple IIc Plus which essentially included a Zip-Chip based design on the motherboard, running the 1 MHz base system with 4 MHz Cache, making it the most useful Apple II up to date.
1988 年,Apple 推出 Apple IIc Plus,其主板上基本集成了基于 Zip-Chip 的设计,使 1 MHz 基础系统以 4 MHz 缓存运行,成为当时最实用的 Apple II。
Bottom line: 1968/69 would be a safe assumption about first cache architecture as we understand it today delivered in production units. As expected, Cache is way older and way more used than just with the 80486.
底线:1968/69 年可稳妥地视为我们今天所理解的缓存架构首次在生产设备中交付的时间。 正如预期,缓存远比 80486 更为古老且应用广泛。
回答二:Wilkes 的理论奠基
The concept of cache memory was formalised by Maurice Wilkes in his 1965 paper, Slave Memories and Dynamic Storage Allocation . This describes a hierarchical memory setup with a small amount of fast core memory serving a larger amount of slower core memory. It refers to system descriptions of "slave memories" in existing computer designs at the time, the ETL Mk-6 computers and the Atlas 2; these had very small, very-high-speed memories used as instruction caches (the Atlas 2's cache was however never implemented). Wilkes' paper discusses the practicalities of extending the concept to use larger amounts of cache for more general purposes.
缓存存储器的概念由 Maurice Wilkes 在其 1965 年的论文 Slave Memories and Dynamic Storage Allocation 中形式化。该文描述了一种层次化存储器配置,以少量快速磁芯存储器服务于更大容量的慢速磁芯存储器。文中提到了当时现有计算机设计中的"从属存储器"系统描述,即 ETL Mk-6 计算机和 Atlas 2;这些机器拥有极小、极高速的存储器用作指令缓存(然而 Atlas 2 的缓存从未实现)。Wilkes 的论文讨论了将这一概念扩展以使用更大容量缓存来实现更通用目的的实践问题。
It covers many concepts and concerns which will still be familiar to present-day readers: tag bits, cache coherency (which shows up in the paper as the need to write back dirty words in the cache on program switches), associativity...
该文涵盖了许多当今读者依然熟悉的概念和问题:标记位、缓存一致性(在论文中体现为程序切换时需将缓存中的脏字写回)、相联度......
The usefulness of cache memories quickly spread, and even an overview of their history and development would be quite long. One could start by looking at the citations of the Wilkes paper, and other articles published in the 60s and 70s such as DJ Kuck and DH Lawrie's The use and performance of memory hierarchies: A survey (which features an extensive bibliography).
缓存存储器的实用性迅速传播,即便仅概述其历史与发展也相当冗长。可从查阅 Wilkes 论文的引用,以及 60、70 年代发表的其他文章入手,如 DJ Kuck 和 DH Lawrie 的 The use and performance of memory hierarchies: A survey(包含大量参考文献)。
Caches appeared in general-purpose processors in the following years; early examples include DEC's KL10, based on the Superfoonly design you mention, and the various cache-equipped System/360 models mentioned in Raffzahn's answer.
缓存在随后几年出现在通用处理器中;早期例子包括基于您提到的 Superfoonly 设计的 DEC KL10,以及 Raffzahn 回答中提到的各类配备缓存的 System/360 型号。
It took a while for microprocessors to include cache, for a number of reasons, most importantly the available transistor budget (see Why did Intel abandon unified CPU cache? for some discussion of that), but also the fact that early microprocessors were slow enough that memory accesses weren't necessarily a huge problem.
微处理器花了一段时间才集成缓存,原因众多,最重要的是可用的晶体管预算(参见 Why did Intel abandon unified CPU cache? 的相关讨论),但也因为早期微处理器速度足够慢,内存访问未必构成严重问题。
三、分离缓存的定义与效用
问题
I was doing a question on Computer Architecture and in it it was mentioned that the cache is a split cache, and no hazard --- what does this exactly mean?
我在做一道计算机架构题目,其中提到缓存是分离缓存,且不存在冒险------这究竟是什么意思?
引言
A split cache is a cache that consists of two physically separate parts, where one part, called the instruction cache, is dedicated for holding instructions and the other, called the data cache, is dedicated for holding data (i.e., instruction memory operands). Both of the instruction cache and data cache are logically considered to be a single cache, described as a split cache, because both are hardware-managed caches for the same physical address space at the same level of the memory hierarchy. Instruction fetch requests are handled only by the instruction cache and memory operand read and write requests are handled only by the data cache. A cache that is not split is called a unified cache.
分离缓存由两个物理上独立 的部分组成,其中一部分称为指令缓存,专用于存储指令;另一部分称为数据缓存,专用于存储数据(即指令的内存操作数)。指令缓存与数据缓存逻辑上被视为单一缓存,称为分离缓存,因为两者均为同一物理地址空间、同一存储层次级别的硬件管理缓存。指令取指请求仅由指令缓存处理,内存操作数的读写请求仅由数据缓存处理。未分离的缓存称为统一缓存。
The Harvard vs. von Neumann architecture distinction originally applies to main memory. However, most modern computer systems implement the modified Harvard architecture whereby the L1 cache implements the Harvard architecture and the rest of the memory hierarchy implements the von Neumann architecture. Therefore, in modern systems, the Harvard vs. von Neumann distinction mostly applies to the L1 cache design. That's why the split cache design is also called the Harvard cache design and the unified cache design is also called von Neumann.
哈佛架构与冯·诺依曼架构的区分最初适用于主存储器。然而,大多数现代计算机系统实现改进型哈佛架构,其中 L1 缓存采用哈佛架构,存储层次结构的其余部分采用冯·诺依曼架构。因此,在现代系统中,哈佛与冯·诺依曼的区分主要适用于 L1 缓存设计。这就是为什么分离缓存设计也被称为哈佛缓存设计,而统一缓存设计也被称为冯·诺依曼缓存设计。
To my knowledge, the idea of the split cache design was first proposed and evaluated by James Bell, David Casasent, and C. Cordon Bell in their paper entitled An Investigation of Alternative Cache Organizations , which was published in 1974 in the IEEE TC journal. The authors found using a simulator that, for almost all cache capacities considered in the study, an equal split results in the best performance.
据我所知,分离缓存设计的思想最早由 James Bell、David Casasent 和 C. Cordon Bell 在其论文 An Investigation of Alternative Cache Organizations 中提出并评估,该文发表于 1974 年的 IEEE TC 期刊。作者使用模拟器发现,对于研究中考虑的几乎所有缓存容量,均等分割均可获得最佳性能。
Typically, the best performance occurs with half of the cache devoted to instructions and half to data.
通常,最佳性能出现在缓存的一半用于指令、另一半用于数据时。
They also provided a comparison with a unified cache design of the same capacity and their initial conclusion was that the split design has no advantage over the unified design.
他们还与相同容量的统一缓存设计进行了比较,初步结论是分离设计相对统一设计并无优势。
As shown in Fig. 6, the performance of the best dedicated cache CUXD (half allotted to instructions and half to data) in general is quite similar to that of a homogeneous cache (CUX); the extra complexity of a dedicated cache control is thus not justifiable.
如图 6 所示,最佳专用缓存 CUXD(一半分配给指令、一半分配给数据)的性能总体上与同质缓存(CUX)相当;因此,专用缓存控制的额外复杂度是不合理的。
It appears to me from Alan's paper that the first processor that used the split cache design was the IBM 801 around 1975 and probably the second processor was the S-1 (around 1976). It's possible that the engineers of these processors might have came up with the split design idea independently.
从 Alan 的论文来看,首个采用分离缓存设计的处理器是 1975 年左右的 IBM 801,第二个可能是 1976 年左右的 S-1。这些处理器的工程师可能独立提出了分离设计的思想。
分离缓存设计的优势
The split cache design was then extensively studied in the next two decades. But it was quickly recognized that the split design is useful for pipelined processors where the instruction fetch unit and the memory access unit are physically located in different parts of the chip. With the unified design, it is impossible to place the cache simultaneously close to the instruction fetch unit and the memory unit, resulting in high cache access latency from one or both units. The split design enables us to place the instruction cache close to the instruction fetch unit and the data cache close to the memory unit, thereby simultaneously reducing the latencies of both. This is the primary advantage of the split design over the unified design.
分离缓存设计在随后二十年间得到广泛研究。但人们很快认识到,分离设计对流水线处理器十分有用------这类处理器的指令取指单元与内存访问单元物理上位于芯片的不同区域。采用统一设计时,无法同时将缓存置于靠近指令取指单元和内存单元的位置,导致其中一个或两个单元的缓存访问延迟较高。分离设计使我们能够将指令缓存置于靠近指令取指单元的位置,将数据缓存置于靠近内存单元的位置,从而同时降低两者的延迟。这是分离设计相对于统一设计的首要优势。
Another advantage of the split design is that it allows instruction and data accesses to occur in parallel without contention. Essentially, a split cache can have double the bandwidth of a unified cache. This improves performance in pipelined processors because instruction and data accesses can occur in the same cycle in different stages of the pipeline.
分离设计的另一优势在于允许指令访问与数据访问并行发生而无争用。本质上,分离缓存可拥有统一缓存两倍的带宽。这提升了流水线处理器的性能,因为指令访问与数据访问可在流水线的不同阶段于同一周期内发生。
Alternatively, the bandwidth of a unified cache can be doubled or improved using multiple access ports or multiple banks. In fact, using two ports provides twice the bandwidth to the whole cache (in contrast, in the split design, the bandwidth is split in half between the instruction cache and the data cache), but adding another port is more expensive in terms of area and power and may impact latency.
或者,统一缓存的带宽可通过多访问端口或多存储体来加倍或改善。事实上,使用两个端口可为整个缓存提供两倍带宽(相比之下,分离设计中带宽在指令缓存与数据缓存之间均分),但增加端口在面积和功耗方面代价更高,且可能影响延迟。
Another potential advantage is that the split design allows us to employ different replacement policies for the instruction cache and data cache that may be more suitable for the access patterns of each cache. All Intel Itanium processors use the LRU policy for the L1I and the NRU policy for the L1D. Moreover, starting with Itanium 9500, the L1 ITLB uses NRU but the L1 DTLB uses LRU.
另一潜在优势是,分离设计允许为指令缓存和数据缓存采用不同的替换策略,这些策略可能更适合各自的访问模式。所有 Intel Itanium 处理器对 L1I 采用 LRU 策略,对 L1D 采用 NRU 策略。此外,自 Itanium 9500 起,L1 ITLB 采用 NRU,而 L1 DTLB 采用 LRU。
The last sub-section mentions that the Mark I machine uses different memory technologies for the instruction and data memories. This made me think whether this can constitute as an advantage for the split design in modern computer systems.
最后一小节提到 Mark I 机器对指令存储器和数据存储器使用不同的存储技术。这使我思考,在现代计算机系统中这是否可构成分离设计的优势。
-
LASIC: Loop-Aware Sleepy Instruction Caches Based on STT-RAM Technology : The instruction cache is mostly read-only, except when there is a miss, in which case the line must be fetched and filled into the cache. This means that, when using STT-RAM (or really any other NVRAM technology), the expensive write operations occur less frequently compared to using STT-RAM for the data cache.
LASIC:基于 STT-RAM 技术的循环感知休眠指令缓存:指令缓存大多为只读,仅在缺失时需要取指并填充到缓存中。这意味着使用 STT-RAM(或任何其他 NVRAM 技术)时,昂贵的写操作频率低于将 STT-RAM 用于数据缓存的情况。
-
Feasibility exploration of NVM based I-cache through MSHR enhancements : This paper also proposes using STT-RAM for the instruction cache while the data cache and the L2 cache remain based on SRAM.
通过 MSHR 增强实现基于 NVM 的 I-cache 可行性探索:该文同样提议将 STT-RAM 用于指令缓存,而数据缓存和 L2 缓存保持基于 SRAM。
So I think we can say that one advantage of the split design is that we can use different memory technologies for the instruction and data caches.
因此我认为可以说,分离设计的优势之一在于可为指令缓存和数据缓存采用不同的存储技术。
分离缓存设计的劣势
The split design has its problems, though. First, the combined space of the instruction and data caches may not be efficiently utilized. A cache line that contains both instructions and data may exist in both caches at the same time. In contrast, in a unified cache, only a single copy of the line would exist in the cache. In addition, the size of the instruction cache and/or the data cache may not be optimal for all applications or different phases of the same application. Simulations have shown that a unified cache of the same total size has a higher hit rate. This is the primary disadvantage of the split design.
然而分离设计也存在问题。首先,指令缓存与数据缓存的合并空间可能未被高效利用。同时包含指令和数据的缓存行可能同时存在于两个缓存中。相比之下,在统一缓存中,该缓存行仅存在单一副本。此外,指令缓存和/或数据缓存的容量对所有应用或同一应用的不同阶段未必最优。模拟表明,相同总容量的统一缓存具有更高的命中率。这是分离设计的主要劣势。
Second, self-modifying code leads to consistency issues that need to be considered at the microarchitecture-level and/or software-level. Maintaining instruction consistency requires more logic and has a higher performance impact in the split design than the unified one.
其次,自修改代码导致一致性问题,需在微架构层面和/或软件层面予以考虑。在分离设计中,维护指令一致性需要更多逻辑,且对性能的影响大于统一设计。
Third, the design and hardware complexity of a split cache compared against a single-ported unified cache, a fully dual-ported unified cache, and dual-ported banked cache of the same overall organization parameters is an important consideration. According to the cache area model proposed in CACTI 3.0: An Integrated Cache Timing, Power, and Area Model , the fully dual-ported design has the biggest area.
第三,与单端口统一缓存、全双端口统一缓存以及相同总体组织参数的双端口分体缓存相比,分离缓存的设计与硬件复杂度是一个重要考量。根据 CACTI 3.0: An Integrated Cache Timing, Power, and Area Model 提出的缓存面积模型,全双端口设计的面积最大。
实际处理器中的统一 L1 与分离 L2 缓存
I'm not aware of any processor designed in the last 15 years that has a unified (L1) cache. In modern processors, the unified design is mostly used for higher-numbered cache levels, which makes sense because they are not directly connected to the pipeline.
据我所知,近 15 年来设计的处理器中没有采用统一 L1 缓存的。在现代处理器中,统一设计主要用于更高层级的缓存,这是合理的,因为它们不直接连接到流水线。
An interesting example where the L2 cache follows the split design is the Intel Itanium 2 9000 processor. This processor has a 3-level cache hierarchy where both the L1 and L2 caches are split and private to each core and the L3 cache is unified and shared between all the cores. The L2D and L2I caches are 256 KB and 1 MB in size, respectively.
一个有趣的例子是 Intel Itanium 2 9000 处理器,其 L2 缓存采用分离设计。该处理器具有 3 级缓存层次结构,其中 L1 和 L2 缓存均为分离式且每核心私有,L3 缓存为统一式且所有核心共享。L2D 和 L2I 缓存的容量分别为 256 KB 和 1 MB。
The separate instruction and data L2 caches provide more efficient access to the caches compared to Itanium 2 processors where instruction requests would contend against data accesses for L2 bandwidth against data accesses and potentially impact core execution as well as L2 throughput.>
与 Itanium 2 处理器相比,独立的指令和数据 L2 缓存提供了更高效的缓存访问------在后者中,指令请求会与数据访问争夺 L2 带宽,并可能影响核心执行以及 L2 吞吐量。
统一 L1 缓存分区
James Bell et al. mentioned in their 1974 paper the idea of partitioning a unified cache between instructions and data. The only paper that I'm aware of that proposed and evaluated such a design is Virtually Split Cache: An Efficient Mechanism to Distribute Instructions and Data , which was published in 2013.
James Bell 等人在其 1974 年的论文中提到了在指令和数据之间划分统一缓存的思想。据我所知,唯一提出并评估此类设计的论文是 2013 年发表的 Virtually Split Cache: An Efficient Mechanism to Distribute Instructions and Data。
The main disadvantage of the split design is that one of the L1 caches may be underutilized while the other may be over-utilized. A split cache doesn't allow one cache to essentially take space from the other when needed. It is for this reason that the unified design has a lower L1 miss rate than the overall miss rate of the split caches (as the paper shows using simulation).
分离设计的主要劣势在于,其中一个 L1 缓存可能利用不足,而另一个可能过度利用。分离缓存不允许一个缓存在需要时从另一个缓存获取空间。正是由于这一原因,统一设计的 L1 缺失率低于分离缓存的总体缺失率(如该论文通过模拟所示)。
The Virtually Split Cache (VSC) design is the middle point between the split and unified designs. The VSC dynamically partitions (way-wise) the L1 cache between instructions and data depending on demand. This enables better utilization of the L1 cache, similar to the unified design. However, the VSC has an even lower miss rate because partitioning reduces potential space conflict between lines holding instructions and lines holding data.
虚拟分离缓存(VSC)设计是分离设计与统一设计之间的中间方案。VSC 根据需求动态地(以路为单位)在指令和数据之间划分 L1 缓存。这使得 L1 缓存得以更好地利用,类似于统一设计。然而,VSC 的缺失率更低,因为划分减少了存放指令的行与存放数据的行之间的潜在空间冲突。
Z80000 统一缓存案例
The Zilog Z80000 processor has a scalar 6-stage pipeline with an on-chip single-ported unified cache. The cache is 16-way fully associative and sectored. Each stage of the pipeline takes at least two clock cycles. Each pair of consecutive clock cycles constitutes a processor cycle.
Zilog Z80000 处理器具有标量 6 级流水线,配备片上单端口统一缓存。该缓存为 16 路全相联且分扇区。流水线的每一级至少占用两个时钟周期。每对连续的时钟周期构成一个处理器周期。
There can be up to two cache accesses in a single processor cycle, including up to one instruction fetch and up to one data access. However, the cache, despite being unified and single-ported, is designed in such a way as to have no contention between instruction fetches and data accesses. The unified cache has an access latency of a single clock cycle (which is equal to half a processor cycle). In each processor cycle, an instruction fetch is performed in the first clock cycle and a data access is performed in the second clock cycle. There is no latency benefit from splitting the cache in this case and time-multiplexing accesses to the cache provides the same bandwidth and also the split design downsides don't exist.
单个处理器周期内最多可进行两次缓存访问,包括最多一次指令取指和最多一次数据访问。然而,该缓存尽管是统一且单端口的,但其设计方式使得指令取指与数据访问之间不存在争用。统一缓存的访问延迟为单个时钟周期(等于半个处理器周期)。在每个处理器周期中,第一个时钟周期执行指令取指,第二个时钟周期执行数据访问。在这种情况下,分离缓存没有延迟优势,而时分复用访问缓存提供了相同的带宽,且不存在分离设计的缺点。
R3000、80486 与 Pentium P5 的案例
The 80386 does have an external cache controller with an external unified cache. However, just because the cache is external doesn't necessarily mean that it's likely to be unified. Consider the R3000 processor, which was released three years after 80386 and is of the same generation as the 80486. The designers of R3000 opted for a large external cache instead of a small on-chip cache to improve performance.
80386 确实配备了外部缓存控制器和外部统一缓存。然而,缓存为外部并不意味着它必然是统一的。考虑 R3000 处理器,它在 80386 三年后发布,与 80486 同属一代。R3000 的设计者选择使用大型外部缓存而非小型片上缓存以提升性能。
The first section of Chapter 1 of the R3000 Software Reference Manual says that the external cache uses the split design so that it can perform an instruction fetch and a read or write data access in the same "clock phase." It's not clear to me how this exactly works though. My understanding is that the external data and address buses are shared between the two caches and with memory as well. Both caches are direct-mapped, maybe to achieve a single-cycle access latency.
R3000 软件参考手册第 1 章第 1 节指出,外部缓存采用分离设计,以便在同一"时钟相位"内执行指令取指和读/写数据访问。但我并不清楚这究竟如何运作。我的理解是,外部数据总线和地址总线由两个缓存以及内存共享。两个缓存均为直接映射,可能是为了实现单周期访问延迟。
According to the Intel paper titled The i486 CPU: executing instructions in one clock cycle , Intel evaluated both designs and deliberately chose to go for the unified on-chip design. Compared to the same-gen R3000, both processors have similar frequency ranges and the off-chip data width is 32 bits in both processors. However, the unified cache of the 80486 is much smaller than total cache capacity of the R3000 (up to 16 KB vs. up to 256 KB+256 KB). On the other hand, being on-chip made it more feasible for the 80486 to have wider cache buses.
根据 Intel 论文 The i486 CPU: executing instructions in one clock cycle,Intel 评估了两种设计并刻意选择了统一片上设计。与同代的 R3000 相比,两款处理器的频率范围相似,且片外数据宽度均为 32 位。然而,80486 的统一缓存远小于 R3000 的总缓存容量(最多 16 KB 对比最多 256 KB+256 KB)。另一方面,片上集成使 80486 拥有更宽的缓存总线更为可行。
In particular, the 80486 cache has a 16-byte instruction fetch bus, a 4-byte data load bus, and a 4-byte data load/store bus. The two data buses could be used at the same time to load a single 8-byte operand (double-precision FP operand or segment descriptor) in one access. The R3000 caches share a single 4-byte bus.
具体而言,80486 缓存拥有 16 字节指令取指总线、4 字节数据加载总线和 4 字节数据加载/存储总线。两条数据总线可同时使用,以一次访问加载单个 8 字节操作数(双精度浮点操作数或段描述符)。R3000 缓存共享一条 4 字节总线。
The relatively small size of the 80486 cache may have allowed making it 4-way associative with a single-cycle latency. This means that a load instruction that hits in the cache can supply the data to a dependent instruction in the next cycle without any stalls. On the R3000, if an instruction depends on an immediately preceding load instruction, it has to stall for one cycle in the best-case scenario of a cache hit.
80486 缓存相对较小的容量可能使其能够以 4 路组相联实现单周期延迟。这意味着命中缓存的加载指令可在下一周期为依赖指令提供数据,无需任何停顿。在 R3000 上,若一条指令依赖于紧前的一条加载指令,即使在缓存命中的最佳情况下,也必须停顿一个周期。
The 80486 cache is single-ported, but the instruction prefetch buffer and the wide 16-byte instruction fetch bus helps keeping contention between instruction fetches and data accesses to minimum. Intel mentions that simulation results show that the unified design provides a hit rate that is higher than that of a split cache enough to compensate for the bandwidth contention.
80486 缓存为单端口,但指令预取缓冲和 16 字节宽的指令取指总线有助于将指令取指与数据访问之间的争用降至最低。Intel 提到,模拟结果表明统一设计提供的命中率高于分离缓存,足以补偿带宽争用。
Intel explained in another paper titled Design of the Intel Pentium processor why they decided to change the cache in the Pentium to split. There are two reasons: (1) The 2-wide superscalar Pentium requires the ability to perform up to two data accesses in a single cycle, and (2) Branch prediction increases cache bandwidth demand.
Intel 在另一篇论文 Design of the Intel Pentium processor 中解释了为何决定将 Pentium 的缓存改为分离式。原因有二:(1) 双发射超标量 Pentium 需要在一个周期内执行最多两次数据访问的能力;(2) 分支预测增加了缓存带宽需求。
四、Intel x86 处理器谱系与微架构演进
问题
My simplified understanding of the evolution of the Intel processors over the last 20 years is that the Pentium II and Pentium III architectures were sort of "dead-ends", and today's Intel processors were built on an earlier design introduced with the Pentium (P5) in the early-90s. My understanding is the P5 core was enhanced to support the x86-64 instruction set "borrowed" from AMD and multi-threading, and they went to multiple cores on die.
我对近 20 年 Intel 处理器演进的简化理解是:Pentium II 和 Pentium III 架构某种程度上是"死胡同",而当今的 Intel 处理器建立在 1990 年代初 Pentium(P5)引入的更早设计之上。我的理解是 P5 核心经过增强以支持从 AMD"借鉴"的 x86-64 指令集和多线程,并发展为片上多核。
I am looking for validation or correction of my (obviously simple) understanding here, as well as more details on the hows & whys for Intel to pursue this strategy? Additionally, why were the PII/PIII approaches a "bust"?
我希望得到对这一(显然过于简化的)理解的验证或纠正,以及 Intel 采取这一策略的具体方式和原因的更多细节?此外,为何 PII/PIII 的方法是"失败"的?
回答一:从流水线到乱序执行
Prior to the Pentium, Intel CPUs were pipelined: different parts of the CPU would simultaneously be working on different operations, but the different parts were designed to work in sequence, with every operation proceeding through all parts.
在 Pentium 之前,Intel CPU 采用流水线设计:CPU 的不同部分同时处理不同操作,但各部分按顺序工作,每条指令依次流经所有部分。
The Pentium expanded on that by being superscalar. Rather than there being only exactly one instruction at each stage of execution at any given time, for a certain period there are two pipelines --- one that can perform any defined operation and one that can perform only the simpler operations. If two instructions are independent and can be dispatched down different pipes then they will be.
Pentium 通过超标量设计扩展了流水线。在特定时段内,执行阶段的每一级不再只有一条指令,而是存在两条流水线------一条可执行任何定义的操作,另一条仅能执行较简单的操作。若两条指令相互独立且可分发到不同流水线,则它们将被并行执行。
The Pentium Pro expands on that by introducing out-of-order execution. Rather than only two pipes, with instructions being considered for potential parallel execution in the order they come, there are a bunch of slots for incoming instructions. An instruction that has no remaining preceding dependencies then proceeds through an execution path, its slot being taken by the next instruction in the stream. That way the Pentium Pro can potentially perform three instructions simultaneously, and is more likely to do so than if it required that all three be sequential and independent.
Pentium Pro 通过引入乱序执行进一步扩展。不再是仅有两条流水线、按指令到达顺序考虑潜在并行执行,而是为进入的指令设置一组槽位。无剩余前置依赖的指令随即进入执行路径,其槽位由指令流中的下一条指令占据。这样 Pentium Pro 可同时执行最多三条指令,且比要求三条指令顺序且独立的条件更容易实现这一点。
(Core 2 and later widened the pipeline to a width of 4 instructions. Actually decoded uops, so complex instructions take more slots. Skylake can decode more than 4 instructions per cycle, but the narrowest bottleneck (issue/rename) is still only 4 uops wide.)
(Core 2 及后续产品将流水线宽度扩展至 4 条指令。实际上是解码后的微操作,因此复杂指令占用更多槽位。Skylake 每周期可解码超过 4 条指令,但最窄的瓶颈(发射/重命名)仍仅为 4 个微操作宽。)
There's a problem though: it's now a branching thing with a few places where you might wait a while, but a processor is still a pipeline. Instructions enter at one end, go through n intermediate steps, then finally are considered done. But what happens if there's a branch? In the worst case, if the processor is currently working on m instructions and a branch occurs, it needs to throw away all the work it did on m-1 of them, then start refilling its pipeline from the point of entry, with the various stages after fetching being empty until something new flows into them. That's called a pipeline stall.
然而存在一个问题:分支预测虽有进展,但处理器本质上仍是流水线。指令从一端进入,经过 n 个中间步骤,最终完成。但如果出现分支会怎样?最坏情况下,若处理器正在处理 m 条指令时发生分支,它需要丢弃其中 m-1 条的所有工作,然后从分支点重新填充流水线,取指之后的各级在指令重新流入前均为空。这称为流水线停顿。
The Pentium IV took a misstep in gambling on a really long pipeline. The thinking was that a longer pipeline has a greater cost when a stall occurs but gives small, simple individual parts, which are easier to scale up the clock rate on. So what you lose on stalls you more than gain on throughput.
Pentium IV 在超长流水线上赌错了。当时的思路是:流水线越长,停顿时代价越大,但各个部分更小更简单,更容易提升时钟频率。因此在停顿上的损失可从吞吐量上获得更多补偿。
Unfortunately, increasing the clock also increases the heat output, and process improvements didn't turn up quickly enough to correct for that. So it was difficult to scale, and the gamble didn't pay off.
不幸的是,提高时钟频率也增加了热量输出,而工艺改进未能及时跟上以纠正这一问题。因此难以扩展,这场赌博未能获得回报。
So Intel switched back from the microarchitecture they'd picked for the Pentium IV to the much-developed version of that which had originated in the Pentium Pro, and doubled down on instruction set extensions (both improving the vector stuff and the 64-bit transition, though the switch back put the latter temporarily on hold) and parallelism as software-side drivers of processing improvements, to bolster incremental improvements to each core.
因此 Intel 从 Pentium IV 选择的微架构退回到源自 Pentium Pro 且已大幅发展的版本,并加倍投入指令集扩展(既改进向量运算也推进 64 位过渡,尽管回归暂时搁置了后者)以及作为软件端处理改进驱动力的并行性,以支撑每核心的渐进式改进。
But what they've definitely abandoned is a long pipeline and the expectation of process improvements driving clock speed improvements.
但他们明确放弃的是超长流水线,以及工艺改进驱动时钟频率提升的期望。
回答二:Intel x86 CPU 谱系
Intel x86 CPU Lineages
Intel x86 CPU 系列
There are only a few Intel x86 CPU microarchitectural lineages. Each of these lineages starts with a processor design that was largely made from scratch, incorporating little of any previous CPU's design.
Intel x86 CPU 仅有少数几个微架构谱系。每个谱系均以基本从零开始设计的处理器为起点,很少继承先前 CPU 的设计。
The early microarchitectural lineages were all dead ends. In this stage of x86 CPU design Intel came up with a new microarchitecture for each major new CPU version that completely replaced the old. They are:
早期微架构谱系均为死胡同。在这一 x86 CPU 设计阶段,Intel 为每个主要新版本 CPU 提出全新的微架构,完全取代旧架构。它们包括:
- The original 8086 design, also used largely unchanged in the 8088 and improved in the 80186.
原始 8086 设计,在 8088 中基本未变使用,在 80186 中得到改进。 - The 80286 design, which added support for protected mode and a 16 megabyte address space.
80286 设计,增加了对保护模式和 16 MB 地址空间的支持。 - The 80386 design, which added support for 32-bit code and a 4 gigabyte address space.
80386 设计,增加了对 32 位代码和 4 GB 地址空间的支持。 - The 80486 design, which significantly improved performance through pipelining.
80486 设计,通过流水线显著提升了性能。
After these processors Intel started to take a more incremental approach to microarchitecture design. Intel would only come up with a completely new design three more times, creating three more lineages of which only one is currently a dead-end:
这些处理器之后,Intel 开始对微架构设计采取更为渐进的方式。Intel 仅再三次提出全新设计,形成三个新谱系,其中目前仅有一个为死胡同:
- The Pentium processor introduced the P5 microarchitecture which was improved on with the Pentium-MMX and much later resurrected to form the basis of the Atom and Xeon Phi CPUs. The major improvement in this design was superscalar dual pipelines.
Pentium 处理器引入了 P5 微架构,经 Pentium-MMX 改进,并在很久后复活,成为 Atom 和 Xeon Phi CPU 的基础。该设计的主要改进是超标量双流水线。 - The Pentium Pro introduced the P6 microarchitecture, starting the lineage that most Intel CPUs, including all current desktop and server CPUs, have used since. The Pentium II and Pentium III CPUs introduced further improvements of this design. The big improvement with the P6 architecture was out-of-order execution.
Pentium Pro 引入了 P6 微架构,开启了此后大多数 Intel CPU(包括所有当前桌面和服务器 CPU)使用的谱系。Pentium II 和 Pentium III CPU 引入了该设计的进一步改进。P6 架构的重大改进是乱序执行。 - The Pentium 4 introduced the NetBurst microarchitecture. This lineage is a dead-end. While it was improved on during its lifetime, it's long been abandoned by Intel and not likely to ever be brought back to life like the P5 lineage. Its major improvement was a very long pipeline which allowed for a big jump in clock speeds, but hurt performance overall.
Pentium 4 引入了 NetBurst 微架构。这一谱系是死胡同。尽管在其生命周期内得到改进,但早已被 Intel 放弃,且不太可能像 P5 谱系那样复活。其主要改进是超长的流水线,允许时钟频率大幅提升,但总体上损害了性能。
The Failure of NetBurst and the Pentium 4
NetBurst 和奔腾 4 的失败
The NetBurst microarchitecture was a failure for largely one reason, it was designed primarily to win the "megahertz wars". In 2000 Intel's main competitor in the x86 CPU market, AMD, was able to release a 1 GHz CPU before Intel could. Intel struggled to beat AMD. They released a 1.13 GHz Pentium III later that year but had to recall it, while AMD was able to push forward to 1.2 GHz.
NetBurst 微架构的失败主要出于一个原因:它主要是为赢得"兆赫兹战争"而设计的。2000 年,Intel 在 x86 CPU 市场的主要竞争对手 AMD 抢先发布了 1 GHz CPU。Intel 奋力追赶 AMD。同年晚些时候他们发布了 1.13 GHz Pentium III,但不得不召回,而 AMD 已推进到 1.2 GHz。
The new NetBurst microarchitecture let Intel and its Pentium 4 CPUs take the megahertz crown decisively. Its much longer pipeline, double the length of the pipeline used in Pentium III, allowed Intel to consistently and significantly beat AMD in raw clock speed. However that came with a big and ultimately fatal penalty, longer pipelines meant a longer delay anytime the processor was forced to discard the contents of the pipeline.
新的 NetBurst 微架构使 Intel 及其 Pentium 4 CPU 果断夺回了兆赫兹桂冠。其更长的流水线------是 Pentium III 所用流水线长度的两倍------使 Intel 在原始时钟频率上持续且显著地击败 AMD。然而这带来了巨大且最终致命的代价:流水线越长,处理器被迫丢弃流水线内容时的延迟就越长。
Intel hoped that better compilers would produce code tuned to the NetBurst architecture to minimize how often this occurred, but in practice this happened often enough in most code to completely offset the performance improvements gained by the faster clock speeds.
Intel 希望更好的编译器能产生针对 NetBurst 架构优化的代码以最小化此类情况的发生,但实际上在大多数代码中这种情况发生得足够频繁,完全抵消了更快时钟频率带来的性能提升。
While being able to advertise faster CPU speeds gave Intel an advantage over AMD in the desktop markets, the NetBurst design ended up hurting Intel significantly in the more lucrative server market. In the latter market, customers weren't as easily swayed by bigger gigahertz numbers, and even overall performance wasn't all they were interested in. They were also concerned about heat, wanting to pack as many CPUs as they could into their server racks without having to pay for bigger and more expensive cooling systems. On the performance-per-watt metric AMD's lower clocked CPUs were able to significantly beat Intel, and this helped AMD grab about a quarter of the server market at its peak.
虽然能够宣传更快的 CPU 速度使 Intel 在桌面市场相对于 AMD 获得优势,但 NetBurst 设计最终在利润更丰厚的服务器市场严重损害了 Intel。在后一市场中,客户不那么容易被更高的吉赫兹数字所打动,甚至整体性能也并非他们唯一的关注点。他们还关心散热,希望在不必为更大更昂贵的冷却系统付费的情况下,在服务器机架中尽可能多地部署 CPU。在每瓦性能指标上,AMD 较低频率的 CPU 能够显著击败 Intel,这帮助 AMD 在巅峰时期占据了约四分之一的服务器市场。
The rise of the Pentium-M
奔腾-M 的崛起
While it took Intel a fairly long while before they realized the NetBurst microarchitecture was a dead-end, the fact that the power hungry Pentium 4 CPUs were a poor choice for laptops was obvious from the beginning. So they had a team of Intel engineers in Israel come up with a low power CPU design that would work well in mobile computers. Rather than base their design on the Pentium 4 they chose to base it on the already more power efficient, if not as fast, Pentium III CPUs. This resulted in the Pentium-M, a CPU that, while not clocked as fast as Pentium 4, could often beat it in real world performance while generating much less heat.
尽管 Intel 花了相当长的时间才意识到 NetBurst 微架构是死胡同,但高功耗的 Pentium 4 CPU 不适合笔记本电脑这一事实从一开始就显而易见。因此他们让 Intel 以色列的一支工程师团队提出一种低功耗 CPU 设计,以在移动计算机中良好工作。他们没有将设计基于 Pentium 4,而是选择基于已更节能(尽管速度不那么快)的 Pentium III CPU。这催生了 Pentium-M------一款虽然时钟频率不如 Pentium 4 高,但在实际性能上常常能击败后者,同时产生少得多的热量的 CPU。
Around the end of Pentium 4 era the advantages of the Pentium-M were becoming obvious. While Intel only sold it as a mobile CPU for use in laptops and other similar applications, motherboard manufacturers and server makers were increasingly releasing products for desktops and servers applications that used Pentium-M CPUs. If Intel didn't kill off the NetBurst design, the market soon would prefer to buy AMD and Pentium-M CPUs instead.
在 Pentium 4 时代末期,Pentium-M 的优势日益明显。尽管 Intel 仅将其作为移动 CPU 销售用于笔记本电脑等类似应用,但主板厂商和服务器制造商越来越多地推出用于桌面和服务器应用的 Pentium-M CPU 产品。如果 Intel 不终止 NetBurst 设计,市场很快会更倾向于购买 AMD 和 Pentium-M CPU。
While Intel was slow to respond to the threat of AMD or leverage the advantages of their own Pentium-M CPUs, they eventually got the message. First releasing a variant of the Core Duo for servers and then the Core 2 CPUs for both servers and desktops. These new more efficient CPUs were based on the Pentium-M. All Intel's desktop and server CPU designs since then are descended from the Pentium-M and so ultimately the Pentium Pro and its P6 microarchitecture.
尽管 Intel 对 AMD 的威胁反应迟缓,也未能充分利用自身 Pentium-M CPU 的优势,但他们最终领会了市场的信号。首先发布了面向服务器的 Core Duo 变体,然后推出了面向服务器和桌面的 Core 2 CPU。这些新型高效 CPU 基于 Pentium-M。此后 Intel 的所有桌面和服务器 CPU 设计均源自 Pentium-M,因此最终源自 Pentium Pro 及其 P6 微架构。
附录:术语表
| 英文术语 | 中文译名 |
|---|---|
| Unified Cache | 统一缓存 |
| Split Cache | 分离缓存 |
| Instruction Cache (L1I) | 指令缓存 |
| Data Cache (L1D) | 数据缓存 |
| Modified Harvard Architecture | 改进型哈佛架构 |
| Associativity (n-way) | 相联度(n 路) |
| Cache Coherency | 缓存一致性 |
| Pipeline Stall | 流水线停顿 |
| Superscalar | 超标量 |
| Out-of-Order Execution | 乱序执行 |
| Translation Lookaside Buffer (TLB) | 转换检测缓冲器 |
| Self-Modifying Code | 自修改代码 |
| Write-Back / Write-Through | 写回 / 写通 |
| Cache Line | 缓存行 |
| Tag / Index / Offset | 标记 / 索引 / 偏移 |
| Hit Rate / Miss Rate | 命中率 / 缺失率 |
| Thrashing | 抖动 |
| Replacement Policy (LRU/NRU) | 替换策略(最近最少使用/非最近使用) |
| Microarchitecture | 微架构 |
| Die Layout | 芯片布局 |
| Clock Multiplier | 时钟倍频器 |
| Memory Bandwidth | 内存带宽 |
| Structural Hazard | 结构冒险 |
| Translation Lookaside Buffer (TLB) | 地址转换后备缓冲器 |
Slave Memories and Dynamic Storage Allocation
从属存储器与动态存储分配
M. V. WILKES
M. V. 威尔克斯
Manuscript received November 30, 1964. The work reported in this note was supported in part by Project MAC, a Massachusetts Institute of Technology. Cambridge, research program, sponsored by the Advanced Research Projects Agency, Dept. of Defense, under Office of Naval Research Contract No. Nonr-4102(01). The author is with the University Mathematical Lab., Cambridge, England.
稿件收稿日期:1964 年 11 月 30 日。本文研究受麻省理工学院 MAC 项目部分资助;该项目由美国国防部高级研究计划局立项、海军研究办公室签约(合同编号 Nonr-4102(01))。作者任职于英国剑桥大学数学实验室。
SUMMARY(摘要)
SUMMARY
The use is discussed of a fast core memory of, say, 32 000 words as a slave to a slower core memory of, say, one million words in such a way that in practical cases the effective access time is nearer that of the fast memory than that of the slow memory.
本文探讨采用容量例如为 32 000 字的高速磁芯存储器,作为容量约一百万字符的低速磁芯存储器的从属存储;在实际应用场景下,系统有效访问时间更贴近高速存储器的访问时延,而非低速存储器。
INTRODUCTION(引言)
In the hierarchic storage systems used at present, core memories are backed up by magnetic drums or disks which are, in their turn, backed up by magnetic tape. In these systems it is natural and efficient for information to be moved in and out of the core memory in blocks. The situation is very different, however, when a fast core memory is backed up by a large slow core memory, since both memories are truly random access and there is no latency time problem. The time spent in transferring to the fast memory words of a program which are not used in a subsequent running is simply wasted.
在当下所用的分层存储系统中,磁芯存储器由磁鼓或磁盘作为后备存储,磁鼓与磁盘又进一步由磁带充当后备介质。在这类架构中,数据以数据块为单位在磁芯存储器中移入移出,该存取方式具备合理性与高效性。但如果高速磁芯存储器以大容量低速磁芯主存作为后备,整体场景便截然不同:两类存储器均为纯随机存取器件,不存在寻道等待时延。将程序后续运行不会调用的数据搬运至高速存储器所耗费的时间,会被白白浪费。
I wish in this note to draw attention to the use of a fast memory as a slave memory. By a slave memory I mean one which automatically accumulates to itself words that come from a slower main memory, and keeps them available for subsequent use without it being necessary for the penalty of main memory access to be incurred again. Since the slave memory can only be a fraction of the size of the main memory, words cannot be preserved in it indefinitely, and there must be wired into the system an algorithm by which they are progressively overwritten. In favorable circumstances, however, a good proportion of the words will survive long enough to be used on subsequent occasions and a distinct gain of speed results. The actual gain depends on the statistics of the particular situation.
本文旨在探讨将高速存储器用作从属存储器的实现思路。从属存储器的定义为:可自动从低速主存中读取数据并缓存,后续访问对应数据时无需再次访问主存,省去主存访问带来的时间损耗。由于从属存储器的容量仅为主存的一小部分,数据无法在其中永久保存,系统需通过硬件固化一套替换算法,逐步对老旧缓存数据进行覆盖。在访问特征理想的场景下,大部分缓存数据能够留存至下次调用,带来明显的运行提速。实际性能提升幅度由具体业务的访问统计特征决定。
小型指令从属存储器
Slave memories have recently come into use for reducing instruction access time in an otherwise conventional computer. 1 , 2 ^{1,2} 1,2 A small, very-high-speed memory of, say, 32 words, accumulates instructions as they are taken out of the main memory. Since instructions often occur in small loops a quite appreciable speeding up can be obtained.
从属存储器现已应用于传统计算机的指令访存加速 1 , 2 ^{1,2} 1,2。选用例如 32 字容量的超高速存储器,缓存从主存读出的指令;受程序小循环执行特征影响,该设计可实现可观的运行提速。
1 Takahashi, S., H. Nishino, K. Voshihiro, and K. Fuchi, System design of the ETL Mk-6 computers. Information Processing 1962 (Proc. IFIP Congress 62), Amsterdam. The Netherlands: North Holland Publishing Co... 1963, p 690.
高桥 S、西野 H、吉广 K、福地 K,ETL Mk-6 计算机系统设计,1962 国际信息处理大会论文集,阿姆斯特丹:荷兰北荷兰出版社,1963:690 页。
2 Ferranti Computing Systems; Atlas 2, London: Ferranti Ltd., 1963.
费兰蒂计算机事业部,Atlas 2 产品手册,伦敦:费兰蒂有限公司,1963。
标记位寻址设计
into the tag bits of that register. One method of designing a slave memory for instructions is as follows. Suppose that the main memory has 64K words (where K = 1024 K=1024 K=1024) and, therefore, 16 address bits, and that the slave memory has 32 words and, therefore, 5 address bits. The slave memory is constructed with a word length equal to that of the main memory plus 11 extra bits, which will be extracted from register r r r of the main memory is copied into register r ( m o d 32 ) r \pmod{32} r(mod32) of the slave memory and, at the same time, the 11 most significant bits of r r r are copied into the 11 tag bits. For example, suppose r = 10259 r=10259 r=10259, the instruction from this register is copied into register 19 of the slave and the number 320 is copied into the tag bits of that register.
存入对应寄存器的标记位。指令型从属存储器的一种硬件设计方案如下:假设主存容量为 64K 字( K = 1024 K=1024 K=1024),对应 16 位地址;从属存储容量 32 字,占用 5 位地址。从属存储单字位宽 = 主存单字位宽 + 11 位扩展标记位。主存地址 r r r 对应的存储单元数据,被复制到从属存储下标为 r ( m o d 32 ) r \pmod{32} r(mod32) 的单元,同时地址 r r r 的高 11 位存入该单元的 11 位标记位。举例: r = 10259 r=10259 r=10259 时,指令存入从属存储第 19 号单元,数值 320 存入该单元标记位。
命中/缺失读写逻辑
Whenever an instruction is required, the slave is first examined to see whether it already contains that instruction. This is done by accessing the register that might contain the instruction namely, register r ( m o d 32 ) r \\pmod{32} r(mod32), and examining the tag bits to see whether they are equal to the 11 most significant digits of r r r. If they are, the instruction is taken from the slave; otherwise, it is obtained from the main memory and a copy left in the slave. If the system is to preserve full freedom for the programmer to modify instructions in the accumulator, it is necessary that every time a writing operation is to take place, the slave shall be examined to see whether it contains the word about to be updated. If it does, then the word must be updated in the slave as well as in the main memory.
处理器取指时优先检索从属存储器。硬件先寻址从属存储 r ( m o d 32 ) r \pmod{32} r(mod32) 单元,比对单元标记位与地址 r r r 的高 11 位;标记匹配即命中,直接从从属存储取指;未命中则访问主存,并将指令副本写入从属存储。为保障程序员可自由在累加器中修改指令,每次写操作前需要检索从属存储:若待修改数据已缓存于从属存储,则从属存储与主存的数据同步更新。
LARGE SLAVE MEMORY(大容量从属存储器)
分时系统应用背景
So far the slave principle has been applied to very small superspeed memories associated with the control of a computer. There would, however, appear to be possibilities in the use of a normal sized core memory as a slave to a large core memory, and I will now discuss various ways in which this might be done. I shall be concerned primarily with a computer system designed for on-line time-sharing in which a large number of user programs are held in auxiliary storage and activated, in turn, according to a sequence determined by a scheduling algorithm. When activated, each program runs until it is either completed or held up by an input/output wait, or until the period of time allocated to it by the scheduling algorithm is exhausted. Another program is then activated. See Corbato. 3 ^{3} 3
此前从属存储原理仅用于控制器配套的极小容量超高速存储;但常规容量磁芯存储同样可作为大容量主存的从属存储,下文阐述多种实现方案。本文聚焦在线分时计算机系统:大量用户程序存放于辅助存储,由调度算法按次序轮换调入运行。单个程序启动后持续执行,直至运行结束、I/O 阻塞或分配时间片耗尽,随即切换下一个程序。详见 Corbato 相关文献 3 ^{3} 3。
3 Corbato, F. J. Proc. 1962 International Federation of Information Processing Congress, Amsterdam, The Netherlands: North-Holland Publishing Co., 1963, p 711.
F. J. 科尔巴托,1962 国际信息处理联合会会议论文,阿姆斯特丹:荷兰北荷兰出版社,1963:711 页。
双层级存储基础方案
Consider a computer in which a working memory of, say, 32K and 1- μ \mu μs access time is backed up by a large core memory of, say, one million words and 8- μ \mu μs access time. In the simplest scheme to be described program make use of one or more blocks for his program. The large core memory is provided with a base register, which contains the starting address of the 32K block currently active. What we wish to avoid is transferring the whole block to the fast core memory every time it becomes active this would be wasteful since chances are only a small fraction of the words will actually be accessed before the block ceases to be active. If the fast core memory is operated on the slave principle, no word is copied into it until that word has actually been called for by the program. When this happens, the word is automatically copied by the hardware into the fast memory, and the fact that copying has taken place is indicated by the first of two tag bits being changed from a 0 to a 1. When any reference to storage takes place the fast memory is accessed first, 4 ^{4} 4 and, if the first tag bit is a 1, no reference is made to the large memory; this is true whether reading or writing is called for. If a word in the fast memory is changed, a second tag bit is changed from 0 to 1. Two tag bits are all that are required in this system.
设系统高速工作存储容量 32K、访问时延 1 μ \mu μs,后端主存为一百万字、访问时延 8 μ \mu μs。最简设计中,用户程序占用一个或多个存储块,大容量主存配置基址寄存器,存放当前活跃 32K 存储块的起始地址。传统方案每次换程需整块搬移数据至高速存储,但多数存储块仅少量数据会被访问,全块拷贝造成资源浪费。采用从属存储机制后,仅程序发起访问的数据才由硬件自动加载至高速存储,用双标记位的第 1 位由 0 置 1 标记缓存有效。所有访存操作优先检索高速存储 4 ^{4} 4:第 1 标记位为 1 代表命中,读写均无需访问低速主存;高速存储内数据被修改时,第 2 标记位由 0 置 1。本基础方案仅需 2 位标记位。
4.If the design of the large core memory permits, access to it can be initiated simultaneously with access to the fast memory, and cancelled if it turns out not to be required.
若低速主存硬件架构支持,可与高速存储并行发起访存请求,缓存命中后撤销主存访问。
程序切换数据回写
As time goes on, the fast memory will accumulate all the words of the program in active use. When the number in the base register is changed so that a new program becomes active in the place of the one currently active (a change that is brought about by the supervisor), a scan of the fast memory is initiated. Each register is examined in turn and, if the first tag bit is a 0, no action is taken for that register. No action is similarly taken if the first tag bit is a 1 and the second tag bit is a 0. If however both tag bits are l's the word in the register under examination is copied into its appropriate place in the large memory.
运行过程中高速存储持续缓存当前活跃程序的访问数据。操作系统修改基址寄存器、切换运行程序时,系统启动高速存储全量扫描:逐单元检查标记位,第 1 标记位为 0 直接跳过;第 1 位为 1、第 2 位为 0 代表数据未修改,同样无需处理;两位标记全为 1 说明缓存数据已被改写,需将数据回写至低速主存对应地址。
方案变体优化
Many variants of the simple scheme are possible. The tag bits may, for example, be stored in a separate superspeed memory. A 1024-word memory, each having 64 bits, would be suitable; such a memory could be made with an access time of about 100 ns, and would enable the scanning process to be completed more rapidly. Similarly, a number of base registers could be provided and the fast core memory divided into sections, each serving as a slave to a separate program block in the main memory. Such a provision would, in principle, enable short programs belonging to a number of users to remain in the fast memory while some other user was active, being displaced only when the space they occupied was required for some other purpose. This would present the designer of the supervisor with problems similar to those presented by an Atlas-type system of dynamic storage allocation. 5 ^{5} 5
上述基础方案存在多种优化变体。例如可将全部标记位独立存放于专属超高速存储:选用 1024 字、单字 64 位的存储器件,访问时延约 100 ns,大幅缩短存储扫描耗时。另一种优化:配置多组基址寄存器,高速存储分区,每个分区单独对接主存内一个程序块。该设计可实现:某用户程序运行时,其余多用户小程序仍驻留高速存储,仅存储空间不足时才被替换;该管理逻辑给调度系统带来的设计难点,与 Atlas 架构动态存储分配体系相近 5 ^{5} 5。
5 Kilburn, T., D. B. G. Edwards, M. J. Lanigan, and F. H. Sumner, One level storage system, IRE Trans. on Electronic Computers, vol 11, Apr 1962, pp 223-235.
T. 基尔伯恩、D. B. G. 爱德华兹、M. J. 拉尼根、F. H. 萨姆纳,单级存储系统,IRE 电子计算机汇刊,第 11 卷,1962 年 4 月:223-235 页。
多块共享从属存储方案
An alternative, and perhaps more attractive, scheme would be to retain 32K (or whatever the size of the fast memory may be) as the block length, but to arrange that the fast memory acts as a slave to more than one block in the main memory, it being recognized that this will lead to some overwriting of information in the slave, but will, nevertheless, on the average, be advantageous. Suppose, for example, that there are seven base registers, each containing an address of a register in the main memory at which a program block starts. Four tag bits are necessary, the first three containing either zeros or the number of one of the base registers. The fourth tag bit indicates whether a word has been altered while in the slave.另一更具实用价值的方案:存储块大小仍固定为 32K(或高速存储原生容量),单块高速从属存储同时对接主存内多个程序块;该机制会产生缓存覆盖,但整体平均性能收益更优。举例:配置 7 组基址寄存器,各自存放主存程序块起始地址;每个缓存单元配置 4 位标记位:低 3 位用于标注对应基址编号或置零,第 4 位标记缓存数据是否被修改。
多程序轮换读写逻辑
At any given time, one of the seven program blocks is active. Whenever access is required to a word in the memory, the hardware looks to see whether that word is to be found in the slave. This is done by reading the word in the appropriate place in the slave and comparing the first 3 tag bits with the number of the base register corresponding to the program block then active. If there is agreement, and if a reading operation is to be performed, the word from the slave is used and operation proceeds. If the three tag bits are all zero, the word is obtained from the main memory and a copy put into the slave memory for future use. If the three tag bits are not zero but correspond to another base register, the fourth digit is examined. If this is a zero, action proceeds as before, the word in the slave being overwritten by the word from the new program block. If, however, the fourth bit is a 1, indicating that the word has been altered while in the slave, that word is copied back into its proper place in the main memory before being overwritten by the word from the new program block. In the case of a writing operation the sequence of events is similar, except that the fourth tag bit is made into a 1 when a word in the slave is modified. Thus, if the seven programs become active in turn, they may be said to share the slave between them and, if each runs in short bursts, there is a fair chance that only a few words belonging to a particular program block get overwritten in the slave before that program block is activated again.任一时刻 7 个程序块仅有 1 个处于运行态。硬件访存时检索从属存储:读取对应缓存单元,对比低 3 位标记与当前运行程序的基址编号;编号匹配则读命中,直接取用缓存数据。3 位标记全零代表未缓存,从主存读取数据并存入从属存储。标记非零但匹配其他基址编号时,检查第 4 位:第 4 位为 0,直接用新程序数据覆盖原有缓存;第 4 位为 1 代表旧缓存数据已修改,需先回写主存再覆盖。写操作流程与读操作大体一致,仅在缓存数据被改写后将第 4 标记位置 1。7 个程序分时轮换复用同一片从属存储,若各程序以短时间片间歇运行,程序再次调度上线前仅有少量缓存数据被覆盖。
基址变更后台扫描
There will, normally, be more than seven program blocks ready to take their turn for running and the supervisor will, from time to time, change the address in one of the base registers. When this happens, a scan of the slave is initiated, and all words which belong to the program block being displaced and which have a 1 in the fourth bit of the tag, are copied into the main memory.系统就绪程序块数量通常大于 7,操作系统会不定期替换基址寄存器绑定的程序块;每次替换触发从属存储扫描,被替换程序中第 4 标记位为 1 的已修改缓存数据统一回写至主存。
方案总结
On the face of it, the scheme just outlined appears to offer the basis for a satisfactory two-level core storage system without involving too high a degree of complexity in the hardware.综上,该多块共享从属存储方案可在硬件复杂度可控前提下,构建性能优良的两级磁芯存储架构。
ACKNOWLEDGMENT(致谢)
ACKNOWLEDGMENT
The author wishes to express his gratitude to Prof. R. M. Fano, Director of Project MAC, for inviting him to participate in the project. He is also grateful to his colleagues in Cambridge, England, for discussions, particularly to Dr. D. J. Wheeler and N. E. Wiseman, who designed the slave memory of Atlas 2. G. Scarrot first suggested the idea of a slave memory to them.致谢
作者感谢 MAC 项目负责人 R. M. 法诺教授的项目参与邀约;同时感谢英国剑桥实验室同行的研讨交流,尤其感谢 Atlas 2 从属存储器设计者 D. J. 惠勒博士与 N. E. 怀斯曼,G. 斯卡罗特最早向团队提出从属存储器构想。
英特尔 L1 缓存演进
英特尔自 1993 年 P5 奔腾处理器起首次采用分离式 L1 缓存架构(L1I + L1D),此后桌面与服务器产品线未再回归统一 L1 设计;1989 年发布的 80486 采用 8 KB 单片统一 L1,为英特尔首款片载统一一级缓存产品。
一、四项切换诱因
-
超标量双流水线资源争抢:奔腾 U/V 双发射架构同周期并行取指、访存,单端口统一 L1 产生结构性资源冲突,拆分设计可消除流水线停顿。
-
组相联电路时序优化:分离架构允许指令缓存与数据缓存独立优化相联度与访问路径,简化标签对比硬件逻辑,缩短缓存访问时延。奔腾初代 L1I/L1D 均为独立设计,无需受统一阵列的折中约束。
-
芯片版图布线优化:L1I 物理临近取指单元,L1D 物理临近载入/存储单元,缩短片内金属互联走线长度,降低线延迟与功耗。
-
硬件制造成本管控:同等带宽指标下,双端口/多端口统一 SRAM 的单元面积、功耗与制造成本显著高于两片独立单端口 SRAM。
二、除上述因素外,额外影响 L1 架构选择的客观约束
(1)软件生态与指令集架构约束
x86 指令集原生支持自修改代码(Self-Modifying Code, SMC)与代码/数据混存的内存布局,多数 RISC 架构未内置指令数据缓存一致性机制。采用分离缓存时,硬件需增设 I/D 同步电路(如监听、冲刷、页失效处理);选用统一 L1 可省去自修改代码的数据同步开销。早期存量 x86 程序普遍混用代码段与数据段,统一缓存可原生适配该访问特征,此条件使 80486 选择统一架构;奔腾完成微架构重构后,配套一致性硬件的工艺成本得到控制,分离缓存方案得以实现。
(2)存储访问局部性差异化约束
程序指令以顺序读取为主、写入频次极低,具备强时间局部性;数据读写随机占比高,同时存在时间与空间双重局部性。分离缓存可针对两类访问特征独立设计替换策略、端口配置与写策略;统一缓存硬件参数需折中适配两种访问模式。80486 采用单发射流水线,访问冲突数量偏低,折中设计可以实施;奔腾超标量并行度提升后,折中方案的硬件收益逐步消失。
(3)制程与晶体管资源约束
80486 与初代奔腾单片晶体管规模分别约为 1.2 M 与 3.1 M。80486 延续 82385 外置统一缓存的产品迭代思路,沿用片上统一缓存能够缩减架构重构工作量;奔腾为全新超标量架构,晶体管资源优先分配至双流水线、分支预测与 FPU 模块,拆分 L1 可在晶体管总量不变的前提下提升存储有效带宽,无需过度扩充缓存占用的芯片面积。
(4)多端口 SRAM 工艺成熟度约束
1989 年量产工艺难以低成本实现大容量双端口统一 SRAM;1993 年制程迭代后,单端口分立 SRAM 在量产良率与生产成本两项指标上优于多端口 SRAM,缓存拆分成为可行的工艺方案,后续多代制程迭代未改变该成本对比关系。
(5)安全与硬件容错设计需求
分离架构实现指令存储区与数据存储区的物理隔离,规避非法数据写入造成指令内容篡改;统一缓存的指令、数据共用存储阵列,非法写操作可直接改写程序指令。1990 年代商用操作系统安全标准逐步完善,该项需求推动缓存架构向分离方向切换。
三、L1 分离架构实施后对英特尔全系列处理器的性能客观影响
正向性能增益
-
单周期双访存带宽提升:L1I、L1D 配置独立访问端口,单时钟周期可同步完成取指与数据读写操作,处理器前端理论带宽提升近 1 倍;奔腾 IPC 相对 80486 显著提升,后续酷睿系列 4 发射超标量设计依托该带宽基准持续提升指令吞吐能力。
-
访问时延精细化优化:L1I 仅配置只读端口,L1D 配置读写端口,两类缓存依据访问特征精简电路。以 Sunny Cove 架构为例,L1I 访问耗时为 4-cycle,L1D 访问耗时为 5-cycle;相较同代统一缓存设计,分离架构可通过针对性优化降低平均访问时延。
-
场景化命中率优化:循环密集型负载下 L1I 命中率可达 95% 以上,数据密集型负载的存储空间不受指令占用干扰;80486 统一缓存易出现指令挤占数据空间、有效数据被替换冲刷的颠簸现象,架构拆分后此类冲突显著降低。
-
微架构迭代延展性增强:分离架构支持单类缓存独立扩容,如 Sunny Cove 架构将 L1D 由 32 KB 升级至 48 KB、L1I 保持 32 KB 不变;统一缓存如需扩容,必须整体扩充存储阵列,硬件投入更高。最新的 Cougar Cove(Panther Lake)更进一步将 P-core L1D 扩至 192 KB,并引入 48 KB L0D 作为低延迟缓冲层,L1I 扩至 64 KB。
固有性能损耗
-
总缓存空间利用率下降:相同总容量条件下,统一缓存可动态分配存储空间给指令或数据,分离缓存分区固定。在负载不均衡场景下,统一缓存的动态空间分配可带来更高的整体命中率;该性能缺口由大容量统一 L2 完成补偿。
-
自修改代码一致性硬件开销 :x86 分离缓存必须搭载 I/D 同步控制电路,自修改代码运行阶段需要额外缓存刷新周期(如
INVLPG、WBINVD指令开销或硬件自动监听机制),产生一定性能损耗;主流商用程序中此类代码占比偏低,整体影响有限。
四、未来英特尔重新启用统一式 L1 缓存的可行性
分应用场景,基于现有制程与产品路线。
1. 桌面/高性能 P 核:10 年内无产品实施条件
现代高性能核普遍采用 4 发射超标量、乱序执行搭配深度分支预测架构,前端并行访存需求持续增加。统一 L1 若要匹配现有带宽,必须采用多端口 SRAM,对应芯片面积与功耗指标高于分离方案。
英特尔 Panther Lake(Cougar Cove P-core)采用 64 KB L1I + 192 KB L1D + 48 KB L0D 的层次化设计,L1 总容量与复杂度持续增加而非简化;下一代 Nova Lake 泄露信息聚焦于 L2/L3/bLLC 扩容,L1 规格预计维持分离架构。MRAM 等新型非易失存储现阶段无法满足 L1 超低时延指标(<< 1 ns 级),不足以驱动架构变更。
2. 超低功耗 E 核/嵌入式 Atom:特定定制场景存在试用可能
超低功耗嵌入式、物联网芯片以缩减芯片面积、简化控制逻辑为设计目标。新一代能效核(Gracemont、Skymont、Darkmont)通过缩小单缓存容量、优化替换算法控制功耗,L1I/L1D 分离架构仍是主流。仅极致成本约束下的 MCU 衍生芯片存在小容量统一 L1 试点空间,消费级、移动端主力能效核不会采用该架构。
勘误说明:初代 Atom(Bonnell,2008)实际采用 32 KB L1I + 24 KB L1D 分离缓存,非统一 L1。文档原文所述"初代部分 Atom 产品搭载统一 L1"为错误信息,已删除。
3. 前沿异构 AI 专用内核:仅非 CPU 标准缓存
人工智能推理内核的指令、数据访问边界模糊,专用加速器(如 NPU、GPU 本地 SRAM)内部存储可能采用统一设计;该类存储不属于通用 CPU 标准 L1 层级,通用处理器主 L1 不会同步更换为统一架构。
五、L2/L3 沿用统一缓存的底层逻辑
L2、L3 物理位置远离流水线前端,不存在单周期同步取指与数据访问的硬性需求。统一架构能够动态调配存储空间,提升整体存储利用率,英特尔从奔腾到历代新款处理器均延续该架构,暂无转向分离设计的产品规划。
全品类通用 x86 处理器(桌面、移动端、服务器)未来主力产品不会更换为统一 L1;仅面向专用嵌入式的定制芯片存在小范围统一 L1 的试用空间,英特尔主力产品线 L1 分离架构将长期延续。
附录:英特尔 L1 缓存容量演进简表
| 架构/产品 | 年代 | L1I | L1D | 备注 |
|---|---|---|---|---|
| 80486 | 1989 | 8 KB 统一 | --- | 首款片载 L1 |
| P5 Pentium | 1993 | 8 KB | 8 KB | 首款分离 L1 |
| P54C/P55C MMX | 1995--1997 | 16 KB | 16 KB | MMX 版本升级 |
| Pentium Pro (P6) | 1995 | 8 KB | 8 KB | 乱序执行架构 |
| Pentium II/III | 1997--1999 | 16 KB | 16 KB | 延续 P6 缓存设计 |
| Pentium 4 (NetBurst) | 2000--2004 | 12K μop Trace Cache | 8/16 KB | Trace Cache 替代传统 L1I |
| Core (Yonah) | 2006 | 32 KB | 32 KB | 回归传统分离 L1 |
| Core 2 (Merom) | 2006 | 32 KB | 32 KB | 64 KB L1 总量 |
| Nehalem--Broadwell | 2008--2014 | 32 KB | 32 KB | 稳定配置 |
| Skylake--Ice Lake | 2015--2019 | 32 KB | 32 KB | 4-way/8-way |
| Sunny Cove (Ice Lake) | 2019 | 32 KB | 48 KB | L1D 首次扩容 |
| Golden Cove (Alder Lake) | 2021 | 32 KB | 48 KB | 维持 Sunny Cove L1 |
| Lion Cove (Lunar Lake) | 2024 | 64 KB | 192 KB | 引入 L0D 48 KB |
| Cougar Cove (Panther Lake) | 2025--2026 | 64 KB | 192 KB + 48 KB L0D | 重大革新 |
reference
-
hardware - What is the history and development of memory caching? - 2022
-
memory layout - Why did Intel abandon unified CPU cache? - 2019
https://retrocomputing.stackexchange.com/questions/11274/why-did-intel-abandon-unified-cpu-cache
-
history - What's the relationship between early 90s Pentium microprocessor and today's Intel designs? - 2017
-
cpu architecture - What does a 'Split' cache means. And how is it useful(if it is)? - 2019
-
Slave Memories and Dynamic Storage Allocation | IEEE Journals & Magazine | 1965
-
......