System Performance ch6.CPU.concetps 笔记

Brendan Gregg. System Performance ch6.CPU

For multiprocessor systems, the kernel typically provides a run queue for each CPU , and aims to keep threads on the same run queue. This means that threads are more likely to keep running on the same CPUs where the CPU caches have cached their data . These caches are described as having cache warmth, and this strategy to keep threads running on the same CPUs is called CPU affinity. On NUMA systems, per-CPU run queues also improve memory locality. This improves performance by keeping threads running on the same memory node (as described in Chapter 7, Memory), and avoids the cost of thread synchronization (mutex locks) for queue operations, which would hurt scalability if the run queue was global and shared among all CPUs.

对于多处理器系统，内核通常为每个CPU提供一个运行队列，目的是使线程保持在同一个运行队列上。这意味着线程更有可能在CPU缓存缓存其数据的同一个CPU上继续运行。这些缓存被描述为具有缓存温暖，这种保持线程在相同CPU上运行的策略称为CPU亲和性。在NUMA系统上，每个cpu运行队列还可以提高内存局部性。这通过保持线程在相同的内存节点上运行来提高性能(如第7章内存所述)，并避免队列操作的线程同步(互斥锁)成本，如果运行队列是全局的并且在所有cpu之间共享，则线程同步将损害可伸缩性。

所以这里的CPU是指 Processor 还是 core还是hyper thread(logical CPU)? 我倾向于是logical CPU

KS认为这里是logical CPU

但是这样HT会导致一些问题，可以通过一些策略来优化，毕竟logical CPU之间是不同的

CPUs execute instructions chosen from their instruction set. An instruction includes the following steps, each processed by a component of the CPU called a functional unit:

Instruction fetch
Instruction decode
Execute
Memory access (optional) 【often the slowest】
Register write-back (optional)

But we can go faster still. Multiple functional units of the same type can be included, so that even more instructions can make forward progress with each clock cycle. This CPU architecture is called superscalar and is typically used with pipelining to achieve a high instruction throughput.

The instruction width describes the target number of instructions to process in parallel. Modern processors are 3-wide or 4-wide, meaning they can complete up to three or four instructions per cycle. How this works depends on the processor, as there may be different numbers of functional units for each stage.

增大 instruction width 设置多条流水线，从而实现在一个单一core的CPU里同时处理多个指令，就是超标量(superscalar)

它既可以是 out of order 也可以是 in order

6.3.5 Instruction Size

Another instruction characteristic is the instruction size: for some processor architectures it is variable: For example, x86, which is classified as a complex instruction set computer (CISC), allows up to 15-byte instructions. ARM, which is a reduced instruction set computer (RISC), has 4 byte instructions with 4-byte alignment for AArch32/A32, and 2- or 4-byte instructions for ARM Thumb.

指令大小对于某些处理器体系结构，是可变的

About HT

The performance of each hardware thread is not the same as a separate CPU core, and depends on the workload. To avoid performance problems, kernels may spread out CPU load across cores so that only one hardware thread on each core is busy, avoiding hardware thread contention.
Workloads that are stall cycle-heavy (low IPC) may also have better performance than those that are instruction-heavy (high IPC) because stall cycles reduce core contention.

停顿周期重(低IPC)的工作负载也可能比指令重(高IPC)的工作负载具有更好的性能，因为停顿周期减少了核心争用。所以IPC未必能够被准确描述当前的性能情况。

A low IPC indicates that CPUs are often stalled, typically for memory access. A high IPC indicates that CPUs are often not stalled and have a high instruction throughput. These metrics suggest where performance tuning efforts may be best spent.

低IPC表示cpu经常处于停滞状态，通常是为了内存访问。IPC高表明cpu通常不会停滞，并且具有较高的指令吞吐量。这些指标建议在哪些地方进行性能调优是最好的。
At Netflix, cloud workloads range from an IPC of 0.2 (considered slow) to 1.5 (considered good). Expressed as CPI, this range is 5.0 to 0.66.

IPC指示了指令处理的效率，而不是指令本身的效率。用油多不意味着跑得快或者单位容量的油能提供更多的动力。

CPU Utilization

CPU utilization is measured by the time a CPU instance is busy performing work during an interval, expressed as a percentage. It can be measured as the time a CPU is not running the kernel idle thread but is instead running user-level application threads or other kernel threads, or processing interrupts.

CPU利用率往往就是通过CPU没有运行kernel idle thread 而是运行 user-level application threads or other kernel threads, or processing interrupts 的时间来衡量的

所以 processing interrupts 的时间会被计入 CPU utilization

所以 memory stall cycles 可以导致很高的CPU utilization

更大的字长意味着更好的性能，但是这其中的关系并不简单。较大的字长可能会导致更高的无意义的内存和IO开销。对于x64结构，这些结构会通过寄存器的增加和更有效的寄存器约定调用得到补偿，所以x64APP性能可能比x32更好。