Global cpu Load - 技术栈

一则小故事

在股票市场，曾经simple moving averages (SMA ), weighted moving averages (WMA ), and exponential moving averages (EMA) 三种算法被用于计算一段时间内股票价格波动。下面是三种算法的举例：

Example) Closing prices for the last 6 days (raw data): 1020, 1030, 1000, 1010, 1020, 1060
- 3-day simple moving average ( SMA )
  - Evenly reflects past and recent data
  - = (1010 + 1020 + 1060) / 3
  - = 1030
- 3-day weighted moving average ( WMA )
  - How to give high weight to recent data (e.g. w1=1/6, w2=2/6, w3=3/6)
  - = 1010 * w1 + 1020 * w2 + 1060 * w3
  - = 168 + 340 + 530
  - = 1038
- 3-day exponential moving average ( EMA : Exponential Moving Average or EWMA : Exponential Weighted Moving Average)
  - It is simply calculated using only the previous day's exponential moving average and today's value, giving more weight to recent data.
  - Using the exponential smoothing factor (Exponential Percentage) k, multiply 1-k by the previous day's exponential moving average and then multiply today's value by k.
    - The smoothing coefficient uses the following exponential function model: (n=period)
      - k = a * (b^n)
  - The exponential smoothing factor (k) varies depending on the environment.
  - The smoothing coefficient k of the stock market is determined as follows using the estimation method by the exponential mean period.
    - (a=2, b=1/(n+1))
    - k= 2 * 1/(n+1) = 2/(n+1)
      - Example: n=2 days, k=2/(n+1)=0.666666...
      - Example: n=3 days, k=2/(n+1)=0.5
      - Example: n=10 days, k=2/(n+1)=0.181818...
  - It is calculated by reflecting the daily exponential moving average value * (1-k) + today's value * (k).
    - Day 1: 1020 = 100% reflection
    - Day 2: Reflecting the 2-day moving average = 1020 * 33% + 1030 * 66% = 1026
    - Day 3: Reflecting the 3-day moving average = 1026 * 50% + 1000 * 50% = 1013
    - Day 4: Reflecting the 3-day moving average = 1013 * 50% + 1010 * 50% = 1012
    - Day 5: Reflecting the 3-day moving average = 1012 * 50% + 1020 * 50% = 1016
    - Day 6: Reflecting the 3-day moving average = 1016 * 50% + 1060 * 50% = 1038

从上面的例子可以看出：

1、SMA算法是计算了前三天价格的完全平均；

2、WMA算法也是只计算了前三天价格的平均，不过加了一个系数（越近的系数越大）。

3、而EMA算法虽然也是只计算了3天的移动平均，但是根据平滑参数k既考虑了越近参考意义越大因素，也将前面所有数据都利用了起来，让历史的数据也能贡献价值。同时，对存储的消耗也只是三天的数据。

Linux中计算cpu load的算法，采用了类似EMA的算法，或者叫做EDA : Exponential Decaying Average, EDMA : Exponential Damped Moving Average

Global CPU load的更新时机

Fixed tick handler
- tick_handle_periodic() -> tick_periodic() -> do_timer() -> calc_global_load()
tick nohz handler
- tick_nohz_handler() -> tick_sched_do_timer() -> tick_do_update_jiffies64() -> do_timer() -> calc_global_load()
- tick_nohz_update_jiffies() -> tick_do_update_jiffies64() -> do_timer() -> calc_global_load()
- tick_nohz_restart_sched_tick() -> tick_do_update_jiffies64() -> do_timer() -> calc_global_load()
hrtimer tick handler
- tick_sched_timer() -> tick_sched_do_timer() -> tick_do_update_jiffies64() -> do_timer() -> calc_global_load()

calc_global_load()虽然在每次schedule tick的时候调用，但是并不是每次都更新，而是有一个更新周期------calc_load_update: default 5 seconds。准确的讲，应该是calc_load_update + 10ticks：

* calc_load - update the avenrun load estimates 10 ticks after the

* CPUs have updated calc_load_tasks.

* Called from the global timer code.

void calc_global_load(unsigned long ticks)

{

unsigned long sample_window;

long active, delta;

sample_window = READ_ONCE(calc_load_update);

if (time_before(jiffies, sample_window + 10))

return;

......

｝

Global CPU load的计算方法

对于负载值，举个例子：如果有两个runnable的task运行在一个cpu的系统上，则负载值为2.0；如果系统有4个cpu，仍然只有2个runnable的task，则负载值就是0.5。

Linux内核中计算全局平均负载的算法如kernel/sched/loadavg.c文件开头的注释：

* Global load-average calculations

* We take a distributed and async approach to calculating the global load-avg

* in order to minimize overhead.

* The global load average is an exponentially decaying average of nr_running +

* nr_uninterruptible.

* Once every LOAD_FREQ:

* nr_active = 0;

* for_each_possible_cpu(cpu)

* nr_active += cpu_of(cpu)->nr_running + cpu_of(cpu)->nr_uninterruptible;

* avenrun[n] = avenrun[0] * exp_n + nr_active * (1 - exp_n)

* Due to a number of reasons the above turns in the mess below:

* - for_each_possible_cpu() is prohibitively expensive on machines with

* serious number of CPUs, therefore we need to take a distributed approach

* to calculating nr_active.

* \Sum_i x_i(t) = \Sum_i x_i(t) - x_i(t_0) | x_i(t_0) := 0

* = \Sum_i { \Sum_j=1 x_i(t_j) - x_i(t_j-1) }

* So assuming nr_active := 0 when we start out -- true per definition, we

* can simply take per-CPU deltas and fold those into a global accumulate

* to obtain the same result. See calc_load_fold_active().

* Furthermore, in order to avoid synchronizing all per-CPU delta folding

* across the machine, we assume 10 ticks is sufficient time for every

* CPU to have completed this task.

* This places an upper-bound on the IRQ-off latency of the machine. Then

* again, being late doesn't loose the delta, just wrecks the sample.

* - cpu_rq()->nr_uninterruptible isn't accurately tracked per-CPU because

* this would add another cross-CPU cacheline miss and atomic operation

* to the wakeup path. Instead we increment on whatever CPU the task ran

* when it went into uninterruptible state and decrement on whatever CPU

* did the wakeup. This means that only the sum of nr_uninterruptible over

* all CPUs yields the correct result.

* This covers the NO_HZ=n code, for extra head-aches, see the comment below.

如注释所说，原本linux中计算负载的公式是：

avenrun[n] = avenrun[0] * exp_n + nr_active * (1 - exp_n)

但是，对于cpu数量比较大的系统，其要计算系统全局的nr_active是比较耗时的。所以，采用了改进的算法：

首先，初始设置nr_active=0（系统启动时），此时的负载当然就是0；

然后，每个schedule tick来时，分别计算percpu的nr_active的增减delta，并记录在全局的变量calc_load_tasks中，参见calc_load_fold_active()函数；

其次，由于更新全局变量calc_load_tasks需要原子的进行，多个cpu也需要同步，所以，这里预留了10 ticks（这也是上一章节calc_load_update+10 ticks的原因）

最后，由于把计算nr_active的工作分散到了平时，当calc_load_update周期到后，就可以直接读取calc_load_tasks全局变量，来作为nr_active。

Linux系统为用户提供了三个最近的平均负载值，分别是last 1 minute, 5 minutes, and 15 minutes的平均负载，记录在avenrun[3]全局数组中。用户可以通过"uptime"命令，或"/proc/loadavg"文件来查看：

$ uptime
20:41:18 up 1:09, 2 users, load average: 0.57, 0.15, 0.11

$ cat /proc/loadavg
0.55 0.28 0.16 1/318 2112

这三个平均负载的计算公式如下：

nr_active = Add the number of running and uninterruptible tasks in the runqueue of each CPU.
avenrun[0] = avenrun[0]*k1 + nr_active*(1-k1) //1分钟
avenrun[1] = avenrun[1]*k2 + nr_active*(1-k2) //5分钟
avenrun[2] = avenrun[2]*k3 + nr_active*(1-k3) //15分钟

公式中k=e^(-1/n)，其中n是周期。如avenrun[0]中记录的是1分钟，每5s更新一次，相当于12个周期。avenrun[1]相当于60周期、avenrun[2]相当于180周期。所以：

n=12，k=e^(-1/12) = 0.920044 (about 92%)；
n=60, k = 0.9835;
n=180, k=0.9945 。

注意：这里的nr_active不仅包含了runqueue中runnable的tasks，还包含了uninterruptible sleep的tasks。这具体的缘由，可以参见：

https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html

在Linux代码实现时，为了计算精度，内核使用了2^11=2048来代表真实负载值1.0，并hardcode了系统的这三个平均负载的factor值：

#define FSHIFT 11 /* nr of bits of precision */

#define FIXED_1 (1<<FSHIFT) /* 1.0 as fixed-point */

#define LOAD_FREQ (5*HZ+1) /* 5 sec intervals */

#define EXP_1 1884 /* 1/exp(5sec/1min) as fixed-point */

#define EXP_5 2014 /* 1/exp(5sec/5min) */

#define EXP_15 2037 /* 1/exp(5sec/15min) */

其中，

EXP_1 = 1/(exp(1/12)) * FIXED_1 = 92.00% * 2048 = 1884
EXP_5 = 1/(exp(1/60)) * FIXED_1 = 98.35% * 2048 = 2014
EXP_15 = 1/(exp(1/180)) * FIXED_1 = 99.45% * 2048 = 2037

而代码中的计算公式，则稍微变形为：

最后加的一个2047，看起来像是一个向上取整的操作，为了体现在active>load时，负载上升的趋势不至于被mod操作截断。

* a1 = a0 * e + a * (1 - e)

static inline unsigned long

calc_load(unsigned long load, unsigned long exp, unsigned long active)

{

unsigned long newload;

newload = load * exp + active * (FIXED_1 - exp);

if (active >= load)

newload += FIXED_1-1;

return newload / FIXED_1;

}