Global cpu Load

一则小故事

在股票市场,曾经simple moving averages (SMA ), weighted moving averages (WMA ), and exponential moving averages (EMA) 三种算法被用于计算一段时间内股票价格波动。下面是三种算法的举例:

  • Example) Closing prices for the last 6 days (raw data): 1020, 1030, 1000, 1010, 1020, 1060
    • 3-day simple moving average ( SMA )
      • Evenly reflects past and recent data
      • = (1010 + 1020 + 1060) / 3
      • = 1030
    • 3-day weighted moving average ( WMA )
      • How to give high weight to recent data (e.g. w1=1/6, w2=2/6, w3=3/6)
      • = 1010 * w1 + 1020 * w2 + 1060 * w3
      • = 168 + 340 + 530
      • = 1038
    • 3-day exponential moving average ( EMA : Exponential Moving Average or EWMA : Exponential Weighted Moving Average)
      • It is simply calculated using only the previous day's exponential moving average and today's value, giving more weight to recent data.
      • Using the exponential smoothing factor (Exponential Percentage) k, multiply 1-k by the previous day's exponential moving average and then multiply today's value by k.
        • The smoothing coefficient uses the following exponential function model: (n=period)
          • k = a * (b^n)
      • The exponential smoothing factor (k) varies depending on the environment.
      • The smoothing coefficient k of the stock market is determined as follows using the estimation method by the exponential mean period.
        • (a=2, b=1/(n+1))
        • k= 2 * 1/(n+1) = 2/(n+1)
          • Example: n=2 days, k=2/(n+1)=0.666666...
          • Example: n=3 days, k=2/(n+1)=0.5
          • Example: n=10 days, k=2/(n+1)=0.181818...
      • It is calculated by reflecting the daily exponential moving average value * (1-k) + today's value * (k).
        • Day 1: 1020 = 100% reflection
        • Day 2: Reflecting the 2-day moving average = 1020 * 33% + 1030 * 66% = 1026
        • Day 3: Reflecting the 3-day moving average = 1026 * 50% + 1000 * 50% = 1013
        • Day 4: Reflecting the 3-day moving average = 1013 * 50% + 1010 * 50% = 1012
        • Day 5: Reflecting the 3-day moving average = 1012 * 50% + 1020 * 50% = 1016
        • Day 6: Reflecting the 3-day moving average = 1016 * 50% + 1060 * 50% = 1038

从上面的例子可以看出:

1、SMA算法是计算了前三天价格的完全平均;

2、WMA算法也是只计算了前三天价格的平均,不过加了一个系数(越近的系数越大)。

3、而EMA算法虽然也是只计算了3天的移动平均,但是根据平滑参数k既考虑了越近参考意义越大因素,也将前面所有数据都利用了起来,让历史的数据也能贡献价值。同时,对存储的消耗也只是三天的数据。

Linux中计算cpu load的算法,采用了类似EMA的算法,或者叫做EDA : Exponential Decaying Average, EDMA : Exponential Damped Moving Average

Global CPU load的更新时机

  • Fixed tick handler
    • tick_handle_periodic() -> tick_periodic() -> do_timer() -> calc_global_load()
  • tick nohz handler
    • tick_nohz_handler() -> tick_sched_do_timer() -> tick_do_update_jiffies64() -> do_timer() -> calc_global_load()
    • tick_nohz_update_jiffies() -> tick_do_update_jiffies64() -> do_timer() -> calc_global_load()
    • tick_nohz_restart_sched_tick() -> tick_do_update_jiffies64() -> do_timer() -> calc_global_load()
  • hrtimer tick handler
    • tick_sched_timer() -> tick_sched_do_timer() -> tick_do_update_jiffies64() -> do_timer() -> calc_global_load()

calc_global_load()虽然在每次schedule tick的时候调用,但是并不是每次都更新,而是有一个更新周期------calc_load_update: default 5 seconds。准确的讲,应该是calc_load_update + 10ticks:

/*

* calc_load - update the avenrun load estimates 10 ticks after the

* CPUs have updated calc_load_tasks.

*

* Called from the global timer code.

*/

void calc_global_load(unsigned long ticks)

{

unsigned long sample_window;

long active, delta;

sample_window = READ_ONCE(calc_load_update);

if (time_before(jiffies, sample_window + 10))

return;

......

Global CPU load的计算方法

对于负载值,举个例子:如果有两个runnable的task运行在一个cpu的系统上,则负载值为2.0;如果系统有4个cpu,仍然只有2个runnable的task,则负载值就是0.5。

Linux内核中计算全局平均负载的算法如kernel/sched/loadavg.c文件开头的注释:

/*

* Global load-average calculations

*

* We take a distributed and async approach to calculating the global load-avg

* in order to minimize overhead.

*

* The global load average is an exponentially decaying average of nr_running +

* nr_uninterruptible.

*

* Once every LOAD_FREQ:

*

* nr_active = 0;

* for_each_possible_cpu(cpu)

* nr_active += cpu_of(cpu)->nr_running + cpu_of(cpu)->nr_uninterruptible;

*

* avenrun[n] = avenrun[0] * exp_n + nr_active * (1 - exp_n)

*

* Due to a number of reasons the above turns in the mess below:

*

* - for_each_possible_cpu() is prohibitively expensive on machines with

* serious number of CPUs, therefore we need to take a distributed approach

* to calculating nr_active.

*

* \Sum_i x_i(t) = \Sum_i x_i(t) - x_i(t_0) | x_i(t_0) := 0

* = \Sum_i { \Sum_j=1 x_i(t_j) - x_i(t_j-1) }

*

* So assuming nr_active := 0 when we start out -- true per definition, we

* can simply take per-CPU deltas and fold those into a global accumulate

* to obtain the same result. See calc_load_fold_active().

*

* Furthermore, in order to avoid synchronizing all per-CPU delta folding

* across the machine, we assume 10 ticks is sufficient time for every

* CPU to have completed this task.

*

* This places an upper-bound on the IRQ-off latency of the machine. Then

* again, being late doesn't loose the delta, just wrecks the sample.

*

* - cpu_rq()->nr_uninterruptible isn't accurately tracked per-CPU because

* this would add another cross-CPU cacheline miss and atomic operation

* to the wakeup path. Instead we increment on whatever CPU the task ran

* when it went into uninterruptible state and decrement on whatever CPU

* did the wakeup. This means that only the sum of nr_uninterruptible over

* all CPUs yields the correct result.

*

* This covers the NO_HZ=n code, for extra head-aches, see the comment below.

*/

如注释所说,原本linux中计算负载的公式是:

avenrun[n] = avenrun[0] * exp_n + nr_active * (1 - exp_n)

但是,对于cpu数量比较大的系统,其要计算系统全局的nr_active是比较耗时的。所以,采用了改进的算法:

首先,初始设置nr_active=0(系统启动时),此时的负载当然就是0;

然后,每个schedule tick来时,分别计算percpu的nr_active的增减delta,并记录在全局的变量calc_load_tasks中,参见calc_load_fold_active()函数;

其次,由于更新全局变量calc_load_tasks需要原子的进行,多个cpu也需要同步,所以,这里预留了10 ticks(这也是上一章节calc_load_update+10 ticks的原因)

最后,由于把计算nr_active的工作分散到了平时,当calc_load_update周期到后,就可以直接读取calc_load_tasks全局变量,来作为nr_active。

Linux系统为用户提供了三个最近的平均负载值,分别是last 1 minute, 5 minutes, and 15 minutes的平均负载,记录在avenrun[3]全局数组中。用户可以通过"uptime"命令,或"/proc/loadavg"文件来查看:

$ uptime
20:41:18 up 1:09, 2 users, load average: 0.57, 0.15, 0.11

$ cat /proc/loadavg
0.55 0.28 0.16 1/318 2112

这三个平均负载的计算公式如下:

  • nr_active = Add the number of running and uninterruptible tasks in the runqueue of each CPU.
  • avenrun[0] = avenrun[0]*k1 + nr_active*(1-k1) //1分钟
  • avenrun[1] = avenrun[1]*k2 + nr_active*(1-k2) //5分钟
  • avenrun[2] = avenrun[2]*k3 + nr_active*(1-k3) //15分钟

公式中k=e^(-1/n),其中n是周期。如avenrun[0]中记录的是1分钟,每5s更新一次,相当于12个周期。avenrun[1]相当于60周期、avenrun[2]相当于180周期。所以:

  • n=12,k=e^(-1/12) = 0.920044 (about 92%);
  • n=60, k = 0.9835;
  • n=180, k=0.9945 。

注意 :这里的nr_active不仅包含了runqueue中runnable的tasks,还包含了uninterruptible sleep的tasks。这具体的缘由,可以参见:

https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html

在Linux代码实现时,为了计算精度,内核使用了2^11=2048来代表真实负载值1.0,并hardcode了系统的这三个平均负载的factor值:

#define FSHIFT 11 /* nr of bits of precision */

#define FIXED_1 (1<<FSHIFT) /* 1.0 as fixed-point */

#define LOAD_FREQ (5*HZ+1) /* 5 sec intervals */

#define EXP_1 1884 /* 1/exp(5sec/1min) as fixed-point */

#define EXP_5 2014 /* 1/exp(5sec/5min) */

#define EXP_15 2037 /* 1/exp(5sec/15min) */

其中,

  • EXP_1 = 1/(exp(1/12)) * FIXED_1 = 92.00% * 2048 = 1884
  • EXP_5 = 1/(exp(1/60)) * FIXED_1 = 98.35% * 2048 = 2014
  • EXP_15 = 1/(exp(1/180)) * FIXED_1 = 99.45% * 2048 = 2037

而代码中的计算公式,则稍微变形为:

最后加的一个2047,看起来像是一个向上取整的操作,为了体现在active>load时,负载上升的趋势不至于被mod操作截断。

/*

* a1 = a0 * e + a * (1 - e)

*/

static inline unsigned long

calc_load(unsigned long load, unsigned long exp, unsigned long active)

{

unsigned long newload;

newload = load * exp + active * (FIXED_1 - exp);

if (active >= load)

newload += FIXED_1-1;

return newload / FIXED_1;

}