Linux之时间子系统(三):timekeeping 模块

一、时间子系统软件架构图

根据软件架构图,从上往下分析各个模块的功能,下面开始分析 timekeeping 模块的源码。

二、timekeeping

timekeeping模块是一个提供时间服务的基础模块。Linux内核提供各种time line,real time clock,monotonic clock、monotonic raw clock等,timekeeping模块就是负责跟踪、维护这些timeline的,并且向其他模块(timer相关模块、用户空间的时间服务等)提供服务,而timekeeping模块维护timeline的基础是基于clocksource模块和tick模块。通过tick模块的tick事件,可以周期性的更新time line,通过clocksource模块、可以获取tick之间更精准的时间信息。

本文熟悉介绍timekeeping的一些基础概念,接着会介绍该模块初始化的过程,此后会从上至下介绍该模块提供的服务、该模块如何和tick模块交互以及如何和clocksource模块交互,最后介绍电源管理相关的内容。

三、timekeeper核心数据定义

/**

* struct tk_read_base - base structure for timekeeping readout

* @clock: Current clocksource used for timekeeping.

* @mask: Bitmask for two's complement subtraction of non 64bit clocks

* @cycle_last: @clock cycle value at last update

* @mult: (NTP adjusted) multiplier for scaled math conversion

* @shift: Shift value for scaled math conversion

* @xtime_nsec: Shifted (fractional) nano seconds offset for readout

* @base: ktime_t (nanoseconds) base time for readout

* @base_real: Nanoseconds base value for clock REALTIME readout

*

* This struct has size 56 byte on 64 bit. Together with a seqcount it

* occupies a single 64byte cache line.

*

* The struct is separate from struct timekeeper as it is also used

* for a fast NMI safe accessors.

*

* @base_real is for the fast NMI safe accessor to allow reading clock

* realtime from any context.

*/

struct tk_read_base {

struct clocksource *clock;

u64 mask;

u64 cycle_last;

u32 mult;

u32 shift;

u64 xtime_nsec;

ktime_t base;

u64 base_real;

};

/**

* struct timekeeper - Structure holding internal timekeeping values.

* @tkr_mono: The readout base structure for CLOCK_MONOTONIC

* @tkr_raw: The readout base structure for CLOCK_MONOTONIC_RAW

* @xtime_sec: Current CLOCK_REALTIME time in seconds

* @ktime_sec: Current CLOCK_MONOTONIC time in seconds

* @wall_to_monotonic: CLOCK_REALTIME to CLOCK_MONOTONIC offset

* @offs_real: Offset clock monotonic -> clock realtime

* @offs_boot: Offset clock monotonic -> clock boottime

* @offs_tai: Offset clock monotonic -> clock tai

* @tai_offset: The current UTC to TAI offset in seconds

* @clock_was_set_seq: The sequence number of clock was set events

* @cs_was_changed_seq: The sequence number of clocksource change events

* @next_leap_ktime: CLOCK_MONOTONIC time value of a pending leap-second

* @raw_sec: CLOCK_MONOTONIC_RAW time in seconds

* @monotonic_to_boot: CLOCK_MONOTONIC to CLOCK_BOOTTIME offset

* @cycle_interval: Number of clock cycles in one NTP interval

* @xtime_interval: Number of clock shifted nano seconds in one NTP

* interval.

* @xtime_remainder: Shifted nano seconds left over when rounding

* @cycle_interval

* @raw_interval: Shifted raw nano seconds accumulated per NTP interval.

* @ntp_error: Difference between accumulated time and NTP time in ntp

* shifted nano seconds.

* @ntp_error_shift: Shift conversion between clock shifted nano seconds and

* ntp shifted nano seconds.

* @last_warning: Warning ratelimiter (DEBUG_TIMEKEEPING)

* @underflow_seen: Underflow warning flag (DEBUG_TIMEKEEPING)

* @overflow_seen: Overflow warning flag (DEBUG_TIMEKEEPING)

*

* Note: For timespec(64) based interfaces wall_to_monotonic is what

* we need to add to xtime (or xtime corrected for sub jiffie times)

* to get to monotonic time. Monotonic is pegged at zero at system

* boot time, so wall_to_monotonic will be negative, however, we will

* ALWAYS keep the tv_nsec part positive so we can use the usual

* normalization.

*

* wall_to_monotonic is moved after resume from suspend for the

* monotonic time not to jump. We need to add total_sleep_time to

* wall_to_monotonic to get the real boot based time offset.

*

* wall_to_monotonic is no longer the boot time, getboottime must be

* used instead.

*

* @monotonic_to_boottime is a timespec64 representation of @offs_boot to

* accelerate the VDSO update for CLOCK_BOOTTIME.

*/

struct timekeeper {

struct tk_read_base tkr_mono;

struct tk_read_base tkr_raw;

u64 xtime_sec;

unsigned long ktime_sec;

struct timespec64 wall_to_monotonic;

ktime_t offs_real;

ktime_t offs_boot;

ktime_t offs_tai;

s32 tai_offset;

unsigned int clock_was_set_seq;

u8 cs_was_changed_seq;

ktime_t next_leap_ktime;

u64 raw_sec;

struct timespec64 monotonic_to_boot;

/* The following members are for timekeeping internal use */

u64 cycle_interval;

u64 xtime_interval;

s64 xtime_remainder;

u64 raw_interval;

/* The ntp_tick_length() value currently being used.

* This cached copy ensures we consistently apply the tick

* length for an entire tick, as ntp_tick_length may change

* mid-tick, and we don't want to apply that new value to

* the tick in progress.

*/

u64 ntp_tick;

/* Difference between accumulated time and NTP time in ntp

* shifted nano seconds. */

s64 ntp_error;

u32 ntp_error_shift;

u32 ntp_err_mult;

/* Flag used to avoid updating NTP twice with same second */

u32 skip_second_overflow;

#ifdef CONFIG_DEBUG_TIMEKEEPING

long last_warning;

/*

* These simple flag variables are managed

* without locks, which is racy, but they are

* ok since we don't really care about being

* super precise about how many events were

* seen, just that a problem was observed.

*/

int underflow_seen;

int overflow_seen;

#endif

};

上面的英文注释已经解释各个变量的含义,下面描述一些主要的信息;

(1)tk_read_base 当前使用的clocksource。这个clock应该系统中最优的那个,如果有好过当前clocksource注册入系统,那么clocksource模块会通知timekeeping模块来切换clocksource。

(2) tk_read_base 的mult 和shift ,是cycle值和纳秒转换的因子,概念和clocksource的mult和shift一致。之后会描述怎么换算;

(3)CLOCK_REALTIME类型的系统时钟(其实就是墙上时钟)。我们都知道,时间就像是一条直线(line),不知道起点,也不知道终点,因此我们称之time line。time line有很多种,和如何定义0值的时间以及用什么样的刻度来度量时间相关。人类熟悉的墙上时间和linux kernel中定义的CLOCK_REALTIME都是用来描述time line的,只不过时间原点和如何度量time line上两点距离的刻度不一样。对于人类的时间,0值是耶稣诞生的时间点;对于CLOCK_REALTIME,0值是linux epoch,即1970年1月1日...。对于墙上时间,在度量的时候虽然也是基于秒的,但是人类做了grouping,因此使用了年月日时分秒的概念。这里的秒数是相对与当前分钟值内的秒数。对于linux世界中的CLOCK_REALTIME time,直接使用秒以及纳秒在当前秒内的偏移来表示。

因此,这里xtime_sec用秒这个的刻度单位来度量CLOCK_REALTIME time line上,时间原点到当前点的距离值。当然xtime_sec是一个对current time point的取整值,为了更好的精度,还需要一个纳秒表示的offset,也就是tk_read_base->xtime_nsec。

不过为了内核内部计算精度(内核对时间的计算是基于cycle的),并不是保存了时间的纳秒偏移值,而是保存了一个shift之后的值,因此,用户看来,当前时间点的值应该是距离时间原点xtime_sec+ (xtime_nsec << shift)距离的那个时间点值.

(4)CLOCK_MONOTONIC类型的系统时钟。这种系统时钟并没有象墙上时钟一样定义一个相对于linux epoch的值,这个成员定义了monotonic clock到real time clock的偏移,也就是说,这里的wall_to_monotonic和offs_real需要加上real time clock的时间值才能得到monotonic clock的时间值。当然,从这里成员的名字就看出来了。wall_to_monotonic和offs_real的意思是一样的,不过时间的格式不一样,用在不同的场合,以便获取性能的提升。

(5)CLOCK_MONOTONIC_RAW类型的系统时钟

(6)CLOCK_TAI类型的系统时钟。TAI(international atomic time)是原子钟,我们说过,UTC就是base TAI的,也就是说用铯133的振荡频率来定义秒的那个时钟,当然UTC还有考虑leap second以便方便广大人民群众。CLOCK_TAI类型的系统时钟就是完完全全使用铯133的振荡频率来定义秒的那个时钟,不向人类妥协。

2、对各种时间值的计算

获取时间:

(1)realtime = timekeeper.xtime_sec+((timekeeper.tkr_mono.clock->read - cycle_last)*mult+tkr_mono->xtime_nsec)>>shift

(2) monotonic_clock = realtime + wall_to_monotonic

(3) monotonic_raw = timekeeper.raw_sec +((timekeeper.tkr_raw.clock->read - cycle_last)*mult+tkr_raw->xtime_nsec)>>shift

(4) upsystime_time = real time clock+ wall_to_monotonic

(5) boottime = timekeeper.offs_boot +timekeeper.tkr_mono.base + (timekeeper.tkr_mono.clock->read - cycle_last)*mult+tkr_mono->xtime_nsec)>>shift

(6) taitime = timekeeper.offs_tai+ timekeeper.tkr_mono.base + (timekeeper.tkr_mono.clock->read - cycle_last)*mult+tkr_mono->xtime_nsec)>>shift

设置时间:

3、全局变量

DEFINE_RAW_SPINLOCK(timekeeper_lock);

/*

* The most important data for readout fits into a single 64 byte

* cache line.

*/

static struct {

seqcount_raw_spinlock_t seq;

struct timekeeper timekeeper;

} tk_core ____cacheline_aligned = {

.seq = SEQCNT_RAW_SPINLOCK_ZERO(tk_core.seq, &timekeeper_lock),

};

static struct timekeeper shadow_timekeeper;

timekeeper维护了系统的所有的clock。一个全局变量(共享资源)没有锁保护怎么行,timekeeper_lock和seq都是用来保护timekeeper的,用在不同的场合。

shadow_timekeeper主要用在更新系统时间的过程中。在update_wall_time中,首先将时间调整值设定到shadow_timekeeper中,然后一次性的copy到真正的那个timekeeper中。这样的设计主要是可以减少持有seq锁的时间(在更新系统时间的过程中),不过需要注意的是:在其他的过程(非update_wall_time),需要sync shadow timekeeper。

3、kk_fast

在timekeep模块中,还有一个 tk_fast: NMI safe timekeeper 的保护更新机制,这个还没研究,埋一个机制,之后更新;

三、timekeeping初始化

timekeeping初始化的代码位于timekeeping_init函数中,在系统初始化的时候(start_kernel)会调用该函数进行timekeeping的初始化。

timekeep_init :

/*

* timekeeping_init - Initializes the clocksource and common timekeeping values

*/

void __init timekeeping_init(void)

{

struct timespec64 wall_time, boot_offset, wall_to_mono;

struct timekeeper *tk = &tk_core.timekeeper;

struct clocksource *clock;

unsigned long flags;

read_persistent_wall_and_boot_offset(&wall_time, &boot_offset);

if (timespec64_valid_settod(&wall_time) &&

timespec64_to_ns(&wall_time) > 0) {

persistent_clock_exists = true;

} else if (timespec64_to_ns(&wall_time) != 0) {

pr_warn("Persistent clock returned invalid value");

wall_time = (struct timespec64){0};

}

if (timespec64_compare(&wall_time, &boot_offset) < 0)

boot_offset = (struct timespec64){0};

/*

* We want set wall_to_mono, so the following is true:

* wall time + wall_to_mono = boot time

*/

wall_to_mono = timespec64_sub(boot_offset, wall_time);

raw_spin_lock_irqsave(&timekeeper_lock, flags);

write_seqcount_begin(&tk_core.seq);

ntp_init();

clock = clocksource_default_clock();

if (clock->enable)

clock->enable(clock);

tk_setup_internals(tk, clock);

tk_set_xtime(tk, &wall_time);

tk->raw_sec = 0;

tk_set_wall_to_mono(tk, wall_to_mono);

timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET);

write_seqcount_end(&tk_core.seq);

raw_spin_unlock_irqrestore(&timekeeper_lock, flags);

}

1、从persistent clock获取当前的时间值

timekeeping模块中支持若干种system clock,这些system clock的数据保存在ram中,一旦断电,数据就丢失了。因此,在系加电启动后,会从persistent clock中中取出当前时间值(例如RTC,RTC有battery供电,因此系统断电也可以保存数据),根据情况初始化各种system clock。具体代码如下:

read_persistent_wall_and_boot_offset(&wall_time, &boot_offset);

if (timespec64_valid_settod(&wall_time) &&

timespec64_to_ns(&wall_time) > 0) {

persistent_clock_exists = true;

} else if (timespec64_to_ns(&wall_time) != 0) {

pr_warn("Persistent clock returned invalid value");

wall_time = (struct timespec64){0};

}

if (timespec64_compare(&wall_time, &boot_offset) < 0)

boot_offset = (struct timespec64){0};

(1)read_persistent_wall_and_boot_offset是一个和architecture相关的函数,具体如何支持可以看具体的architecture相关的代码实现。对于ARM,其实现在linux/arch/arm/kernel/time.c文件中。该函数的功能就是从系统中的HW clock(例如RTC)中获取时间信息。

read_persistent_wall_and_boot_offset->read_persistent_clock64-> __read_persistent_clock->read_persistent

read_persistent 这个函数是注册进来的函数指针

(2)timespec64_valid_settod用来校验一个timespec是否是有效。如何判断从RTC获取的值是有效的呢?要满足timespec中的秒数值要大于等于0,小于KTIME_SEC_MAX,纳秒值要小于NSEC_PER_SEC(10^9)。KTIME_SEC_MAX这个宏定义了ktime_t这种类型的数据可以表示的最大的秒数值,从RTC中读出的秒数值当然不能大于它,KTIME_SEC_MAX定义如下:

#define KTIME_MAX ((s64)~((u64)1 << 63))

#if (BITS_PER_LONG == 64)

define KTIME_SEC_MAX (KTIME_MAX / NSEC_PER_SEC)

#else

define KTIME_SEC_MAX LONG_MAX

#endif

ktime_t这种数据类型占据了64 bit的size,对于64 bit的CPU和32 bit CPU上是不一样的,64 bit的CPU上定义为一个signed long long,该值直接表示了纳秒值。对于32bit CPU而言,64 bit的数据分成两个signed int类型,分别表示秒数和纳秒数。

(3)设定persistent_clock_exist flag,说明系统中存在RTC的硬件模块,timekeeping模块会和RTC模块进行交互。例如:在suspend的时候,如果该flag是true的话,RTC driver不能sleep,因为timekeeping模块还需要在resume的时候通过RTC的值恢复其时间值呢。

2、为timekeeping模块设置default的clock source

clock = clocksource_default_clock();--------------------(1)

if (clock->enable)

clock->enable(clock);-----enalbe default clocksource

tk_setup_internals(tk, clock);------------------------(2)

(1)在timekeeping初始化的时候,很难选择一个最好的clock source,因为很有可能最好的那个还没有初始化呢。因此,这里的策略就是采用一个在timekeeping初始化时一定是ready的clock source,也就是基于jiffies 的那个clocksource。clocksource_default_clock定义在kernel/time/jiffies.c,是一个weak symble,如果你愿意也可以重新定义clocksource_default_clock这个函数。不过,要保证在timekeeping初始化的时候是ready的。

(2)建立default clocksource和timekeeping伙伴关系。

3、初始化real time clock、monotonic clock和monotonic raw clock

tk_set_xtime(tk, &wall_time);

tk->raw_sec = 0;

tk_set_wall_to_mono(tk, wall_to_mono);

timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET);

write_seqcount_end(&tk_core.seq);

raw_spin_unlock_irqrestore(&timekeeper_lock, flags);

(1)根据从RTC中获取的时间值来初始化timekeeping中的real time clock,如果没有获取到正确的RTC时间值,那么缺省的real time(wall time)就是linux epoch。

(2)monotonic raw clock被设定为从0开始。

(3)启动时将monotonic clock设定为负的real time clock,timekeeper并没有直接保存monotonic clock,而是保存了一个wall_to_monotonic的值,这个值类似offset,real time clock加上这个offset就可以得到monotonic clock。因此,初始化的时间点上,monotonic clock实际上等于0(如果没有获取到有效的booting time)。当系统运行之后,real time clock+ wall_to_monotonic是系统的uptime,而real time clock+ wall_to_monotonic + sleep time也就是系统的boot time。

四、获取和设定当前系统时钟的时间值

1、获取monotonic clock的时间值:ktime_get和ktime_get_ts

ktime_t ktime_get(void)

{

struct timekeeper *tk = &tk_core.timekeeper;

unsigned int seq;

ktime_t base;

u64 nsecs;

WARN_ON(timekeeping_suspended);

do {

seq = read_seqcount_begin(&tk_core.seq);

base = tk->tkr_mono.base;

nsecs = timekeeping_get_ns(&tk->tkr_mono);

} while (read_seqcount_retry(&tk_core.seq, seq));

return ktime_add_ns(base, nsecs);

}

EXPORT_SYMBOL_GPL(ktime_get);

一般而言,timekeeping模块是在tick到来的时候更新各种系统时钟的时间值,ktime_get调用很有可能发生在两次tick之间,这时候,仅仅依靠当前系统时钟的值精度就不甚理想了,毕竟那个时间值是per tick更新的。因此,为了获得高精度,ns值的获取是通过timekeeping_get_ns完成的,该函数获取了real time clock的当前时刻的纳秒值,而这是通过上一次的tick时候的real time clock的时间值(xtime_nsec)加上当前时刻到上一次tick之间的delta时间值计算得到的。

ktime_get_ts的概念和ktime_get是一样的,只不过返回的时间值格式不一样而已。

2、获取real time clock的时间值:ktime_get_real_ts64 和 ktime_get_coarse_real_ts64

这两个函数的具体逻辑动作和获取monotonic clock的时间值函数是完全一样的,大家可以自己看代码分析。这里稍微提一下另外一个函数:current_kernel_time,代码如下:

/**

* ktime_get_real_ts64 - Returns the time of day in a timespec64.

* @ts: pointer to the timespec to be set

*

* Returns the time of day in a timespec64 (WARN if suspended).

*/

void ktime_get_real_ts64(struct timespec64 *ts)

{

struct timekeeper *tk = &tk_core.timekeeper;

unsigned int seq;

u64 nsecs;

WARN_ON(timekeeping_suspended);

do {

seq = read_seqcount_begin(&tk_core.seq);

ts->tv_sec = tk->xtime_sec;

nsecs = timekeeping_get_ns(&tk->tkr_mono);

} while (read_seqcount_retry(&tk_core.seq, seq));

ts->tv_nsec = 0;

timespec64_add_ns(ts, nsecs);

}

EXPORT_SYMBOL(ktime_get_real_ts64);

nsecs = timekeeping_get_ns(&tk->tkr_mono);,调用clocksource的read函数获取tick之间的delta时间值,因此current_kernel_time是一个高精度版本的real time clock。

void ktime_get_coarse_real_ts64(struct timespec64 *ts)

{

struct timekeeper *tk = &tk_core.timekeeper;

unsigned int seq;

do {

seq = read_seqcount_begin(&tk_core.seq);

*ts = tk_xtime(tk);

} while (read_seqcount_retry(&tk_core.seq, seq));

}

EXPORT_SYMBOL(ktime_get_coarse_real_ts64);

这没有调用clocksource 的read 函数,精度低,但是性能好。

3、获取boot clock的时间值:getboottime64和ktime_get_ts64

void getboottime64(struct timespec64 *ts)

{

struct timekeeper *tk = &tk_core.timekeeper;

ktime_t t = ktime_sub(tk->offs_real, tk->offs_boot);

*ts = ktime_to_timespec64(t);

}

EXPORT_SYMBOL_GPL(getboottime64);

/**

* ktime_get_ts64 - get the monotonic clock in timespec64 format

* @ts: pointer to timespec variable

*

* The function calculates the monotonic clock from the realtime

* clock and the wall_to_monotonic offset and stores the result

* in normalized timespec64 format in the variable pointed to by @ts.

*/

void ktime_get_ts64(struct timespec64 *ts)

{

struct timekeeper *tk = &tk_core.timekeeper;

struct timespec64 tomono;

unsigned int seq;

u64 nsec;

WARN_ON(timekeeping_suspended);

do {

seq = read_seqcount_begin(&tk_core.seq);

ts->tv_sec = tk->xtime_sec;

nsec = timekeeping_get_ns(&tk->tkr_mono);

tomono = tk->wall_to_monotonic;

} while (read_seqcount_retry(&tk_core.seq, seq));

ts->tv_sec += tomono.tv_sec;

ts->tv_nsec = 0;

timespec64_add_ns(ts, nsec + tomono.tv_nsec);

}

EXPORT_SYMBOL_GPL(ktime_get_ts64);

boot clock这个系统时钟和monotonic clock有什么不同?monotonic clock是从一个固定点开始作为epoch,对于linux,就是启动的时间点,因此,monotonic clock是一个从0开始增加的clock,并且不接受用户的setting,看起来好象适合boot clock是一致的,不过它们之间唯一的差别是对系统进入suspend的处理,对于monotonic clock,它是不记录系统睡眠时间的,因此monotonic clock得到的是一个system uptime。而boot clock计算睡眠时间,直到系统reboot。

4、获取TAI clock的时间值:ktime_get_clocktai和timekeeping_clocktai

static ktime_t posix_get_tai_ktime(clockid_t which_clock)

{

return ktime_get_clocktai();

}

/**

* ktime_get_clocktai - Returns the TAI time of day in ktime_t format

*/

static inline ktime_t ktime_get_clocktai(void)

{

return ktime_get_with_offset(TK_OFFS_TAI);

}

ktime_t ktime_get_with_offset(enum tk_offsets offs)

{

struct timekeeper *tk = &tk_core.timekeeper;

unsigned int seq;

ktime_t base, *offset = offsets[offs];

u64 nsecs;

WARN_ON(timekeeping_suspended);

do {

seq = read_seqcount_begin(&tk_core.seq);

base = ktime_add(tk->tkr_mono.base, *offset);

nsecs = timekeeping_get_ns(&tk->tkr_mono);

} while (read_seqcount_retry(&tk_core.seq, seq));

return ktime_add_ns(base, nsecs);

}

原子钟和real time clock(UTC)是类似的,只是有一个偏移而已,记录在tai_offset中。代码非常简单,大家自己阅读即可。ktime_get_clocktai返回ktime的时间值,而timekeeping_clocktai返回timespec格式的时间值。

5、设定wall time clock

/**

* do_settimeofday64 - Sets the time of day.

* @ts: pointer to the timespec64 variable containing the new time

*

* Sets the time of day to the new time and update NTP and notify hrtimers

*/

int do_settimeofday64(const struct timespec64 *ts)

{

struct timekeeper *tk = &tk_core.timekeeper;

struct timespec64 ts_delta, xt;

unsigned long flags;

int ret = 0;

if (!timespec64_valid_settod(ts))

return -EINVAL;

raw_spin_lock_irqsave(&timekeeper_lock, flags);

write_seqcount_begin(&tk_core.seq);

timekeeping_forward_now(tk); //更新timekeeper至当前时间

xt = tk_xtime(tk);

ts_delta = timespec64_sub(*ts, xt);

if (timespec64_compare(&tk->wall_to_monotonic, &ts_delta) > 0) {

ret = -EINVAL;

goto out;

}

//修改 wall_to_monotonic ,offs_real ,offs_tai,同时也修改了对应的值

tk_set_wall_to_mono(tk, timespec64_sub(tk->wall_to_monotonic, ts_delta));

//调整wall time clock

tk_set_xtime(tk, ts);

out:

timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR | TK_CLOCK_WAS_SET);

write_seqcount_end(&tk_core.seq);

raw_spin_unlock_irqrestore(&timekeeper_lock, flags);

/* signal hrtimers about time change */

clock_was_set();

if (!ret)

audit_tk_injoffset(ts_delta);

return ret;

}

EXPORT_SYMBOL(do_settimeofday64);

五、和clocksource模块的交互

除了直接调用clocksource的read函数之外,timekeeping和clocksource主要的交互就是change clocksource的操作了。当系统中有更高精度的clocksource的时候,会调用timekeeping_notify函数通知timekeeping模块进行clock source的切换,代码如下:

clocksource::__clocksource_select-> timekeeping_notify

/**

* timekeeping_notify - Install a new clock source

* @clock: pointer to the clock source

*

* This function is called from clocksource.c after a new, better clock

* source has been registered. The caller holds the clocksource_mutex.

*/

int timekeeping_notify(struct clocksource *clock)

{

struct timekeeper *tk = &tk_core.timekeeper;

if (tk->tkr_mono.clock == clock)

return 0;

stop_machine(change_clocksource, clock, NULL);

tick_clock_notify();

return tk->tkr_mono.clock == clock ? 0 : -1;

}

stop_machine从字面上就可以知道是停掉了所有cpu上的任务(这个machine都不能对外提供服务了),只是执行一个函数,在这个场景下是change_clocksource。(为何不直接调用change_clocksource而是使用stop_machine这样的大招?现在还在思考中......)。change_clocksource主要执行的步骤包括:

static int change_clocksource(void *data)

{

struct timekeeper *tk = &tk_core.timekeeper;

struct clocksource *new, *old;

unsigned long flags;

new = (struct clocksource *) data;

raw_spin_lock_irqsave(&timekeeper_lock, flags);

write_seqcount_begin(&tk_core.seq);

timekeeping_forward_now(tk);

/*

* If the cs is in module, get a module reference. Succeeds

* for built-in code (owner == NULL) as well.

*/

if (try_module_get(new->owner)) {

if (!new->enable || new->enable(new) == 0) {

old = tk->tkr_mono.clock;

tk_setup_internals(tk, new);

if (old->disable)

old->disable(old);

module_put(old->owner);

} else {

module_put(new->owner);

}

}

timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR | TK_CLOCK_WAS_SET);

write_seqcount_end(&tk_core.seq);

raw_spin_unlock_irqrestore(&timekeeper_lock, flags);

return 0;

}

(1)调用timekeeping_forward_now函数。就要更换新的clocksource了,就是旧clocksource最后再发挥一次作用。调用旧的clocksource的read函数,将最后的这段时间间隔(当前到上次read)加到real time system clock以及minitonic raw system clock上去。

(2)调用tk_setup_internals函数设定新的clocksource,disable旧的clocksource。tk_setup_internals函数代码如下:

static void tk_setup_internals(struct timekeeper *tk, struct clocksource *clock)

{

u64 interval;

u64 tmp, ntpinterval;

struct clocksource *old_clock;

++tk->cs_was_changed_seq;

old_clock = tk->tkr_mono.clock;

tk->tkr_mono.clock = clock; //更新新clocksource

tk->tkr_mono.mask = clock->mask;

tk->tkr_mono.cycle_last = tk_clock_read(&tk->tkr_mono);//更新last cycle值

tk->tkr_raw.clock = clock;//更新新clocksource

tk->tkr_raw.mask = clock->mask;

tk->tkr_raw.cycle_last = tk->tkr_mono.cycle_last;//更新last cycle值

/* Do the ns -> cycle conversion first, using original mult */

tmp = NTP_INTERVAL_LENGTH;

tmp <<= clock->shift;

ntpinterval = tmp;//NTP interval设定的纳秒数

tmp += clock->mult/2;

do_div(tmp, clock->mult);//将NTP interval的纳秒值转成新clocksource的cycle值

if (tmp == 0)

tmp = 1;

interval = (u64) tmp;

tk->cycle_interval = interval;//设定新的NTP interval的cycle值

/* Go back from cycles -> shifted ns */

tk->xtime_interval = interval * clock->mult;//将NTP interval的cycle值转成ns

tk->xtime_remainder = ntpinterval - tk->xtime_interval;//计算remainder,上面的tmp

tk->raw_interval = interval * clock->mult;//将NTP interval的cycle值转成ns

/* if changing clocks, convert xtime_nsec shift units */

if (old_clock) {

int shift_change = clock->shift - old_clock->shift;//比较新旧的shift值是否一样

if (shift_change < 0) {

tk->tkr_mono.xtime_nsec >>= -shift_change;

tk->tkr_raw.xtime_nsec >>= -shift_change;

} else {

tk->tkr_mono.xtime_nsec <<= shift_change;

tk->tkr_raw.xtime_nsec <<= shift_change;

}

}

tk->tkr_mono.shift = clock->shift;//更换为新的shift 因子

tk->tkr_raw.shift = clock->shift;

tk->ntp_error = 0;

tk->ntp_error_shift = NTP_SCALE_SHIFT - clock->shift;

tk->ntp_tick = ntpinterval << tk->ntp_error_shift;

/*

* The timekeeper keeps its own mult values for the currently

* active clocksource. These value will be adjusted via NTP

* to counteract clock drifting.

*/

tk->tkr_mono.mult = clock->mult;//更换新的mult 因子

tk->tkr_raw.mult = clock->mult;

tk->ntp_err_mult = 0;

tk->skip_second_overflow = 0;

}

由于更换了新的clocksource,一般而言,新旧clocksource的工作参数不一样,就要就导致timekeeper的一些内部的数据成员要进行更新,例如NTP interval、multi和shift facotr数值等。

(3)调用timekeeping_update函数。由于更新了clocksource,因此timekeeping模块要更新其内部数据。TK_CLEAR_NTP控制clear 旧的NTP的状态数据。TK_MIRROR用来更新shadow timekeeper,主要是为了保持和real timekeeper同步。TK_CLOCK_WAS_SET用在paravirtual clock场景中,这里就不详细描述了。

六、和tick device模块的接口

1、periodic tick

当系统采用periodic tick机制的时候,tick device模块会在周期性tick到来的时候,调用tick_periodic来进行下面的动作:

(1)如果是global tick,需要调用do_timer来修改jiffies,计算系统负荷。

(2)如果是global tick,需要调用update_wall_time来更新系统时间。timekeeping模块是按照自己的节奏来更新系统时间的,更新一般是发生在周期性tick到来的时候。如果HZ=100的话,那么每10ms就会有一个tick事件(clockevent事件),跟的太紧,会浪费CPU,跟的太松会损失一些精度。timekeeper中的cycle_interval成员就是周期性tick的cycle interval,如果距离上次的更新还不到一个tick的时间,那么就不再更新系统时间,直接退出。

(3)调用update_process_times和profile_tick,分别更新进程时间和进行内核剖析相关的操作。

2、dynamic tick

这个后续在 tick device 源码分析时补上

???

七、timekeeping模的电源管理

1、初始化

static struct syscore_ops timekeeping_syscore_ops = {

.resume = timekeeping_resume,

.suspend = timekeeping_suspend,

};

static int __init timekeeping_init_ops(void)

{

register_syscore_ops(&timekeeping_syscore_ops);

return 0;

}

device_initcall(timekeeping_init_ops);

在系统初始化的过程中,会调用 timekeeping_init_ops来注册和timekeeping相关的system core operations。在旧的内核中,这部分的功能是通过sysdev class和sysdev实现的。通过sysdev class和sysdev实现的suspend和resume看起来比较笨重而且效率低,因此新的内核为某些core subsystem设计了新的基于syscore_ops 的接口。而注册的这些callback函数会在系统suspend和resume的时候,在适当的时机执行(在system suspend过程中,syscore suspend的执行非常的靠后,在那些普通的总线设备之后,对应的,system resume过程中,非常早的醒来进入工作状态)。当然,这属于电源管理子系统的内容,这篇文章就不描述了,大家可以参考suspend_enter函数。之后会出电源管理的源码分析文章。

2、suspend 回调函数

int timekeeping_suspend(void)

{

struct timekeeper *tk = &tk_core.timekeeper;

unsigned long flags;

struct timespec64 delta, delta_delta;

static struct timespec64 old_delta;

struct clocksource *curr_clock;

u64 cycle_now;

//记录suspend 的时间到 timekeeping_suspend_time全局变量中

read_persistent_clock64(&timekeeping_suspend_time);

/*

* On some systems the persistent_clock can not be detected at

* timekeeping_init by its return value, so if we see a valid

* value returned, update the persistent_clock_exists flag.

*/

if (timekeeping_suspend_time.tv_sec || timekeeping_suspend_time.tv_nsec)

persistent_clock_exists = true;

//suspend_timing_needed 这个全局变量在 resume 时 使用

suspend_timing_needed = true;

raw_spin_lock_irqsave(&timekeeper_lock, flags);

write_seqcount_begin(&tk_core.seq);

//suspend 之前,更新timekeep 的时间值

timekeeping_forward_now(tk);

//标记 timekeeping 模块进入 suspend 状态,禁止获取时间

timekeeping_suspended = 1;

/*

* Since we've called forward_now, cycle_last stores the value

* just read from the current clocksource. Save this to potentially

* use in suspend timing.

*/

//curr_clock 和suspend_start 是全局变量值,suspend 状态值保存,方便使用

curr_clock = tk->tkr_mono.clock;

cycle_now = tk->tkr_mono.cycle_last;

clocksource_start_suspend_timing(curr_clock, cycle_now);

if (persistent_clock_exists) {

/*

* To avoid drift caused by repeated suspend/resumes,

* which each can add ~1 second drift error,

* try to compensate so the difference in system time

* and persistent_clock time stays close to constant.

*/

//delta 是本次real time clock和persistent clock之间的差值

delta = timespec64_sub(tk_xtime(tk), timekeeping_suspend_time);

delta_delta = timespec64_sub(delta, old_delta);

if (abs(delta_delta.tv_sec) >= 2) {

/*

* if delta_delta is too large, assume time correction

* has occurred and set old_delta to the current delta.

*/

old_delta = delta;

} else {

/* Otherwise try to adjust old_system to compensate */

timekeeping_suspend_time =

timespec64_add(timekeeping_suspend_time, delta_delta);

}

}

//更新shadow timekeeper

timekeeping_update(tk, TK_MIRROR);

halt_fast_timekeeper(tk);

write_seqcount_end(&tk_core.seq);

raw_spin_unlock_irqrestore(&timekeeper_lock, flags);

// tick 层和 clocksoure 层 和clockevent 层 开始进入suspend

tick_suspend();

clocksource_suspend();

clockevents_suspend();

return 0;

}

(1)一般而言,在整机suspend之后,clocksource和clockevent所依赖的底层硬件会被推入深度睡眠甚至是断电状态(当然,也有一些例外,有些clocksource会标记CLOCK_SOURCE_SUSPEND_NONSTOP flag),这时候,有些有计时能力的硬件(persistent clock),例如RTC,仍然是running状态。虽然RTC的精度不是很好,但是time keeping的动作在suspend中的时候也要继续,需要记录这一段时间的流逝。因此,这里调用read_persistent_clock将suspend时间点信息记录到timekeeping_suspend_time变量中。persistent_clock_exist变量标识系统中是否有RTC的硬件,按理说应该在timekeeping初始化的时候设定,不过也有可能在那个时刻,系统中RTC驱动还没有初始化,因此,如果这里能得到一个有效的时间值的话,也相应的更新persistent_clock_exist变量。

(2)timekeeping subsystem马上就睡下去了,临睡前,最后一次更新timekeeper的系统时钟的数据,此后,底层的硬件会停掉,硬件counter和硬件timer都会停止工作了。

(3)标记timekeeping subsystem进入suspend过程。在这个过程中的获取时间操作应该被禁止。

(4)persistent clock的精度一般没有那么好,可能只是以秒的精度在计时。因此,一次suspend/resume的过程中,read persistent clock会引入半秒的误差。为了防止连续的suspend/resume引起时间偏移,这里也考虑了real time clock和persistent clock之间的delta值。delta是本次real time clock和persistent clock之间的差值,delta_delta是两次suspend之间delta的差值,如果delta_delta大于2秒,就把delta 赋值到old_delta,否则 timekeeping_suspend_time 增加这个delta_delta 时间

3、resume回调函数

/**

* timekeeping_resume - Resumes the generic timekeeping subsystem.

*/

void timekeeping_resume(void)

{

struct timekeeper *tk = &tk_core.timekeeper;

struct clocksource *clock = tk->tkr_mono.clock;

unsigned long flags;

struct timespec64 ts_new, ts_delta;

u64 cycle_now, nsec;

bool inject_sleeptime = false;

//通过persistent clock记录醒来的时间点

read_persistent_clock64(&ts_new);

//唤醒 clockevent 层和 clocksource 层

clockevents_resume();

clocksource_resume();

raw_spin_lock_irqsave(&timekeeper_lock, flags);

write_seqcount_begin(&tk_core.seq);

/*

* After system resumes, we need to calculate the suspended time and

* compensate it for the OS time. There are 3 sources that could be

* used: Nonstop clocksource during suspend, persistent clock and rtc

* device.

*

* One specific platform may have 1 or 2 or all of them, and the

* preference will be:

* suspend-nonstop clocksource -> persistent clock -> rtc

* The less preferred source will only be tried if there is no better

* usable source. The rtc part is handled separately in rtc core code.

*/

cycle_now = tk_clock_read(&tk->tkr_mono);

//通过suspend_clocksource,获取和suspend_start 时间的delta ns值,即本次suspend 时间

nsec = clocksource_stop_suspend_timing(clock, cycle_now);

if (nsec > 0) { //nsec 大于0 ,说明有 suspend_clocksouce

ts_delta = ns_to_timespec64(nsec);

inject_sleeptime = true;

} else if (timespec64_compare(&ts_new, &timekeeping_suspend_time) > 0) {

ts_delta = timespec64_sub(ts_new, timekeeping_suspend_time);

inject_sleeptime = true;

}

if (inject_sleeptime) {

suspend_timing_needed = false;

__timekeeping_inject_sleeptime(tk, &ts_delta);

}

/* Re-base the last cycle value */

tk->tkr_mono.cycle_last = cycle_now;

tk->tkr_raw.cycle_last = cycle_now;

tk->ntp_error = 0;

timekeeping_suspended = 0;

timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET);

write_seqcount_end(&tk_core.seq);

raw_spin_unlock_irqrestore(&timekeeper_lock, flags);

touch_softlockup_watchdog();

tick_resume();

hrtimers_resume();

}

(1)如果timekeeper当前有suspend_clocksource,在suspend的时候没有stop,那么有机会使用精度更高的clocksource而不是persistent clock。前提是clocksource没有溢出,因此才有了cycle_now > clock->cycle_last的判断(不过,这里要求clocksource应该有一个很长的overflow的时间)。

(2)如果没有suspend nonstop的clock,也没有关系,可以用persistent clock的时间值。

(3)调用__timekeeping_inject_sleeptime函数,具体如下:

/**

* __timekeeping_inject_sleeptime - Internal function to add sleep interval

* @delta: pointer to a timespec delta value

*

* Takes a timespec offset measuring a suspend interval and properly

* adds the sleep offset to the timekeeping variables.

*/

static void __timekeeping_inject_sleeptime(struct timekeeper *tk,

const struct timespec64 *delta)

{

if (!timespec64_valid_strict(delta)) {

printk_deferred(KERN_WARNING

"__timekeeping_inject_sleeptime: Invalid "

"sleep delta value!\n");

return;

}

tk_xtime_add(tk, delta);

tk_set_wall_to_mono(tk, timespec64_sub(tk->wall_to_monotonic, *delta));

tk_update_sleep_time(tk, timespec64_to_ktime(*delta));

tk_debug_account_sleep_time(delta);

}

monotonic clock不计sleep时间,因此wall_to_monotonic要减去suspend的时间值。

tk_update_sleep_time 更新:offs_boot当然需要加上suspend的时间值。

(4)其他模块的唤醒顺序:clockevents_resume()->clocksource_resume()->tick_resume()-> hrtimers_resume(); 唤醒和suspend 顺序是反的;

留一个问题,suspend 函数中,没有让 hrtimers_suspend,但是在resume 函数中,确有 hrtimers_resume(),为什么?

相关推荐
计算机毕设定制辅导-无忧学长2 小时前
Nginx 性能优化技巧与实践(二)
运维·nginx·性能优化
烛.照1034 小时前
Nginx部署的前端项目刷新404问题
运维·前端·nginx
安静的做,安静的学5 小时前
网络仿真工具Core环境搭建
linux·网络·网络协议
m0_742155436 小时前
linux ——waitpid介绍及示例
linux·c++·学习方法
华纳云IDC服务商6 小时前
超融合服务器怎么优化数据管理?
运维·服务器
会飞的土拨鼠呀6 小时前
Prometheus监控minio对象存储
运维·prometheus
hy____1237 小时前
动态内存管理
linux·运维·算法
ks胤墨7 小时前
Docker快速部署高效照片管理系统LibrePhotos搭建私有云相册
运维·docker·容器
小度爱学习7 小时前
数据链路层协议
运维·服务器·网络·网络协议·网络安全
龙之叶7 小时前
Android13源码下载和编译过程详解
android·linux·ubuntu