Linux中页面回收函数try_to_free_pages的实现

尝试回收页面try_to_free_pages

c 复制代码
int try_to_free_pages(struct zone **zones,
                unsigned int gfp_mask, unsigned int order)
{
        int priority;
        int ret = 0;
        int total_scanned = 0, total_reclaimed = 0;
        struct reclaim_state *reclaim_state = current->reclaim_state;
        struct scan_control sc;
        unsigned long lru_pages = 0;
        int i;

        sc.gfp_mask = gfp_mask;
        sc.may_writepage = 0;

        inc_page_state(allocstall);

        for (i = 0; zones[i] != NULL; i++) {
                struct zone *zone = zones[i];

                zone->temp_priority = DEF_PRIORITY;
                lru_pages += zone->nr_active + zone->nr_inactive;
        }

        for (priority = DEF_PRIORITY; priority >= 0; priority--) {
                sc.nr_mapped = read_page_state(nr_mapped);
                sc.nr_scanned = 0;
                sc.nr_reclaimed = 0;
                sc.priority = priority;
                shrink_caches(zones, &sc);
                shrink_slab(sc.nr_scanned, gfp_mask, lru_pages);
                if (reclaim_state) {
                        sc.nr_reclaimed += reclaim_state->reclaimed_slab;
                        reclaim_state->reclaimed_slab = 0;
                }
                if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX) {
                        ret = 1;
                        goto out;
                }
                total_scanned += sc.nr_scanned;
                total_reclaimed += sc.nr_reclaimed;

                /*
                 * Try to write back as many pages as we just scanned.  This
                 * tends to cause slow streaming writers to write data to the
                 * disk smoothly, at the dirtying rate, which is nice.   But
                 * that's undesirable in laptop mode, where we *want* lumpy
                 * writeout.  So in laptop mode, write out the whole world.
                 */
                if (total_scanned > SWAP_CLUSTER_MAX + SWAP_CLUSTER_MAX/2) {
                        wakeup_bdflush(laptop_mode ? 0 : total_scanned);
                        sc.may_writepage = 1;
                }

                /* Take a nap, wait for some writeback to complete */
                if (sc.nr_scanned && priority < DEF_PRIORITY - 2)
                        blk_congestion_wait(WRITE, HZ/10);
        }
        if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY))
                out_of_memory(gfp_mask);
out:
        for (i = 0; zones[i] != 0; i++)
                zones[i]->prev_priority = zones[i]->temp_priority;
        return ret;
}

1. 函数功能

在内存压力下尝试回收页面,这是直接内存回收的核心函数。通过多优先级扫描和回收机制,尝试释放足够的内存来满足分配请求

2. 逐行代码解析

c 复制代码
int try_to_free_pages(struct zone **zones,
                unsigned int gfp_mask, unsigned int order)
{
  • 函数定义:返回int类型,1表示成功回收足够页面,0表示失败
  • zones:要回收的内存区域数组
  • gfp_mask:分配标志,控制回收行为
  • order:请求的分配阶数
c 复制代码
        int priority;
  • 优先级变量:控制回收的激进程度,从高到低(数值从大到小)
c 复制代码
        int ret = 0;
  • 返回值初始化:默认返回0(回收失败)
c 复制代码
        int total_scanned = 0, total_reclaimed = 0;
  • 统计变量
    • total_scanned:累计扫描的页面数
    • total_reclaimed:累计回收的页面数
c 复制代码
        struct reclaim_state *reclaim_state = current->reclaim_state;
  • 获取当前进程的回收状态
    • current->reclaim_state:当前进程的回收状态指针
    • 用于跟踪slab回收器回收的页面数量
c 复制代码
        struct scan_control sc;
  • 扫描控制结构:包含页面回收的所有控制参数和统计信息
c 复制代码
        unsigned long lru_pages = 0;
  • LRU页面总数:所有zone中活跃和非活跃页面的总和
c 复制代码
        int i;
  • 循环计数器
c 复制代码
        sc.gfp_mask = gfp_mask;
  • 设置扫描控制的GFP掩码:传递分配标志给回收器
c 复制代码
        sc.may_writepage = 0;
  • 初始化写页面权限:初始不允许写页面到磁盘
c 复制代码
        inc_page_state(allocstall);
  • 增加分配停顿统计:记录发生了一次内存回收事件
c 复制代码
        for (i = 0; zones[i] != NULL; i++) {
  • 遍历所有zone:初始化每个zone的回收参数
c 复制代码
                struct zone *zone = zones[i];
  • 获取当前zone指针
c 复制代码
                zone->temp_priority = DEF_PRIORITY;
  • 设置临时优先级DEF_PRIORITY表示默认优先级
c 复制代码
                lru_pages += zone->nr_active + zone->nr_inactive;
  • 累计LRU页面总数:计算所有zone中可回收页面的基数
c 复制代码
        for (priority = DEF_PRIORITY; priority >= 0; priority--) {
  • 主回收循环:从默认优先级12开始,逐步降低到0
  • 优先级含义:数值越高越温和,数值越低越激进
c 复制代码
                sc.nr_mapped = read_page_state(nr_mapped);
  • 读取已映射页面数:获取系统当前被进程映射的页面总数
c 复制代码
                sc.nr_scanned = 0;
  • 重置本次扫描计数:每轮优先级循环重新计数
c 复制代码
                sc.nr_reclaimed = 0;
  • 重置本次回收计数:每轮优先级循环重新计数
c 复制代码
                sc.priority = priority;
  • 设置当前优先级:控制本次循环的回收激进程度
c 复制代码
                shrink_caches(zones, &sc);
  • 核心回收函数:扫描并回收页面缓存和匿名页面
c 复制代码
                shrink_slab(sc.nr_scanned, gfp_mask, lru_pages);
  • 收缩slab缓存:回收内核对象缓存
c 复制代码
                if (reclaim_state) {
  • 检查是否有回收状态:slab回收器可能已经回收了页面
c 复制代码
                        sc.nr_reclaimed += reclaim_state->reclaimed_slab;
  • 累加slab回收的页面:将slab回收的页面数加到总回收数
c 复制代码
                        reclaim_state->reclaimed_slab = 0;
  • 重置slab回收计数:为下一轮循环准备
c 复制代码
                if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX) {
  • 检查是否回收足够页面SWAP_CLUSTER_MAX通常是32页
c 复制代码
                        ret = 1;
  • 设置成功标志:表示回收了足够页面
c 复制代码
                        goto out;
  • 跳转到清理代码:直接退出回收循环
c 复制代码
                total_scanned += sc.nr_scanned;
  • 累计总扫描页面数
c 复制代码
                total_reclaimed += sc.nr_reclaimed;
  • 累计总回收页面数
c 复制代码
                if (total_scanned > SWAP_CLUSTER_MAX + SWAP_CLUSTER_MAX/2) {
  • 检查是否需要唤醒bdflush:当扫描超过48页时(32+16)
c 复制代码
                        wakeup_bdflush(laptop_mode ? 0 : total_scanned);
  • 唤醒磁盘刷写线程
    • 笔记本模式:参数0,刷写所有脏页
    • 正常模式:参数为扫描数量,按需刷写
c 复制代码
                        sc.may_writepage = 1;
  • 允许写页面:在后续循环中可以写页面到磁盘
c 复制代码
                if (sc.nr_scanned && priority < DEF_PRIORITY - 2)
  • 检查是否需要等待
    • sc.nr_scanned:本轮扫描了页面
    • priority < DEF_PRIORITY - 2:优先级低于10(比较激进时)
c 复制代码
                        blk_congestion_wait(WRITE, HZ/10);
  • 等待IO拥塞缓解
c 复制代码
        if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY))
  • 检查是否触发OOM
    • __GFP_FS:允许文件系统操作
    • !__GFP_NORETRY:允许重试(不禁止OOM)
c 复制代码
                out_of_memory(gfp_mask);
  • 调用OOM killer:选择并杀死一个进程来释放内存
c 复制代码
out:
  • 标签:清理代码的开始
c 复制代码
        for (i = 0; zones[i] != 0; i++)
  • 遍历所有zone进行清理
c 复制代码
                zones[i]->prev_priority = zones[i]->temp_priority;
  • 保存优先级历史 :将本次回收的优先级记录到prev_priority
c 复制代码
        return ret;
  • 返回结果:1表示成功,0表示失败

3. 回收策略详解

3.1. 成功条件

  • 单轮回收 ≥ 32页(SWAP_CLUSTER_MAX)
  • 或者触发OOM killer

3.2. 退出条件

  1. 成功退出:回收足够页面
  2. 循环结束:所有优先级都尝试过
  3. OOM触发:无法回收足够页面且允许OOM

协调多个内存区域的页面回收shrink_caches

c 复制代码
static void
shrink_caches(struct zone **zones, struct scan_control *sc)
{
        int i;

        for (i = 0; zones[i] != NULL; i++) {
                struct zone *zone = zones[i];

                if (zone->present_pages == 0)
                        continue;

                zone->temp_priority = sc->priority;
                if (zone->prev_priority > sc->priority)
                        zone->prev_priority = sc->priority;

                if (zone->all_unreclaimable && sc->priority != DEF_PRIORITY)
                        continue;       /* Let kswapd poll it */

                shrink_zone(zone, sc);
        }
}

1. 函数功能

协调多个内存区域的页面回收工作,根据内存区域的状态和回收优先级决定是否对每个zone执行页面回收。这是内存回收的调度器,负责将回收请求分发到各个内存区域

2. 逐行代码解析

c 复制代码
static void
shrink_caches(struct zone **zones, struct scan_control *sc)
{
  • static void: 静态函数,无返回值
  • zones: 内存区域指针数组,包含所有需要回收的zone
  • sc: 扫描控制结构指针,包含回收参数和统计信息
c 复制代码
        int i;
  • 循环计数器:用于遍历zones数组
c 复制代码
        for (i = 0; zones[i] != NULL; i++) {
  • 遍历所有zone
    • i = 0: 从第一个zone开始
    • zones[i] != NULL: 循环条件,直到遇到NULL指针(数组结束)
    • i++: 每次循环处理一个zone
c 复制代码
                struct zone *zone = zones[i];
  • 获取当前zone指针
    • zones[i]: 访问zones数组的第i个元素
    • 将当前zone的指针保存到局部变量zone中便于使用
c 复制代码
                if (zone->present_pages == 0)
  • 检查zone是否有物理内存
    • zone->present_pages: zone中实际存在的物理页面数量
    • 如果为0,表示这个zone没有任何物理内存
c 复制代码
                        continue;
  • 跳过空zone
    • 如果zone没有物理内存,执行continue跳过当前循环迭代
    • 直接进入下一个zone的处理
c 复制代码
                zone->temp_priority = sc->priority;
  • 设置zone的临时优先级
    • sc->priority: 当前扫描控制的优先级(从12到0)
    • zone->temp_priority: zone的临时优先级字段
    • 作用:记录本次回收使用的优先级
c 复制代码
                if (zone->prev_priority > sc->priority)
  • 检查是否需要更新历史优先级
    • zone->prev_priority: zone的上次回收使用的优先级
    • sc->priority: 当前优先级
    • 条件:如果历史优先级大于当前优先级(历史更温和)
c 复制代码
                        zone->prev_priority = sc->priority;
  • 更新历史优先级
    • 将zone的prev_priority设置为当前优先级
    • 设计意义:记录最近使用的最激进优先级,用于后续回收决策
c 复制代码
                if (zone->all_unreclaimable && sc->priority != DEF_PRIORITY)
  • 检查不可回收zone
    • zone->all_unreclaimable: zone是否被标记为完全不可回收
    • sc->priority != DEF_PRIORITY: 当前优先级不是默认优先级(12)
    • 条件:如果zone不可回收且当前不是温和回收
c 复制代码
                        continue;       /* Let kswapd poll it */
  • 跳过不可回收zone
    • 执行continue跳过当前zone的回收
    • 设计意义:避免在激进回收时浪费CPU在不可回收的zone上
c 复制代码
                shrink_zone(zone, sc);
  • 执行zone回收
    • shrink_zone(zone, sc): 核心回收函数,对该zone执行实际的页面回收
    • 参数:当前zone指针和扫描控制结构

指定内存区域页面回收shrink_zone

c 复制代码
static void
shrink_zone(struct zone *zone, struct scan_control *sc)
{
        unsigned long nr_active;
        unsigned long nr_inactive;

        /*
         * Add one to `nr_to_scan' just to make sure that the kernel will
         * slowly sift through the active list.
         */
        zone->nr_scan_active += (zone->nr_active >> sc->priority) + 1;
        nr_active = zone->nr_scan_active;
        if (nr_active >= SWAP_CLUSTER_MAX)
                zone->nr_scan_active = 0;
        else
                nr_active = 0;

        zone->nr_scan_inactive += (zone->nr_inactive >> sc->priority) + 1;
        nr_inactive = zone->nr_scan_inactive;
        if (nr_inactive >= SWAP_CLUSTER_MAX)
                zone->nr_scan_inactive = 0;
        else
                nr_inactive = 0;

        sc->nr_to_reclaim = SWAP_CLUSTER_MAX;

        while (nr_active || nr_inactive) {
                if (nr_active) {
                        sc->nr_to_scan = min(nr_active,
                                        (unsigned long)SWAP_CLUSTER_MAX);
                        nr_active -= sc->nr_to_scan;
                        refill_inactive_zone(zone, sc);
                }

                if (nr_inactive) {
                        sc->nr_to_scan = min(nr_inactive,
                                        (unsigned long)SWAP_CLUSTER_MAX);
                        nr_inactive -= sc->nr_to_scan;
                        shrink_cache(zone, sc);
                        if (sc->nr_to_reclaim <= 0)
                                break;
                }
        }
}

1. 函数功能

在指定内存区域中执行页面回收,通过平衡活跃和非活跃链表的管理,将页面从活跃链表移动到非活跃链表并最终回收

2. 逐行代码解析

c 复制代码
static void
shrink_zone(struct zone *zone, struct scan_control *sc)
{
  • static void: 静态函数,无返回值
  • zone: 目标内存区域指针
  • sc: 扫描控制结构指针,包含回收参数和统计
c 复制代码
        unsigned long nr_active;
  • 活跃页面扫描计数:本次要扫描的活跃页面数量
c 复制代码
        unsigned long nr_inactive;
  • 非活跃页面扫描计数:本次要扫描的非活跃页面数量
c 复制代码
        zone->nr_scan_active += (zone->nr_active >> sc->priority) + 1;
  • 累计活跃页面扫描计数
    • zone->nr_active: zone中活跃页面的总数
    • zone->nr_active >> sc->priority: 根据优先级计算扫描比例
      • 优先级越高(数值大),右移位数越多,扫描比例越小(温和)
      • 优先级越低(数值小),右移位数越少,扫描比例越大(激进)
    • + 1: 确保至少扫描1个页面
    • 结果累加到zone->nr_scan_active(zone的活跃扫描累加器)
c 复制代码
        nr_active = zone->nr_scan_active;
  • 获取当前活跃扫描计数:将累加值保存到局部变量
c 复制代码
        if (nr_active >= SWAP_CLUSTER_MAX)
  • 检查是否达到批量扫描阈值SWAP_CLUSTER_MAX通常是32页
c 复制代码
                zone->nr_scan_active = 0;
  • 重置活跃扫描累加器:如果达到阈值,清零准备下一轮累计
c 复制代码
        else
                nr_active = 0;
  • 不足阈值则本次不扫描活跃页面 :如果累计不足32页,设置nr_active = 0,本次跳过活跃链表扫描
c 复制代码
        zone->nr_scan_inactive += (zone->nr_inactive >> sc->priority) + 1;
  • 累计非活跃页面扫描计数
    • 同样的逻辑应用于非活跃页面
    • zone->nr_inactive >> sc->priority: 根据优先级计算非活跃页面扫描比例
    • + 1: 确保至少扫描1个页面
c 复制代码
        nr_inactive = zone->nr_scan_inactive;
  • 获取当前非活跃扫描计数
c 复制代码
        if (nr_inactive >= SWAP_CLUSTER_MAX)
                zone->nr_scan_inactive = 0;
        else
                nr_inactive = 0;
  • 同样的阈值检查逻辑应用于非活跃页面
c 复制代码
        sc->nr_to_reclaim = SWAP_CLUSTER_MAX;
  • 设置回收目标:本次回收希望回收32个页面
c 复制代码
        while (nr_active || nr_inactive) {
  • 主回收循环:只要还有活跃或非活跃页面需要扫描就继续
  • nr_active || nr_inactive: 任一不为零就继续循环
c 复制代码
                if (nr_active) {
  • 检查是否需要扫描活跃页面
c 复制代码
                        sc->nr_to_scan = min(nr_active,
                                        (unsigned long)SWAP_CLUSTER_MAX);
  • 计算本次扫描数量
    • min(nr_active, (unsigned long)SWAP_CLUSTER_MAX):
    • 取剩余活跃页面数和32中的较小值
    • 确保单次扫描不超过32个页面(避免长时间持有锁)
c 复制代码
                        nr_active -= sc->nr_to_scan;
  • 更新剩余活跃页面数:减去本次要扫描的数量
c 复制代码
                        refill_inactive_zone(zone, sc);
  • 核心函数:补充非活跃zone
    • 扫描活跃链表,将符合条件的页面移动到非活跃链表
    • 这是页面回收的第一步:将"热"页面降级为"冷"页面
c 复制代码
                if (nr_inactive) {
  • 检查是否需要扫描非活跃页面
c 复制代码
                        sc->nr_to_scan = min(nr_inactive,
                                        (unsigned long)SWAP_CLUSTER_MAX);
  • 计算非活跃页面扫描数量:同样的限制逻辑
c 复制代码
                        nr_inactive -= sc->nr_to_scan;
  • 更新剩余非活跃页面数
c 复制代码
                        shrink_cache(zone, sc);
  • 核心函数:收缩缓存
    • 扫描非活跃链表,实际回收页面
    • 可能将页面写回磁盘或直接释放
c 复制代码
                        if (sc->nr_to_reclaim <= 0)
  • 检查是否达到回收目标sc->nr_to_reclaim在回收过程中递减
c 复制代码
                                break;
  • 提前退出循环:如果已经回收了足够页面,立即退出

3. 双阶段回收策略

3.1. 阶段1: refill_inactive_zone

  • 目的: 将活跃链表中"冷却"的页面移动到非活跃链表
  • 策略: 基于页面访问频率和年龄
  • 效果: 准备可回收的候选页面

3.2. 阶段2: shrink_cache

  • 目的: 实际回收非活跃链表中的页面
  • 动作: 写回脏页、释放干净页、交换匿名页
  • 效果: 真正释放物理内存

移动到非活跃链表refill_inactive_zone

c 复制代码
static void
refill_inactive_zone(struct zone *zone, struct scan_control *sc)
{
        int pgmoved;
        int pgdeactivate = 0;
        int pgscanned = 0;
        int nr_pages = sc->nr_to_scan;
        LIST_HEAD(l_hold);      /* The pages which were snipped off */
        LIST_HEAD(l_inactive);  /* Pages to go onto the inactive_list */
        LIST_HEAD(l_active);    /* Pages to go onto the active_list */
        struct page *page;
        struct pagevec pvec;
        int reclaim_mapped = 0;
        long mapped_ratio;
        long distress;
        long swap_tendency;

        lru_add_drain();
        pgmoved = 0;
        spin_lock_irq(&zone->lru_lock);
        while (pgscanned < nr_pages && !list_empty(&zone->active_list)) {
                page = lru_to_page(&zone->active_list);
                prefetchw_prev_lru_page(page, &zone->active_list, flags);
                if (!TestClearPageLRU(page))
                        BUG();
                list_del(&page->lru);
                if (get_page_testone(page)) {
                        /*
                         * It was already free!  release_pages() or put_page()
                         * are about to remove it from the LRU and free it. So
                         * put the refcount back and put the page back on the
                         * LRU
                         */
                        __put_page(page);
                        SetPageLRU(page);
                        list_add(&page->lru, &zone->active_list);
                } else {
                        list_add(&page->lru, &l_hold);
                        pgmoved++;
                }
                pgscanned++;
        }
        zone->pages_scanned += pgscanned;
        zone->nr_active -= pgmoved;
        spin_unlock_irq(&zone->lru_lock);

        /*
         * `distress' is a measure of how much trouble we're having reclaiming
         * pages.  0 -> no problems.  100 -> great trouble.
         */
        distress = 100 >> zone->prev_priority;

        /*
         * The point of this algorithm is to decide when to start reclaiming
         * mapped memory instead of just pagecache.  Work out how much memory
         * is mapped.
         */
        mapped_ratio = (sc->nr_mapped * 100) / total_memory;

        /*
         * Now decide how much we really want to unmap some pages.  The mapped
         * ratio is downgraded - just because there's a lot of mapped memory
         * doesn't necessarily mean that page reclaim isn't succeeding.
         *
         * The distress ratio is important - we don't want to start going oom.
         *
         * A 100% value of vm_swappiness overrides this algorithm altogether.
         */
        swap_tendency = mapped_ratio / 2 + distress + vm_swappiness;

        /*
         * Now use this metric to decide whether to start moving mapped memory
         * onto the inactive list.
         */
        if (swap_tendency >= 100)
                reclaim_mapped = 1;

        while (!list_empty(&l_hold)) {
                page = lru_to_page(&l_hold);
                list_del(&page->lru);
                if (page_mapped(page)) {
                        if (!reclaim_mapped ||
                            (total_swap_pages == 0 && PageAnon(page)) ||
                            page_referenced(page, 0, sc->priority <= 0)) {
                                list_add(&page->lru, &l_active);
                                continue;
                        }
                }
                list_add(&page->lru, &l_inactive);
        }

        pagevec_init(&pvec, 1);
        pgmoved = 0;
        spin_lock_irq(&zone->lru_lock);
        while (!list_empty(&l_inactive)) {
                page = lru_to_page(&l_inactive);
                prefetchw_prev_lru_page(page, &l_inactive, flags);
                if (TestSetPageLRU(page))
                        BUG();
                if (!TestClearPageActive(page))
                        BUG();
                list_move(&page->lru, &zone->inactive_list);
                pgmoved++;
                if (!pagevec_add(&pvec, page)) {
                        zone->nr_inactive += pgmoved;
                        spin_unlock_irq(&zone->lru_lock);
                        pgdeactivate += pgmoved;
                        pgmoved = 0;
                        if (buffer_heads_over_limit)
                                pagevec_strip(&pvec);
                        __pagevec_release(&pvec);
                        spin_lock_irq(&zone->lru_lock);
                }
        }
        zone->nr_inactive += pgmoved;
        pgdeactivate += pgmoved;
        if (buffer_heads_over_limit) {
                spin_unlock_irq(&zone->lru_lock);
                pagevec_strip(&pvec);
                spin_lock_irq(&zone->lru_lock);
        }

        pgmoved = 0;
        while (!list_empty(&l_active)) {
                page = lru_to_page(&l_active);
                prefetchw_prev_lru_page(page, &l_active, flags);
                if (TestSetPageLRU(page))
                        BUG();
                BUG_ON(!PageActive(page));
                list_move(&page->lru, &zone->active_list);
                pgmoved++;
                if (!pagevec_add(&pvec, page)) {
                        zone->nr_active += pgmoved;
                        pgmoved = 0;
                        spin_unlock_irq(&zone->lru_lock);
                        __pagevec_release(&pvec);
                        spin_lock_irq(&zone->lru_lock);
                }
        }
        zone->nr_active += pgmoved;
        spin_unlock_irq(&zone->lru_lock);
        pagevec_release(&pvec);

        mod_page_state_zone(zone, pgrefill, pgscanned);
        mod_page_state(pgdeactivate, pgdeactivate);
}

1. 函数功能

将页面从活跃链表移动到非活跃链表,这是页面回收的关键步骤。通过智能算法决定哪些活跃页面应该被"降级"到非活跃状态,为后续的实际回收做准备

2. 第一段:变量声明和初始化

c 复制代码
static void
refill_inactive_zone(struct zone *zone, struct scan_control *sc)
{
        int pgmoved;
        int pgdeactivate = 0;
        int pgscanned = 0;
        int nr_pages = sc->nr_to_scan;
        LIST_HEAD(l_hold);      /* The pages which were snipped off */
        LIST_HEAD(l_inactive);  /* Pages to go onto the inactive_list */
        LIST_HEAD(l_active);    /* Pages to go onto the active_list */
        struct page *page;
        struct pagevec pvec;
        int reclaim_mapped = 0;
        long mapped_ratio;
        long distress;
        long swap_tendency;

变量说明

  • pgmoved:移动的页面计数
  • pgdeactivate:停用页面计数(最终进入非活跃链表的页面)
  • pgscanned:已扫描页面计数
  • nr_pages:要扫描的总页面数
  • l_hold:临时存放从活跃链表取下的页面
  • l_inactive:将要放入非活跃链表的页面
  • l_active:将要放回活跃链表的页面
  • reclaim_mapped:是否回收映射页面的标志
  • mapped_ratio:映射内存比例
  • distress:内存压力程度
  • swap_tendency:交换倾向性评分

3. 第二段:LRU准备和页面提取

c 复制代码
        lru_add_drain();
        pgmoved = 0;
        spin_lock_irq(&zone->lru_lock);
        while (pgscanned < nr_pages && !list_empty(&zone->active_list)) {
                page = lru_to_page(&zone->active_list);
                prefetchw_prev_lru_page(page, &zone->active_list, flags);
                if (!TestClearPageLRU(page))
                        BUG();
                list_del(&page->lru);
                if (get_page_testone(page)) {
                        /*
                         * It was already free!  release_pages() or put_page()
                         * are about to remove it from the LRU and free it. So
                         * put the refcount back and put the page back on the
                         * LRU
                         */
                        __put_page(page);
                        SetPageLRU(page);
                        list_add(&page->lru, &zone->active_list);
                } else {
                        list_add(&page->lru, &l_hold);
                        pgmoved++;
                }
                pgscanned++;
        }
        zone->pages_scanned += pgscanned;
        zone->nr_active -= pgmoved;
        spin_unlock_irq(&zone->lru_lock);

这段代码的作用:从活跃链表中批量提取页面到临时链表

关键操作

  1. lru_add_drain():清空LRU缓存,确保所有待添加页面已加入相应链表
  2. 循环从活跃链表头部取页面
  3. get_page_testone(page):检查页面是否正在被释放
    • 如果正在释放,将页面放回活跃链表
    • 否则,加入临时链表l_hold
  4. 更新zone统计信息

4. 第三段:回收策略决策算法

c 复制代码
        /*
         * `distress' is a measure of how much trouble we're having reclaiming
         * pages.  0 -> no problems.  100 -> great trouble.
         */
        distress = 100 >> zone->prev_priority;

        /*
         * The point of this algorithm is to decide when to start reclaiming
         * mapped memory instead of just pagecache.  Work out how much memory
         * is mapped.
         */
        mapped_ratio = (sc->nr_mapped * 100) / total_memory;

        /*
         * Now decide how much we really want to unmap some pages.  The mapped
         * ratio is downgraded - just because there's a lot of mapped memory
         * doesn't necessarily mean that page reclaim isn't succeeding.
         *
         * The distress ratio is important - we don't want to start going oom.
         *
         * A 100% value of vm_swappiness overrides this algorithm altogether.
         */
        swap_tendency = mapped_ratio / 2 + distress + vm_swappiness;

        /*
         * Now use this metric to decide whether to start moving mapped memory
         * onto the inactive list.
         */
        if (swap_tendency >= 100)
                reclaim_mapped = 1;

决策算法详解

  1. 内存压力计算distress = 100 >> zone->prev_priority

    • 优先级越低(越激进),distress值越大
  2. 映射内存比例mapped_ratio = (sc->nr_mapped * 100) / total_memory

    • 计算被进程映射的内存占总内存的比例
  3. 交换倾向性swap_tendency = mapped_ratio / 2 + distress + vm_swappiness

    • 综合三个因素:映射内存比例、内存压力、系统交换倾向设置
  4. 决策 :如果swap_tendency >= 100,则设置reclaim_mapped = 1

    • 表示开始回收映射内存(进程的工作集)

5. 第四段:页面分类决策

c 复制代码
        while (!list_empty(&l_hold)) {
                page = lru_to_page(&l_hold);
                list_del(&page->lru);
                if (page_mapped(page)) {
                        if (!reclaim_mapped ||
                            (total_swap_pages == 0 && PageAnon(page)) ||
                            page_referenced(page, 0, sc->priority <= 0)) {
                                list_add(&page->lru, &l_active);
                                continue;
                        }
                }
                list_add(&page->lru, &l_inactive);
        }

页面分类逻辑

对于每个临时页面:

  1. 如果是映射页面(page_mapped(page)):

    • 如果!reclaim_mapped(不回收映射页面),放回活跃链表
    • 如果没有交换空间且是匿名页面,放回活跃链表
    • 如果页面最近被访问(page_referenced),放回活跃链表
    • 否则,放入非活跃链表
  2. 如果是非映射页面(文件缓存),直接放入非活跃链表

6. 第五段:页面批量放回非活跃链表

c 复制代码
        pagevec_init(&pvec, 1);
        pgmoved = 0;
        spin_lock_irq(&zone->lru_lock);
        while (!list_empty(&l_inactive)) {
                page = lru_to_page(&l_inactive);
                prefetchw_prev_lru_page(page, &l_inactive, flags);
                if (TestSetPageLRU(page))
                        BUG();
                if (!TestClearPageActive(page))
                        BUG();
                list_move(&page->lru, &zone->inactive_list);
                pgmoved++;
                if (!pagevec_add(&pvec, page)) {
                        zone->nr_inactive += pgmoved;
                        spin_unlock_irq(&zone->lru_lock);
                        pgdeactivate += pgmoved;
                        pgmoved = 0;
                        if (buffer_heads_over_limit)
                                pagevec_strip(&pvec);
                        __pagevec_release(&pvec);
                        spin_lock_irq(&zone->lru_lock);
                }
        }
        zone->nr_inactive += pgmoved;
        pgdeactivate += pgmoved;
        if (buffer_heads_over_limit) {
                spin_unlock_irq(&zone->lru_lock);
                pagevec_strip(&pvec);
                spin_lock_irq(&zone->lru_lock);
        }

操作流程

  1. 初始化页面向量用于批量操作
  2. l_inactive中的页面移动到zone的非活跃链表
  3. 清除PG_active标志,设置PG_LRU标志
  4. 使用pagevec批量处理,提高效率
  5. 如果buffer头超过限制,进行特殊处理

7. 第六段:页面放回活跃链表和统计更新

c 复制代码
        pgmoved = 0;
        while (!list_empty(&l_active)) {
                page = lru_to_page(&l_active);
                prefetchw_prev_lru_page(page, &l_active, flags);
                if (TestSetPageLRU(page))
                        BUG();
                BUG_ON(!PageActive(page));
                list_move(&page->lru, &zone->active_list);
                pgmoved++;
                if (!pagevec_add(&pvec, page)) {
                        zone->nr_active += pgmoved;
                        pgmoved = 0;
                        spin_unlock_irq(&zone->lru_lock);
                        __pagevec_release(&pvec);
                        spin_lock_irq(&zone->lru_lock);
                }
        }
        zone->nr_active += pgmoved;
        spin_unlock_irq(&zone->lru_lock);
        pagevec_release(&pvec);

        mod_page_state_zone(zone, pgrefill, pgscanned);
        mod_page_state(pgdeactivate, pgdeactivate);
}

最后阶段

  1. l_active中的页面放回zone的活跃链表
  2. 确保这些页面保持PG_active标志
  3. 更新zone的活跃页面计数
  4. 更新内核统计信息:
    • pgrefill:页面补充统计
    • pgdeactivate:页面停用统计

从非活跃链表中回收页面shrink_cache

c 复制代码
static void shrink_cache(struct zone *zone, struct scan_control *sc)
{
        LIST_HEAD(page_list);
        struct pagevec pvec;
        int max_scan = sc->nr_to_scan;

        pagevec_init(&pvec, 1);

        lru_add_drain();
        spin_lock_irq(&zone->lru_lock);
        while (max_scan > 0) {
                struct page *page;
                int nr_taken = 0;
                int nr_scan = 0;
                int nr_freed;

                while (nr_scan++ < SWAP_CLUSTER_MAX &&
                                !list_empty(&zone->inactive_list)) {
                        page = lru_to_page(&zone->inactive_list);

                        prefetchw_prev_lru_page(page,
                                                &zone->inactive_list, flags);

                        if (!TestClearPageLRU(page))
                                BUG();
                        list_del(&page->lru);
                        if (get_page_testone(page)) {
                                /*
                                 * It is being freed elsewhere
                                 */
                                __put_page(page);
                                SetPageLRU(page);
                                list_add(&page->lru, &zone->inactive_list);
                                continue;
                        }
                        list_add(&page->lru, &page_list);
                        nr_taken++;
                }
                zone->nr_inactive -= nr_taken;
                spin_unlock_irq(&zone->lru_lock);

                if (nr_taken == 0)
                        goto done;

                max_scan -= nr_scan;
                if (current_is_kswapd())
                        mod_page_state_zone(zone, pgscan_kswapd, nr_scan);
                else
                        mod_page_state_zone(zone, pgscan_direct, nr_scan);
                nr_freed = shrink_list(&page_list, sc);
                if (current_is_kswapd())
                        mod_page_state(kswapd_steal, nr_freed);
                mod_page_state_zone(zone, pgsteal, nr_freed);
                sc->nr_to_reclaim -= nr_freed;

                spin_lock_irq(&zone->lru_lock);
                /*
                 * Put back any unfreeable pages.
                 */
                while (!list_empty(&page_list)) {
                        page = lru_to_page(&page_list);
                        if (TestSetPageLRU(page))
                                BUG();
                        list_del(&page->lru);
                        if (PageActive(page))
                                add_page_to_active_list(zone, page);
                        else
                                add_page_to_inactive_list(zone, page);
                        if (!pagevec_add(&pvec, page)) {
                                spin_unlock_irq(&zone->lru_lock);
                                __pagevec_release(&pvec);
                                spin_lock_irq(&zone->lru_lock);
                        }
                }
        }
        spin_unlock_irq(&zone->lru_lock);
done:
        pagevec_release(&pvec);
}

1. 函数功能

从非活跃链表中回收页面,这是实际执行页面释放操作的地方。包括将脏页写回磁盘、交换匿名页面、释放干净页面等具体回收操作

2. 第一段:变量声明和初始化

c 复制代码
static void shrink_cache(struct zone *zone, struct scan_control *sc)
{
        LIST_HEAD(page_list);
        struct pagevec pvec;
        int max_scan = sc->nr_to_scan;

        pagevec_init(&pvec, 1);

变量说明

  • page_list:临时链表,存放从非活跃链表取出的待处理页面
  • pvec:页面向量,用于批量释放页面
  • max_scan:最大扫描页面数,从扫描控制结构复制而来
  • pagevec_init(&pvec, 1):初始化页面向量,参数1表示冷页面

3. 第二段:准备工作和锁获取

c 复制代码
        lru_add_drain();
        spin_lock_irq(&zone->lru_lock);

准备工作

  • lru_add_drain():清空Per-CPU的LRU缓存,确保所有待添加页面已加入相应链表
  • spin_lock_irq(&zone->lru_lock):获取zone的LRU锁并禁用中断,保护LRU链表操作

4. 第三段:主扫描循环

c 复制代码
        while (max_scan > 0) {
                struct page *page;
                int nr_taken = 0;
                int nr_scan = 0;
                int nr_freed;

主循环条件max_scan > 0,还有页面需要扫描

局部变量

  • nr_taken:本次从非活跃链表取出的页面数
  • nr_scan:本次扫描的页面计数器
  • nr_freed:实际释放的页面数

5. 第四段:从非活跃链表提取页面

c 复制代码
                while (nr_scan++ < SWAP_CLUSTER_MAX &&
                                !list_empty(&zone->inactive_list)) {
                        page = lru_to_page(&zone->inactive_list);

                        prefetchw_prev_lru_page(page,
                                                &zone->inactive_list, flags);

                        if (!TestClearPageLRU(page))
                                BUG();
                        list_del(&page->lru);

批量提取逻辑

  • 循环条件1:nr_scan++ < SWAP_CLUSTER_MAX,最多批量提取32个页面
  • 循环条件2:!list_empty(&zone->inactive_list),非活跃链表不为空
  • lru_to_page(&zone->inactive_list):从链表头部获取页面
  • prefetchw_prev_lru_page():预取下一个页面,提高缓存性能
  • TestClearPageLRU(page):原子地清除LRU标志,如果失败触发BUG
  • list_del(&page->lru):从非活跃链表中删除页面

6. 第五段:页面引用检查

c 复制代码
                        if (get_page_testone(page)) {
                                /*
                                 * It is being freed elsewhere
                                 */
                                __put_page(page);
                                SetPageLRU(page);
                                list_add(&page->lru, &zone->inactive_list);
                                continue;
                        }
                        list_add(&page->lru, &page_list);
                        nr_taken++;
                }

引用检查逻辑

  • get_page_testone(page):检查页面引用计数,如果正在被其他地方释放
    • __put_page(page):减少引用计数
    • SetPageLRU(page):重新设置LRU标志
    • list_add(&page->lru, &zone->inactive_list):将页面放回非活跃链表
    • continue:跳过这个页面,处理下一个
  • 否则:将页面加入临时链表page_list,增加nr_taken计数

7. 第六段:统计更新和实际回收

c 复制代码
                zone->nr_inactive -= nr_taken;
                spin_unlock_irq(&zone->lru_lock);

                if (nr_taken == 0)
                        goto done;

                max_scan -= nr_scan;
                if (current_is_kswapd())
                        mod_page_state_zone(zone, pgscan_kswapd, nr_scan);
                else
                        mod_page_state_zone(zone, pgscan_direct, nr_scan);

统计更新

  • zone->nr_inactive -= nr_taken:更新zone的非活跃页面计数
  • spin_unlock_irq(&zone->lru_lock):释放锁,允许其他操作
  • 如果nr_taken == 0,跳转到完成处理
  • max_scan -= nr_scan:减少剩余扫描数量
  • 根据当前进程是否是kswapd更新不同的统计信息

8. 第七段:实际回收操作

c 复制代码
                nr_freed = shrink_list(&page_list, sc);
                if (current_is_kswapd())
                        mod_page_state(kswapd_steal, nr_freed);
                mod_page_state_zone(zone, pgsteal, nr_freed);
                sc->nr_to_reclaim -= nr_freed;

核心回收

  • nr_freed = shrink_list(&page_list, sc):实际回收页面,返回释放的页面数
    • 这个函数内部处理页面的具体回收逻辑(写回、交换、释放)
  • 更新回收统计信息:
    • kswapd_steal:kswapd回收的页面数
    • pgsteal:总的页面窃取数
  • sc->nr_to_reclaim -= nr_freed:减少待回收页面目标

9. 第八段:未回收页面的处理

c 复制代码
                spin_lock_irq(&zone->lru_lock);
                /*
                 * Put back any unfreeable pages.
                 */
                while (!list_empty(&page_list)) {
                        page = lru_to_page(&page_list);
                        if (TestSetPageLRU(page))
                                BUG();
                        list_del(&page->lru);
                        if (PageActive(page))
                                add_page_to_active_list(zone, page);
                        else
                                add_page_to_inactive_list(zone, page);
                        if (!pagevec_add(&pvec, page)) {
                                spin_unlock_irq(&zone->lru_lock);
                                __pagevec_release(&pvec);
                                spin_lock_irq(&zone->lru_lock);
                        }
                }
        }

未回收页面处理

  • 重新获取锁,处理未能回收的页面
  • 遍历page_list中剩余的页面(未能被回收的)
  • TestSetPageLRU(page):设置LRU标志,如果已设置则触发BUG
  • 根据页面是否活跃,放回相应的链表:
    • PageActive(page):放回活跃链表
    • 否则:放回非活跃链表
  • 使用pagevec批量操作提高效率

10. 第九段:清理工作

c 复制代码
        spin_unlock_irq(&zone->lru_lock);
done:
        pagevec_release(&pvec);
}

收尾工作

  • spin_unlock_irq(&zone->lru_lock):最终释放锁
  • done::标签,用于前面goto跳转
  • pagevec_release(&pvec):释放页面向量中剩余的页面

实际执行页面回收shrink_list

c 复制代码
static int shrink_list(struct list_head *page_list, struct scan_control *sc)
{
        LIST_HEAD(ret_pages);
        struct pagevec freed_pvec;
        int pgactivate = 0;
        int reclaimed = 0;

        cond_resched();

        pagevec_init(&freed_pvec, 1);
        while (!list_empty(page_list)) {
                struct address_space *mapping;
                struct page *page;
                int may_enter_fs;
                int referenced;

                page = lru_to_page(page_list);
                list_del(&page->lru);

                if (TestSetPageLocked(page))
                        goto keep;

                BUG_ON(PageActive(page));

                if (PageWriteback(page))
                        goto keep_locked;

                sc->nr_scanned++;
                /* Double the slab pressure for mapped and swapcache pages */
                if (page_mapped(page) || PageSwapCache(page))
                        sc->nr_scanned++;

                referenced = page_referenced(page, 1, sc->priority <= 0);
                /* In active use or really unfreeable?  Activate it. */
                if (referenced && page_mapping_inuse(page))
                        goto activate_locked;

#ifdef CONFIG_SWAP
                /*
                 * Anonymous process memory has backing store?
                 * Try to allocate it some swap space here.
                 */
                if (PageAnon(page) && !PageSwapCache(page)) {
                        if (!add_to_swap(page))
                                goto activate_locked;
                }
#endif /* CONFIG_SWAP */

                mapping = page_mapping(page);
                may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
                        (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));

                /*
                 * The page is mapped into the page tables of one or more
                 * processes. Try to unmap it here.
                 */
                if (page_mapped(page) && mapping) {
                        switch (try_to_unmap(page)) {
                        case SWAP_FAIL:
                                goto activate_locked;
                        case SWAP_AGAIN:
                                goto keep_locked;
                        case SWAP_SUCCESS:
                                ; /* try to free the page below */
                        }
                }

                if (PageDirty(page)) {
                        if (referenced)
                                goto keep_locked;
                        if (!may_enter_fs)
                                goto keep_locked;
                        if (laptop_mode && !sc->may_writepage)
                                goto keep_locked;

                        /* Page is dirty, try to write it out here */
                        switch(pageout(page, mapping)) {
                        case PAGE_KEEP:
                                goto keep_locked;
                        case PAGE_ACTIVATE:
                                goto activate_locked;
                        case PAGE_SUCCESS:
                                if (PageWriteback(page) || PageDirty(page))
                                        goto keep;
                                /*
                                 * A synchronous write - probably a ramdisk.  Go
                                 * ahead and try to reclaim the page.
                                 */
                                if (TestSetPageLocked(page))
                                        goto keep;
                                if (PageDirty(page) || PageWriteback(page))
                                        goto keep_locked;
                                mapping = page_mapping(page);
                        case PAGE_CLEAN:
                                ; /* try to free the page below */
                        }
                }

                /*
                 * If the page has buffers, try to free the buffer mappings
                 * associated with this page. If we succeed we try to free
                 * the page as well.
                 *
                 * We do this even if the page is PageDirty().
                 * try_to_release_page() does not perform I/O, but it is
                 * possible for a page to have PageDirty set, but it is actually
                 * clean (all its buffers are clean).  This happens if the
                 * buffers were written out directly, with submit_bh(). ext3
                 * will do this, as well as the blockdev mapping.
                 * try_to_release_page() will discover that cleanness and will
                 * drop the buffers and mark the page clean - it can be freed.
                 *
                 * Rarely, pages can have buffers and no ->mapping.  These are
                 * the pages which were not successfully invalidated in
                 * truncate_complete_page().  We try to drop those buffers here
                 * and if that worked, and the page is no longer mapped into
                 * process address space (page_count == 1) it can be freed.
                 * Otherwise, leave the page on the LRU so it is swappable.
                 */
                if (PagePrivate(page)) {
                        if (!try_to_release_page(page, sc->gfp_mask))
                                goto activate_locked;
                        if (!mapping && page_count(page) == 1)
                                goto free_it;
                }

                if (!mapping)
                        goto keep_locked;       /* truncate got there first */

                spin_lock_irq(&mapping->tree_lock);

                /*
                 * The non-racy check for busy page.  It is critical to check
                 * PageDirty _after_ making sure that the page is freeable and
                 * not in use by anybody.       (pagecache + us == 2)
                 */
                if (page_count(page) != 2 || PageDirty(page)) {
                        spin_unlock_irq(&mapping->tree_lock);
                        goto keep_locked;
                }

#ifdef CONFIG_SWAP
                if (PageSwapCache(page)) {
                        swp_entry_t swap = { .val = page->private };
                        __delete_from_swap_cache(page);
                        spin_unlock_irq(&mapping->tree_lock);
                        swap_free(swap);
                        __put_page(page);       /* The pagecache ref */
                        goto free_it;
                }
#endif /* CONFIG_SWAP */

                __remove_from_page_cache(page);
                spin_unlock_irq(&mapping->tree_lock);
                __put_page(page);

free_it:
                unlock_page(page);
                reclaimed++;
                if (!pagevec_add(&freed_pvec, page))
                        __pagevec_release_nonlru(&freed_pvec);
                continue;

activate_locked:
                SetPageActive(page);
                pgactivate++;
keep_locked:
                unlock_page(page);
keep:
                list_add(&page->lru, &ret_pages);
                BUG_ON(PageLRU(page));
        }
        list_splice(&ret_pages, page_list);
        if (pagevec_count(&freed_pvec))
                __pagevec_release_nonlru(&freed_pvec);
        mod_page_state(pgactivate, pgactivate);
        sc->nr_reclaimed += reclaimed;
        return reclaimed;
}

1. 函数功能

实际执行页面回收操作,包括解除映射、写回脏页、释放缓存页面等。这是页面回收管道中真正释放内存的地方

2. 第一段:变量声明和初始化

c 复制代码
static int shrink_list(struct list_head *page_list, struct scan_control *sc)
{
        LIST_HEAD(ret_pages);
        struct pagevec freed_pvec;
        int pgactivate = 0;
        int reclaimed = 0;

        cond_resched();

        pagevec_init(&freed_pvec, 1);

变量说明

  • ret_pages:临时链表,存放未能回收需要返回的页面
  • freed_pvec:页面向量,用于批量释放已回收的页面
  • pgactivate:激活页面计数(从非活跃提升到活跃的页面)
  • reclaimed:成功回收的页面计数
  • cond_resched():在开始前让出CPU,避免长时间占用
  • pagevec_init(&freed_pvec, 1):初始化页面向量用于批量释放

3. 第二段:主循环和页面锁定

c 复制代码
        while (!list_empty(page_list)) {
                struct address_space *mapping;
                struct page *page;
                int may_enter_fs;
                int referenced;

                page = lru_to_page(page_list);
                list_del(&page->lru);

                if (TestSetPageLocked(page))
                        goto keep;

主循环:处理输入链表中的所有页面

页面锁定

  • TestSetPageLocked(page):尝试锁定页面,如果已被锁定则跳转到keep
  • 页面锁定防止在回收过程中被其他操作修改

4. 第三段:基本状态检查

c 复制代码
                BUG_ON(PageActive(page));

                if (PageWriteback(page))
                        goto keep_locked;

                sc->nr_scanned++;
                /* Double the slab pressure for mapped and swapcache pages */
                if (page_mapped(page) || PageSwapCache(page))
                        sc->nr_scanned++;

状态检查

  • BUG_ON(PageActive(page)):确保页面不在活跃状态(应该是非活跃的)
  • PageWriteback(page):如果页面正在写回,跳过回收
  • 扫描计数:映射页面或交换缓存页面计数加倍(回收成本更高)

5. 第四段:页面引用检查

c 复制代码
                referenced = page_referenced(page, 1, sc->priority <= 0);
                /* In active use or really unfreeable?  Activate it. */
                if (referenced && page_mapping_inuse(page))
                        goto activate_locked;

引用检查

  • page_referenced(page, 1, sc->priority <= 0):检查页面是否最近被引用
  • 如果被引用且映射还在使用中,激活页面(提升到活跃链表)

6. 第五段:匿名页面处理

c 复制代码
#ifdef CONFIG_SWAP
                /*
                 * Anonymous process memory has backing store?
                 * Try to allocate it some swap space here.
                 */
                if (PageAnon(page) && !PageSwapCache(page)) {
                        if (!add_to_swap(page))
                                goto activate_locked;
                }
#endif /* CONFIG_SWAP */

匿名页面

  • 如果是匿名页面且不在交换缓存中,尝试分配交换空间
  • 如果分配失败,激活页面(无法回收)

7. 第六段:映射页面解除映射

c 复制代码
                mapping = page_mapping(page);
                may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
                        (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));

                /*
                 * The page is mapped into the page tables of one or more
                 * processes. Try to unmap it here.
                 */
                if (page_mapped(page) && mapping) {
                        switch (try_to_unmap(page)) {
                        case SWAP_FAIL:
                                goto activate_locked;
                        case SWAP_AGAIN:
                                goto keep_locked;
                        case SWAP_SUCCESS:
                                ; /* try to free the page below */
                        }
                }

解除映射

  • try_to_unmap(page):尝试从所有进程的页表中解除页面映射
  • 三种结果:
    • SWAP_FAIL:解除失败,激活页面
    • SWAP_AGAIN:需要重试,保持锁定
    • SWAP_SUCCESS:成功解除映射,继续回收

8. 第七段:脏页写回处理

c 复制代码
                if (PageDirty(page)) {
                        if (referenced)
                                goto keep_locked;
                        if (!may_enter_fs)
                                goto keep_locked;
                        if (laptop_mode && !sc->may_writepage)
                                goto keep_locked;

                        /* Page is dirty, try to write it out here */
                        switch(pageout(page, mapping)) {
                        case PAGE_KEEP:
                                goto keep_locked;
                        case PAGE_ACTIVATE:
                                goto activate_locked;
                        case PAGE_SUCCESS:
                                if (PageWriteback(page) || PageDirty(page))
                                        goto keep;
                                /*
                                 * A synchronous write - probably a ramdisk.  Go
                                 * ahead and try to reclaim the page.
                                 */
                                if (TestSetPageLocked(page))
                                        goto keep;
                                if (PageDirty(page) || PageWriteback(page))
                                        goto keep_locked;
                                mapping = page_mapping(page);
                        case PAGE_CLEAN:
                                ; /* try to free the page below */
                        }
                }

脏页处理

  • 多种情况跳过写回:被引用、不允许文件系统操作、笔记本模式等
  • pageout(page, mapping):执行页面写回
  • 四种结果:
    • PAGE_KEEP:保持页面
    • PAGE_ACTIVATE:激活页面
    • PAGE_SUCCESS:写回成功,继续回收
    • PAGE_CLEAN:页面变干净,继续回收

9. 第八段:缓冲区页面处理

c 复制代码
                if (PagePrivate(page)) {
                        if (!try_to_release_page(page, sc->gfp_mask))
                                goto activate_locked;
                        if (!mapping && page_count(page) == 1)
                                goto free_it;
                }

缓冲区页面

  • PagePrivate(page):页面有缓冲区(文件系统元数据)
  • try_to_release_page():尝试释放缓冲区
  • 如果没有映射且只有一个引用,可以直接释放

10. 第九段:页面缓存检查

c 复制代码
                if (!mapping)
                        goto keep_locked;       /* truncate got there first */

                spin_lock_irq(&mapping->tree_lock);

                /*
                 * The non-racy check for busy page.  It is critical to check
                 * PageDirty _after_ making sure that the page is freeable and
                 * not in use by anybody.       (pagecache + us == 2)
                 */
                if (page_count(page) != 2 || PageDirty(page)) {
                        spin_unlock_irq(&mapping->tree_lock);
                        goto keep_locked;
                }

页面缓存检查

  • 检查页面是否可释放:引用计数必须为2(页面缓存+当前回收)
  • 页面必须干净(非脏页)

11. 第十段:交换缓存页面释放

c 复制代码
#ifdef CONFIG_SWAP
                if (PageSwapCache(page)) {
                        swp_entry_t swap = { .val = page->private };
                        __delete_from_swap_cache(page);
                        spin_unlock_irq(&mapping->tree_lock);
                        swap_free(swap);
                        __put_page(page);       /* The pagecache ref */
                        goto free_it;
                }
#endif /* CONFIG_SWAP */

交换缓存页面

  • 从交换缓存中删除页面
  • 释放交换条目
  • 减少页面缓存引用

12. 第十一段:普通页面缓存释放

c 复制代码
                __remove_from_page_cache(page);
                spin_unlock_irq(&mapping->tree_lock);
                __put_page(page);

free_it:
                unlock_page(page);
                reclaimed++;
                if (!pagevec_add(&freed_pvec, page))
                        __pagevec_release_nonlru(&freed_pvec);
                continue;

页面缓存释放

  • 从页面缓存中移除页面
  • 减少引用计数
  • 批量释放已回收的页面

13. 第十二段:失败处理和统计

c 复制代码
activate_locked:
                SetPageActive(page);
                pgactivate++;
keep_locked:
                unlock_page(page);
keep:
                list_add(&page->lru, &ret_pages);
                BUG_ON(PageLRU(page));
        }
        list_splice(&ret_pages, page_list);
        if (pagevec_count(&freed_pvec))
                __pagevec_release_nonlru(&freed_pvec);
        mod_page_state(pgactivate, pgactivate);
        sc->nr_reclaimed += reclaimed;
        return reclaimed;
}

收尾工作

  • 将未能回收的页面返回原链表
  • 批量释放已回收的页面
  • 更新统计信息

收缩内核对象缓存shrink_slab

c 复制代码
static int shrink_slab(unsigned long scanned, unsigned int gfp_mask,
                        unsigned long lru_pages)
{
        struct shrinker *shrinker;

        if (scanned == 0)
                scanned = SWAP_CLUSTER_MAX;

        if (!down_read_trylock(&shrinker_rwsem))
                return 0;

        list_for_each_entry(shrinker, &shrinker_list, list) {
                unsigned long long delta;
                unsigned long total_scan;

                delta = (4 * scanned) / shrinker->seeks;
                delta *= (*shrinker->shrinker)(0, gfp_mask);
                do_div(delta, lru_pages + 1);
                shrinker->nr += delta;
                if (shrinker->nr < 0)
                        shrinker->nr = LONG_MAX;        /* It wrapped! */

                total_scan = shrinker->nr;
                shrinker->nr = 0;

                while (total_scan >= SHRINK_BATCH) {
                        long this_scan = SHRINK_BATCH;
                        int shrink_ret;

                        shrink_ret = (*shrinker->shrinker)(this_scan, gfp_mask);
                        if (shrink_ret == -1)
                                break;
                        mod_page_state(slabs_scanned, this_scan);
                        total_scan -= this_scan;

                        cond_resched();
                }

                shrinker->nr += total_scan;
        }
        up_read(&shrinker_rwsem);
        return 0;
}

1. 函数功能

收缩内核对象缓存(slab缓存),通过调用所有注册的shrinker函数来回收内核数据结构使用的内存

2. 第一段:函数定义和初始检查

c 复制代码
static int shrink_slab(unsigned long scanned, unsigned int gfp_mask,
                        unsigned long lru_pages)
{
        struct shrinker *shrinker;

        if (scanned == 0)
                scanned = SWAP_CLUSTER_MAX;

参数说明

  • scanned:页面回收过程中扫描的页面数量,反映内存压力程度
  • gfp_mask:分配标志,控制回收行为
  • lru_pages:系统中LRU页面的总数,用于计算回收比例

初始检查

  • 如果scanned为0,设置为SWAP_CLUSTER_MAX(通常32)
  • 确保即使没有页面扫描信息,也能执行一定程度的slab回收

3. 第二段:锁获取和遍历准备

c 复制代码
        if (!down_read_trylock(&shrinker_rwsem))
                return 0;

        list_for_each_entry(shrinker, &shrinker_list, list) {

锁机制

  • down_read_trylock(&shrinker_rwsem):尝试获取shrinker列表的读锁
    • 如果获取失败(返回0),直接返回,不执行slab回收
    • 使用trylock避免在锁争用时阻塞
  • shrinker_rwsem:保护shrinker列表的读写信号量

遍历开始

  • list_for_each_entry(shrinker, &shrinker_list, list):遍历shrinker链表
  • 每个内核子系统可以注册自己的shrinker来管理其缓存

4. 第三段:回收量计算算法

c 复制代码
                unsigned long long delta;
                unsigned long total_scan;

                delta = (4 * scanned) / shrinker->seeks;
                delta *= (*shrinker->shrinker)(0, gfp_mask);
                do_div(delta, lru_pages + 1);
                shrinker->nr += delta;
                if (shrinker->nr < 0)
                        shrinker->nr = LONG_MAX;        /* It wrapped! */

回收量计算步骤

  1. 基础增量delta = (4 * scanned) / shrinker->seeks

    • scanned:反映内存压力,扫描越多压力越大
    • shrinker->seeks:该缓存的重建成本,值越大表示回收代价越高
    • 系数4:经验值,调整回收强度
  2. 乘以可回收对象数delta *= (*shrinker->shrinker)(0, gfp_mask)

    • 调用shrinker函数,参数0表示只查询可回收数量,不实际回收
    • 获取该缓存中可回收的对象数量
  3. 按比例缩放do_div(delta, lru_pages + 1)

    • 根据系统总LRU页面数缩放回收量
    • 系统内存越大,单次回收比例越小
    • +1防止除零
  4. 累计和边界检查

    • shrinker->nr += delta:累计到该shrinker的待回收计数
    • 如果溢出(小于0),设置为LONG_MAX

5. 第四段:批量回收执行

c 复制代码
                total_scan = shrinker->nr;
                shrinker->nr = 0;

                while (total_scan >= SHRINK_BATCH) {
                        long this_scan = SHRINK_BATCH;
                        int shrink_ret;

                        shrink_ret = (*shrinker->shrinker)(this_scan, gfp_mask);
                        if (shrink_ret == -1)
                                break;
                        mod_page_state(slabs_scanned, this_scan);
                        total_scan -= this_scan;

                        cond_resched();
                }

批量回收逻辑

  1. 初始化total_scan = shrinker->nr,然后清零shrinker->nr

    • 保存累计的待回收量,并重置计数器
  2. 批量循环while (total_scan >= SHRINK_BATCH)

    • SHRINK_BATCH:批量大小(通常128),避免频繁调用shrinker
  3. 执行回收shrink_ret = (*shrinker->shrinker)(this_scan, gfp_mask)

    • 实际调用shrinker函数回收指定数量的对象
    • 参数this_scan:本次要尝试回收的对象数量
  4. 错误检查if (shrink_ret == -1) break

    • 如果shrinker返回-1,表示无法继续回收,提前退出
  5. 更新统计mod_page_state(slabs_scanned, this_scan)

    • 更新slab扫描统计信息
  6. 调度机会cond_resched()

    • 在长时间循环中让出CPU,避免饿死其他进程

6. 第五段:清理和返回

c 复制代码
                shrinker->nr += total_scan;
        }
        up_read(&shrinker_rwsem);
        return 0;
}

收尾工作

  1. 保存剩余量shrinker->nr += total_scan

    • 将未处理完的回收量保存回shrinker,供下次使用
    • 实现渐进式回收,避免丢失回收进度
  2. 释放锁up_read(&shrinker_rwsem)

    • 释放shrinker列表的读锁
  3. 返回return 0

    • 总是返回0,实际回收效果通过全局状态体现

基于我们前面分析的所有函数,我来总结一个完整的页面回收工作流程图。

完整的内存回收工作流程图

否 是 是 否 是 否 是 否 是 否 是 否 是 否 是 否 是 否 是 否 内存分配失败 try_to_free_pages
直接内存回收入口 初始化优先级12->0循环 设置扫描控制参数 shrink_caches
协调各zone回收 遍历所有zone zone有物理内存? 跳过空zone 更新zone优先级 zone不可回收且非默认优先级? 跳过不可回收zone shrink_zone
zone级别回收 继续下一个zone 计算活跃/非活跃扫描量 活跃链表处理循环 refill_inactive_zone
活跃->非活跃 lru_add_drain
清空LRU缓存 从活跃链表提取页面 计算回收策略参数 distress=100>>prev_priority mapped_ratio=映射内存比例 swap_tendency=综合评分 swap_tendency>=100? reclaim_mapped=1 reclaim_mapped=0 页面分类决策 页面映射? 满足回收条件? 放入非活跃链表 放回活跃链表 批量放回非活跃链表 批量放回活跃链表 更新zone统计 非活跃链表处理循环 shrink_cache
实际回收页面 检查页面类型和状态 文件页面? 页面脏? 匿名页面交换 写回磁盘 直接释放 更新回收统计 回收目标达成? 提前退出 shrink_slab
slab缓存回收 获取shrinker锁 遍历所有shrinker 计算回收量delta 批量回收循环 调用shrinker函数 更新统计和调度 批量完成? 处理下一个shrinker 释放锁 回收页面>=32? 返回成功 继续下一优先级 内存分配成功

1. 关键函数职责总结

1.1. 顶层协调层

  • try_to_free_pages(): 回收入口,管理优先级循环
  • shrink_caches(): 协调各个zone的回收工作

1.2. Zone级别回收层

  • shrink_zone(): 单个zone的回收调度,计算扫描量
  • refill_inactive_zone(): 活跃→非活跃链表转换
  • shrink_cache(): 实际回收非活跃链表中的页面

1.3. 页面处理层

  • 页面分类: 映射 vs 非映射,文件 vs 匿名页面
  • 回收策略: 基于访问频率、内存压力、交换成本
  • 实际操作: 写回脏页、释放干净页、交换匿名页

1.4. Slab回收层

  • shrink_slab(): 内核对象缓存回收调度
  • Shrinker机制: 各子系统注册的缓存回收器

1.5. 辅助功能层

  • lru_add_drain(): LRU缓存刷新
  • 统计更新: 各种页面和回收统计

2. 回收策略决策矩阵

页面类型 映射状态 回收策略 成本
文件页面 未映射 直接释放
文件页面 已映射 谨慎回收
匿名页面 未映射 交换释放
匿名页面 已映射 避免回收

3. 成功条件

  1. 单次回收 ≥ 32页 (SWAP_CLUSTER_MAX)
  2. 所有优先级循环完成
  3. 触发OOM Killer (最终手段)
相关推荐
虾..10 小时前
Linux 软硬链接和动静态库
linux·运维·服务器
Evan芙10 小时前
Linux常见的日志服务管理的常见日志服务
linux·运维·服务器
hkhkhkhkh12312 小时前
Linux设备节点基础知识
linux·服务器·驱动开发
HZero.chen13 小时前
Linux字符串处理
linux·string
张童瑶13 小时前
Linux SSH隧道代理转发及多层转发
linux·运维·ssh
汪汪队立大功12313 小时前
什么是SELinux
linux
石小千13 小时前
Linux安装OpenProject
linux·运维
柏木乃一13 小时前
进程(2)进程概念与基本操作
linux·服务器·开发语言·性能优化·shell·进程
Lime-309013 小时前
制作Ubuntu 24.04-GPU服务器测试系统盘
linux·运维·ubuntu
百年渔翁_肯肯14 小时前
Linux 与 Unix 的核心区别(清晰对比版)
linux·运维·unix