Gem5 预取机制介绍

本文以Stride预取器为例对Gem5模拟器的预取机制进行介绍

Gem5版本：v24.1.0.3

运行模式：SE

一、Load指令执行过程

根据官网的介绍，Load指令的执行过程如下：

cpp 复制代码

IEW::tick()->IEW::executeInsts()
  ->LSQUnit::executeLoad()
    ->StaticInst::initiateAcc()
      ->LSQ::pushRequest()
        ->LSQUnit::read()
          ->LSQRequest::buildPackets()
          ->LSQRequest::sendPacketToCache()
    ->LSQUnit::checkViolation()
DcachePort::recvTimingResp()->LSQRequest::recvTimingResp()
  ->LSQUnit::completeDataAccess()
    ->LSQUnit::writeback()
      ->StaticInst::completeAcc()
      ->IEW::instToCommit()
IEW::tick()->IEW::writebackInsts()

首先，从LSQ::pushRequest()开始，Load指令计算好地址后，将会在LSQ::pushRequest()函数内生成LSQRequest，其中&thread[tid]是LSQUnit对象，便于在返回时处理数据。

cpp 复制代码

if (htm_cmd || tlbi_cmd) {
    assert(addr == 0x0lu);
    assert(size == 8);
    request = new UnsquashableDirectRequest(&thread[tid], inst, flags);
} else if (needs_burst) {
    request = new SplitDataRequest(&thread[tid], inst, isLoad, addr,
            size, flags, data, res);
} else {
    request = new SingleDataRequest(&thread[tid], inst, isLoad, addr,
            size, flags, data, res, std::move(amo_op));
}

此时的LSQRequest 里面仅包含的是虚拟地址，在访问存储之前还需要进行虚拟地址转换 ，该步骤通过initiateTranslation()函数完成

cpp 复制代码

void
LSQ::SingleDataRequest::initiateTranslation()
{
    assert(_reqs.size() == 0);

    addReq(_addr, _size, _byteEnable);		// 添加 Request 对象

    if (_reqs.size() > 0) {
        _reqs.back()->setReqInstSeqNum(_inst->seqNum);
        _reqs.back()->taskId(_taskId);
        _inst->translationStarted(true);
        setState(State::Translation);
        flags.set(Flag::TranslationStarted);

        _inst->savedRequest = this;
        sendFragmentToTranslation(0);		// 地址转换函数
    } else {
        _inst->setMemAccPredicate(false);
    }
}

void
LSQ::LSQRequest::sendFragmentToTranslation(int i)
{
    numInTranslationFragments++;
    _port.getMMUPtr()->translateTiming(req(i), _inst->thread->getTC(),	// 发送给MMU进行处理
            this, isLoad() ? BaseMMU::Read : BaseMMU::Write);
}

值得注意的是 LSQRequest 和这里添加的 Request 对象，二者不一样，需要注意区分，LSQRequest主要保存与发送者相关的信息，例如LSQ指针，而Request仅保存与访存相关的信息，是实际存储系统处理的对象。LSQRequest包含了Request属性以及下面即将介绍的Packet。在LSQRequest对象中会保存一个Request指针向量，addReq() 就是负责创建并填充到该向量中的。发送给MMU进行地址转换的，也是这个Request对象。

cpp 复制代码

class LSQRequest : 
public:
	  LSQUnit& _port;
	  const DynInstPtr _inst;
	  uint32_t _taskId;
	  PacketDataPtr _data;
	  std::vector<PacketPtr> _packets;
	  std::vector<RequestPtr> _reqs;
	  std::vector<Fault> _fault;
	  uint64_t* _res;
	  const Addr _addr;
	  const uint32_t _size;
	  const Request::Flags _flags;
	  std::vector<bool> _byteEnable;
	  uint32_t _numOutstandingPackets;
	  AtomicOpFunctorPtr _amo_op;
	  bool _hasStaleTranslation;

在存储系统对Request对象进行处理时，该对象中常用的几个属性/功能函数如下：

cpp 复制代码

bool req->hasPC()			// 是否有PC
Addr req->getPC()			// 获取PC
bool req->hasVaddr()		// 是否有虚拟地址
Addr req->getVaddr()		// 获取虚拟地址
bool req->hasPaddr()		// 是否有物理地址
Addr req->getPaddr()		// 获取物理地址
bool req->isPrefetch()		// 是否是预取请求

然后，在LSQUnit::read()函数内，完成Store Forwarding的检查，如果存在Store地址范围包含Load地址范围的情况，则会将Store的数据直接复制给Load指令。

在完成检查后，在LSQRequest::buildPackets()函数内生成访存Packet，之前生成的Request对象也将保存在这个Packet之中。

cpp 复制代码

void
LSQ::SingleDataRequest::buildPackets()
{
    /* Retries do not create new packets. */
    if (_packets.size() == 0) {
        _packets.push_back(
                isLoad()
                    ?  Packet::createRead(req())
                    :  Packet::createWrite(req()));
        _packets.back()->dataStatic(_inst->memData);
        _packets.back()->senderState = this;
       ...
}

Packet 常用的几个属性/功能函数如下：

cpp 复制代码

RequestPtr pkt->req		// 获取包中的请求对象指针
bool pkt->isRead()		// 是否是读操作包
bool pkt->isWrite()		// 是否是写操作包
bool pkt->hasData()		// 是否拥有数据
unsigned pkt->getSize()	// 获取访存大小
Addr pkt->getAddr()		// 获取访存物理地址
Addr pkt->getOffset()	// 获取访存偏移量
Addr pkt->getBlockAddr()	// 获取访存的块地址
T* pkt->getPtr<typename T>()	// 获取数据指针
void setDataFromBlock(const uint8_t *blk_data, int blkSize)	// 将Cacheline中的数据复制到数据指针里面

最后尝试发送访存包给Cache。
LSQRequest::sendPacketToCache()

cpp 复制代码

void
LSQ::SingleDataRequest::sendPacketToCache()
{
    assert(_numOutstandingPackets == 0);
    if (lsqUnit()->trySendPacket(isLoad(), _packets.at(0)))
        _numOutstandingPackets = 1;
}

二、访存行为触发预取流程

Cache的基础类在src/mem/cache/base.cc文件中，其在接收Packet之后，主要由BaseCache::recvTimingReq(PaketPtr pkt)函数进行处理，根据缓存是否命中进行不同的操作。

cpp 复制代码

void
BaseCache::recvTimingReq(PacketPtr pkt)
{
        satisfied = access(pkt, blk, lat, writebacks);
    if (satisfied) {
        ppHit->notify(CacheAccessProbeArg(pkt,accessor));
        handleTimingReqHit(pkt, blk, request_time);
    } else {
        handleTimingReqMiss(pkt, blk, forward_time, request_time);
        ppMiss->notify(CacheAccessProbeArg(pkt,accessor));
    }
}

这里的ppHit/ppMiss是ProbePointer探针，不同的探针根据使用的场景可以在不同的事件中触发对应的行为，其声明在BaseCache::regProbePoints()函数中，不同的探针具有不同且唯一的字符串名称，以下面的代码为例，ppHit的名称是"Hit"，在声明时传入初始化函数。预取器就是通过这个唯一的标签来定位和监听访存行为的。

cpp 复制代码

void
BaseCache::regProbePoints()
{
    ppHit = new ProbePointArg<CacheAccessProbeArg>(
        this->getProbeManager(), "Hit");
    ppMiss = new ProbePointArg<CacheAccessProbeArg>(
        this->getProbeManager(), "Miss");
    ppFill = new ProbePointArg<CacheAccessProbeArg>(
        this->getProbeManager(), "Fill");
    ppDataUpdate =
        new ProbePointArg<CacheDataUpdateProbeArg>(
            this->getProbeManager(), "Data Update");
}

预取器内部会声明探针监听器ProbeListener，其在Base::regProbeListeners()函数中进行声明。其中的声明参数也包含一个唯一的字符串名称，名称与探针声明的字符串名称相同的探针监听器将可以收到来自探针的函数调用。

cpp 复制代码

void
Base::regProbeListeners()
{
    if (listeners.empty() && probeManager != nullptr) {
        listeners.push_back(new PrefetchListener(*this, probeManager,
                                                "Miss", false, true));
        listeners.push_back(new PrefetchListener(*this, probeManager,
                                                 "Fill", true, false));
        listeners.push_back(new PrefetchListener(*this, probeManager,
                                                 "Hit", false, false));
        listeners.push_back(new PrefetchEvictListener(*this, probeManager,
                                                 "Data Update"));
    }
}

以ppMiss->notify()为例，该函数将能够调用名称为"Miss"的探针监听器的notify()函数，具体而言，在Base::PrefetchListener::notify()函数，随后将调用probeNotify函数。

cpp 复制代码

void
Base::PrefetchListener::notify(const CacheAccessProbeArg &arg)
{
    if (isFill) {
        parent.notifyFill(arg);
    } else {
        parent.probeNotify(arg, miss);
    }
}

void
Base::probeNotify(const CacheAccessProbeArg &acc, bool miss)
{
    const PacketPtr pkt = acc.pkt;
    const CacheAccessor &cache = acc.cache;
    bool has_been_prefetched =
        acc.cache.hasBeenPrefetched(pkt->getAddr(), pkt->isSecure(),
                                    requestorId);
    // Verify this access type is observed by prefetcher
    if (observeAccess(pkt, miss, has_been_prefetched)) {	// 控制预取行为
        if (useVirtualAddresses && pkt->req->hasVaddr()) {
            PrefetchInfo pfi(pkt, pkt->req->getVaddr(), miss);
            notify(acc, pfi);		// 通知预取器进行预取计算
        } else if (!useVirtualAddresses) {
            PrefetchInfo pfi(pkt, pkt->req->getPaddr(), miss);
            notify(acc, pfi);		// 通知预取器进行预取计算
        }
    }
}

预取器的预取行为在这里由observeAccess()函数进行控制，根据是否Miss，是否预取过，以及配置的预取器属性进行预取，例如 prefetchOnPfHit、prefetchOnAccess 等，随后构建预取信息PrefetchInfo。

Base::notify(acc, pfi)是一个纯虚函数，其将由不同的预取器单独实现，所有的预取器都放在src/mem/cache/prefetcher/目录下，他们被设计为具有继承关系的类，所有预取器的基类是这里的BasePrefetcher。

以下，我们以StridePrefetcher为例对其预取计算行为进行分析，Stride 预取器继承于Queued，其notify()函数如下。

cpp 复制代码

void
Queued::notify(const CacheAccessProbeArg &acc, const PrefetchInfo &pfi)
{
    std::vector<AddrPriority> addresses;
    calculatePrefetch(pfi, addresses, cache);		// 预取计算

    size_t max_pfs = getMaxPermittedPrefetches(addresses.size());	// 最大预取数量控制

    size_t num_pfs = 0;
    for (AddrPriority& addr_prio : addresses) {

        addr_prio.first = blockAddress(addr_prio.first);

        bool can_cross_page = (mmu != nullptr);
        if (can_cross_page || samePage(addr_prio.first, pfi.getAddr())) {
            // 允许跨页 or 在相同的页面内，无需进行地址转换
            PrefetchInfo new_pfi(pfi,addr_prio.first);
            insert(pkt, new_pfi, addr_prio.second, cache);		// 插入预取访存队列
            num_pfs += 1;
            if (num_pfs == max_pfs) {
                break;
            }
        } else {
            DPRINTF(HWPrefetch, "Ignoring page crossing prefetch.\n");
        }
    }
}

Queued预取器中的calculatePrefetch()函数将会调用Stride子类的calculatePrefetch()进行计算。其主要代码如下所示，根据表项中的lastAddr与本次访存的地址计算步幅值，如果这个步幅值与表项中存储的不复制相等，则增加置信度。如果置信度超过了阈值，则计算预取地址。值得注意的是，其预取地址是按照Cacheline的长度为粒度(blkSize)进行的。

cpp 复制代码

void
Stride::calculatePrefetch(const PrefetchInfo &pfi,
                                    std::vector<AddrPriority> &addresses,
                                    const CacheAccessor &cache)
{
    Addr pf_addr = useCachelineAddr ? blockAddress(pfi.getAddr())
                                    : pfi.getAddr();
    Addr pc = pfi.getPC();
    bool is_secure = pfi.isSecure();

    // Search for entry in the pc table
    const StrideEntry::KeyType key{pc, is_secure};
    StrideEntry *entry = pc_table.findEntry(key);

    if (entry != nullptr) {
        pc_table.accessEntry(entry);
        int new_stride = pf_addr - entry->lastAddr;
        bool stride_match = (new_stride == entry->stride);

        if (stride_match) {
            entry->confidence++;		// 步幅相同，置信度增加
        } else {
            entry->confidence--;
            if (entry->confidence.calcSaturation() < threshConf) {	// 更新步幅
                entry->stride = new_stride;
            }
        }
        entry->lastAddr = pf_addr;

        if (entry->confidence.calcSaturation() < threshConf) {	// 低置信度，返回
            return;
        }

        int prefetch_stride = entry->stride;
        if (abs(prefetch_stride) < blkSize) {
            prefetch_stride = (prefetch_stride < 0) ? -blkSize : blkSize;
        }

        Addr new_addr = pf_addr + distance * prefetch_stride;	// 计算预取地址
        // Generate up to degree prefetches
        for (int d = 1; d <= degree; d++) {
            new_addr += prefetch_stride;
            addresses.push_back(AddrPriority(new_addr, 0));		// 传出预取信息
        }
    } else {
        StrideEntry* entry = pc_table.findVictim(key);				// 插入新表项
        entry->lastAddr = pf_addr;
        pc_table.insertEntry(key, entry);
    }
}

在获得预取信息pfi之后，在notify()函数中调用insert()函数进行物理地址计算，其代码如下所示。主要涉及两个变量，其一是预取器使用的物理地址计算还是虚拟地址计算预取地址的，其二是预取的地址与触发预取的访存地址是否跨页了。

使用物理地址进行预取
- 当没有发生跨页时：预取器输出的地址就是物理地址，直接用于预取访存。√
- 当发生跨页时：将预取的步幅与触发预取的访存虚拟地址相加得到预取的虚拟地址，先进行虚拟地址转换再用于预取访存。×
使用虚拟地址进行预取
- 当没有发生跨页时：将预取的步幅与触发预取的访存物理地址相加得到预取的物理地址，直接用于预取访存。√
- 当发生跨页时：预取器输出的地址就是虚拟地址，先进行虚拟地址转换再用于预取访存。×

cpp 复制代码

void
Queued::insert(const PacketPtr &pkt, PrefetchInfo &new_pfi,
               int32_t priority, const CacheAccessor &cache)
{
    /*
     * Physical address computation
     * if the prefetch is within the same page
     *   using VA: add the computed stride to the original PA
     *   using PA: no actions needed
     * if we are page crossing
     *   using VA: Create a translaion request and enqueue the corresponding
     *       deferred packet to the queue of pending translations
     *   using PA: use the provided VA to obtain the target VA, then attempt to
     *     translate the resulting address
     */
    Addr orig_addr = useVirtualAddresses ?
        pkt->req->getVaddr() : pkt->req->getPaddr();				// 输入预取器的原始地址
    bool positive_stride = new_pfi.getAddr() >= orig_addr;
    Addr stride = positive_stride ?
        (new_pfi.getAddr() - orig_addr) : (orig_addr - new_pfi.getAddr());	// 步幅

    Addr target_paddr;
    bool has_target_pa = false;
    RequestPtr translation_req = nullptr;
    if (samePage(orig_addr, new_pfi.getAddr())) {		// 预取地址与访存地址在同一页内
        if (useVirtualAddresses) {
            target_paddr = positive_stride ? (pkt->req->getPaddr() + stride) :
                (pkt->req->getPaddr() - stride);
                // 如果使用的是虚拟地址预取，则在物理地址的基础上加上步幅。
        } else {
            target_paddr = new_pfi.getAddr();
                // 如果使用的是物理地址预取，则直接用预取的地址就是物理地址。
        }
        has_target_pa = true;
    } else {		// 预取地址与访存地址跨页了，需要现进行地址转化再进行预取。
        if (useVirtualAddresses) {
            // 如果使用的是虚拟地址预取，那么直接使用预取的地址作为虚拟地址
            has_target_pa = false;
            translation_req = createPrefetchRequest(new_pfi.getAddr(), new_pfi,
                                                    pkt);
        } else if (pkt->req->hasVaddr()) {
            // 如果使用的是物理地址预取，那么需要使用访存的虚拟地址加上步幅得到预取的虚拟地址
            has_target_pa = false;
            Addr target_vaddr = positive_stride ?
                (pkt->req->getVaddr() + stride) :
                (pkt->req->getVaddr() - stride);
            translation_req = createPrefetchRequest(target_vaddr, new_pfi,
                                                    pkt);
        } else {
            return;
        }
    }

    DeferredPacket dpp(this, new_pfi, 0, priority, cache);
    if (has_target_pa) {
        // 如果有物理地址，则插入prefetch队列pfq
        Tick pf_time = curTick() + clockPeriod() * latency;
        dpp.createPkt(target_paddr, blkSize, requestorId, tagPrefetch,
                      pf_time);
        addToQueue(pfq, dpp);
    } else {
        // 没有物理地址，先进行地址转换，插入pfqMissingTranslation队列
        dpp.setTranslationRequest(translation_req);
        dpp.tc = system->threads[translation_req->contextId()];
        addToQueue(pfqMissingTranslation, dpp);
    }
}

在pfq中的预取信息，将会在Cache空闲时将相应的地址写入MSHR，并触发相应的访存行为。getNextQueueEntry()函数中会首先对MSHR和Write Buffer进行处理，处理完成后，将预取地址写入MSHR进行处理。

cpp 复制代码

void
BaseCache::CacheReqPacketQueue::sendDeferredPacket()
{
    assert(!waitingOnRetry);
    assert(deferredPacketReadyTime() == MaxTick);

    // check for request packets (requests & writebacks)
    // 获取请求包，pfq中的包，并转换没有物理地址的pfqMissingTranslation包
    QueueEntry* entry = cache.getNextQueueEntry();		

    if (!entry) {
    } else {
        if (checkConflictingSnoop(entry->getTarget()->pkt)) {
            return;
        }
        waitingOnRetry = entry->sendPacket(cache);	// 发送访存包
    }
    if (!waitingOnRetry) {
        schedSendEvent(cache.nextQueueReadyTime());
    }
}


QueueEntry*
BaseCache::getNextQueueEntry()
{
    MSHR *miss_mshr  = mshrQueue.getNext();
    WriteQueueEntry *wq_entry = writeBuffer.getNext();

    if (wq_entry && (writeBuffer.isFull() || !miss_mshr)) {
        MSHR *conflict_mshr = mshrQueue.findPending(wq_entry);

        if (conflict_mshr && conflict_mshr->order < wq_entry->order) {
            return conflict_mshr;
        }
        return wq_entry;
    } else if (miss_mshr) {
        WriteQueueEntry *conflict_mshr = writeBuffer.findPending(miss_mshr);
        if (conflict_mshr) {
            return conflict_mshr;
        }
        return miss_mshr;
    }

    // 尝试发送预取访存包
    assert(!miss_mshr && !wq_entry);
    if (prefetcher && mshrQueue.canPrefetch() && !isBlocked()) {

        PacketPtr pkt = prefetcher->getPacket();	// 获取pfq中的包
        if (pkt) {
            Addr pf_addr = pkt->getBlockAddr(blkSize);
            // 如果已经访问过了，则跳过
            if (tags->findBlock({pf_addr, pkt->isSecure()})) {
                prefetcher->pfHitInCache();
                delete pkt;
            } else if (mshrQueue.findMatch(pf_addr, pkt->isSecure())) {
                prefetcher->pfHitInMSHR();
                delete pkt;
            } else if (writeBuffer.findMatch(pf_addr, pkt->isSecure())) {
                prefetcher->pfHitInWB();
                delete pkt;
            } else {
                assert(pkt->req->requestorId() < system->maxRequestors());
                stats.cmdStats(pkt).mshrMisses[pkt->req->requestorId()]++;

                return allocateMissBuffer(pkt, curTick(), false);	// 分配一个MSHR进行处理
            }
        }
    }

    return nullptr;
}

最后，在下一层缓存返回访存数据时，将会触发BaseCache::recvTimingResp(pkt)函数，对Cache进行填充，实现预取。其主要操作代码如下所示。

cpp 复制代码

void
BaseCache::recvTimingResp(PacketPtr pkt)
{
    MSHR *mshr = dynamic_cast<MSHR*>(pkt->popSenderState());

    const QueueEntry::Target *initial_tgt = mshr->getTarget();

    bool is_fill = !mshr->isForward &&
        (pkt->isRead() || pkt->cmd == MemCmd::UpgradeResp ||
         mshr->wasWholeLineWrite);

    CacheBlk *blk = tags->findBlock({pkt->getAddr(), pkt->isSecure()});

    if (is_fill && !is_error) {
        const bool allocate = (writeAllocator && mshr->wasWholeLineWrite) ?
            writeAllocator->allocate() : mshr->allocOnFill();
        blk = handleFill(pkt, blk, writebacks, allocate);		// 对cache进行填充
    }
    delete pkt;
}

另一方面，对于没有物理地址的预取包，其放在pfqMissingTranslation队列之中，通过Queued::processMissingTranslations()函数进行地址转换，转换完成后，通过Queued::translationComplete()函数写回pfq队列，并继续之前所述的MSHR和访存等操作。

cpp 复制代码

void
Queued::translationComplete(DeferredPacket *dp, bool failed,
                            const CacheAccessor &cache)
{
    auto it = pfqMissingTranslation.begin();
    while (it != pfqMissingTranslation.end()) {
        if (&(*it) == dp) {
            break;
        }
        it++;
    }

    if (!failed) {
        Addr target_paddr = it->translationRequest->getPaddr();
        if (cacheSnoop &&
                (cache.inCache(target_paddr, it->pfInfo.isSecure()) ||
                 cache.inMissQueue(target_paddr, it->pfInfo.isSecure()))) {
            statsQueued.pfInCache++;
        } else {
            Tick pf_time = curTick() + clockPeriod() * latency;
            it->createPkt(target_paddr, blkSize, requestorId, tagPrefetch,
                          pf_time);
            addToQueue(pfq, *it);		// 利用转换后的物理地址插入pfq之中
        }
    }
    pfqMissingTranslation.erase(it);
}