SwiftClockCache：一个高性能并发缓存的设计与实现

一、背景：传统缓存在多线程并发下的困境

在现代高性能系统中，内存缓存是提升数据访问速度的关键组件。然而，传统的缓存淘汰算法在多线程并发场景下面临着严峻的挑战。

1.1 LRU 缓存的并发瓶颈

LRU（Least Recently Used）是最经典的缓存淘汰策略，其核心思想是「最近最少使用的数据最先被淘汰」。典型实现使用 哈希表 + 双向链表：哈希表提供 O(1) 查找，双向链表维护访问顺序。

然而，这种结构在多线程环境下存在天然的并发瓶颈：

每次读操作都需要写链表 ：即使是 Lookup（只读语义），也必须将命中节点移动到链表头部（splice 操作），这意味着读操作实际上是一个写操作，必须加锁保护
锁竞争严重，分片也无法根治：由于链表是全局共享的，所有线程的读写操作都竞争同一把 mutex；即使采用分片（Sharded LRU），每个分片内部仍然需要 mutex，当热点数据集中在少数分片时，锁竞争依然是主要瓶颈
缓存行颠簸：链表节点在内存中分散分布，频繁的指针追踪导致 CPU 缓存命中率低下

1.2 传统 LRU 的问题总结

问题	表现
读操作需要写	每次 Lookup 都要移动链表节点
锁粒度	每分片一把 mutex，热点分片竞争严重
内存布局	链表节点分散，CPU 缓存不友好
扫描抗性	❌ 扫描流量直接冲刷链表

1.3 Clock 算法：为并发而生的淘汰策略

Clock 算法（又称 Second-Chance 算法）最初来源于操作系统的页面置换策略，是 LRU 的一种高效近似。它的核心优势在于：

无需链表：使用固定大小的环形数组 + 时钟指针，天然适合连续内存布局
读操作是 Wait-Free 的 ：访问时只需执行一次 fetch_add（原子递增 AcquireCounter）即可完成标记，无需移动节点，也无需 CAS 重试循环。
写操作可并行，锁粒度在 Slot 级别：插入和淘汰操作的锁粒度不是整个分片（Shard），而是单个 Slot。多个线程可以同时对不同 Slot 进行写入，互不阻塞
淘汰操作可并行 ：时钟指针通过原子 fetch_add 推进，每个线程获得各自独立的淘汰扫描区域，多个线程可以同时执行淘汰而不会产生冲突
扫描抗性："第二次机会" 机制天然保护热点数据

这些特性使得 Clock 算法成为构建高性能无锁缓存的理想选择。

1.4 RocksDB HyperClockCache 与本项目的关系

SwiftClockCache 的设计直接受到 RocksDB HyperClockCache 的启发，可以视为其 简化与独立化版本。

RocksDB 是 Facebook 开源的高性能嵌入式 KV 存储引擎。HyperClockCache 是 RocksDB 团队在 v7.7+ 版本中引入的新一代缓存实现，在高并发场景下相比旧版 LRUCache 有显著的性能提升。官方对应的一些实验数据如下：https://smalldatum.blogspot.com/2022/10/hyping-hyper-clock-cache-in-rocksdb.html

然而，RocksDB 的 HyperClockCache 强依赖于 RocksDB 内部体系，难以作为独立组件单独使用。因此，SwiftClockCache 保留了 HyperClockCache 最核心的算法设计---------并以零外部依赖（仅需 C++17 标准库 + xxHash）重新实现了一版，可以作为独立的通用缓存组件直接使用。

二、项目概览

SwiftClockCache 是一个基于 Clock 淘汰算法 的高性能、线程安全的 C++17 内存缓存库。它采用 开放寻址哈希表 + 分片（Sharding）+ 原子操作 的架构，在多线程高并发场景下将竞争降到极小粒度，从而实现安全且高效的插入、查找、删除和淘汰操作。

核心特性

极低竞争的并发设计：读操作是 Wait-Free 的，写操作在分片内以 Slot 为粒度并行，多个线程可同时对不同 Slot 进行插入和淘汰，互不阻塞
Clock 淘汰算法：相比 LRU 具有更好的扫描抗性（Scan Resistance）
TTL 支持：每个缓存条目可设置独立的过期时间
分片架构：通过多分片降低并发竞争
RAII Handle ：查找返回的 Handle 对象自动管理引用计数，离开作用域自动释放，用户无需再用 shared_ptr 二次封装

三、整体架构

项目的分层架构如下：

复制代码

┌─────────────────────────────────────────────────────────────────┐
│                SwiftClockCache<Key, Value>                       │
│                                                                 │
│  ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│  │ ClockCacheShard  │ │ ClockCacheShard  │ │ ClockCacheShard  │ │
│  │    Shard 0       │ │    Shard 1       │ │    Shard N-1     │ │
│  └────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘ │
│           │                    │                    │           │
│  ┌────────▼─────────┐ ┌────────▼─────────┐ ┌────────▼─────────┐ │
│  │ FixedClockTable  │ │ FixedClockTable  │ │ FixedClockTable  │ │
│  │                  │ │                  │ │                  │ │
│  │ ┌──┬──┬──┬──┬──┐ │ │ ┌──┬──┬──┬──┬──┐ │ │ ┌──┬──┬──┬──┬──┐ │ │
│  │ │S0│S1│S2│..│Sn│ │ │ │S0│S1│S2│..│Sn│ │ │ │S0│S1│S2│..│Sn│ │ │
│  │ └──┴──┴──┴──┴──┘ │ │ └──┴──┴──┴──┴──┘ │ │ └──┴──┴──┴──┴──┘ │ │
│  │  (Slot 数组，     │ │  (开放寻址       │ │   哈希表)        │ │
│  │   每个 Slot 含:   │ │                  │ │                  │ │
│  │   AtomicSlotMeta │ │                  │ │                  │ │
│  │   + Key + Value  │ │                  │ │                  │ │
│  │   + HashedKey)   │ │                  │ │                  │ │
│  └──────────────────┘ └──────────────────┘ └──────────────────┘ │
└───────────────────────────┬─────────────────────────────────────┘
                            │ Lookup 返回
                            ▼
                 ┌─────────────────────┐
                 │  Handle<Key, Value>  │
                 │  (RAII, 移动语义)    │
                 │  析构自动 Release    │
                 │  → 回调 Shard       │
                 └─────────────────────┘

层级	文件	职责
顶层接口	`swift_clock_cache.h`	对外 API、分片路由、哈希计算、Handle RAII 封装
分片层	`clock_cache_shard.h`	单分片的容量管理、驱逐触发
核心引擎	`fixed_clock_table.h`	开放寻址哈希表、Clock 淘汰、无锁并发控制
基础设施	`base.h`	时间函数、工具宏、错误码

四、核心数据结构详解

4.1 SlotMeta ------ 64 位原子状态字

整个并发控制的核心在于 SlotMeta，它将一个 Slot 的全部状态压缩到 64 位 中，使得所有状态变更都可以通过单次原子操作完成：

复制代码

┌─────────────────────────────────────────────────────────────────┐
│ 63    62       61       60      59 ─── 30    29 ─── 0          │
│ Visible Shareable Occupied (保留)  ReleaseCounter AcquireCounter│
│ [1bit]  [1bit]   [1bit]  [1bit]   [30 bits]      [30 bits]    │
└─────────────────────────────────────────────────────────────────┘

状态标志位：

标志	含义
`OccupiedBit` (bit 61)	该 Slot 已被占用（正在构建或已就绪）
`ShareableBit` (bit 62)	数据已构建完成，可被并发读取
`VisibleBit` (bit 63)	数据对查找可见（未被标记删除）

状态机转换：

复制代码

                        ┌───────────┐
                        │   Empty   │ ◄─── 初始状态
                        └─────┬─────┘
                              │ fetch_or(OccupiedBit)
                              ▼
                  ┌─────────────────────┐
          ┌──────►│ UnderConstruction   │──────┐
          │       └──────────┬──────────┘      │
          │                  │ Store(           │ Store(0)
          │                  │  Occupied|       │
          │                  │  Shareable|      ▼
          │                  │  Visible)   ┌─────────┐
          │                  ▼             │  Empty   │
          │          ┌───────────┐         └─────────┘
          │          │  Visible  │
          │          └─────┬─────┘
          │                │ FetchClearVisible()
          │                ▼
          │        ┌─────────────┐
          └────────│  Invisible  │
   CAS(→Under      └─────────────┘
   Construction)
   [refcount==0]

Empty ：OccupiedBit == 0，Slot 空闲
UnderConstruction ：Occupied=1, Shareable=0，正在写入数据，其他线程不可读
Visible ：Occupied=1, Shareable=1, Visible=1，正常可用状态
Invisible ：Occupied=1, Shareable=1, Visible=0，已标记删除，等待引用归零后回收

引用计数：

引用计数采用 双计数器 设计（AcquireCounter / ReleaseCounter），各占 30 位：

复制代码

Refcount = AcquireCounter - ReleaseCounter

这种设计的精妙之处在于：FetchAddAcquire 和 FetchAddRelease 分别操作不同的位域，可以通过 fetch_add 原子操作独立递增，避免了传统 fetch_add(1) / fetch_sub(1) 在同一字段上的 ABA 问题。同时，由于状态标志和引用计数共享同一个 64 位字，一次 CAS 就能同时检查状态和引用计数。

更巧妙的是，这套计数器同时被复用为 Clock 淘汰算法的倒计时 。当一个 Slot 没有被任何线程持有时（即 AcquireCounter == ReleaseCounter），此时 AcquireCounter 的绝对值就充当了 Clock 的"生命值"。每次 Lookup 命中都会递增 AcquireCounter，而 Clock 指针扫过时会在 ClockUpdate 中将两个计数器同步递减（acq - 1），直到降为 0 才真正淘汰。这意味着引用计数和 Clock 倒计时共享同一组字段，无需额外的标志位或独立的访问频次计数器，一个 64 位原子字就同时承载了状态标志、引用计数和淘汰优先级三重语义。

4.2 Slot 结构

每个 Slot 包含：

cpp 复制代码

struct Slot {
  AtomicSlotMeta meta;                 // 64位原子状态
  std::atomic<uint32_t> displacements; // 位移计数（开放寻址探测链长度）
  std::atomic<uint32_t> expire_at;     // TTL 过期时间戳
  HashedKey hashed_key;                // 128位哈希值
  unsigned char key_storage[sizeof(Key)];   // Key 的原地存储
  unsigned char value_storage[sizeof(Value)]; // Value 的原地存储
};

Key 和 Value 使用 placement new 在预分配的字节数组上原地构造，避免了额外的堆分配。

4.3 Displacement 计数器

displacements 是开放寻址哈希表中一个关键的优化字段。它记录了有多少其他元素在探测过程中"经过"了当前 Slot。

插入时 ：沿探测路径上的每个非目标 Slot，displacements++
删除时 ：沿探测路径回滚，displacements--
查找时 ：如果遇到一个空 Slot 且 displacements == 0，说明没有任何元素的探测链经过此处，可以提前终止查找

这是一种比"墓碑标记"更高效的删除策略，避免了查找时需要跳过大量已删除标记的问题。

五、核心操作流程

5.1 哈希与分片路由

哈希使用 XXH3 128-bit 算法（通过随机种子初始化），生成 128 位哈希值 HashedKey{h0, h1}：

h1：作为哈希表中的 基地址（base）
h0 | 1：作为开放寻址的步长（increment），| 1 确保步长为奇数，与 2^n 大小的表互质，保证能遍历所有 Slot
Upper32of64(h0)：用于 分片路由 ，取高 32 位的前 shard_bits_ 位确定分片索引

5.2 Insert 流程

复制代码

  Caller                ClockCacheShard              FixedClockTable
    │                        │                            │
    │  Insert(key,val,ttl)   │                            │
    │───────────────────────►│                            │
    │                        │  检查 size >= capacity?    │
    │                        │──────┐                     │
    │                        │      │ 是：触发驱逐        │
    │                        │◄─────┘                     │
    │                        │  Evict(now, count, &data)  │
    │                        │───────────────────────────►│
    │                        │  驱逐结果                  │
    │                        │◄───────────────────────────│
    │                        │                            │
    │                        │  DoInsert(key, val, hk,    │
    │                        │          expire_at)        │
    │                        │───────────────────────────►│
    │                        │                            │  FindSlot 探测
    │                        │                            │──────┐
    │                        │                            │      │ 对每个 Slot
    │                        │                            │◄─────┘ 尝试 TryInsert
    │                        │                            │
    │                        │          ┌─────────────────┤
    │                        │          │ 找到空 Slot:    │
    │                        │          │  fetch_or       │
    │                        │          │  (OccupiedBit)  │
    │                        │          │  构造 Key/Value │
    │                        │          │  Store(Visible) │
    │                        │          │  → 返回 Slot*   │
    │                        │          ├─────────────────┤
    │                        │          │ 发现重复 Key:   │
    │                        │          │  Rollback       │
    │                        │          │  displacement   │
    │                        │          │  → 返回 nullptr │
    │                        │◄─────────┴─────────────────│
    │                        │                            │

TryInsert 的关键步骤：

原子占位 ：fetch_or(OccupiedBit) ------ 如果之前是 Empty，则成功占位
构造数据：在占位成功后，安全地写入 Key、Value、HashedKey、ExpireAt
发布：Store(Visible) 使数据对其他线程可见，同时设置初始引用计数为 {Acquire=1, Release=1}（即 refcount=0，表示无外部持有者）
重复检测 ：如果探测到已有相同 Key 的 Visible Slot，设置 already_matches=true 并返回失败

5.3 Lookup 流程

沿探测路径遍历
对每个 Slot 执行 FetchAddAcquire(1) 原子递增引用计数
检查状态是否 Visible 且 Key 匹配
匹配成功：检查 TTL，未过期则返回（此时 refcount 已+1，调用者持有引用）
不匹配：Unref 回退引用计数
遇到 displacements == 0 的空位时提前终止

5.4 Clock 淘汰算法

Clock 算法是 LRU 的一种近似实现，使用"时钟指针"循环扫描：

复制代码

        clock_pointer_
             │
             ▼
┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐     ┌────────┐
│Slot 0│→│Slot 1│→│Slot 2│→│Slot 3│→···→│Slot N-1│─┐
└──────┘ └──────┘ └──────┘ └──────┘     └────────┘ │
    ▲                                               │
    └───────────────────────────────────────────────┘
                    (环形扫描)

ClockUpdate 的决策逻辑：

复制代码

对于每个被扫描到的 Slot：
1. 不是 Shareable → 跳过
2. refcount > 0（被 pin 住）→ 跳过，记录 seen_pinned_count
3. 已过期（TTL 到期）→ 直接驱逐
4. Visible 且 AcquireCounter > 0 → 降低计数器（给予"第二次机会"），不驱逐
5. 其他情况 → CAS 为 UnderConstruction，驱逐成功

"第二次机会"机制：当一个 Slot 被访问过（AcquireCounter > 0），Clock 扫描到它时不会立即驱逐，而是将计数器降低。只有在下一轮扫描时计数器仍为 0，才会被驱逐。这赋予了热点数据更强的留存能力。

步进大小 ：每次 fetch_add(kStepSize=4) 推进时钟指针，批量处理 4 个 Slot，减少原子操作的竞争频率。

驱逐努力上限 ：IsEvictionEffortExceeded 通过 (freed_count + 1) * eviction_effort_cap <= seen_pinned_count 限制扫描范围，避免在大量条目被 pin 住时无限扫描。

5.5 Erase 与延迟回收

删除操作分两种情况：

refcount == 1（仅当前线程持有）：直接 CAS 为 UnderConstruction → 释放 → 标记 Empty
refcount > 1 （还有其他线程持有）：仅 FetchClearVisible() 标记为 Invisible，实际回收延迟到最后一个持有者 Release 时执行

这种 延迟回收 机制确保了正在使用中的数据不会被提前释放，是无锁设计中保证内存安全的关键。

六、Handle ------ RAII 引用管理

Handle<Key, Value> 是用户与缓存交互的核心接口：

cpp 复制代码

auto h = cache.Lookup(key);
if (h) {
    Value& v = *h;      // operator* 解引用
    h->member;           // operator-> 访问成员
    h.GetKey();          // 获取 Key
}
// h 离开作用域 → 析构函数自动调用 shard_->Release(slot_)

禁止拷贝 ：Handle(const Handle&) = delete
支持移动：转移所有权，源 Handle 置空
手动释放 ：Reset() 提前释放引用

这种设计确保了引用计数的正确性------每个 Handle 恰好对应一次 Acquire 和一次 Release。

七、benchmark

这里完全让 claude opus 4.6 帮我写了一份测试代码，对比一般的分段 LRU 做性能和命中率的测试，代码如下。

C++ 复制代码

/*
 * SwiftClockCache vs LRU Cache 性能与命中率对比 Benchmark
 *
 * 编译方式（以 g++ 为例）：
 *   g++ -std=c++17 -O2 -o benchmark benchmark.cc xxhash.cc -lpthread
 *
 * 运行：
 *   ./benchmark
 */

#include <algorithm>
#include <atomic>
#include <chrono>
#include <cmath>
#include <cstdint>
#include <iomanip>
#include <iostream>
#include <list>
#include <mutex>
#include <numeric>
#include <random>
#include <string>
#include <thread>
#include <unordered_map>
#include <vector>

#include "include/swift_clock_cache.h"

using namespace swiftclockcache;

// ============================================================================
// 单分片 LRU Cache（内部使用，带 mutex）
// ============================================================================
template <typename Key, typename Value>
class LRUCacheShard {
 public:
  explicit LRUCacheShard(size_t capacity) : capacity_(capacity) {}

  bool Insert(const Key& key, const Value& value) {
    std::lock_guard<std::mutex> lock(mu_);
    auto it = map_.find(key);
    if (it != map_.end()) {
      list_.erase(it->second);
      list_.push_front({key, value});
      it->second = list_.begin();
      return true;
    }
    if (map_.size() >= capacity_) {
      auto& back = list_.back();
      map_.erase(back.first);
      list_.pop_back();
    }
    list_.push_front({key, value});
    map_[key] = list_.begin();
    return true;
  }

  bool Lookup(const Key& key, Value* out_value = nullptr) {
    std::lock_guard<std::mutex> lock(mu_);
    auto it = map_.find(key);
    if (it == map_.end()) {
      return false;
    }
    list_.splice(list_.begin(), list_, it->second);
    if (out_value) {
      *out_value = it->second->second;
    }
    return true;
  }

  void Erase(const Key& key) {
    std::lock_guard<std::mutex> lock(mu_);
    auto it = map_.find(key);
    if (it != map_.end()) {
      list_.erase(it->second);
      map_.erase(it);
    }
  }

  size_t GetSize() {
    std::lock_guard<std::mutex> lock(mu_);
    return map_.size();
  }

 private:
  using ListType = std::list<std::pair<Key, Value>>;
  size_t capacity_;
  ListType list_;
  std::unordered_map<Key, typename ListType::iterator> map_;
  std::mutex mu_;
};

// ============================================================================
// 分片 LRU Cache（与 SwiftClockCache 使用相同分片策略做公平对比）
// ============================================================================
template <typename Key, typename Value>
class ShardedLRUCache {
 public:
  explicit ShardedLRUCache(size_t capacity, size_t num_shards = 16) {
    num_shards_ = RoundUpToPowerOf2(num_shards);
    shard_bits_ = FloorLog2(num_shards_);
    size_t cap_per_shard = std::max(size_t{1}, capacity / num_shards_);
    shards_.reserve(num_shards_);
    for (size_t i = 0; i < num_shards_; i++) {
      shards_.emplace_back(std::make_unique<LRUCacheShard<Key, Value>>(cap_per_shard));
    }
  }

  bool Insert(const Key& key, const Value& value) {
    return GetShard(key).Insert(key, value);
  }

  bool Lookup(const Key& key, Value* out_value = nullptr) {
    return GetShard(key).Lookup(key, out_value);
  }

  void Erase(const Key& key) {
    GetShard(key).Erase(key);
  }

  size_t GetSize() {
    size_t total = 0;
    for (auto& s : shards_) total += s->GetSize();
    return total;
  }

 private:
  LRUCacheShard<Key, Value>& GetShard(const Key& key) {
    // 使用 XXH3 哈希（与 SwiftClockCache 相同），保证分片均匀
    auto h128 = XXH3_128bits(&key, sizeof(Key));
    uint64_t h = h128.low64;
    if (shard_bits_ == 0) return *shards_[0];
    uint32_t upper = static_cast<uint32_t>(h >> 32);
    size_t idx = upper >> (32 - shard_bits_);
    return *shards_[idx];
  }

  size_t num_shards_;
  int shard_bits_;
  std::vector<std::unique_ptr<LRUCacheShard<Key, Value>>> shards_;
};

// ============================================================================
// Zipf 分布生成器（用于模拟真实访问模式）
// ============================================================================
class ZipfGenerator {
 public:
  ZipfGenerator(size_t n, double alpha, uint64_t seed = 42) : n_(n), alpha_(alpha), rng_(seed) {
    // 预计算 CDF
    double c = 0.0;
    for (size_t i = 1; i <= n; i++) {
      c += 1.0 / std::pow(static_cast<double>(i), alpha);
    }
    c = 1.0 / c;

    cdf_.resize(n + 1, 0.0);
    for (size_t i = 1; i <= n; i++) {
      cdf_[i] = cdf_[i - 1] + c / std::pow(static_cast<double>(i), alpha);
    }
  }

  size_t Next() {
    double u = dist_(rng_);
    // 二分查找
    auto it = std::lower_bound(cdf_.begin(), cdf_.end(), u);
    size_t idx = std::distance(cdf_.begin(), it);
    if (idx == 0) idx = 1;
    if (idx > n_) idx = n_;
    return idx - 1;  // 返回 [0, n-1]
  }

 private:
  size_t n_;
  double alpha_;
  std::mt19937_64 rng_;
  std::uniform_real_distribution<double> dist_{0.0, 1.0};
  std::vector<double> cdf_;
};

// ============================================================================
// 计时辅助
// ============================================================================
class Timer {
 public:
  void Start() { start_ = std::chrono::high_resolution_clock::now(); }
  void Stop() { end_ = std::chrono::high_resolution_clock::now(); }
  double ElapsedMs() const {
    return std::chrono::duration_cast<std::chrono::microseconds>(end_ - start_).count() / 1000.0;
  }
  double ElapsedUs() const {
    return static_cast<double>(std::chrono::duration_cast<std::chrono::microseconds>(end_ - start_).count());
  }

 private:
  std::chrono::high_resolution_clock::time_point start_, end_;
};

// ============================================================================
// 格式化输出辅助
// ============================================================================
void PrintSeparator() { std::cout << std::string(80, '-') << std::endl; }

void PrintHeader(const std::string& title) {
  std::cout << std::endl;
  std::cout << "========================================" << std::endl;
  std::cout << " " << title << std::endl;
  std::cout << "========================================" << std::endl;
}

void PrintResult(const std::string& cache_name, double time_ms, size_t ops, size_t hits, size_t total) {
  double throughput = static_cast<double>(ops) / (time_ms / 1000.0);
  double hit_rate = total > 0 ? 100.0 * hits / total : 0.0;
  double latency_ns = (time_ms * 1e6) / static_cast<double>(ops);

  std::cout << std::left << std::setw(20) << cache_name << " | "
            << "耗时: " << std::right << std::setw(8) << std::fixed << std::setprecision(2) << time_ms << " ms | "
            << "吞吐: " << std::setw(10) << std::setprecision(0) << throughput << " ops/s | "
            << "延迟: " << std::setw(6) << std::setprecision(1) << latency_ns << " ns/op | "
            << "命中率: " << std::setw(6) << std::setprecision(2) << hit_rate << "%" << std::endl;
}

// ============================================================================
// Benchmark 1：均匀随机访问模式
// ============================================================================
void BenchUniformRandom() {
  PrintHeader("Benchmark 1: 均匀随机访问模式");

  const size_t kCacheSize = 10000;
  const size_t kKeyRange = 100000;  // key 范围远大于 cache 容量 → 命中率约 10%
  const size_t kOps = 500000;

  std::cout << "  缓存容量=" << kCacheSize << " key范围=[0," << kKeyRange << ") 操作数=" << kOps << std::endl;
  PrintSeparator();

  // --- SwiftClockCache ---
  {
    SwiftClockCache<int, int>::Options opts;
    opts.max_size = kCacheSize;
    opts.num_shards = 16;
    SwiftClockCache<int, int> cache(opts);

    std::mt19937_64 rng(12345);
    std::uniform_int_distribution<int> dist(0, kKeyRange - 1);

    // 预热
    for (size_t i = 0; i < kCacheSize; i++) {
      cache.Insert(dist(rng), 0);
    }

    rng.seed(99999);
    size_t hits = 0;
    Timer timer;
    timer.Start();

    for (size_t i = 0; i < kOps; i++) {
      int key = dist(rng);
      auto h = cache.Lookup(key);
      if (h) {
        hits++;
      } else {
        cache.Insert(key, key);
      }
    }

    timer.Stop();
    PrintResult("SwiftClockCache", timer.ElapsedMs(), kOps, hits, kOps);
  }

  // --- Sharded LRU Cache ---
  {
    ShardedLRUCache<int, int> cache(kCacheSize, 16);

    std::mt19937_64 rng(12345);
    std::uniform_int_distribution<int> dist(0, kKeyRange - 1);

    // 预热
    for (size_t i = 0; i < kCacheSize; i++) {
      cache.Insert(dist(rng), 0);
    }

    rng.seed(99999);
    size_t hits = 0;
    Timer timer;
    timer.Start();

    for (size_t i = 0; i < kOps; i++) {
      int key = dist(rng);
      if (cache.Lookup(key)) {
        hits++;
      } else {
        cache.Insert(key, key);
      }
    }

    timer.Stop();
    PrintResult("ShardedLRU(16)", timer.ElapsedMs(), kOps, hits, kOps);
  }
}

// ============================================================================
// Benchmark 2：Zipf 分布访问模式（模拟真实热点访问）
// ============================================================================
void BenchZipf() {
  PrintHeader("Benchmark 2: Zipf 分布访问模式 (alpha=0.99)");

  const size_t kCacheSize = 10000;
  const size_t kKeyRange = 100000;
  const size_t kOps = 500000;

  std::cout << "  缓存容量=" << kCacheSize << " key范围=[0," << kKeyRange << ") 操作数=" << kOps << std::endl;
  PrintSeparator();

  // 预生成 Zipf key 序列（两个 cache 使用相同序列，保证公平对比）
  ZipfGenerator zipf(kKeyRange, 0.99, 42);
  std::vector<int> keys(kOps);
  for (size_t i = 0; i < kOps; i++) {
    keys[i] = static_cast<int>(zipf.Next());
  }

  // --- SwiftClockCache ---
  {
    SwiftClockCache<int, int>::Options opts;
    opts.max_size = kCacheSize;
    opts.num_shards = 16;
    SwiftClockCache<int, int> cache(opts);

    size_t hits = 0;
    Timer timer;
    timer.Start();

    for (size_t i = 0; i < kOps; i++) {
      int key = keys[i];
      auto h = cache.Lookup(key);
      if (h) {
        hits++;
      } else {
        cache.Insert(key, key);
      }
    }

    timer.Stop();
    PrintResult("SwiftClockCache", timer.ElapsedMs(), kOps, hits, kOps);
  }

  // --- Sharded LRU Cache ---
  {
    ShardedLRUCache<int, int> cache(kCacheSize, 16);

    size_t hits = 0;
    Timer timer;
    timer.Start();

    for (size_t i = 0; i < kOps; i++) {
      int key = keys[i];
      if (cache.Lookup(key)) {
        hits++;
      } else {
        cache.Insert(key, key);
      }
    }

    timer.Stop();
    PrintResult("ShardedLRU(16)", timer.ElapsedMs(), kOps, hits, kOps);
  }
}

// ============================================================================
// Benchmark 3：扫描抗性测试// Clock 算法在面对扫描时应该比 LRU 有更好的命中率
// ============================================================================
void BenchScanResistance() {
  PrintHeader("Benchmark 3: 扫描抗性测试 (热点 + 循环扫描)");

  const size_t kCacheSize = 5000;
  const size_t kHotKeys = 3000;    // 热点 key 范围
  const size_t kScanKeys = 20000;  // 扫描 key 范围（远大于缓存）
  const size_t kOps = 500000;

  std::cout << "  缓存容量=" << kCacheSize << " 热点key数=" << kHotKeys << " 扫描key数=" << kScanKeys
            << " 操作数=" << kOps << std::endl;
  std::cout << "  模式: 80% 热点随机访问 + 20% 顺序扫描" << std::endl;
  PrintSeparator();

  // 预生成访问序列
  std::mt19937_64 rng(777);
  std::uniform_int_distribution<int> hot_dist(0, kHotKeys - 1);
  std::uniform_real_distribution<double> ratio_dist(0.0, 1.0);
  std::vector<int> keys(kOps);
  size_t scan_cursor = 0;

  for (size_t i = 0; i < kOps; i++) {
    if (ratio_dist(rng) < 0.8) {
      // 80%: 热点随机访问
      keys[i] = hot_dist(rng);
    } else {
      // 20%: 顺序扫描（key 偏移到不同范围避免和热点重叠）
      keys[i] = static_cast<int>(kHotKeys + (scan_cursor % kScanKeys));
      scan_cursor++;
    }
  }

  // --- SwiftClockCache ---
  {
    SwiftClockCache<int, int>::Options opts;
    opts.max_size = kCacheSize;
    opts.num_shards = 16;
    SwiftClockCache<int, int> cache(opts);

    size_t hits = 0;
    Timer timer;
    timer.Start();

    for (size_t i = 0; i < kOps; i++) {
      int key = keys[i];
      auto h = cache.Lookup(key);
      if (h) {
        hits++;
      } else {
        cache.Insert(key, key);
      }
    }

    timer.Stop();
    PrintResult("SwiftClockCache", timer.ElapsedMs(), kOps, hits, kOps);
  }

  // --- Sharded LRU Cache ---
  {
    ShardedLRUCache<int, int> cache(kCacheSize, 16);

    size_t hits = 0;
    Timer timer;
    timer.Start();

    for (size_t i = 0; i < kOps; i++) {
      int key = keys[i];
      if (cache.Lookup(key)) {
        hits++;
      } else {
        cache.Insert(key, key);
      }
    }

    timer.Stop();
    PrintResult("ShardedLRU(16)", timer.ElapsedMs(), kOps, hits, kOps);
  }
}

// ============================================================================
// Benchmark 4：纯读吞吐量// ============================================================================
void BenchPureReadThroughput() {
  PrintHeader("Benchmark 4: 纯读吞吐量 (100% 命中)");

  const size_t kCacheSize = 50000;
  const size_t kOps = 2000000;

  std::cout << "  缓存容量=" << kCacheSize << " 操作数=" << kOps << " (所有 key 均在缓存中)" << std::endl;
  PrintSeparator();

  // --- SwiftClockCache ---
  {
    SwiftClockCache<int, int>::Options opts;
    opts.max_size = kCacheSize;
    opts.num_shards = 16;
    SwiftClockCache<int, int> cache(opts);

    // 填满缓存
    for (size_t i = 0; i < kCacheSize; i++) {
      cache.Insert(static_cast<int>(i), static_cast<int>(i));
    }

    std::mt19937_64 rng(54321);
    std::uniform_int_distribution<int> dist(0, kCacheSize - 1);

    size_t hits = 0;
    Timer timer;
    timer.Start();

    for (size_t i = 0; i < kOps; i++) {
      int key = dist(rng);
      auto h = cache.Lookup(key);
      if (h) {
        hits++;
      }
    }

    timer.Stop();
    PrintResult("SwiftClockCache", timer.ElapsedMs(), kOps, hits, kOps);
  }

  // --- Sharded LRU Cache ---
  {
    ShardedLRUCache<int, int> cache(kCacheSize, 16);

    for (size_t i = 0; i < kCacheSize; i++) {
      cache.Insert(static_cast<int>(i), static_cast<int>(i));
    }

    std::mt19937_64 rng(54321);
    std::uniform_int_distribution<int> dist(0, kCacheSize - 1);

    size_t hits = 0;
    Timer timer;
    timer.Start();

    for (size_t i = 0; i < kOps; i++) {
      int key = dist(rng);
      if (cache.Lookup(key)) {
        hits++;
      }
    }

    timer.Stop();
    PrintResult("ShardedLRU(16)", timer.ElapsedMs(), kOps, hits, kOps);
  }
}

// ============================================================================
// Benchmark 5：多线程并发性能对比
// ============================================================================
void BenchConcurrent() {
  PrintHeader("Benchmark 5: 多线程并发性能对比");

  const size_t kCacheSize = 50000;
  const size_t kKeyRange = 100000;

  for (int num_threads : {1, 2, 4, 8, 16, 32}) {
    const size_t kOpsPerThread = 200000;

    std::cout << std::endl;
    std::cout << "  线程数=" << num_threads << " 每线程操作数=" << kOpsPerThread
              << " 缓存容量=" << kCacheSize << " key范围=[0," << kKeyRange << ")" << std::endl;
    PrintSeparator();

    // 为每个线程预生成 Zipf 访问序列
    std::vector<std::vector<int>> thread_keys(num_threads);
    for (int t = 0; t < num_threads; t++) {
      ZipfGenerator zipf(kKeyRange, 0.99, 1000 + t);
      thread_keys[t].resize(kOpsPerThread);
      for (size_t i = 0; i < kOpsPerThread; i++) {
        thread_keys[t][i] = static_cast<int>(zipf.Next());
      }
    }

    // --- SwiftClockCache ---
    {
      SwiftClockCache<int, int>::Options opts;
      opts.max_size = kCacheSize;
      opts.num_shards = 16;
      SwiftClockCache<int, int> cache(opts);

      // 预热
      for (size_t i = 0; i < kCacheSize / 2; i++) {
        cache.Insert(static_cast<int>(i), static_cast<int>(i));
      }

      std::atomic<size_t> total_hits{0};
      std::atomic<size_t> total_ops{0};

      Timer timer;
      timer.Start();

      std::vector<std::thread> threads;
      for (int t = 0; t < num_threads; t++) {
        threads.emplace_back([&, t]() {
          size_t local_hits = 0;
          for (size_t i = 0; i < kOpsPerThread; i++) {
            int key = thread_keys[t][i];
            auto h = cache.Lookup(key);
            if (h) {
              local_hits++;
            } else {
              cache.Insert(key, key);
            }
          }
          total_hits.fetch_add(local_hits, std::memory_order_relaxed);
          total_ops.fetch_add(kOpsPerThread, std::memory_order_relaxed);
        });
      }

      for (auto& th : threads) th.join();
      timer.Stop();

      PrintResult("SwiftClockCache", timer.ElapsedMs(), total_ops.load(), total_hits.load(), total_ops.load());
    }

    // --- Sharded LRU Cache ---
    {
      ShardedLRUCache<int, int> cache(kCacheSize, 16);

      // 预热
      for (size_t i = 0; i < kCacheSize / 2; i++) {
        cache.Insert(static_cast<int>(i), static_cast<int>(i));
      }

      std::atomic<size_t> total_hits{0};
      std::atomic<size_t> total_ops{0};

      Timer timer;
      timer.Start();

      std::vector<std::thread> threads;
      for (int t = 0; t < num_threads; t++) {
        threads.emplace_back([&, t]() {
          size_t local_hits = 0;
          for (size_t i = 0; i < kOpsPerThread; i++) {
            int key = thread_keys[t][i];
            if (cache.Lookup(key)) {
              local_hits++;
            } else {
              cache.Insert(key, key);
            }
          }
          total_hits.fetch_add(local_hits, std::memory_order_relaxed);
          total_ops.fetch_add(kOpsPerThread, std::memory_order_relaxed);
        });
      }

      for (auto& th : threads) th.join();
      timer.Stop();

      PrintResult("ShardedLRU(16)", timer.ElapsedMs(), total_ops.load(), total_hits.load(), total_ops.load());
    }
  }
}

// ============================================================================
// Benchmark 6：不同缓存容量下的命中率对比
// ============================================================================
void BenchHitRateVsCapacity() {
  PrintHeader("Benchmark 6: 不同缓存容量下的命中率对比 (Zipf alpha=0.99)");

  const size_t kKeyRange = 100000;
  const size_t kOps = 300000;

  // 预生成访问序列
  ZipfGenerator zipf(kKeyRange, 0.99, 2025);
  std::vector<int> keys(kOps);
  for (size_t i = 0; i < kOps; i++) {
    keys[i] = static_cast<int>(zipf.Next());
  }

  std::cout << std::left << std::setw(12) << "缓存容量" << " | " << std::setw(22) << "SwiftClockCache 命中率"
            << " | " << std::setw(22) << "ShardedLRU(16) 命中率"
            << " | " << "差异" << std::endl;
  PrintSeparator();

  for (size_t cache_size : {1000, 2000, 5000, 10000, 20000, 50000}) {
    // --- SwiftClockCache ---
    size_t clock_hits = 0;
    {
      SwiftClockCache<int, int>::Options opts;
      opts.max_size = cache_size;
      opts.num_shards = 16;
      SwiftClockCache<int, int> cache(opts);

      for (size_t i = 0; i < kOps; i++) {
        int key = keys[i];
        auto h = cache.Lookup(key);
        if (h) {
          clock_hits++;
        } else {
          cache.Insert(key, key);
        }
      }
    }

    // --- Sharded LRU Cache ---
    size_t lru_hits = 0;
    {
      ShardedLRUCache<int, int> cache(cache_size, 16);

      for (size_t i = 0; i < kOps; i++) {
        int key = keys[i];
        if (cache.Lookup(key)) {
          lru_hits++;
        } else {
          cache.Insert(key, key);
        }
      }
    }

    double clock_rate = 100.0 * clock_hits / kOps;
    double lru_rate = 100.0 * lru_hits / kOps;
    double diff = clock_rate - lru_rate;

    std::cout << std::left << std::setw(12) << cache_size << " | " << std::right << std::setw(20) << std::fixed
              << std::setprecision(2) << clock_rate << "%" << " | " << std::setw(20) << lru_rate << "%" << " | "
              << std::showpos << std::setprecision(2) << diff << "%" << std::noshowpos << std::endl;
  }
}

// ============================================================================
// Benchmark 7：纯插入吞吐量
// ============================================================================
void BenchInsertThroughput() {
  PrintHeader("Benchmark 7: 纯插入吞吐量");

  const size_t kCacheSize = 10000;
  const size_t kOps = 1000000;

  std::cout << "  缓存容量=" << kCacheSize << " 插入操作数=" << kOps << std::endl;
  PrintSeparator();

  // --- SwiftClockCache ---
  {
    SwiftClockCache<int, int>::Options opts;
    opts.max_size = kCacheSize;
    opts.num_shards = 16;
    SwiftClockCache<int, int> cache(opts);

    Timer timer;
    timer.Start();

    for (size_t i = 0; i < kOps; i++) {
      cache.Insert(static_cast<int>(i), static_cast<int>(i));
    }

    timer.Stop();
    PrintResult("SwiftClockCache", timer.ElapsedMs(), kOps, 0, 0);
  }

  // --- Sharded LRU Cache ---
  {
    ShardedLRUCache<int, int> cache(kCacheSize, 16);

    Timer timer;
    timer.Start();

    for (size_t i = 0; i < kOps; i++) {
      cache.Insert(static_cast<int>(i), static_cast<int>(i));
    }

    timer.Stop();
    PrintResult("ShardedLRU(16)", timer.ElapsedMs(), kOps, 0, 0);
  }
}

// ============================================================================
// Benchmark 8：混合读写比例对比 (Zipf)
// ============================================================================
void BenchMixedReadWrite() {
  PrintHeader("Benchmark 8: 不同读写比例下的性能对比 (Zipf alpha=0.99)");

  const size_t kCacheSize = 10000;
  const size_t kKeyRange = 100000;
  const size_t kOps = 500000;

  for (int read_pct : {50, 80, 95, 99}) {
    std::cout << std::endl;
    std::cout << "  读比例=" << read_pct << "% 写比例=" << (100 - read_pct) << "%" << std::endl;
    PrintSeparator();

    // 预生成 key 和操作类型序列
    ZipfGenerator zipf(kKeyRange, 0.99, 3000 + read_pct);
    std::mt19937_64 rng(4000 + read_pct);
    std::uniform_int_distribution<int> pct_dist(0, 99);

    std::vector<int> keys(kOps);
    std::vector<bool> is_read(kOps);
    for (size_t i = 0; i < kOps; i++) {
      keys[i] = static_cast<int>(zipf.Next());
      is_read[i] = (pct_dist(rng) < read_pct);
    }

    // --- SwiftClockCache ---
    {
      SwiftClockCache<int, int>::Options opts;
      opts.max_size = kCacheSize;
      opts.num_shards = 16;
      SwiftClockCache<int, int> cache(opts);

      // 预热
      for (size_t i = 0; i < kCacheSize / 2; i++) {
        cache.Insert(static_cast<int>(i), static_cast<int>(i));
      }

      size_t hits = 0, reads = 0;
      Timer timer;
      timer.Start();

      for (size_t i = 0; i < kOps; i++) {
        if (is_read[i]) {
          reads++;
          auto h = cache.Lookup(keys[i]);
          if (h) hits++;
        } else {
          cache.Insert(keys[i], keys[i]);
        }
      }

      timer.Stop();
      PrintResult("SwiftClockCache", timer.ElapsedMs(), kOps, hits, reads);
    }

    // --- Sharded LRU Cache ---
    {
      ShardedLRUCache<int, int> cache(kCacheSize, 16);

      // 预热
      for (size_t i = 0; i < kCacheSize / 2; i++) {
        cache.Insert(static_cast<int>(i), static_cast<int>(i));
      }

      size_t hits = 0, reads = 0;
      Timer timer;
      timer.Start();

      for (size_t i = 0; i < kOps; i++) {
        if (is_read[i]) {
          reads++;
          if (cache.Lookup(keys[i])) hits++;
        } else {
          cache.Insert(keys[i], keys[i]);
        }
      }

      timer.Stop();
      PrintResult("ShardedLRU(16)", timer.ElapsedMs(), kOps, hits, reads);
    }
  }
}

// ============================================================================
// main
// ============================================================================
int main() {
  std::cout << "================================================================" << std::endl;
  std::cout << " SwiftClockCache vs ShardedLRU --- 性能与命中率对比 Benchmark" << std::endl;
  std::cout << "================================================================" << std::endl;

  BenchUniformRandom();
  BenchZipf();
  BenchScanResistance();
  BenchPureReadThroughput();
  BenchInsertThroughput();
  BenchMixedReadWrite();
  BenchHitRateVsCapacity();
  BenchConcurrent();

  std::cout << std::endl;
  std::cout << "================================================================" << std::endl;
  std::cout << " Benchmark 全部完成！" << std::endl;
  std::cout << "================================================================" << std::endl;

  return 0;
}

其结果如下：

📊 SwiftClockCache vs ShardedLRU --- 性能与命中率对比 Benchmark

① 均匀随机访问模式

条件：缓存容量 = 10,000 | key 范围 = [0, 100,000) | 操作数 = 500,000

缓存实现	耗时	吞吐量	延迟	命中率
SwiftClockCache	68.79 ms	727 万 ops/s	137.6 ns/op	10.04%
ShardedLRU(16)	68.38 ms	731 万 ops/s	136.8 ns/op	10.03%

💡 均匀随机下两者表现接近，命中率 ≈ 缓存容量 / key 范围 = 10%，符合理论预期。

② Zipf 分布访问模式（alpha=0.99）

条件：缓存容量 = 10,000 | key 范围 = [0, 100,000) | 操作数 = 500,000

缓存实现	耗时	吞吐量	延迟	命中率
SwiftClockCache	19.73 ms	2,535 万 ops/s	39.5 ns/op	72.88%
ShardedLRU(16)	28.62 ms	1,747 万 ops/s	57.2 ns/op	71.99%

🚀 热点访问模式下，SwiftClockCache 吞吐量高出 45%，命中率也略高 0.89%。

③ 扫描抗性测试（热点 + 循环扫描）

条件：缓存容量 = 5,000 | 热点 key 数 = 3,000 | 扫描 key 数 = 20,000 | 操作数 = 500,000
模式：80% 热点随机访问 + 20% 顺序扫描

缓存实现	耗时	吞吐量	延迟	命中率
SwiftClockCache	19.46 ms	2,569 万 ops/s	38.9 ns/op	78.66%
ShardedLRU(16)	24.55 ms	2,037 万 ops/s	49.1 ns/op	74.81%

🛡️ 命中率差距达 3.85%。Clock 的"第二次机会"机制有效保护热点数据不被扫描流量冲刷。

④ 纯读吞吐量（100% 命中）

条件：缓存容量 = 50,000 | 操作数 = 2,000,000（所有 key 均在缓存中）

缓存实现	耗时	吞吐量	延迟	命中率
SwiftClockCache	54.01 ms	3,703 万 ops/s	27.0 ns/op	99.20%
ShardedLRU(16)	66.51 ms	3,007 万 ops/s	33.3 ns/op	99.20%

⚡ 纯读场景 SwiftClockCache 吞吐量高出 23%。

⑤ 纯插入吞吐量

条件：缓存容量 = 10,000 | 插入操作数 = 1,000,000

缓存实现	耗时	吞吐量	延迟
SwiftClockCache	57.20 ms	1,748 万 ops/s	57.2 ns/op
ShardedLRU(16)	140.41 ms	712 万 ops/s	140.4 ns/op

🔥 插入性能差距最大，SwiftClockCache 快 2.45 倍！

⑥ 不同读写比例下的性能对比（Zipf alpha=0.99）

读写比例	SwiftClockCache 吞吐	ShardedLRU 吞吐	提升幅度	SwiftClockCache 命中率	ShardedLRU 命中率
50 : 50	3,446 万 ops/s	1,994 万 ops/s	+73%	74.80%	73.83%
80 : 20	4,006 万 ops/s	2,938 万 ops/s	+36%	76.51%	75.66%
95 : 5	4,484 万 ops/s	3,946 万 ops/s	+14%	75.73%	75.72%
99 : 1	4,934 万 ops/s	3,880 万 ops/s	+27%	74.45%	74.45%

📈 写比例越高，SwiftClockCache 的优势越大。

⑦ 不同缓存容量下的命中率对比（Zipf alpha=0.99）

条件：key 范围 = 100,000 | 操作数 = 300,000

缓存容量	SwiftClockCache 命中率	ShardedLRU 命中率	差异
1,000	50.07%	48.93%	+1.14%
2,000	56.85%	55.75%	+1.10%
5,000	65.85%	64.89%	+0.96%
10,000	72.65%	71.80%	+0.84%
20,000	78.84%	78.33%	+0.50%
50,000	83.56%	83.56%	±0.00%

📉 缓存容量越小（竞争越激烈），SwiftClockCache 的命中率优势越明显；容量充足时两者趋同。

⑧ 多线程并发性能对比 ⭐（核心指标）

条件：每线程操作数 = 200,000 | 缓存容量 = 50,000 | key 范围 = [0, 100,000) | Zipf alpha=0.99

线程数	SwiftClockCache 吞吐	ShardedLRU 吞吐	倍数	SwiftClockCache 命中率	ShardedLRU 命中率
1	4,287 万 ops/s	4,166 万 ops/s	1.03x	89.88%	89.88%
2	5,764 万 ops/s	2,362 万 ops/s	2.44x	91.03%	90.76%
4	9,094 万 ops/s	1,964 万 ops/s	4.63x	91.52%	91.05%
8	9,622 万 ops/s	1,570 万 ops/s	6.13x	91.69%	91.17%
16	9,176 万 ops/s	1,017 万 ops/s	9.02x	91.77%	91.25%
32	9,209 万 ops/s	1,036 万 ops/s	8.89x	91.82%	91.30%

关键结论：

单线程下两者几乎持平，说明无锁设计没有引入额外的单线程开销

随着线程数增加，SwiftClockCache 近乎线性扩展 ，在 8 线程时达到峰值 9,622 万 ops/s

ShardedLRU 在高并发下严重退化

32 线程时 SwiftClockCache 快了近 9 倍

八、总结

这是一个精心设计的适合在高并发场景下使用的 cache，笔者水平有限，文章如有错误请指出，源码：https://github.com/viktorika/swiftclockcache。