caffine概率统计算法之Count-Min Sketch

Count-Min Sketch 算法详解

1. 什么是 Count-Min Sketch？

Count-Min Sketch（CMS）是一种概率型数据结构 ，用于在有限空间内高效地统计大量元素的频率（计数），尤其适合处理大规模数据流。

它的特点是：

占用空间小（远小于为每个元素分配一个计数器）
支持快速插入和查询
统计结果有误差（但误差可控）

Count-Min Sketch 常用于网络流量统计、缓存频率统计（如 Caffeine 的 TinyLFU）、大数据分析等场景。

2. 算法原理

核心结构：二维计数数组 + 多个哈希函数

设定 <math xmlns="http://www.w3.org/1998/Math/MathML"> d d </math>d 个哈希函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> h 1 , h 2 , . . . , h d h_1, h_2, ..., h_d </math>h1,h2,...,hd，每个哈希函数对应一行数组，数组长度为 <math xmlns="http://www.w3.org/1998/Math/MathML"> w w </math>w。
总体结构是 <math xmlns="http://www.w3.org/1998/Math/MathML"> d × w d \times w </math>d×w 的二维数组，每个元素初始为 0。
每个哈希函数将输入元素映射到 <math xmlns="http://www.w3.org/1998/Math/MathML"> [ 0 , w − 1 ] [0, w-1] </math>[0,w−1] 的区间。

插入操作（add/put）

对于元素 <math xmlns="http://www.w3.org/1998/Math/MathML"> x x </math>x，每个哈希函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> h i h_i </math>hi 计算 <math xmlns="http://www.w3.org/1998/Math/MathML"> h i ( x ) h_i(x) </math>hi(x)，在第 <math xmlns="http://www.w3.org/1998/Math/MathML"> i i </math>i 行第 <math xmlns="http://www.w3.org/1998/Math/MathML"> h i ( x ) h_i(x) </math>hi(x) 列的计数器加一。
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> for i = 1 to d : count [ i ] [ h i ( x ) ] + = 1 \text{for } i = 1 \text{ to } d: \quad \text{count}[i][h_i(x)] += 1 </math>for i=1 to d:count[i][hi(x)]+=1

查询操作（estimate/count）

对于元素 <math xmlns="http://www.w3.org/1998/Math/MathML"> x x </math>x，查询每个哈希函数对应的计数器，返回最小值作为估计频率：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> f ^ ( x ) = min ⁡ i = 1 d count [ i ] [ h i ( x ) ] \hat{f}(x) = \min_{i=1}^d \text{count}[i][h_i(x)] </math>f^(x)=i=1mindcount[i][hi(x)]

为什么取最小值？

由于哈希冲突，某些计数器可能被其它元素"污染"而偏大。取最小值能保证"不会高估太多"。

3. 误差分析与参数选择

误差来源：哈希冲突导致频率被高估（不会低估）。
误差上界：估计频率 <math xmlns="http://www.w3.org/1998/Math/MathML"> f ^ ( x ) \hat{f}(x) </math>f^(x) 至多比实际频率 <math xmlns="http://www.w3.org/1998/Math/MathML"> f ( x ) f(x) </math>f(x) 多 <math xmlns="http://www.w3.org/1998/Math/MathML"> ε N \varepsilon N </math>εN，其中 <math xmlns="http://www.w3.org/1998/Math/MathML"> N N </math>N 是总插入次数， <math xmlns="http://www.w3.org/1998/Math/MathML"> ε \varepsilon </math>ε 由数组宽度 <math xmlns="http://www.w3.org/1998/Math/MathML"> w w </math>w 控制。
错误概率 <math xmlns="http://www.w3.org/1998/Math/MathML"> δ \delta </math>δ 由哈希函数个数 <math xmlns="http://www.w3.org/1998/Math/MathML"> d d </math>d 控制。

参数选择：

<math xmlns="http://www.w3.org/1998/Math/MathML"> w = ⌈ e / ε ⌉ w = \lceil e / \varepsilon \rceil </math>w=⌈e/ε⌉， <math xmlns="http://www.w3.org/1998/Math/MathML"> d = ⌈ ln ⁡ ( 1 / δ ) ⌉ d = \lceil \ln(1/\delta) \rceil </math>d=⌈ln(1/δ)⌉
例如， <math xmlns="http://www.w3.org/1998/Math/MathML"> ε = 0.01 \varepsilon = 0.01 </math>ε=0.01， <math xmlns="http://www.w3.org/1998/Math/MathML"> δ = 0.001 \delta = 0.001 </math>δ=0.001，则 <math xmlns="http://www.w3.org/1998/Math/MathML"> w = 271 w=271 </math>w=271, <math xmlns="http://www.w3.org/1998/Math/MathML"> d = 7 d=7 </math>d=7

4. 伪代码示例

java 复制代码

class CountMinSketch {
    int[][] table; // d x w
    HashFunction[] hashFunctions; // d个哈希函数

    public CountMinSketch(int d, int w) {
        table = new int[d][w];
        hashFunctions = new HashFunction[d];
        // 初始化哈希函数
    }

    public void add(String key) {
        for (int i = 0; i < d; i++) {
            int idx = hashFunctions[i].hash(key) % w;
            table[i][idx]++;
        }
    }

    public int estimate(String key) {
        int min = Integer.MAX_VALUE;
        for (int i = 0; i < d; i++) {
            int idx = hashFunctions[i].hash(key) % w;
            min = Math.min(min, table[i][idx]);
        }
        return min;
    }
}

5. 优缺点

优点：

空间效率高，适合大数据场景
查询和插入都是 <math xmlns="http://www.w3.org/1998/Math/MathML"> O ( d ) O(d) </math>O(d)，非常快
误差可控

缺点：

频率估计有误差（高估，不会低估）
不支持删除（标准 CMS 只有加法）

6. 应用场景

缓存频率统计（如 Caffeine 的 TinyLFU）
网络包计数、热门关键词统计
数据流分析、反垃圾邮件等

好的，下面用图示和简化说明帮助你理解 Count-Min Sketch 算法。

示例解释

1. 数据结构图示

假设我们有 3 个哈希函数，每个哈希函数对应一行，数组长度为 8：

ini 复制代码

哈希函数1: [0] [1] [2] [3] [4] [5] [6] [7]
哈希函数2: [0] [1] [2] [3] [4] [5] [6] [7]
哈希函数3: [0] [1] [2] [3] [4] [5] [6] [7]

初始时，每个格子都是 0。

2. 插入元素

假设我们插入元素 "A"：

哈希函数1("A") = 3
哈希函数2("A") = 6
哈希函数3("A") = 1

我们就在对应位置加 1：

ini 复制代码

哈希函数1: [0] [0] [0] [1] [0] [0] [0] [0]
哈希函数2: [0] [0] [0] [0] [0] [0] [1] [0]
哈希函数3: [0] [1] [0] [0] [0] [0] [0] [0]

再插入 "B"：

哈希函数1("B") = 5
哈希函数2("B") = 3
哈希函数3("B") = 1

ini 复制代码

哈希函数1: [0] [0] [0] [1] [0] [1] [0] [0]
哈希函数2: [0] [0] [0] [1] [0] [0] [1] [0]
哈希函数3: [0] [2] [0] [0] [0] [0] [0] [0]

注意，"B" 和 "A" 在哈希函数3的位置冲突了，所以第 1 列变成了 2。

3. 查询元素频率

查询 "A" 的频率：

哈希函数1("A") = 3，值为 1
哈希函数2("A") = 6，值为 1
哈希函数3("A") = 1，值为 2（因为冲突）

取最小值：1

所以，"A" 的估计频率是 1。

查询 "B" 的频率：

哈希函数1("B") = 5，值为 1
哈希函数2("B") = 3，值为 1
哈希函数3("B") = 1，值为 2

取最小值：1

所以，"B" 的估计频率也是 1。

4. 图解流程

rust 复制代码

插入 "A"：
哈希函数1 --> 3号格 +1
哈希函数2 --> 6号格 +1
哈希函数3 --> 1号格 +1

插入 "B"：
哈希函数1 --> 5号格 +1
哈希函数2 --> 3号格 +1
哈希函数3 --> 1号格 +1（与"A"冲突）

查询 "A"：
查3号格、6号格、1号格，取最小值

查询 "B"：
查5号格、3号格、1号格，取最小值

5. 总结

Count-Min Sketch 用多个哈希函数，把每个数据插入多个计数器。
查询时取所有计数器的最小值，避免哈希冲突导致频率被高估太多。
空间很省，适合大数据流场景。

如果需要更复杂的图示（比如流程图或动画），可以用白板或画图工具，但上面已经用文字和表格清楚表达了核心流程。