C++ STL unordered系列关联式容器详解

文章目录

[C++ STL unordered系列关联式容器详解](#C++ STL unordered系列关联式容器详解)
- [unordered 系列关联式容器](#unordered 系列关联式容器)
- [unordered_set 的特点](#unordered_set 的特点)
- [unordered_set 的定义方式](#unordered_set 的定义方式)
- [unordered_set 常用接口](#unordered_set 常用接口)
- [unordered_set 的使用](#unordered_set 的使用)
- unordered_multiset
- [unordered_map 的特点](#unordered_map 的特点)
- [unordered_map 的定义方式](#unordered_map 的定义方式)
- [unordered_map 常用接口](#unordered_map 常用接口)
- [unordered_map 的插入](#unordered_map 的插入)
- [unordered_map 的查找和删除](#unordered_map 的查找和删除)
- [unordered_map 的 operator\[\] 用法](#unordered_map 的 operator[] 用法)
- [unordered_map 的遍历和常用接口](#unordered_map 的遍历和常用接口)
- unordered_multimap
- 哈希桶相关接口
- [自定义类型作为 key](#自定义类型作为 key)
- [unordered 和树形容器怎么选](#unordered 和树形容器怎么选)
- 总结
- [unordered 系列容器的模拟实现](#unordered 系列容器的模拟实现)
- [先看最朴素的 KV 哈希表](#先看最朴素的 KV 哈希表)
- 模板参数为什么要调整
- [KeyOfT 的作用](#KeyOfT 的作用)
- [string 类型不能直接取模](#string 类型不能直接取模)
- 扩容为什么选择素数
- 插入和扩容的实现思路
- 查找和删除的实现思路
- 默认成员函数要补上
- 正向迭代器怎么走
- [unordered_set 的封装](#unordered_set 的封装)
- [unordered_map 的封装](#unordered_map 的封装)
- 封装完成后的完整代码
- 简单测试一下封装结果
- 小结

unordered 系列关联式容器

C++98 里已经有一批关联式容器，比如 set、map、multiset、multimap

这些容器底层通常是红黑树，查找、插入、删除的时间复杂度是 O(logN)，也就是最多需要比较树的高度次

如果数据量比较小，这个效率已经很好了，但当结点很多时，每次查找都沿着树往下比较，还是会有一些成本

更理想的情况是，能通过某种计算直接定位到元素大概所在的位置，少做比较甚至不做多余比较

所以 C++11 又提供了 unordered 系列关联式容器：

容器	对应的树形容器	主要特点
unordered_set	set	只存 key，key 唯一，元素无序
unordered_multiset	multiset	只存 key，key 可以重复，元素无序
unordered_map	map	存 key 和 value，key 唯一，元素无序
unordered_multimap	multimap	存 key 和 value，key 可以重复，元素无序

它们和红黑树版本的容器用法很像，常见接口也差不多，比如 insert、erase、find、count、size、empty、clear、swap 等

真正不一样的地方在底层结构

树形关联式容器靠红黑树维护有序性，unordered 系列容器靠哈希表组织数据

哈希表的大概思路是：先对 key 做哈希计算，得到一个哈希值，再根据这个哈希值把元素放到某个桶里

查找时也一样，先算 key 的哈希值，再去对应桶里找

如果哈希函数比较合适，桶也分布得比较均匀，查找平均可以接近 O(1)

不过这不是绝对的，如果大量 key 都落到同一个桶里，冲突很多，最坏情况下也可能退化到 O(N)

所以 unordered 系列容器的特点可以先记成这样：

平均查找效率高，通常比树形容器快
元素没有排序结果，遍历顺序不固定
底层是哈希表，不是红黑树
适合按 key 快速查找，不适合依赖有序遍历

unordered_set 的特点

unordered_set 是一种不按照特定顺序存储 key 的关联式容器

它和 set 一样，存进去的元素本身就是 key，也就是说 value 和 key 是同一个东西

unordered_set 的几个特点：

元素不能重复
元素没有固定顺序
根据 key 查找元素通常很快
底层会把哈希值相同或桶下标相同的元素放到同一个桶里
迭代器至少是前向迭代器

这里要注意，unordered_set 不是按照大小顺序存的

比如插入 1、4、3、2，遍历结果可能是 1 4 3 2，也可能是别的顺序

这个顺序和具体编译器、哈希策略、桶数量都有关系，代码里不要依赖它

如果需要有序结果，用 set

如果只是想快速判断一个值是否存在，用 unordered_set 往往更合适

unordered_set 的定义方式

使用 unordered_set 需要包含头文件：

cpp 复制代码

#include <unordered_set>

方式一：构造一个指定类型的空容器

cpp 复制代码

unordered_set<int> us1;

方式二：拷贝构造一个同类型容器

cpp 复制代码

unordered_set<int> us2(us1);

方式三：使用迭代器区间构造

cpp 复制代码

string str("abcdef");
unordered_set<char> us3(str.begin(), str.end());

如果区间里有重复元素，最后也只会保留一份

cpp 复制代码

string str("banana");
unordered_set<char> us(str.begin(), str.end());

for (auto ch : us)
{
    cout << ch << " ";
}

上面这段代码中，a 和 n 都出现了多次，但 unordered_set 里只会保留不同字符

unordered_set 常用接口

unordered_set 常用成员函数如下：

成员函数	功能
insert	插入指定元素
erase	删除指定元素
find	查找指定元素
size	获取容器中元素个数
empty	判断容器是否为空
clear	清空容器
swap	交换两个容器中的数据
count	统计指定元素的个数

迭代器相关接口如下：

成员函数	功能
begin	获取第一个元素的正向迭代器
end	获取最后一个元素下一个位置的正向迭代器

unordered_set 没有 rbegin 和 rend 这一类反向迭代器接口，因为它本身就没有有序方向，反向遍历也没有什么明确意义

unordered_set 的使用

cpp 复制代码

#include <iostream>
#include <unordered_set>
using namespace std;

int main()
{
    unordered_set<int> us;

    // 插入元素，重复值会被去掉
    us.insert(1);
    us.insert(4);
    us.insert(3);
    us.insert(3);
    us.insert(2);
    us.insert(2);
    us.insert(3);

    // 范围 for 遍历，输出顺序不固定
    for (auto e : us)
    {
        cout << e << " ";
    }
    cout << endl;

    // 删除方式一，按值删除
    us.erase(3);

    // 删除方式二，先查找再按迭代器删除
    unordered_set<int>::iterator pos = us.find(1);
    if (pos != us.end())
    {
        us.erase(pos);
    }

    // 迭代器遍历
    unordered_set<int>::iterator it = us.begin();
    while (it != us.end())
    {
        cout << *it << " ";
        ++it;
    }
    cout << endl;

    // count 在 unordered_set 中只会返回 0 或 1
    cout << us.count(2) << endl;

    // 容器大小
    cout << us.size() << endl;

    // 清空容器
    us.clear();

    // 判断容器是否为空
    cout << us.empty() << endl;

    // 交换两个容器的数据
    unordered_set<int> tmp{ 11, 22, 33, 44 };
    us.swap(tmp);

    for (auto e : us)
    {
        cout << e << " ";
    }
    cout << endl;

    return 0;
}

这里有一个小细节，unordered_set 的 insert 返回值和 set 类似

它返回的是一个 pair，first 是迭代器，second 表示是否插入成功

如果元素原来不存在，插入成功，second 为 true

如果元素已经存在，插入失败，second 为 false

cpp 复制代码

#include <iostream>
#include <unordered_set>
using namespace std;

int main()
{
    unordered_set<int> us;

    auto ret1 = us.insert(10);
    cout << ret1.second << endl; // 1

    auto ret2 = us.insert(10);
    cout << ret2.second << endl; // 0

    return 0;
}

unordered_multiset

unordered_multiset 和 unordered_set 的底层结构一样，都是哈希表，接口也基本一致

它们最大的区别是：unordered_set 不允许 key 重复，unordered_multiset 允许 key 重复

cpp 复制代码

#include <iostream>
#include <unordered_set>
using namespace std;

int main()
{
    unordered_multiset<int> ums;

    ums.insert(1);
    ums.insert(4);
    ums.insert(3);
    ums.insert(3);
    ums.insert(2);
    ums.insert(2);
    ums.insert(3);

    for (auto e : ums)
    {
        cout << e << " ";
    }
    cout << endl;

    cout << ums.count(3) << endl; // 3

    return 0;
}

由于 unordered_multiset 允许重复元素，所以 count 的意义比 unordered_set 更明显

成员函数	unordered_set 中的意义	unordered_multiset 中的意义
find	找到指定值返回迭代器，找不到返回 end	返回某个相同值元素的迭代器，具体是哪一个不需要依赖
count	结果只可能是 0 或 1	返回指定值出现的次数

如果想拿到某个值对应的所有元素，可以用 equal_range

cpp 复制代码

#include <iostream>
#include <unordered_set>
using namespace std;

int main()
{
    unordered_multiset<int> ums{ 1, 2, 2, 2, 3, 4 };

    auto range = ums.equal_range(2);
    while (range.first != range.second)
    {
        cout << *range.first << " ";
        ++range.first;
    }
    cout << endl;

    return 0;
}

再补一个删除细节

对 unordered_multiset 调用 erase(value) 时，会删除所有等于 value 的元素

如果只想删除其中一个，应该先用 find 找到一个迭代器，再按迭代器删除

cpp 复制代码

#include <iostream>
#include <unordered_set>
using namespace std;

int main()
{
    unordered_multiset<int> ums{ 1, 2, 2, 2, 3 };

    auto pos = ums.find(2);
    if (pos != ums.end())
    {
        ums.erase(pos); // 只删除一个 2
    }

    cout << ums.count(2) << endl; // 2

    ums.erase(2); // 删除剩下所有 2

    cout << ums.count(2) << endl; // 0

    return 0;
}

unordered_map 的特点

unordered_map 是用来存键值对的哈希关联式容器

它和 map 一样，存储的是 key 和 value 的对应关系

只不过 map 底层是红黑树，元素会按照 key 有序排列

unordered_map 底层是哈希表，元素没有固定顺序，但按 key 查找的平均效率更高

unordered_map 的几个特点：

存储 key 和 value 组成的键值对
key 唯一，不能重复
key 不能被修改，value 可以修改
元素遍历顺序不固定
根据 key 查找元素，平均时间复杂度接近 O(1)
支持 operator\[\]，可以通过 key 访问或修改 value

在 unordered_map 中，每个元素大致可以理解成 pair<const Key, T>

这里的 const Key 说明 key 不允许被直接修改

原因也很好理解，哈希表是根据 key 的哈希值来决定元素放在哪个桶里的

如果把 key 原地改掉，元素应该所在的桶可能就变了，哈希表结构会被破坏

unordered_map 的定义方式

使用 unordered_map 需要包含头文件：

cpp 复制代码

#include <unordered_map>

方式一：指定 key 和 value 类型，构造空容器

cpp 复制代码

unordered_map<int, string> um1;

方式二：拷贝构造同类型容器

cpp 复制代码

unordered_map<int, string> um2(um1);

方式三：使用迭代器区间构造

cpp 复制代码

vector<pair<int, string>> v{
    { 1, "one" },
    { 2, "two" },
    { 3, "three" }
};

unordered_map<int, string> um3(v.begin(), v.end());

方式四：构造时指定桶数量

cpp 复制代码

unordered_map<int, string> um4(100);

这里的 100 只是初始桶数量相关的提示，不是元素个数限制

实际使用中一般不需要手动传桶数量，除非已经知道数据规模，想提前减少扩容和重哈希次数

unordered_map 常用接口

unordered_map 常用成员函数如下：

成员函数	功能
insert	插入键值对
erase	根据 key 或迭代器删除元素
find	根据 key 查找元素
operator\[\]	根据 key 访问或修改 value
size	获取容器中元素个数
empty	判断容器是否为空
clear	清空容器
swap	交换两个容器中的数据
count	统计指定 key 的元素个数

迭代器相关接口如下：

成员函数	功能
begin	获取第一个元素的正向迭代器
end	获取最后一个元素下一个位置的正向迭代器

和 unordered_set 一样，unordered_map 的遍历顺序也不固定

unordered_map 的插入

unordered_map 插入的是键值对，可以用 pair，也可以用 make_pair

cpp 复制代码

#include <iostream>
#include <string>
#include <unordered_map>
using namespace std;

int main()
{
    unordered_map<int, string> um;

    // 写法一，直接构造 pair
    um.insert(pair<int, string>(1, "one"));

    // 写法二，使用 make_pair
    um.insert(make_pair(2, "two"));
    um.insert(make_pair(3, "three"));

    for (auto e : um)
    {
        cout << "<" << e.first << "," << e.second << "> ";
    }
    cout << endl;

    return 0;
}

insert 的返回值和 map 类似，也是 pair

first 是指向元素的迭代器，second 表示是否插入成功

cpp 复制代码

#include <iostream>
#include <string>
#include <unordered_map>
using namespace std;

int main()
{
    unordered_map<int, string> um;

    auto ret1 = um.insert(make_pair(1, "one"));
    cout << ret1.second << endl; // 1

    auto ret2 = um.insert(make_pair(1, "ONE"));
    cout << ret2.second << endl; // 0
    cout << ret2.first->second << endl; // one

    return 0;
}

第二次插入 key 为 1 的键值对时，插入失败

因为 unordered_map 不允许 key 重复，所以原来的 value 不会被覆盖

如果想覆盖已有 value，可以用 operator\[\] 或者找到迭代器后修改 second

unordered_map 的查找和删除

find 会根据 key 查找元素

找到了，返回对应位置的迭代器

找不到，返回 end

cpp 复制代码

#include <iostream>
#include <string>
#include <unordered_map>
using namespace std;

int main()
{
    unordered_map<int, string> um;

    um.insert(make_pair(1, "one"));
    um.insert(make_pair(2, "two"));
    um.insert(make_pair(3, "three"));

    auto pos = um.find(2);
    if (pos != um.end())
    {
        cout << pos->second << endl;
    }

    return 0;
}

删除可以按 key 删除，也可以按迭代器删除

cpp 复制代码

#include <iostream>
#include <string>
#include <unordered_map>
using namespace std;

int main()
{
    unordered_map<int, string> um;

    um.insert(make_pair(1, "one"));
    um.insert(make_pair(2, "two"));
    um.insert(make_pair(3, "three"));

    // 根据 key 删除，返回删除的元素个数
    size_t n = um.erase(3);
    cout << n << endl;

    // 根据迭代器删除
    auto pos = um.find(2);
    if (pos != um.end())
    {
        um.erase(pos);
    }

    return 0;
}

unordered_map 的 key 唯一，所以 erase(key) 的返回值只可能是 0 或 1

unordered_map 的 operator\[\] 用法

unordered_map 和 map 一样支持 operator\[\]

它可以通过 key 返回 value 的引用

如果 key 存在，直接返回对应 value 的引用

如果 key 不存在，会先插入一个默认 value，再返回这个 value 的引用

可以把它的核心逻辑理解成下面这样：

cpp 复制代码

mapped_type& operator[](const key_type& k)
{
    pair<iterator, bool> ret = insert(make_pair(k, mapped_type()));
    iterator it = ret.first;
    return it->second;
}

所以 operator\[\] 既能修改已有元素，也能插入新元素

cpp 复制代码

#include <iostream>
#include <string>
#include <unordered_map>
using namespace std;

int main()
{
    unordered_map<string, int> countMap;

    countMap["apple"] = 1;  // apple 不存在，先插入再赋值
    countMap["apple"] += 2; // apple 已存在，直接修改 value
    countMap["banana"]++;

    cout << countMap["apple"] << endl;  // 3
    cout << countMap["banana"] << endl; // 1

    return 0;
}

operator\[\] 很适合做计数

cpp 复制代码

#include <iostream>
#include <string>
#include <unordered_map>
using namespace std;

int main()
{
    string str = "abacbca";
    unordered_map<char, int> countMap;

    for (char ch : str)
    {
        countMap[ch]++;
    }

    for (auto e : countMap)
    {
        cout << e.first << ":" << e.second << endl;
    }

    return 0;
}

不过有个坑要记一下

如果只是判断 key 是否存在，不建议直接写 countMap $key$

因为 key 不存在时，它会把这个 key 插进去

更稳一点的写法是 find

cpp 复制代码

auto pos = countMap.find('x');
if (pos != countMap.end())
{
    cout << pos->second << endl;
}

unordered_map 的遍历和常用接口

unordered_map 可以用迭代器遍历，也可以用范围 for 遍历

但是遍历顺序不能当成排序结果

cpp 复制代码

#include <iostream>
#include <string>
#include <unordered_map>
using namespace std;

int main()
{
    unordered_map<int, string> um;

    um.insert(make_pair(2, "two"));
    um.insert(make_pair(1, "one"));
    um.insert(make_pair(3, "three"));

    // 迭代器遍历
    auto it = um.begin();
    while (it != um.end())
    {
        cout << "<" << it->first << "," << it->second << "> ";
        ++it;
    }
    cout << endl;

    // 范围 for 遍历
    for (auto e : um)
    {
        cout << "<" << e.first << "," << e.second << "> ";
    }
    cout << endl;

    return 0;
}

其他常用接口示例：

cpp 复制代码

#include <iostream>
#include <string>
#include <unordered_map>
using namespace std;

int main()
{
    unordered_map<int, string> um;

    um.insert(make_pair(1, "one"));
    um.insert(make_pair(2, "two"));
    um.insert(make_pair(3, "three"));

    cout << um.size() << endl;
    cout << um.count(2) << endl;

    unordered_map<int, string> tmp;
    tmp.insert(make_pair(10, "ten"));

    um.swap(tmp);

    cout << um.size() << endl;

    um.clear();
    cout << um.empty() << endl;

    return 0;
}

unordered_multimap

unordered_multimap 和 unordered_map 的关系，就像 unordered_multiset 和 unordered_set 的关系

unordered_map 要求 key 唯一

unordered_multimap 允许 key 重复

cpp 复制代码

#include <iostream>
#include <string>
#include <unordered_map>
using namespace std;

int main()
{
    unordered_multimap<int, string> umm;

    umm.insert(make_pair(2, "two"));
    umm.insert(make_pair(2, "double"));
    umm.insert(make_pair(1, "one"));
    umm.insert(make_pair(3, "three"));

    for (auto e : umm)
    {
        cout << "<" << e.first << "," << e.second << "> ";
    }
    cout << endl;

    cout << umm.count(2) << endl; // 2

    return 0;
}

由于 unordered_multimap 允许同一个 key 对应多个 value，所以它不支持 operator\[\]

原因和 multimap 一样，如果写 umm $2$ ，容器里可能有多个 key 为 2 的元素，根本不知道应该返回哪一个 value 的引用

如果要拿到某个 key 对应的所有 value，可以用 equal_range

cpp 复制代码

#include <iostream>
#include <string>
#include <unordered_map>
using namespace std;

int main()
{
    unordered_multimap<int, string> umm;

    umm.insert(make_pair(2, "two"));
    umm.insert(make_pair(2, "double"));
    umm.insert(make_pair(2, "second"));
    umm.insert(make_pair(1, "one"));

    auto range = umm.equal_range(2);
    while (range.first != range.second)
    {
        cout << range.first->second << " ";
        ++range.first;
    }
    cout << endl;

    return 0;
}

这里也要注意 erase(key)

对 unordered_multimap 来说，erase(key) 会删除所有 key 相同的元素

如果只想删除其中一个，还是先 find，再 erase 迭代器

哈希桶相关接口

unordered 系列容器底层是哈希表，所以它们还有一些和桶有关的接口

这些接口在 set、map 这种树形容器里没有

成员函数	功能
bucket_count	获取桶的数量
bucket_size	获取指定桶里的元素个数
bucket	获取某个 key 所在的桶编号
load_factor	获取当前负载因子
max_load_factor	获取或设置最大负载因子
rehash	设置桶数量，并重新组织元素
reserve	预留能容纳指定元素个数的空间

负载因子可以简单理解为：

cpp 复制代码

load_factor = 元素个数 / 桶的数量

负载因子越高，说明桶里平均装的元素越多，冲突可能越多

当元素越来越多时，unordered 容器会自动扩容并重新哈希，这个过程叫 rehash

rehash 会让原来元素重新分布到新的桶里，所以迭代器可能失效

如果一开始就知道大概要插入多少数据，可以提前 reserve，减少中途反复扩容的次数

cpp 复制代码

#include <iostream>
#include <string>
#include <unordered_map>
using namespace std;

int main()
{
    unordered_map<string, int> dict;

    dict.reserve(1000);

    dict["apple"] = 1;
    dict["banana"] = 2;
    dict["orange"] = 3;

    cout << dict.size() << endl;
    cout << dict.bucket_count() << endl;
    cout << dict.load_factor() << endl;

    size_t bucketIndex = dict.bucket("apple");
    cout << bucketIndex << endl;
    cout << dict.bucket_size(bucketIndex) << endl;

    return 0;
}

平时写业务代码不一定会直接用这些桶接口

但理解它们有帮助，尤其是排查 unordered_map 性能波动时

比如某些 key 的哈希分布很差，大量元素挤到同一个桶里，平均 O(1) 的查找就会变慢

自定义类型作为 key

unordered_set 和 unordered_map 默认知道怎么处理 int、string、char 这类常见类型

因为标准库已经给这些类型准备好了哈希函数

如果想把自定义类型作为 key，就需要告诉容器两件事：

怎么计算这个类型的哈希值
怎么判断两个对象是否相等

比如定义一个学生信息，只按 id 判断是否是同一个学生

cpp 复制代码

#include <iostream>
#include <string>
#include <unordered_set>
using namespace std;

struct Student
{
    int id;
    string name;
};

struct StudentHash
{
    size_t operator()(const Student& s) const
    {
        return hash<int>()(s.id);
    }
};

struct StudentEqual
{
    bool operator()(const Student& left, const Student& right) const
    {
        return left.id == right.id;
    }
};

int main()
{
    unordered_set<Student, StudentHash, StudentEqual> students;

    students.insert({ 1, "张三" });
    students.insert({ 2, "李四" });
    students.insert({ 1, "王五" });

    cout << students.size() << endl; // 2

    return 0;
}

第三次插入的 id 还是 1，所以会被认为和第一个学生重复

这里 name 不参与相等判断，只是普通信息

这也说明一点：哈希容器不是只看哈希值，还要看相等比较

两个对象哈希值相同，不代表它们一定相等

哈希值只是用来快速定位桶，真正确认是不是同一个 key，还要靠相等比较

unordered 和树形容器怎么选

如果只是看接口，unordered_set 和 set 很像，unordered_map 和 map 也很像

但实际选择时要看需求

需求	更合适的容器
需要有序遍历	set 或 map
只关心快速查找	unordered_set 或 unordered_map
key 不能重复	unordered_set 或 unordered_map
key 可以重复	unordered_multiset 或 unordered_multimap
需要一个 key 对应一个 value	unordered_map
需要一个 key 对应多个 value	unordered_multimap

unordered 容器适合查找、去重、计数、映射这类场景

比如：

判断某个单词是否出现过，用 unordered_set
统计字符出现次数，用 unordered_map
保存一个学生 id 到学生信息的映射，用 unordered_map
一个标签对应多个文章 id，用 unordered_multimap

如果需要按照 key 从小到大输出，或者需要 lower_bound、upper_bound 这类和顺序有关的操作，就应该考虑 set、map 这些树形容器

总结

unordered 系列容器是 C++11 提供的哈希关联式容器

它们和 set、map、multiset、multimap 的使用方式很像，但底层结构从红黑树换成了哈希表

平均情况下，unordered 系列容器按 key 查找可以接近 O(1)，这也是它们最吸引人的地方

但它们不维护元素顺序，遍历结果不能当作排序结果使用

最后把几个容器放在一起记：

容器	存储内容	是否允许重复	是否支持 operator\[\]
unordered_set	key 本身	不允许	不支持
unordered_multiset	key 本身	允许	不支持
unordered_map	key 和 value	key 不允许重复	支持
unordered_multimap	key 和 value	key 允许重复	不支持

学这几个容器时，重要的是搞清楚三个问题：

存的是一个值，还是键值对
key 能不能重复
是否需要有序遍历

这三个点想清楚以后，unordered 系列容器的选择就比较自然了

unordered 系列容器的模拟实现

前面主要是在讲 unordered_set、unordered_map 这些容器怎么用

这一节接着往下看底层实现

让底层哈希表只关心一件事：我存的是一个 T 类型的数据，但我能从 T 里面拿到 key

这样一来：

unordered_set 传下来的 T 就是 K
unordered_map 传下来的 T 就是 pair<const K, V>
底层哈希表通过 KeyOfT 从 T 里面拿 key
计算桶编号时再用哈希仿函数 Hash 把 key 转成整数

这其实就是 STL 里很常见的一种封装思路

上层容器负责告诉底层数据长什么样，底层容器只负责通用逻辑

先看最朴素的 KV 哈希表

一开始可以先写一个只支持 KV 模型的哈希表

每个桶里放的是单链表结点，结点中保存 pair<K, V>

发生哈希冲突时，就把冲突的结点挂到同一个桶的链表里

cpp 复制代码

template<class K, class V>
struct HashNode
{
    pair<K, V> _kv;
    HashNode<K, V>* _next;

    HashNode(const pair<K, V>& kv)
        : _kv(kv)
        , _next(nullptr)
    {}
};

哈希表中用 vector 保存桶数组

每个桶的位置保存链表头指针

cpp 复制代码

template<class K, class V>
class HashTable
{
    typedef HashNode<K, V> Node;

private:
    vector<Node*> _table;
    size_t _n = 0;
};

插入时大概分成几步：

先查找 key 是否已经存在
如果不允许冗余，存在就插入失败
检查负载因子，必要时扩容
计算 key 对应的桶编号
创建新结点并头插到桶里
有效元素个数加一

源文里的初始版本就是这个方向

它用 key 对哈希表大小取模来计算桶编号

cpp 复制代码

size_t index = key % _table.size();

这个写法对 int 这种整数类型没问题，但它马上会遇到一个限制

如果 key 是 string，就不能直接取模

这也就是后面必须引入哈希仿函数的原因

模板参数为什么要调整

如果只实现 unordered_map，底层结点存 pair<K, V> 就够了

但我们现在想复用同一个哈希表，同时封装 unordered_set 和 unordered_map

这时底层结点里存的东西就不固定了

unordered_set 的结点里应该存 K

unordered_map 的结点里应该存 pair<const K, V>

所以哈希表模板参数要从原来的 K 和 V，改成 K 和 T

K 表示 key 的类型

T 表示哈希表中真正存放的数据类型

cpp 复制代码

template<class K, class T>
class HashTable;

对 unordered_set 来说：

cpp 复制代码

HashTable<K, K> _ht;

对 unordered_map 来说：

cpp 复制代码

HashTable<K, pair<const K, V>> _ht;

这样写以后，哈希结点也要跟着改

它不再固定保存 pair<K, V>，而是保存 T

cpp 复制代码

template<class T>
struct HashNode
{
    T _data;
    HashNode<T>* _next;

    HashNode(const T& data)
        : _data(data)
        , _next(nullptr)
    {}
};

问题也随之出现了

底层哈希表只知道自己拿到了一个 T

但 T 可能是 K，也可能是 pair<const K, V>

查找、删除、扩容时都需要拿 key，底层怎么从 T 里取出 key 呢

这就需要上层容器提供一个仿函数，专门负责取 key

也就是 KeyOfT

KeyOfT 的作用

unordered_set 里，T 本身就是 key

所以取 key 直接返回数据本身就行

cpp 复制代码

struct SetKeyOfT
{
    const K& operator()(const K& key)
    {
        return key;
    }
};

unordered_map 里，T 是 pair<const K, V>

key 在 first 里面

cpp 复制代码

struct MapKeyOfT
{
    const K& operator()(const pair<const K, V>& kv)
    {
        return kv.first;
    }
};

底层哈希表拿到 T 以后，不需要知道它到底是 K 还是 pair

只要调用 KeyOfT 就能拿到 key

cpp 复制代码

KeyOfT kot;
const K& key = kot(data);

这一步是整个封装里最关键的地方

没有 KeyOfT，就很难用一份哈希表同时服务 set 模型和 map 模型

string 类型不能直接取模

源文一开始的代码里，桶编号是这样算的：

cpp 复制代码

size_t index = key % _table.size();

如果 key 是 int，这样能跑

但 unordered_map<string, int> 这种容器很常见

string 没有办法直接对整数取模

所以不能让哈希表假设 key 一定能使用 %

更通用的做法是提供一个哈希仿函数，把不同类型的 key 转成 size_t

整数类型可以直接强转

cpp 复制代码

template<class K>
struct HashFunc
{
    size_t operator()(const K& key) const
    {
        return (size_t)key;
    }
};

string 类型单独特化一下

cpp 复制代码

template<>
struct HashFunc<string>
{
    size_t operator()(const string& str) const
    {
        size_t hash = 0;
        for (auto ch : str)
        {
            hash = hash * 131 + ch;
        }
        return hash;
    }
};

这样计算桶编号时，就变成先算哈希值，再对桶数量取模

cpp 复制代码

Hash hash;
size_t index = hash(key) % _tables.size();

注意这里的 131 不是唯一选择，只是一个常见写法

真实标准库里的哈希实现会更复杂，这里只是模拟实现，用来理解思想已经够了

扩容为什么选择素数

哈希表扩容时，桶数量不能随便改

如果桶数量选得不好，可能会让一批 key 更容易集中到某些桶里

源文用了一个素数表，每次扩容时取一个比当前桶数量大的素数

这个做法的目的就是尽量让数据分布更均匀

cpp 复制代码

size_t GetNextPrime(size_t prime)
{
    static const size_t primeList[] =
    {
        53ul, 97ul, 193ul, 389ul, 769ul,
        1543ul, 3079ul, 6151ul, 12289ul, 24593ul,
        49157ul, 98317ul, 196613ul, 393241ul, 786433ul,
        1572869ul, 3145739ul, 6291469ul, 12582917ul, 25165843ul,
        50331653ul, 100663319ul, 201326611ul, 402653189ul,
        805306457ul, 1610612741ul, 3221225473ul, 4294967291ul
    };

    size_t count = sizeof(primeList) / sizeof(primeList[0]);
    for (size_t i = 0; i < count; ++i)
    {
        if (primeList[i] > prime)
        {
            return primeList[i];
        }
    }

    return primeList[count - 1];
}

这里顺手修了一下边界问题

如果循环走到最后还没有找到更大的素数，就返回素数表最后一个值

不要写成 return primeList $i$ ，因为 i 可能已经等于数组长度了

插入和扩容的实现思路

插入前先判断元素是否已经存在

unordered_set 和 unordered_map 都不允许 key 冗余，所以存在就返回失败

cpp 复制代码

iterator ret = Find(kot(data));
if (ret != end())
{
    return make_pair(ret, false);
}

然后判断是否需要扩容

这里用的是负载因子等于 1 时扩容

也就是有效元素个数等于桶数量时，开始重新开桶

cpp 复制代码

if (_n == _tables.size())
{
    size_t newSize = GetNextPrime(_tables.size());
    vector<Node*> newTables(newSize, nullptr);

    for (size_t i = 0; i < _tables.size(); ++i)
    {
        Node* cur = _tables[i];
        while (cur)
        {
            Node* next = cur->_next;
            size_t index = hash(kot(cur->_data)) % newSize;

            cur->_next = newTables[index];
            newTables[index] = cur;

            cur = next;
        }

        _tables[i] = nullptr;
    }

    _tables.swap(newTables);
}

这里不是重新 new 结点，而是把原来的结点摘下来，重新挂到新桶里

这样可以少做很多对象拷贝

扩容完成后，再把新结点头插到对应桶里

cpp 复制代码

size_t index = hash(kot(data)) % _tables.size();
Node* newNode = new Node(data);

newNode->_next = _tables[index];
_tables[index] = newNode;

++_n;
return make_pair(iterator(newNode, this), true);

查找和删除的实现思路

查找时先根据 key 算桶编号

然后只需要在这个桶对应的链表里找

cpp 复制代码

iterator Find(const K& key)
{
    if (_tables.empty())
    {
        return end();
    }

    Hash hash;
    KeyOfT kot;
    size_t index = hash(key) % _tables.size();

    Node* cur = _tables[index];
    while (cur)
    {
        if (kot(cur->_data) == key)
        {
            return iterator(cur, this);
        }
        cur = cur->_next;
    }

    return end();
}

删除也是类似

区别是单链表删除时要记录前一个结点

如果删的是桶里的第一个结点，就直接改桶头

如果删的是中间结点，就让前一个结点跳过它

cpp 复制代码

bool Erase(const K& key)
{
    if (_tables.empty())
    {
        return false;
    }

    Hash hash;
    KeyOfT kot;
    size_t index = hash(key) % _tables.size();

    Node* prev = nullptr;
    Node* cur = _tables[index];

    while (cur)
    {
        if (kot(cur->_data) == key)
        {
            if (prev == nullptr)
            {
                _tables[index] = cur->_next;
            }
            else
            {
                prev->_next = cur->_next;
            }

            delete cur;
            --_n;
            return true;
        }

        prev = cur;
        cur = cur->_next;
    }

    return false;
}

这里是真删除，会释放结点

删除后，指向该结点的迭代器肯定不能继续用了

默认成员函数要补上

哈希表里有 new 出来的结点，所以不能只靠编译器默认生成的析构函数

否则结点不会释放，会内存泄漏

析构时要把每个桶的链表全部删掉

cpp 复制代码

void Clear()
{
    for (size_t i = 0; i < _tables.size(); ++i)
    {
        Node* cur = _tables[i];
        while (cur)
        {
            Node* next = cur->_next;
            delete cur;
            cur = next;
        }

        _tables[i] = nullptr;
    }

    _n = 0;
}

拷贝构造也不能用默认浅拷贝

如果两个哈希表指向同一批结点，析构时就会重复释放

简单一点的写法是重新插入一遍对方的数据

cpp 复制代码

HashTable(const HashTable& ht)
{
    _tables.resize(ht._tables.size(), nullptr);

    for (size_t i = 0; i < ht._tables.size(); ++i)
    {
        Node* cur = ht._tables[i];
        while (cur)
        {
            Insert(cur->_data);
            cur = cur->_next;
        }
    }
}

赋值运算符可以用传值加交换的写法

cpp 复制代码

HashTable& operator=(HashTable ht)
{
    _tables.swap(ht._tables);
    swap(_n, ht._n);
    return *this;
}

这样写比较省心，传值参数 ht 会先拷贝一份，交换后旧资源跟着 ht 析构自动释放

正向迭代器怎么走

哈希表迭代器比 vector 迭代器麻烦一些

因为元素不是连续存储的

它们分散在不同桶里，每个桶又是一条链表

所以迭代器加加时分两种情况

当前结点后面还有结点，直接走当前桶的下一个结点
当前桶走完了，就去后面的桶里找下一个非空桶

迭代器里除了保存当前结点指针，还要保存哈希表指针

因为跨桶时需要访问桶数组

cpp 复制代码

template<class K, class T, class KeyOfT, class Hash>
struct HashIterator
{
    typedef HashNode<T> Node;
    typedef HashIterator<K, T, KeyOfT, Hash> Self;

    Node* _node;
    HashTable<K, T, KeyOfT, Hash>* _pht;

    HashIterator(Node* node, HashTable<K, T, KeyOfT, Hash>* pht)
        : _node(node)
        , _pht(pht)
    {}

    T& operator*()
    {
        return _node->_data;
    }

    T* operator->()
    {
        return &_node->_data;
    }
};

operator++ 的核心逻辑如下：

cpp 复制代码

Self& operator++()
{
    if (_node->_next)
    {
        _node = _node->_next;
    }
    else
    {
        KeyOfT kot;
        Hash hash;

        size_t index = hash(kot(_node->_data)) % _pht->_tables.size();
        ++index;

        while (index < _pht->_tables.size())
        {
            if (_pht->_tables[index])
            {
                _node = _pht->_tables[index];
                return *this;
            }

            ++index;
        }

        _node = nullptr;
    }

    return *this;
}

当后面再也找不到非空桶时，_node 置成 nullptr

这就表示走到了 end

unordered_set 的封装

unordered_set 只需要把 key 传给底层哈希表

它的 KeyOfT 返回 key 本身

cpp 复制代码

template<class K, class Hash = HashFunc<K>>
class unordered_set
{
public:
    struct SetKeyOfT
    {
        const K& operator()(const K& key)
        {
            return key;
        }
    };

    typedef typename HashTable<K, K, SetKeyOfT, Hash>::iterator iterator;

    pair<iterator, bool> insert(const K& key)
    {
        return _ht.Insert(key);
    }

    bool erase(const K& key)
    {
        return _ht.Erase(key);
    }

    iterator find(const K& key)
    {
        return _ht.Find(key);
    }

    size_t count(const K& key)
    {
        return find(key) != end();
    }

    iterator begin()
    {
        return _ht.begin();
    }

    iterator end()
    {
        return _ht.end();
    }

private:
    HashTable<K, K, SetKeyOfT, Hash> _ht;
};

这里的 count 返回值只会是 0 或 1

因为 unordered_set 不允许重复

unordered_map 的封装

unordered_map 的底层存 pair<const K, V>

KeyOfT 从 pair 里取 first

cpp 复制代码

template<class K, class V, class Hash = HashFunc<K>>
class unordered_map
{
public:
    typedef pair<const K, V> value_type;

    struct MapKeyOfT
    {
        const K& operator()(const value_type& kv)
        {
            return kv.first;
        }
    };

    typedef typename HashTable<K, value_type, MapKeyOfT, Hash>::iterator iterator;

    pair<iterator, bool> insert(const value_type& kv)
    {
        return _ht.Insert(kv);
    }

    V& operator[](const K& key)
    {
        pair<iterator, bool> ret = _ht.Insert(value_type(key, V()));
        return ret.first->second;
    }

    iterator find(const K& key)
    {
        return _ht.Find(key);
    }

    bool erase(const K& key)
    {
        return _ht.Erase(key);
    }

    iterator begin()
    {
        return _ht.begin();
    }

    iterator end()
    {
        return _ht.end();
    }

private:
    HashTable<K, value_type, MapKeyOfT, Hash> _ht;
};

operator\[\] 的写法和前面讲 STL map 时很像

先尝试插入一个默认 value

如果 key 已经存在，insert 不会插入新结点，但会返回已有元素的迭代器

最后返回 value 的引用

所以 unordered_map 也能用 operator\[\] 做统计

封装完成后的完整代码

下面把哈希表、正向迭代器、unordered_set 和 unordered_map 放到一起

为了避免和标准库同名冲突，我放在 my_unordered 命名空间里

cpp 复制代码

#include <iostream>
#include <vector>
#include <string>
#include <utility>
using namespace std;

namespace my_unordered
{
    template<class T>
    struct HashNode
    {
        T _data;
        HashNode<T>* _next;

        HashNode(const T& data)
            : _data(data)
            , _next(nullptr)
        {}
    };

    template<class K>
    struct HashFunc
    {
        size_t operator()(const K& key) const
        {
            return (size_t)key;
        }
    };

    template<>
    struct HashFunc<string>
    {
        size_t operator()(const string& str) const
        {
            size_t hash = 0;
            for (auto ch : str)
            {
                hash = hash * 131 + ch;
            }
            return hash;
        }
    };

    template<class K, class T, class KeyOfT, class Hash>
    class HashTable;

    template<class K, class T, class KeyOfT, class Hash>
    struct HashIterator
    {
        typedef HashNode<T> Node;
        typedef HashIterator<K, T, KeyOfT, Hash> Self;

        Node* _node;
        HashTable<K, T, KeyOfT, Hash>* _pht;

        HashIterator(Node* node, HashTable<K, T, KeyOfT, Hash>* pht)
            : _node(node)
            , _pht(pht)
        {}

        T& operator*()
        {
            return _node->_data;
        }

        T* operator->()
        {
            return &_node->_data;
        }

        Self& operator++()
        {
            if (_node->_next)
            {
                _node = _node->_next;
            }
            else
            {
                KeyOfT kot;
                Hash hash;

                size_t index = hash(kot(_node->_data)) % _pht->_tables.size();
                ++index;

                while (index < _pht->_tables.size())
                {
                    if (_pht->_tables[index])
                    {
                        _node = _pht->_tables[index];
                        return *this;
                    }

                    ++index;
                }

                _node = nullptr;
            }

            return *this;
        }

        bool operator!=(const Self& s) const
        {
            return _node != s._node;
        }

        bool operator==(const Self& s) const
        {
            return _node == s._node;
        }
    };

    template<class K, class T, class KeyOfT, class Hash>
    class HashTable
    {
        typedef HashNode<T> Node;

        friend struct HashIterator<K, T, KeyOfT, Hash>;

    public:
        typedef HashIterator<K, T, KeyOfT, Hash> iterator;

        HashTable() = default;

        HashTable(const HashTable& ht)
        {
            _tables.resize(ht._tables.size(), nullptr);

            for (size_t i = 0; i < ht._tables.size(); ++i)
            {
                Node* cur = ht._tables[i];
                while (cur)
                {
                    Insert(cur->_data);
                    cur = cur->_next;
                }
            }
        }

        HashTable& operator=(HashTable ht)
        {
            _tables.swap(ht._tables);
            swap(_n, ht._n);
            return *this;
        }

        ~HashTable()
        {
            Clear();
        }

        iterator begin()
        {
            for (size_t i = 0; i < _tables.size(); ++i)
            {
                if (_tables[i])
                {
                    return iterator(_tables[i], this);
                }
            }

            return end();
        }

        iterator end()
        {
            return iterator(nullptr, this);
        }

        pair<iterator, bool> Insert(const T& data)
        {
            KeyOfT kot;
            Hash hash;

            iterator ret = Find(kot(data));
            if (ret != end())
            {
                return make_pair(ret, false);
            }

            if (_n == _tables.size())
            {
                size_t newSize = GetNextPrime(_tables.size());
                vector<Node*> newTables(newSize, nullptr);

                for (size_t i = 0; i < _tables.size(); ++i)
                {
                    Node* cur = _tables[i];
                    while (cur)
                    {
                        Node* next = cur->_next;
                        size_t index = hash(kot(cur->_data)) % newSize;

                        cur->_next = newTables[index];
                        newTables[index] = cur;

                        cur = next;
                    }

                    _tables[i] = nullptr;
                }

                _tables.swap(newTables);
            }

            size_t index = hash(kot(data)) % _tables.size();
            Node* newNode = new Node(data);

            newNode->_next = _tables[index];
            _tables[index] = newNode;

            ++_n;
            return make_pair(iterator(newNode, this), true);
        }

        iterator Find(const K& key)
        {
            if (_tables.empty())
            {
                return end();
            }

            Hash hash;
            KeyOfT kot;
            size_t index = hash(key) % _tables.size();

            Node* cur = _tables[index];
            while (cur)
            {
                if (kot(cur->_data) == key)
                {
                    return iterator(cur, this);
                }

                cur = cur->_next;
            }

            return end();
        }

        bool Erase(const K& key)
        {
            if (_tables.empty())
            {
                return false;
            }

            Hash hash;
            KeyOfT kot;
            size_t index = hash(key) % _tables.size();

            Node* prev = nullptr;
            Node* cur = _tables[index];

            while (cur)
            {
                if (kot(cur->_data) == key)
                {
                    if (prev == nullptr)
                    {
                        _tables[index] = cur->_next;
                    }
                    else
                    {
                        prev->_next = cur->_next;
                    }

                    delete cur;
                    --_n;
                    return true;
                }

                prev = cur;
                cur = cur->_next;
            }

            return false;
        }

        void Clear()
        {
            for (size_t i = 0; i < _tables.size(); ++i)
            {
                Node* cur = _tables[i];
                while (cur)
                {
                    Node* next = cur->_next;
                    delete cur;
                    cur = next;
                }

                _tables[i] = nullptr;
            }

            _n = 0;
        }

        size_t Size() const
        {
            return _n;
        }

        bool Empty() const
        {
            return _n == 0;
        }

    private:
        size_t GetNextPrime(size_t prime)
        {
            static const size_t primeList[] =
            {
                53ul, 97ul, 193ul, 389ul, 769ul,
                1543ul, 3079ul, 6151ul, 12289ul, 24593ul,
                49157ul, 98317ul, 196613ul, 393241ul, 786433ul,
                1572869ul, 3145739ul, 6291469ul, 12582917ul, 25165843ul,
                50331653ul, 100663319ul, 201326611ul, 402653189ul,
                805306457ul, 1610612741ul, 3221225473ul, 4294967291ul
            };

            size_t count = sizeof(primeList) / sizeof(primeList[0]);
            for (size_t i = 0; i < count; ++i)
            {
                if (primeList[i] > prime)
                {
                    return primeList[i];
                }
            }

            return primeList[count - 1];
        }

    private:
        vector<Node*> _tables;
        size_t _n = 0;
    };

    template<class K, class Hash = HashFunc<K>>
    class unordered_set
    {
    public:
        struct SetKeyOfT
        {
            const K& operator()(const K& key)
            {
                return key;
            }
        };

        typedef typename HashTable<K, K, SetKeyOfT, Hash>::iterator iterator;

        pair<iterator, bool> insert(const K& key)
        {
            return _ht.Insert(key);
        }

        bool erase(const K& key)
        {
            return _ht.Erase(key);
        }

        iterator find(const K& key)
        {
            return _ht.Find(key);
        }

        size_t count(const K& key)
        {
            return find(key) != end();
        }

        void clear()
        {
            _ht.Clear();
        }

        size_t size() const
        {
            return _ht.Size();
        }

        bool empty() const
        {
            return _ht.Empty();
        }

        iterator begin()
        {
            return _ht.begin();
        }

        iterator end()
        {
            return _ht.end();
        }

    private:
        HashTable<K, K, SetKeyOfT, Hash> _ht;
    };

    template<class K, class V, class Hash = HashFunc<K>>
    class unordered_map
    {
    public:
        typedef pair<const K, V> value_type;

        struct MapKeyOfT
        {
            const K& operator()(const value_type& kv)
            {
                return kv.first;
            }
        };

        typedef typename HashTable<K, value_type, MapKeyOfT, Hash>::iterator iterator;

        pair<iterator, bool> insert(const value_type& kv)
        {
            return _ht.Insert(kv);
        }

        bool erase(const K& key)
        {
            return _ht.Erase(key);
        }

        iterator find(const K& key)
        {
            return _ht.Find(key);
        }

        size_t count(const K& key)
        {
            return find(key) != end();
        }

        V& operator[](const K& key)
        {
            pair<iterator, bool> ret = _ht.Insert(value_type(key, V()));
            return ret.first->second;
        }

        void clear()
        {
            _ht.Clear();
        }

        size_t size() const
        {
            return _ht.Size();
        }

        bool empty() const
        {
            return _ht.Empty();
        }

        iterator begin()
        {
            return _ht.begin();
        }

        iterator end()
        {
            return _ht.end();
        }

    private:
        HashTable<K, value_type, MapKeyOfT, Hash> _ht;
    };
}

简单测试一下封装结果

先测 unordered_set

cpp 复制代码

int main()
{
    my_unordered::unordered_set<int> s;

    s.insert(1);
    s.insert(3);
    s.insert(3);
    s.insert(2);
    s.insert(5);

    for (auto e : s)
    {
        cout << e << " ";
    }
    cout << endl;

    cout << s.count(3) << endl;
    cout << s.count(10) << endl;

    s.erase(3);

    for (auto e : s)
    {
        cout << e << " ";
    }
    cout << endl;

    return 0;
}

再测 unordered_map

cpp 复制代码

int main()
{
    my_unordered::unordered_map<string, int> dict;

    dict.insert(make_pair("apple", 1));
    dict.insert(make_pair("banana", 2));
    dict.insert(make_pair("apple", 10));

    dict["orange"] = 3;
    dict["banana"] += 5;

    for (auto& kv : dict)
    {
        cout << kv.first << ":" << kv.second << endl;
    }

    auto pos = dict.find("banana");
    if (pos != dict.end())
    {
        cout << pos->second << endl;
    }

    return 0;
}

这里第二次插入 apple 不会覆盖原来的值

因为 insert 遇到已存在的 key 会插入失败

而 dict $"banana"$ += 5 可以修改 value，因为 operator\[\] 返回的是 value 的引用

小结

哈希表用链地址法处理冲突
vector<Node*> 保存桶数组
每个桶下面挂一条单链表
扩容时把旧结点重新映射到新桶
用 Hash 仿函数解决 string 不能直接取模的问题
用 KeyOfT 从不同类型的 T 中提取 key
迭代器加加时要能从当前桶跳到下一个非空桶
unordered_set 和 unordered_map 只是对同一份 HashTable 做了一层封装

这样看下来，unordered_set 和 unordered_map 并不是两套完全独立的实现

它们真正共用的是底层哈希表

区别只在于上层传给哈希表的数据类型不同，取 key 的方式不同，对外提供的接口也稍微不同

这也是 STL 容器封装比较有意思的地方