C++哈希表 - 技术栈

哈希表实现详解

1. 哈希概念

哈希（hash）又称散列，是一种组织数据的方式。本质就是通过哈希函数把关键字 Key 跟存储位置建立一个映射关系，查找时通过这个哈希函数计算出 Key 存储的位置，进行快速查找。

1.1 直接定址法

当关键字的范围比较集中时，直接定址法是非常简单高效的方法。比如一组关键字都在 $0,99$ 之间，那么我们开一个 100 个数的数组，每个关键字的值直接就是存储位置的下标。

示例：统计字符串中第一个唯一字符

cpp 复制代码

class Solution {
public:
    int firstUniqChar(string s) {
        // 每个字母的ascii码 - 'a'的ascii码作为下标映射
        int count[26] = {0};
        
        for(auto ch : s) {
            count[ch - 'a']++;
        }
        
        for(size_t i = 0; i < s.size(); ++i) {
            if(count[s[i] - 'a'] == 1)
                return i;
        }
        return -1;
    }
};

直接定址法的缺点：当关键字的范围比较分散时，浪费内存甚至内存不够用。

1.2 哈希冲突

两个不同的 key 可能会映射到同一个位置，这叫做哈希冲突（哈希碰撞）。理想情况是找出一个好的哈希函数避免冲突，但实际中冲突不可避免，我们需要设计解决冲突的方案。

1.3 负载因子

负载因子 = N / M

其中 N 为哈希表中已存储的数据个数，M 为哈希表的大小。

负载因子越大：哈希冲突概率越高，空间利用率越高
负载因子越小：哈希冲突概率越低，空间利用率越低

1.4 关键字转整数

将关键字映射到数组位置，一般是整数好做映射计算。如果不是整数，需要想办法转换成整数。

2. 哈希函数

2.1 除法散列法（除留余数法）

公式：h(key) = key % M

使用建议：M 取不太接近 2 的整数次幂的一个质数（素数）

实践中的变通：Java 的 HashMap 使用 2 的整数次幂作为 M，通过位运算提高效率，同时让 key 的所有位都参与计算以均匀分布。

2.2 乘法散列法（了解）

公式：h(key) = floor(M × ((A × key) % 1.0))

其中 A ∈ (0,1)，Knuth 建议取黄金分割点 A = (√5 - 1)/2 ≈ 0.6180339887

2.3 全域散列法（了解）

给散列函数增加随机性，防止恶意构造的数据集攻击。

公式：h_{ab}(key) = ((a × key + b) % P) % M

P：足够大的质数
a ∈ $1, P-1$ ，b ∈ $0, P-1$

3. 处理哈希冲突

3.1 开放定址法

所有元素都放到哈希表里，冲突时按规则找到下一个空位置存储。负载因子必须小于 1。

线性探测

从冲突位置开始，依次线性向后探测：

h(key, i) = (hash0 + i) % M

缺点：容易产生群集/堆积现象。

示例：{19, 30, 5, 36, 13, 20, 21, 12} 映射到 M=11

Key	哈希值
19	8
30	8（冲突）
5	5
36	3
13	2
20	9
21	10
12	1

二次探测

左右按二次方跳跃式探测：

h(key, i) = (hash0 ± i²) % M

可以一定程度改善线性探测的群集问题。

3.2 开放定址法代码实现

cpp 复制代码

namespace open_address {
    enum State { EXIST, EMPTY, DELETE };
    
    template<class K, class V>
    struct HashData {
        pair<K, V> _kv;
        State _state = EMPTY;
    };
    
    template<class K, class V, class Hash = HashFunc<K>>
    class HashTable {
    public:
        bool Insert(const pair<K, V>& kv) {
            if (Find(kv.first)) return false;
            
            // 负载因子 > 0.7 扩容
            if (_n * 10 / _tables.size() >= 7) {
                HashTable<K, V, Hash> newHT;
                newHT._tables.resize(__stl_next_prime(_tables.size() + 1));
                for (size_t i = 0; i < _tables.size(); i++) {
                    if (_tables[i]._state == EXIST) {
                        newHT.Insert(_tables[i]._kv);
                    }
                }
                _tables.swap(newHT._tables);
            }
            
            Hash hash;
            size_t hash0 = hash(kv.first) % _tables.size();
            size_t hashi = hash0;
            size_t i = 1;
            
            while (_tables[hashi]._state == EXIST) {
                hashi = (hash0 + i) % _tables.size(); // 线性探测
                ++i;
            }
            
            _tables[hashi]._kv = kv;
            _tables[hashi]._state = EXIST;
            ++_n;
            return true;
        }
        
        HashData<K, V>* Find(const K& key) {
            Hash hash;
            size_t hash0 = hash(key) % _tables.size();
            size_t hashi = hash0;
            size_t i = 1;
            
            while (_tables[hashi]._state != EMPTY) {
                if (_tables[hashi]._state == EXIST && 
                    _tables[hashi]._kv.first == key) {
                    return &_tables[hashi];
                }
                hashi = (hash0 + i) % _tables.size();
                ++i;
            }
            return nullptr;
        }
        
        bool Erase(const K& key) {
            HashData<K, V>* ret = Find(key);
            if (ret == nullptr) return false;
            ret->_state = DELETE;
            --_n;
            return true;
        }
        
    private:
        vector<HashData<K, V>> _tables;
        size_t _n = 0;  // 表中存储数据个数
    };
}

3.3 字符串哈希处理

当 key 为 string 类型时，需要将其转换为整数：

cpp 复制代码

template<class K>
struct HashFunc {
    size_t operator()(const K& key) {
        return (size_t)key;
    }
};

// string 特化版本 - BKDR哈希
template<>
struct HashFunc<string> {
    size_t operator()(const string& key) {
        size_t hash = 0;
        for (auto e : key) {
            hash *= 131;  // 质数，也可用31
            hash += e;
        }
        return hash;
    }
};

3.4 链地址法（哈希桶）

所有数据不再直接存储在哈希表中，哈希表中存储指针，冲突的数据链接成链表挂在对应位置。

优点：

负载因子可以大于 1
解决冲突更简单

扩容策略：负载因子达到 1 时扩容

3.5 链地址法代码实现

cpp 复制代码

namespace hash_bucket {
    template<class K, class V>
    struct HashNode {
        pair<K, V> _kv;
        HashNode<K, V>* _next;
        
        HashNode(const pair<K, V>& kv)
            : _kv(kv), _next(nullptr) {}
    };
    
    template<class K, class V, class Hash = HashFunc<K>>
    class HashTable {
        typedef HashNode<K, V> Node;
        
    public:
        HashTable() {
            _tables.resize(__stl_next_prime(0), nullptr);
        }
        
        ~HashTable() {
            for (size_t i = 0; i < _tables.size(); i++) {
                Node* cur = _tables[i];
                while (cur) {
                    Node* next = cur->_next;
                    delete cur;
                    cur = next;
                }
                _tables[i] = nullptr;
            }
        }
        
        bool Insert(const pair<K, V>& kv) {
            Hash hs;
            
            // 负载因子 == 1 扩容
            if (_n == _tables.size()) {
                HashTable<K, V, Hash> newHT;
                newHT._tables.resize(__stl_next_prime(_tables.size() + 1), nullptr);
                
                for (size_t i = 0; i < _tables.size(); i++) {
                    Node* cur = _tables[i];
                    while (cur) {
                        newHT.Insert(cur->_kv);
                        cur = cur->_next;
                    }
                }
                _tables.swap(newHT._tables);
            }
            
            size_t hashi = hs(kv.first) % _tables.size();
            
            // 检查是否已存在
            Node* cur = _tables[hashi];
            while (cur) {
                if (cur->_kv.first == kv.first)
                    return false;
                cur = cur->_next;
            }
            
            // 头插
            Node* newnode = new Node(kv);
            newnode->_next = _tables[hashi];
            _tables[hashi] = newnode;
            ++_n;
            return true;
        }
        
        Node* Find(const K& key) {
            Hash hs;
            size_t hashi = hs(key) % _tables.size();
            Node* cur = _tables[hashi];
            while (cur) {
                if (cur->_kv.first == key)
                    return cur;
                cur = cur->_next;
            }
            return nullptr;
        }
        
        bool Erase(const K& key) {
            Hash hs;
            size_t hashi = hs(key) % _tables.size();
            Node* prev = nullptr;
            Node* cur = _tables[hashi];
            
            while (cur) {
                if (cur->_kv.first == key) {
                    if (prev == nullptr)
                        _tables[hashi] = cur->_next;
                    else
                        prev->_next = cur->_next;
                    delete cur;
                    --_n;
                    return true;
                }
                prev = cur;
                cur = cur->_next;
            }
            return false;
        }
        
    private:
        vector<Node*> _tables;  // 指针数组
        size_t _n = 0;          // 表中存储数据个数
    };
}

3.6 质数表（用于扩容）

cpp 复制代码

inline unsigned long __stl_next_prime(unsigned long n) {
    static const int __stl_num_primes = 28;
    static const unsigned long __stl_prime_list[] = {
        53, 97, 193, 389, 769,
        1543, 3079, 6151, 12289, 24593,
        49157, 98317, 196613, 393241, 786433,
        1572869, 3145739, 6291469, 12582917, 25165843,
        50331653, 100663319, 201326611, 402653189, 805306457,
        1610612741, 3221225473, 4294967291
    };
    
    const unsigned long* first = __stl_prime_list;
    const unsigned long* last = __stl_prime_list + __stl_num_primes;
    const unsigned long* pos = lower_bound(first, last, n);
    return pos == last ? *(last - 1) : *pos;
}

4. 极端场景优化

如果某个桶特别长，Java 8 的 HashMap 采用的优化策略：

当桶的长度超过阈值（8）时，将链表转换成红黑树
查找效率从 O(n) 提升到 O(log n)

5. 总结

方法	优点	缺点	适用场景
直接定址法	简单高效	浪费内存	关键字范围集中
开放定址法	实现简单	产生堆积	负载因子小
链地址法	空间利用率高	需要额外指针	通用场景

选择建议：

实践中通常选择链地址法（哈希桶）
哈希函数使用除留余数法，M 取质数
字符串 key 使用 BKDR 哈希 转换
负载因子控制在 0.7~1 之间