C++&&数据结构——哈希表

一，unordered系列容器

[1.1 关于unordered系列](#1.1 关于unordered系列)

[1.2 unordered_set](#1.2 unordered_set)

[1.4 unordered_map](#1.4 unordered_map)

[1.5 性能对比](#1.5 性能对比)

二，哈希

[2.1 哈希概念](#2.1 哈希概念)

[2.2 常用哈希函数](#2.2 常用哈希函数)

[2.3 哈希冲突及解决](#2.3 哈希冲突及解决)

[2.3.1 闭散列](#2.3.1 闭散列)

[2.3.2 开散列](#2.3.2 开散列)

[2.4 哈希表扩容](#2.4 哈希表扩容)

[2.4.1 闭散列扩容](#2.4.1 闭散列扩容)

[2.4.2 开散列扩容](#2.4.2 开散列扩容)

三，哈希表模拟实现

[3.1 映射函数实现](#3.1 映射函数实现)

[3.2 闭散列哈希表](#3.2 闭散列哈希表)

[3.3 开散列哈希表](#3.3 开散列哈希表)

一，unordered系列容器

1.1 关于unordered系列

在C++98中，STL提供了以红黑树为底层的一系列关联式容器，查询时效率可达到logN，但是当树中节点非常多时，查询效率也不理想，所以在C++11中，STL提供了unordered系列的几个容器，使用哈希表作为底层，大大增加了查询效率

1.2 unordered_set

关于unordered_set的使用和之前介绍的set大体相同，如下代码：

cpp 复制代码

void test_unordered_set()
{
	unordered_set<int> s;
	s.insert(2);
	s.insert(3);
	s.insert(1);
	s.insert(2);
	s.insert(5);

	unordered_set<int>::iterator it = s.begin();
	while (it != s.end())
	{
		cout << *it << " ";
		++it;
	}
	cout << endl;
}

1.4 unordered_map

cpp 复制代码

void test_unordered_map()
{
	string arr[] = { "西瓜", "西瓜", "苹果", "西瓜", "苹果", "苹果", "西瓜", "苹果", "香蕉", "苹果", "香蕉", "梨" };
	map<string, int> countMap;
	for (auto& e : arr)
	{
		countMap[e]++;
	}

	for (auto& kv : countMap)
	{
		cout << kv.first << ":" << kv.second << endl;
	}
}

1.5 性能对比

cpp 复制代码

void test_op() //测试性能
{
	//产生n个随机数
	int n = 100000;
	vector<int> v;
	v.reserve(n);
	srand(time(0));
	for (int i = 0; i < n; ++i)
	{
		//把n个随机数放到vector里去
		//v.push_back(i); //有序插入
		// //v.push_back(rand());  // 重复多
		v.push_back(rand() + i);  // 重复少
	}
	size_t begin1 = clock();
	set<int> s;
	for (auto e : v)
	{
		s.insert(e);//先往set插入
	}
	size_t end1 = clock(); 

	size_t begin2 = clock();
	unordered_set<int> us;
	for (auto e : v)
	{
		us.insert(e);//再往unordered_set插入
	}
	size_t end2 = clock();

	cout << "size:" << s.size() << endl;

	cout << "set insert:" << end1 - begin1 << endl; //算出set插入时间
	cout << "unordered_set insert:" << end2 - begin2 << endl; //算出unordered_set插入时间
	cout << endl;

	size_t begin3 = clock();
	for (auto e : v)
	{
		s.find(e);
	}
	size_t end3 = clock();

	size_t begin4 = clock();
	for (auto e : v)
	{
		us.find(e);
	}
	size_t end4 = clock();
	//对比查找效率
	cout << "set find:" << end3 - begin3 << endl;
	cout << "unordered_set find:" << end4 - begin4 << endl;
	cout << endl;

	size_t begin5 = clock();
	for (auto e : v)
	{
		s.erase(e);
	}
	size_t end5 = clock();

	size_t begin6 = clock();
	for (auto e : v)
	{
		us.erase(e);
	}
	size_t end6 = clock();
	//对比删除效率
	cout << "set erase:" << end5 - begin5 << endl;
	cout << "unordered_set erase:" << end6 - begin6 << endl;

	unordered_map<string, int> countMap;
	countMap.insert(make_pair("苹果", 1));

	//可以支持
	unordered_map<string, int> countmap;
	countmap.insert(make_pair("苹果", 1));
	//综合各种场景而言，unordered系列综合性能是更好的，尤其是find
}

二，哈希

2.1 哈希概念

哈希本质是一种设计思路。

顺序结构以及平衡树中， 元素关键码与其存储位置之间没有对应的关系，因此**在查找一个元素时，必须经过多次关键码的比较。顺序查找时间复杂度为O(N)，平衡树中为O(logN)，**查找效率取决于搜索过程中元素的比较次数。

所以，为了使查找效率更高，推出了一种理想的搜索方法，即不经过任何比较，一次直接从表中查到相关的数据。如果构造一种存储结构，通过某种函数(HashFunc)使元素的存储位置与它的关键码之间建立一一映射的关系，那么查找时能很快找到该数据。

向该结构中：

**①插入元素：**根据插入元素的key值，以此函数计算出该元素的存储位置进行存放

**②查找元素：**对要查找元素的key值进行相同的计算，得出存储位置，再对比关键码查看结构当中是否有该元素

该方法被称为哈希（散列）方法，哈希方法中使用的位置计算函数称为哈希（散列）函数，构造出来的结构称为哈希表

例如：数据集合{ 1,6,7,4,5,9 }，哈希函数设置为hash(key) = key % capacity

用该方法进行搜索不必进行多次关键码比较，因此搜索速度比较快

2.2 常用哈希函数

1，直接定值法

取关键字的某个线性函数为散列地址:Hash(key) = A*key + B。

这种方法的优点是简单，缺点是需要提前直到关键字的分布情况，适合查找比较小且连续的数据

2，除留余数法

设哈希表允许的地址数为m，去一个不大于m，但最接近或者等于m的质数作为除数，按照哈希函数：Hash(key) = key%p (p<=m)，将关键码转换成哈希地址

2.3 哈希冲突及解决

就上面的图而言，如果插入44时，会算出和4同样的位置。

不同关键字通过相同哈希函数计算出相同的地址，这种现象被称为哈希冲突或哈希碰撞。

2.3.1 闭散列

闭散列也叫开放定址法，当发生哈希冲突时，如果哈希表未被填满，说明哈必然还有其他空位置，那么可以把key存放到冲突位置的"下一个"空位置中去，这里寻找下一个位置的方法称为**"线性探测"**

就上面的图，我们要插入44，会发生哈希冲突，所以我们从发生冲突的位置开始依次向后探测，直到寻找到下一个空位置为止，该方法应用在插入函数中，如下图：

对于删除，采用闭散列处理哈希冲突时，不能直接删除表中的数据否则会影响其他数据的搜索，所以采用标记的伪删除法来删除，给要删除的位置打上delete的标记，具体实现请看后面的模拟实现部分

2.3.2 开散列

开散列又叫链地址法，首先对关键码集合用哈希函数计算地址，具有相同地址的关键码用一个单链表集合起来，称每个单链表为一个桶，每个链表的头结点存在哈希表中，如下图：

2.4 哈希表扩容

2.4.1 闭散列扩容

哈希表的负载因子定义为：i = 表中现有数据个数/表的总长度

由于表长是定值，i与表中现有数据个数成正比，所以，负载因子越大，表面填入表中的数据越多，产生冲突的可能性越大，负载因子越小，产生冲突的可能性越小。

对于开放定址法，负载因子必须严格限制在0.7 -- 0.8以下，超过0.8，查表时的CPU的计算效率成指数上升。因此，一些采用开放定址法的hash库，如Java的库限制了负载因子为0.75，超过将resize哈希表。扩容具体实现请看下面哈希表模拟实现部分

2.4.2 开散列扩容

桶的个数是一定的，随着数据的不断插入，每个桶中元素不断增多，极端情况下可能会导致一个桶中的链表节点非常多，影响哈希表的查找效率，所以需要对哈希表进行增容。

最好的情况是，每个桶中刚好有一个节点，再插入数据时，都会发生哈希冲突，所以在数据个数等于桶的个数时，也就是负载因子等于1的适合进行扩容

注：如果实在没办法扩容，但是又有很多值经过哈希函数运算后插入同一个地址，那么可以将桶挂单链表改为挂红黑树。

三，哈希表模拟实现

3.1 映射函数实现

cpp 复制代码

template<class K>
struct HashFunc
{
	size_t operator()(const K& key)
	{
		return (size_t)key;
	}
};

//特化 -- 如果是普通类型走上面的，如果是string类型走下面的
template<>
struct HashFunc<string>
{
	//将字符串变成整数以后非常常见
	size_t operator()(const string& key)
	{
		//BKDR算法
		size_t val = 0;
		for (auto ch : key)
		{
			val *= 131;
			val += ch;
		}
		return val;
	}
};

3.2 闭散列哈希表

cpp 复制代码

enum State//标志位，解决删除带来的老六问题
{
	EMPTY,
	EXIST,
	DELETE
};

template<class K, class V>
struct HashData
{
	pair<K, V> _kv;
	State _state = EMPTY;
};

template<class K, class V, class Hash = HashFunc<K>>//仿函数
class HashTable
{
public:
	bool Insert(const pair<K, V>& kv)
	{
		if (Find(kv.first))
			return false;

		if (_tables.size() == 0 || 10 * _size / _tables.size() >= 7)//使用负载因子控制扩容
		{
			//size_t newsize = _tables.size() == 0 ? 10 : _tables.size() * 2;
			//vector<HashData> newtables(newsize);
			 遍历旧表，重新映射到新表
			//for (auto& data : _tables)
			//{
			//	if (data._state == EXIST)
			//	{
			//		// 重新算在新表的位置
			//		size_t i = 1;
			//		size_t index = hashi;
			//		while (newtables[index]._state == EXIST)
			//		{
			//			index = hashi + i;
			//			index %= newtables.size();
			//			++i;
			//		}
			//		newtables[index]._kv = data._kv;
			//		newtables[index]._state = EXIST;
			//	}
			//}
			//_tables.swap(newtables);

			size_t newSize = _tables.size() == 0 ? 10 : _tables.size() * 2;
			HashTable<K, V> newHT;
			newHT._tables.resize(newSize);//把空间开好
			for (auto& e : _tables)//遍历旧表
			{
				if (e._state == EXIST)
				{
					newHT.Insert(e._kv);//直接复用插入
				}
			}
			_tables.swap(newHT._tables);
		}

		//线性探测
		Hash hash;
		size_t hashi = hash(kv.first) % _tables.size();
		while (_tables[hashi]._state == EXIST)
		{
			//找空位置，如果走到结尾了，从头开始找，反正要找到一个空位置
			//注意不是从头开始找，从映射的值的那个位置开始找
			hashi++;
			hashi %= _tables.size();
		}
		_tables[hashi]._kv = kv;
		_tables[hashi]._state = EXIST;
		++_size;

		线性探测问题：某个位置冲突很多的情况下，互相占用，冲突一篇，效率变低
		二次探测 -- 按2的i次方进行探测 hash+i^2 (i>=0)
		//while (_tables[hashi]._state == EXIST)
		//{
		//	++i;
		//	hashi = start + i * i;
		//	hashi %= _tables.size();
		//}
		//tables[hashi].kv = kv;
		//_tables[hashi].state = EXIST;
		//++_szie;

		return true;
	}

	HashData<K, V>* Find(const K& key)
	{
		if (_tables.size() == 0)//表为空，返回空
		{
			return nullptr;
		}

		Hash hash;
		size_t start = hash(key) % _tables.size();
		size_t hashi = start;
		while (_tables[hashi]._state != EMPTY)
		{
			if (_tables[hashi]._state != DELETE && _tables[hashi]._kv.first == key)//要同时判断状态和值
			{
				return &_tables[hashi];//找到了，返回该处的地址
			}
			else
			{
				hashi++;
				hashi %= _tables.size();

				if (hashi == start)//极端判断，删一点插入一点
					break;
			}
		}
		return nullptr;
	}

	bool Erase(const K& key)
	{
		HashData<K, V>* ret = Find(key);
		if (ret)
		{
			ret->_state = DELETE;
			--_size;
			return true;
		}
		else
		{
			return false;//要删除的值不存在
		}
	}

	void Print()
	{
		for (size_t i = 0; i < _tables.size(); i++)
		{
			if (_tables[i]._state == EXIST)
			{
				printf("[%d:%d]", i, _tables[i]._kv.first);
			}
			else
			{
				printf("[%d:*]", i);
			}
		}
		cout << endl;
	}
private:
	vector<HashData<K, V>> _tables;
	size_t _size;//表示已经存储了多少个有效数据
};

测试代码：

cpp 复制代码

void TestHashTable1()
{
	int a[] = { 1,11,4,15,26,7,44,9 };
	HashTable<int, int> ht;
	for (auto e : a)
	{
		ht.Insert(make_pair(e, e));
	}
	ht.Print();
	ht.Erase(4);
	cout << ht.Find(44)->_kv.first << endl;
	cout << ht.Find(4) << endl;
	ht.Print();
}
void TestHashTable2()
{
	int a[] = { 3, 33, 2, 13, 5, 12, 1002 };
	HashTable<int, int> ht;
	for (auto e : a)
	{
		ht.Insert(make_pair(e, e));
	}

	ht.Insert(make_pair(15, 15));

	if (ht.Find(13))
	{
		cout << "13在" << endl;
	}
	else
	{
		cout << "13不在" << endl;
	}

	ht.Erase(13);

	if (ht.Find(13))
	{
		cout << "13在" << endl;
	}
	else
	{
		cout << "13不在" << endl;
	}
}
void TestHashTable3()
{
	string arr[] = { "苹果", "西瓜", "苹果", "西瓜", "苹果", "苹果", "西瓜", "苹果", "香蕉", "苹果", "香蕉" };
	HashTable<string, int> countHT;
	for (auto& str : arr)
	{
		auto ptr = countHT.Find(str);
		if (ptr)
		{
			ptr->_kv.second++;
		}
		else
		{
			countHT.Insert(make_pair(str, 1));
		}
	}
}
void TestHashTable4()
{
	HashFunc<string> hash;
	cout << hash("abcd") << endl;
	cout << hash("bcad") << endl;
	cout << hash("eat") << endl;
	cout << hash("ate") << endl;
	cout << hash("abcd") << endl;
	cout << hash("aadd") << endl << endl;

	cout << hash("abcd") << endl;
	cout << hash("bcad") << endl;
	cout << hash("eat") << endl;
	cout << hash("ate") << endl;
	cout << hash("abcd") << endl;
	cout << hash("aadd") << endl << endl;
}

3.3 开散列哈希表

cpp 复制代码

template<class K, class V>
struct HashNode
{
	pair<K, V>_kv;
	HashNode<K, V>* _next;

	HashNode(const pair<K, V>& kv)
		:_kv(kv)
		, _next(nullptr)
	{}
};

template<class K, class V, class Hash = HashFunc<K>>
class HashTable
{
	typedef HashNode<K, V> Node;
public:

	~HashTable()
	{
		/*for (size_t i = 0; i < _tables.size(); i++)
		{
			Node* cur = _tables[i];
			while (cur)
			{
				Node* next = cur->_next;
				delete(cur);
				cur = next;
			}
			_tables[i] = nullptr;
		}*/

		for (auto& cur : _tables)
		{
			while (cur)
			{
				Node* next = cur->_next;
				delete cur;
				cur = next;
			}

			cur = nullptr;
		}
	}

	Node* Find(const K& key)
	{
		if (_tables.size() == 0)
		{
			return nullptr;
		}

		Hash hash;
		size_t hashi = hash(key) % _tables.size();
		Node* cur = _tables[hashi];
		while (cur)
		{
			if (cur->_kv.first == key)
			{
				return cur;
			}

			cur = cur->_next;
		}

		return nullptr;
	}

	bool Erase(const K& key)
	{
		//不能直接用Find找这个节点
		if (_tables.size() == 0)
		{
			return nullptr;
		}

		//现在这个数组是一个指针数组，里面存的都是指针，由hashi的数字代表位置，_tables[hashi]代表该位置的指针，该指针指向下面的单链表
		Hash hash;
		size_t hashi = hash(key) % _tables.size();
		Node* cur = _tables[hashi];
		Node* prev = nullptr;
		while (cur)
		{
			if (cur->_kv.first == key)//找到了
			{
				if (prev == nullptr)//头删
				{
					_tables[hashi] = cur->_next;
				}
				else//中间删
				{
					prev->_next = cur->_next;
				}
				delete cur;
				--_size;
				return true;
			}
			else//没找到
			{
				prev = cur;
				cur = cur->_next;
			}
		}

		return false;
	}

	inline size_t GetNextPrime(size_t prime)
	{
		//SGI
		static const size_t __stl_num_primes = 28;
		static const size_t __stl_prime_list[__stl_num_primes] =
		{
			53, 97, 193, 389, 769,
			1543, 3079, 6151, 12289, 24593,
			49157, 98317, 196613, 393241, 786433,
			1572869, 3145739, 6291469, 12582917, 25165843,
			50331653, 100663319, 201326611, 402653189, 805306457,
			1610612741, 3221225473, 4294967291
		};

		//去上面的一堆素数中找第一个大于n的值
		for (size_t i = 0; i < __stl_num_primes; ++i)
		{
			if (__stl_prime_list[i] > prime)
			{
				return __stl_prime_list[i];
			}
		}

		return -1;
	}

	bool Insert(const pair<K, V>& kv)
	{
		//去重
		if (Find(kv.first))
		{
			return false;
		}
		Hash hash;
		//负载因子到1就扩容
		if (_size == _tables.size())
		{
			//size_t newSize = _tables.size() == 0 ? 10 : _tables.size() * 2;
			vector<Node*> newTables;
			//newTables.resize(newSize, nullptr);
			newTables.resize(GetNextPrime(_tables.size()), nullptr);//利用素数表来扩容
			//旧表中节点映射移动到新表
			for (size_t i = 0; i < _tables.size(); i++)//如果复用insert，调用insert，会生成新节点，然后拷贝再释放旧节点，代价太大，所以我们直接把旧节点废物利用，直接把旧节点搞过来
			{
				Node* cur = _tables[i];
				while (cur)
				{
					Node* next = cur->_next;
					size_t hashi = hash(cur->_kv.first) % newTables.size();//通过映射找到新表中对应的位置
					cur->_next = newTables[hashi];
					newTables[hashi] = cur;

					cur = next;//往后走
				}

				_tables[i] = nullptr;
			}
			_tables.swap(newTables);
		}

		size_t hashi = hash(kv.first) % _tables.size();
		//头插
		Node* newnode = new Node(kv);
		newnode->_next = _tables[hashi];
		_tables[hashi] = newnode;
		++_size;

		return true;
	}

	size_t Size()
	{
		return _size;
	}

	//表的长度 -- 有多少个桶的位置
	size_t TablesSize()
	{
		return _tables.size();
	}
	//表中已经有多少桶被使用了
	size_t BucketNum()
	{
		size_t num = 0;
		for (size_t i = 0; i < _tables.size(); ++i)
		{
			if (_tables[i])
			{
				++num;
			}
		}
		return num;
	}

	size_t MaxBucketLenth()
	{
		size_t maxLen = 0;
		for (size_t i = 0; i < _tables.size(); ++i)
		{
			size_t len = 0;
			Node* cur = _tables[i];
			while (cur)
			{
				++len;
				cur = cur->_next;
			}

			if (len > maxLen)
			{
				maxLen = len;
			}
		}

		return maxLen;
	}

private:
	vector<Node*> _tables;
	size_t _size = 0;//存储的有效数据
};

测试代码：

cpp 复制代码

	void TestHashBucket1()
	{
		int a[] = { 3, 33, 2, 13, 5, 12, 1002 };
		HashTable<int, int> ht;
		for (auto e : a)
		{
			ht.Insert(make_pair(e, e));
		}
		ht.Insert(make_pair(15, 15));
		ht.Insert(make_pair(25, 25));
		ht.Insert(make_pair(35, 35));
		ht.Insert(make_pair(45, 45));
		ht.Erase(12);
		ht.Erase(3);
		ht.Erase(33);
	}
	void TestHashBucket2()
	{
		string arr[] = { "苹果", "西瓜", "苹果", "西瓜", "苹果", "苹果", "西瓜", "苹果", "香蕉", "苹果", "香蕉" };

		//HashTable<string, int, HashFuncString> countHT;
		HashTable<string, int> countHT;
		for (auto& str : arr)
		{
			auto ptr = countHT.Find(str);
			if (ptr)
			{
				ptr->_kv.second++;
			}
			else
			{
				countHT.Insert(make_pair(str, 1));
			}
		}
	}
	//字符串转整形算法测试函数
	void TestHashBucket3()
	{
		//HashTable<string, string, HashStr> ht;
		HashTable<string, string> ht;
		ht.Insert(make_pair("sort", "排序"));
		ht.Insert(make_pair("string", "字符串"));
		ht.Insert(make_pair("left", "左边"));
		ht.Insert(make_pair("right", "右边"));
		ht.Insert(make_pair("", "右边"));

		HashFunc<string> hashstr;
		cout << hashstr("abcd") << endl;
		cout << hashstr("bcda") << endl;
		cout << hashstr("aadd") << endl;
		cout << hashstr("eat") << endl;
		cout << hashstr("ate") << endl;
	}
	void TestHashBucket4()
	{

		int n = 1000000;
		vector<int> v;
		v.reserve(n);
		srand(time(0));
		for (int i = 0; i < n; ++i)
		{
			//v.push_back(i);
			v.push_back(rand() + i);  // 重复少
			//v.push_back(rand());  // 重复多
		}

		size_t begin1 = clock();
		HashTable<int, int> ht;
		for (auto e : v)
		{
			ht.Insert(make_pair(e, e));
		}
		size_t end1 = clock();

		cout << "数据个数:" << ht.Size() << endl;
		cout << "表的长度:" << ht.TablesSize() << endl;
		cout << "桶的个数:" << ht.BucketNum() << endl;
		cout << "平均每个桶的长度:" << (double)ht.Size() / (double)ht.BucketNum() << endl;
		cout << "最长的桶的长度:" << ht.MaxBucketLenth() << endl;
		cout << "负载因子:" << (double)ht.Size() / (double)ht.TablesSize() << endl;
	}