C++-HashMap 数据结构的实现

内容目录

前要说明

HashMap 是一种利用 Hash Table 做多对 Pair 的存储的一种高效数据结构。
关于 HashMap 的实现方法，主要是使用 Key 的 Hash值 进行一个计算，将计算后的结果插入到数组的某一个位置。
当我们要查找 Key 的位置的时候，需要重新计算传入的 Hash值，由于每次 Hash 运算的结果是一样的，那么就会找到正确的位置。
也就是说最好情况下，查找速度可以达到 O(1)，而当所有的 Key 的 Hash值 相同时，则是最坏的查找情况 O(n)。

基于这个理论，我们可以先创建一个表，这个表需要一个索引，而索引是通过 Hash值 计算得来的，为了优化效率并节省内存，在这里可以通过 Hash值 % Hash表大小 来计算出一个 HashMap的容量下当前的索引项。

Array<LinkedList<Pair<Key, Value>>> Map;

第一个 Array 是用来保存索引，只需要开扩为一个 HashMap 的容量即可，这样 Hash值 就会被分配到具有指定容量的数组中。
并将 Key 与 Value 都放到链表里。也就是说，Key 的结果为相同 Hash值 的，会被存储到同一个链表中，若所有的 Pair 都被存储到同一个链表里了。
那么此时查询时间复杂度为 O(n)。

此时利用链表来完成 `Pair` 的存储有几个优势：

链表的插入与删除效率很高，因为是非连续内存，并不需要在删除或插入后，做额外的内存移动处理。
对于链表来说，并不需要明确指定一个大小作为链表的容量，这样就不会有额外的内存浪费或重新分配所占用的性能（相较于数组）
既然不需要指定明确大小，那么在内存允许的范围内，我们可以对其进行大额分配，当相同 Hash值 （也指的是索引项）较多时，可以一直添加下去。

当然也有缺点：

缓存系统优化了内存连续访问的性能，内存不连续时会导致些许性能问题。
当相同的索引项过多时，链表会变得很长，此时的操作时间复杂度会到 O(n)。

为了解决这个缺点问题，聪明的人类想到了一些其他的做法。

平衡二叉搜索树：例如红黑树，AVL树等等，这些会使 Map 具备排序功能，并提供最坏 O(logn) 的操作速度。
跳表：通过牺牲一定内存加快操作速度，并提供 O(logn) 的平均操作速度。
数组：就不说了，有一定优势和劣势，但是劣势多点，具体需要看使用需求。

理解了基本原理之后，就可以开始着手写一个 HashMap 了

实现

Pair

首先最基本的，我们需要实现一个 Pair 类用于存储一对元素。

template <typename KeyType, typename ValueType>
class Pair
{
    KeyType Key;
    ValueType Value;
public:
    Pair(const KeyType& InKey, const ValueType& InValue) : Key(InKey), Value(InValue) {}
    Pair(KeyType&& InKey, const ValueType& InValue) : Key(std::move(InKey)), Value(InValue) {}
    Pair(const KeyType& InKey, ValueType&& InValue) : Key(InKey), Value(std::move(InValue)) {}
    Pair(KeyType&& InKey, ValueType&& InValue) : Key(std::move(InKey)), Value(std::move(InValue)) {}

    Pair(const Pair& InPair) = default;
    Pair(Pair&& InPair) = default;
    ~Pair() = default;

    KeyType& GetKey() { return Key; }
    ValueType& GetValue() { return Value; }

    const KeyType& GetKey() const { return Key; }
    const ValueType& GetValue() const { return Value; }

    void SetKey(const KeyType& InKey) { Key = InKey; }
    void SetValue(const ValueType& InValue) { Value = InValue; }

    void SetKey(const KeyType& InKey) const { Key = InKey; }
    void SetValue(const ValueType& InValue) const { Value = InValue; }

    Pair& operator= (const Pair& InPair) = default;
    Pair& operator= (Pair&& InPair) = default;
};

HashMap

接下来在 HashMap 中，我们需要先创建一个用于存储的结构。并设定一个当前哈希表的容量，还需要一个 Size 来保存当前结构中存储了多少对 Pair。

template <typename KeyType, typename ValueType>
class HashMap
{
    SizeType Capacity = 8;
    SizeType Size = 0;
    Array<LinkedList<Pair<KeyType, ValueType>>> Map;
};

注意：此处的 Array | LinkedList | Pair 均为我自己的实现。
可以使用 std::vector | std::list | std::pair 做出相同的实现效果。

接下来写一些公开函数内容，首先是构造，先将 构造函数 复制构造函数 移动构造函数 复制赋值运算符 移动赋值运算符 析构函数 这 6 个函数创建出来，
并在构造函数中初始化数组大小，再将目前没有实现的函数 delete 掉，以防 bug。

HashMap()
{
    Map.Init(LinkedList<Pair<KeyType, ValueType>>(), Capacity);
}

HashMap(const HashMap& NewMap) = delete;
HashMap(HashMap&& NewMap) = delete;
~HashMap() = default;

HashMap& operator= (const HashMap& NewMap) = delete;
HashMap& operator= (HashMap&& NewMap) = delete;

Add

接下来是新增函数，若我们想新增一个函数，则需要传入一对数值，也就是 Pair。

inline void Add(const Pair<KeyType, ValueType>& InPair)
{
    // 此处首先利用了 std::hash<>{} 去定义了一个 Hash 转换器，并将 Key 传入进去计算哈希值。
    // 最后使用 % Capacity 来转换成数组索引。
    SizeType Hash = std::hash<KeyType>{}(InPair.GetKey()) % Capacity;

    // 在对应索引位置加入 InPair。
    // 这里的 ForEach 为我自己库的函数，可以直接用普通的 for 代替。
    // 先检查一下链表中是否拥有同样名字的 Key，如果有的话，直接设置 Value 的值为新的值。
    // 如果没有的话，我们将 Pair 加入这里。
    SizeType Hash = std::hash<KeyType>{}(InPair.GetKey()) % Capacity;
    bool bFound = false;
    ForEach(Map[Hash], [&bFound, InPair](Pair<KeyType, ValueType>& TempPair)
    {
        if (TempPair.GetKey() == InPair.GetKey())
        {
            TempPair.SetValue(InPair.GetValue());
            bFound = true;
            return false;
        }
        return true;
    });
    if (!bFound)
    {
        Map[Hash].Add(InPair);
        // 对 Size 进行一个扩容。
        ++Size;
    }
}

ReHash

显然，我们一开始的容量只有 8，这也意味着当数据量过多的时候，只有 8 个槽位可供放入哈希索引。
所以说我们需要一个 ReHash 函数用来在超过一定容量后，重新分配所有的哈希值。

inline void ReHash()
{
    // 我这里将容量直接*2了，依据实际的情况可以自己修改，也可以提前暴露出每次扩容所需要的大小。
    // 如果你要存储大量的 Hash 值，则尽可能的提前分配，否则会因为多次扩容导致调用 Rehash 函数
    // 增加一些性能开销。
    Capacity *= 2;

    // 创建一个新的存储结构，并初始化为新的容量大小。
    Array<LinkedList<Pair<KeyType, ValueType>>> TempMap;
    TempMap.Init(LinkedList<Pair<KeyType, ValueType>>(), Capacity);

    // 重新计算所有 Key 的 Hash，并将他们模到一个新的范围内，再添加回临时 Map 中。
    for (auto& Chain : TempMap)
    {
        for (auto& SinglePair : Chain)
        {
            SizeType Hash = std::hash<KeyType>{}(SinglePair.GetKey()) % Capacity;
            TempMap[Hash].Add(SinglePair);
        }
    }

    // 移动新建的 Map 到 之前的 Map 上。
    Map = std::move(TempMap);
}

现在我们需要决定，在什么时候来调用 ReHash。
例如我们现在的容量为 8，这也同时意味着，如果取模后的结果有多个同时分配在了一个链表中。
那么性能就会变差，我们不需要根据一个链表的长度去决定这件事情，可以使用总体的大小来判定是否需要 ReHash。
例如容量为 8 时，我们添加了 64 个元素进来，那么最乐观的情况下，每个数组插槽的链表中会拥有 8 个元素。
此时若查询链表最后一个数值的时候，会调用 8 次的 for。
如果这个时候将容量改为 16。 ReHash 后最乐观的情况下（实际中都是悲观情况），每个链表里只会存入 4 个。

所以这里在 Add 函数中做一个判定，如果这个 HashMap 的总 Pair 量超过了一定比例（暂定为 0.75，可根据需要动态调整）
则需要进行 ReHash。

// Add 函数开头。
// 当 当前容量 /
if (static_cast<float>(Size) / static_cast<float>(Capacity) >= 0.75f)
{
    ReHash();
}

Remove

接下来写 Remove 函数，要求传入一个 Key。
之后首先计算 Key 的哈希值索引，在对应的索引中遍历整个链表，当链表中存储的 Key 等于传入的 Key 时，从链表中删除这个索引。
并让 Size 减去 1。

inline void Remove(const KeyType& Key)
{
    SizeType Hash = std::hash<KeyType>{}(Key) % Capacity;
    for (SizeType i = 0; i < Map[Hash].Num(); ++i)
    {
        if (Map[Hash][i]->GetValue().GetKey() == Key)
        {
            Map[Hash].Remove(i);
            --Size;
            break;
        }
    }
}

Iterator

为了可以快速对 Map 进行迭代，我们需要实现一个迭代器。而元素在 Map 中的分配不是连续的，所以我们需要动态的去查找元素。
首先在 HashMap 类中定义一个迭代器，并实现基本的解引用运算符等于运算符不等于运算符构造函数等。
并定义几个变量。


class Iterator
{
    // 用来存储 Map 本身的指针。
    HashMap* IteratorHashMap = nullptr;
    // 当前检索到的桶。
    SizeType BucketIndex = 0;
    // 当前检索到的 Pair 的指针。
    LinkedListNode<Pair<InternalKey, InternalValue>>* CurrentNode = nullptr;
public:
    // 需要在外部传入 HashMap 的指针。
    explicit Iterator(HashMap* Map) : IteratorHashMap(Map) {}

    // 得到当前 HashMap 指针的函数。
    HashMap* GetHashMap() const
    {
        return IteratorHashMap;
    }

    // 得到当前链表节点的函数。
    LinkedListNode<Pair<InternalKey, InternalValue>>* GetCurrentNode() const
    {
        return CurrentNode;
    }

    Iterator& operator++ () = delete;
    bool operator== (const Iterator& Other) = delete;
    bool operator!= (const Iterator& Other) = delete;

    // 实现一下解引用运算符，从当前链表的指针中获取值。
    [[nodiscard]] Pair<KeyType, ValueType>& operator*() const
    {
        return CurrentNode->GetValue();
    }

};

接下来我们需要一个函数，帮助我们移动到下一个空桶。

void MoveToNextBucket()
{
    // 若当前持有的 HashMap 指针存在的话。
    if (IteratorHashMap)
    {
        // 如果当前桶的索引小于桶的容量，则持续查找，如果找到某一个桶的链表不为空，将当前桶的链表的节点赋值给 CurrentNode，并切换到下一个桶，如果当前桶不为空，则直接切换到下一个桶，并重新迭代。
        while (BucketIndex < IteratorHashMap->GetCapacity())
        {
            if (IteratorHashMap->GetMap()[BucketIndex].GetHeadNode())
            {
                CurrentNode = IteratorHashMap->GetMap()[BucketIndex].GetHeadNode();
                ++BucketIndex;
                break;
            }
            ++BucketIndex;
        }
    }
}

// 构造函数中一开始需要调用一下。
explicit Iterator(HashMap* Map) : IteratorHashMap(Map)
{
    MoveToNextBucket();
}

接下来来实现 ++ 运算符，切换到下一个节点的指针。

Iterator& operator++ ()
{
    if (!IteratorHashMap)
    {
        return *this;
    }

    if (CurrentNode)
    {
        CurrentNode = CurrentNode->GetNextNode();
    }

    if (!CurrentNode)
    {
        MoveToNextBucket();
        if (!CurrentNode)
        {
            IteratorHashMap = nullptr;
        }
    }
    return *this;
}

接下来实现一波等于运算符和不等于运算符，非常简单，只需要检查一下 HashMap 和 CurrentNode 是否跟另外一个 Iterator 相等。

bool operator== (const Iterator& Other) const
{
    return (IteratorHashMap == Other.GetHashMap() && CurrentNode == Other.GetCurrentNode());
}
bool operator!= (const Iterator& Other) const
{
    return (IteratorHashMap != Other.GetHashMap() && CurrentNode != Other.GetCurrentNode());
}

测试

HashMap<int32, String> Map;
Map.Add(Pair(1, String("This is one.")));
Map.Add(Pair(2, String("This is two.")));
Map.Add(Pair(3, String("This is three.")));
Map.Add(Pair(4, String("This is four")));

for (auto i = Map.begin(); i != Map.end(); ++i)
{
    std::cout << (*i).GetValue() << "\n";
}

Map.Remove(2);

std::cout << "\n";
for (auto i : Map)
{
    std::cout << i.GetValue() << "\n";
}

输出：（因为 HashMap 没有顺序，所以是正常现象）

This is four
This is one.
This is three.
This is two.

This is four
This is one.
This is three.

双参数 Add 重载

为了给用户提供一些方便且可操作的空间，需要重载多个双参数类型的 Add 函数，分别传入 Key 与 Value。
并且为了保证性能，还需要重载右值引用的参数。

先重载一个 Pair 右值引用。

// 参数改为右值引用
void Add(Pair<KeyType, ValueType>&& InPair)
{
    if (static_cast<float>(Size) / static_cast<float>(Capacity) >= 0.75f)
    {
        ReHash();
    }
    SizeType Hash = std::hash<KeyType>{}(InPair.GetKey()) % Capacity;

    bool bFound = false;
    ForEach(Map[Hash], [&bFound, InPair](Pair<KeyType, ValueType>& TempPair)
    {
        if (TempPair.GetKey() == InPair.GetKey())
        {
            TempPair.SetValue(InPair.GetValue());
            bFound = true;
            return false;
        }
        return true;
    });
    if (!bFound)
    {
        // 在此处调用 move 从而移动 Pair 到这里。
        Map[Hash].Add(std::move(InPair));
        ++Size;
    }
}

接下来分别实现四个函数，这里实际上需要对内部链表开放万能引用原位构造接口来减少 move 增加性能，不过需要利用到 tuple 等特性，会在之后讲解一下，但这里就直接来调用右值引用参数的 Add 接口

void Add(const KeyType& Key, const ValueType& Value)
{
    Add(Pair<KeyType, ValueType>(Key, Value));
}

void Add(KeyType&& Key, const ValueType& Value)
{
    Add(Pair<KeyType, ValueType>(std::move(Key), Value));
}

void Add(const KeyType& Key, ValueType&& Value)
{
    Add(Pair<KeyType, ValueType>(Key, std::move(Value)));
}

void Add(KeyType&& Key, ValueType&& Value)
{
    Add(Pair<KeyType, ValueType>(std::move(Key), std::move(Value)));
}

移动构造与复制构造

HashMap(const HashMap& NewMap)
{
    Capacity = NewMap.GetCapacity();
    Size = NewMap.GetSize();
    Map = NewMap;
}
HashMap(HashMap&& NewMap) noexcept
{
    Capacity = NewMap.GetCapacity();
    Size = NewMap.GetSize();
    Map = std::move(NewMap);
}

对重要的资源进行拷贝或移动（需要数组支持移动赋值运算符与复制赋值运算符），对基本数据类型进行拷贝。

移动赋值与复制赋值

HashMap& operator= (const HashMap& NewMap)
{
    Capacity = NewMap.GetCapacity();
    Size = NewMap.GetSize();
    Map = NewMap.GetMap();
    return *this;
}
HashMap& operator= (HashMap&& NewMap) noexcept
{
    Capacity = NewMap.GetCapacity();
    Size = NewMap.GetSize();
    Map = std::move(NewMap.GetMap());
    return *this;
}

接口类型的简化

// 这个作为存储 HashMap 具体内容的类型疑似有点太长了，这里可以替换成
Array<LinkedList<Pair<KeyType, ValueType>>> Map;

using MapType = Array<LinkedList<Pair<KeyType, ValueType>>>;
MapType Map;

然后在实现 getter 的时候可以

[[nodiscard]] SizeType GetCapacity() const { return Capacity; }
[[nodiscard]] SizeType GetSize() const { return Size; }

[[nodiscard]] MapType& GetMap() { return Map; }
[[nodiscard]] const MapType& GetMap() const { return Map; }

测试

HashMap<int32, String> Map;
Map.Add(Pair(1, String("This is one.")));
Map.Add(Pair(2, String("This is two.")));
Map.Add(Pair(3, String("This is three.")));
Map.Add(Pair(4, String("This is four")));

HashMap<int32, String> CopyConstruction = Map;
for (auto i : CopyConstruction)
{
    std::cout << i.GetValue() << "\n";
}

HashMap<int32, String> MoveConstruction = {{1, "Test1"}, {2, "Test2"}};
for (auto i : MoveConstruction)
{
    std::cout << i.GetValue() << "\n";
}

HashMap<int32, String> CopyAssign = Map;
for (auto i : CopyAssign)
{
    std::cout << i.GetValue() << "\n";
}

HashMap<int32, String> MoveAssign = {{1, "Test1"}};
MoveAssign = {{2, "Test2"}};

for (auto i : MoveAssign)
{
    std::cout << i.GetValue() << "\n";
}

均运行正常。

扩展

关于完美转发 + 原地构造

以后再更

论证参考：

红石C++镜像：std::map<Key,T,Compare,Allocator>::emplace