Hashmap¶

Terms¶

Collision: when two keys' hash values are the same, they are going to map to the two same location in the array.
Load Factor: assume your hash table has size m, which means that there is room for up to m items in it. But we also care about how many elements we are actually storing in the array right now — we denote this number by s. The ratio $alpha = \frac{s}{m}$ is called the load factor of the hash table. When it is small, there is a lot of room in the table. If it ever exceeds 1, we are guaranteed that a collision must have occurre.

Typically, load factor maintains the ratio of number of <key, value> pairs / number of buckets. Usually it's under ½.

Screen Shot 2020-08-04 at 11.21.20 AM.png

This array has a loading factor of 7/13.

Hashing Function: So how many random keys do we need to draw until we expect to see our first collision? There are m possible indices of h(k) we can draw, and if any two draws have the same index, then we have a collision. This is exactly an instance of the Birthday Paradox, so we would expect our first collision to occur around key number $\theta(\sqrt{m})$ . In other words, even if we think that the keys are drawn randomly (which should be pretty friendly to the hash table), the first collisions should occur around load factor $\alpha = \theta(\sqrt{m})$ . This is all the more reason why we will need to deal with collisions below.
Modulo Arithmetic: $h(k) = k$ mod $m$ the problem is that it might get very clustered key items in particular digits (e.g. m = 10 then all the if all numbers has the same last two digits will fall into the same key). Solution: choose a large prime number.
Chaining: using a linkedlist to store all the collide values, do a linear scan to check if exist in the linkedlist.

Time Complexity: assume the size of each linkedlist is $l_i$ , and we have $s$ elements in total, then the possibability for a given element to fall on $l_i$ is $l_i / s$ , it will take $\theta(l_i)$ to find in the linkedlist. Thus the total time complexity is $\theta (\frac{1}{s} \times \sum_i l_i^{2})$

Probing or Open Addressing: use only one list to store all the values, and find the next avaliable location for inserting if there is a collision.
Linear Probing: if position i is occupied, then try i + 1, i + 2... until find the first free place. In practice it's a bad idea: assume you have filled 0, ..., m / 2 positions in the list. The possibability of filling next element that will fall in 0, ..., m / 2 is ½, which will take m/2 + 1 operations to find the next avaliable place.

TODO: add open addressing

Designing Thread Safe Lookup table (hashmap or `std::map<T,T>`)¶

One solution to prevent data racing is by adding mutex lock everytime we want to do any operation. Can we do better?
There are three common ways of implementing an associateive container like lookup table:
A balanced binary tree, such as red-black tree (have to lock the root, which has no difference than locking the entire tree)
A sorted array (still have to lock the entire array)
A hash table (we can just lock the bucket we are accessing!)

Example implementation¶

template <typename Key, typename Value, typename Hash = std::hash<Key>>
class threadsafe_lookup_table {
private:
  class bucket_type {
  public:
    Value value_for(Key const &key, Value const &default_value) const {
      std::shared_lock<std::shared_mutex> lock(mutex);
      bucket_iterator const found_entry = find_entry_for(key);
      return (found_entry == data.end()) ? default_value : found_entry->second;
    }

    void add_or_update_mapping(Key const &key, Value const &value) {
      std::unique_lock<std::shared_mutex> lock(mutex);
      bucket_iterator const found_entry = find_entry_for(key);
      return (found_entry == data.end()) ? default_value : found_entry->second;
    }

    void remove_mapping(Key const &key) {
      std::unique_lock<std::shared_mutex> lock(mutex);
      bucket_iterator const found_entry = find_entry_for(key);
      if (found_entry != data.end())
        data.erase(found_entry);
    }

  private:
    typedef std::pair<Key, Value> bucket_value;
    typedef std::list<bucket_value> bucket_data;
    typedef typename bucket_data::iterator bucket_iterator;
    bucket_data data;
    mutable std::shared_mutex mutex;

    bucket_iterator find_entry_for(Key const &key) const {
      return std::find_if(
          data.begin(), data.end(),
          [&](bucket_value const &item) { return item.first == key; });
    }
  };
  std::vector<std::unique_lock<bucket_type>> buckets;
  Hash hasher;
  bucket_type &get_bucket(Key const &key) const {
    std::size_t const bucket_index = hasher(key) % buckets.size();
    return *buckets[bucket_index];
  }

public:
  typedef Key key_type;
  typedef Value mapped_type;
  typedef Hash hash_type;
  threadsafe_lookup_table(unsigned num_buckets = 19, // arbitary prime number
                          Hash const &hasher_ = Hash())
      : buckets(num_buckets), hasher(hasher_) {
    for (unsigned i = 0; i < num_buckets; ++i) {
      buckets[i].reset(new bucket_type);
    }
  }
  threadsafe_lookup_table(threadsafe_lookup_table const &other) = delete;
  threadsafe_lookup_table & operator=(threadsafe_lookup_table const &other) = delete;
  Value value_for(Key const &key, Value const &default_value = Value()) const {
    return get_bucket(key).value_for(key, default_value);
  }
  void add_or_update_mapping(Key const &key, Value const &value) {
    get_bucket(key).add_or_update_mapping(key, value);
  }
  void remove_mapping(Key const &key) { get_bucket(key).remove_mapping(key); }
};

Last update: January 9, 2021

Hashmap¶

Terms¶

Designing Thread Safe Lookup table (hashmap or std::map<T,T>)¶

Example implementation¶

Designing Thread Safe Lookup table (hashmap or `std::map<T,T>`)¶