***************** Topic 7 - Hashing ***************** We have seen most of the search tree that you can encounter in CS. We will now see an ADT that you already used before the **hash table** ADT. Hash tables only have a subset of the search trees operations: * Insertions * Deletions * Find All the operations that require any ordering information among the elements are not supported. General Idea ============ A hash table can be seen as an array containing elements. The elements are composed of two parts: * The key. * The element itself. Another important element of the table is the table's size. We will see that the size is crucial part of the ADT. How does it work? ----------------- * Each key is mapped into some number in the range of 0 to :math:`\text{TableSize}-1`. * The mapping is called a **hash function**. * The function must be simple to compute * And should ensure that a key is unique .. admonition:: Example Imagine that we want to use a Hash Table to store names. We could obtain the following table: .. figure:: ../img/topic7/hash_table.* :align: center In this table `John` has to :math:`3`, `Phil` to :math:`4`. .. important:: This example shows a perfect case. Some key could have the same hash! Hash Function ============= This hash function is the most important part of the ADT. .. admonition:: Activity :class: warning * Think about a possible function. * Discuss it with your neighbours. A simple case ------------- Consider a hash table that need to store integers. Then, a simple hash function could be .. math:: \text{key} \mod \text{TableSize} .. admonition:: Activity :class: warning Is it always working for integers? * If yes, prove it. * Otherwise, show a counterexample. Consider a table of size 10, if you have multiple integers ending with a zero, then it will not work. To counter this, we always ensure that the size is prime. It distributes the keys evenly. More general case ----------------- Usually keys are strings. With strings keys the hash function is more complex and must be chosen carefully. We will consider three examples of hash functions. First Example ************* .. admonition:: Activity :class: warning * Write a hash function that takes a string as key and hash it by adding the ASCII value of each char and return the modulo. .. code-block:: cpp int hash(const string &key, int tableSize){ } The table size needs to be large enough. * Consider a size of :math:`10 007` * Suppose the keys have a size of eight or less. * The hash will calculate only values between 0 and :math:`1 016`, which is :math:`127\times 8`. It does not distribute the values evenly. Second Example ************** We can consider a second example. * Consider that the key will have at least three characters. * The value 27 represents the number of letters in the English alphabet (plus the blank). * 729 is :math:`27^2`. We can implement the following function: .. code-block:: cpp int hash(const string &key, int tableSize){ return (key[0] + 27*key[1] + 729*key[2]) % tableSize; } It could work if the word in English are random, but they are not. We could imagine there are :math:`26^3 =17576` possible combinations of three characters. In reality only :math:`2851`! Only 28 percent of the table can actually be hashed, and increasing the size of the table doesn't change anything. Third Example ************* The following code implement the `Horner's method `_. .. code-block:: cpp unsigned int hash(const string &key, int tableSize){ unsigned int hashVal = 0; for (char ch: key){ hashVal = 37 * hashVal + ch; } return hashVal % tableSize; } It calculates a polynomial function of 37 to an :math:`n`th degree, :math:`n` being the size of the string: .. math:: \sum_{i=0}^{KeySize - 1} Key\left[KeySize - i - 1 \right] \times 37^i .. admonition:: Activity :class: warning Is it a good hash function? Separate Chaining ================= How are we dealing collision? ----------------------------- We can have a good has function, but it can always create a collision even if it's uncommon. A first strategy is known as **separate chaining**. The idea -------- * We represent the HashTable by a vector (from the STL) of list. * When a key has a hash that collides we add the key to the list corresponding to the vector index. If it's not very clear, here is an illustration: .. figure:: ../img/topic7/separate_chaining.* :align: center .. admonition:: Activity :class: warning * Complete the interface of the class. * Create the constructor * Create the insert and remove method * Create the private data members. .. literalinclude:: ../code/topic7/HashTable.h :language: cpp :linenos: :lines: 9-12, 20-21, 24-28 Hashing ------- Now we need to implement the hash function. In C++, it is usually done with a function object template: .. code-block:: cpp template class hash{ public: size_t operator() (const Key &k) const; }; It is already implemented for basic types such as :code:`int` or :code:`string`. .. admonition:: Example The code given in the third example would be written like this: .. code-block:: cpp :linenos: template <> class hash{ public: size_t operator() (const string &key){ size_t hashVal = 0; for(char ch: key){ hashVal = 37*hashVal+ch; } return hashVal; } }; The type :code:`size_t` is an unsigned integer type that represents the size of an object. It guarantees that it can store an array index. We can finally see how to implement :code:`myhash()`. .. code-block:: cpp size_t myhash(const HashObj &x) const{ static hash hf; return hf(x) % theLists.size(); } Example ======= We will see a practical example. Consider the following class :code:`Employee`: .. literalinclude:: ../code/topic7/Employee.h :language: cpp :linenos: :lines: 7-33 The first is to implement the hash function object template :code:`operator()`. .. admonition:: Activity :class: warning * Implement the hash function object. Implementation ============== There are 4 methods that need to be implemented: * makeEmpty * contains * remove * insert makeEmpty --------- The method is simple, for each list in the vector, clear the list. contains -------- It is also simple. * You start by hashing the object. * Then, you get the list in the vector. * Finally, you try to find the element in this list. remove ------ The first steps are the same that for :code:`contains`. If the element is not found, return false. Otherwise, erase the element, reduce :code:`currentSize` by 1 and return true. insert ------ The first steps are the same that for :code:`contains`. If the element is not inside the list, insert it. Increase :code:`currentSize` by 1, and if :code:`currentSize` is greater than the size of the list :code:`rehash`. Finally return true. .. admonition:: Activity :class: warning Implement these methods.