Topic 7 - Hashing

We have seen most of the search tree that you can encounter in CS.

We will now see an ADT that you already used before the hash table ADT.

Hash tables only have a subset of the search trees operations:

  • Insertions

  • Deletions

  • Find

All the operations that require any ordering information among the elements are not supported.

General Idea

A hash table can be seen as an array containing elements.

The elements are composed of two parts:

  • The key.

  • The element itself.

Another important element of the table is the table’s size. We will see that the size is crucial part of the ADT.

How does it work?

  • Each key is mapped into some number in the range of 0 to \(\text{TableSize}-1\).

  • The mapping is called a hash function.

    • The function must be simple to compute

    • And should ensure that a key is unique

Example

Imagine that we want to use a Hash Table to store names.

We could obtain the following table:

_images/hash_table.png

In this table John has to \(3\), Phil to \(4\).

Important

This example shows a perfect case. Some key could have the same hash!

Hash Function

This hash function is the most important part of the ADT.

Activity

  • Think about a possible function.

  • Discuss it with your neighbours.

A simple case

Consider a hash table that need to store integers.

Then, a simple hash function could be

\[\text{key} \mod \text{TableSize}\]

Activity

Is it always working for integers? * If yes, prove it. * Otherwise, show a counterexample.

Consider a table of size 10, if you have multiple integers ending with a zero, then it will not work.

To counter this, we always ensure that the size is prime. It distributes the keys evenly.

More general case

Usually keys are strings.

With strings keys the hash function is more complex and must be chosen carefully.

We will consider three examples of hash functions.

First Example

Activity

  • Write a hash function that takes a string as key and hash it by adding the ASCII value of each char and return the modulo.

int hash(const string &key, int tableSize){

}

The table size needs to be large enough.

  • Consider a size of \(10 007\)

  • Suppose the keys have a size of eight or less.

  • The hash will calculate only values between 0 and \(1 016\), which is \(127\times 8\).

It does not distribute the values evenly.

Second Example

We can consider a second example.

  • Consider that the key will have at least three characters.

  • The value 27 represents the number of letters in the English alphabet (plus the blank).

  • 729 is \(27^2\).

We can implement the following function:

int hash(const string &key, int tableSize){
    return (key[0] + 27*key[1] + 729*key[2]) % tableSize;
}

It could work if the word in English are random, but they are not.

We could imagine there are \(26^3 =17576\) possible combinations of three characters.

In reality only \(2851\)!

Only 28 percent of the table can actually be hashed, and increasing the size of the table doesn’t change anything.

Third Example

The following code implement the Horner’s method.

unsigned int hash(const string &key, int tableSize){
    unsigned int hashVal = 0;

    for (char ch: key){
        hashVal = 37 * hashVal + ch;
    }
    return hashVal % tableSize;
}

It calculates a polynomial function of 37 to an \(n\) being the size of the string:

\[\sum_{i=0}^{KeySize - 1} Key\left[KeySize - i - 1 \right] \times 37^i\]

Activity

Is it a good hash function?

Separate Chaining

How are we dealing collision?

We can have a good has function, but it can always create a collision even if it’s uncommon.

A first strategy is known as separate chaining.

The idea

  • We represent the HashTable by a vector (from the STL) of list.

  • When a key has a hash that collides we add the key to the list corresponding to the vector index.

If it’s not very clear, here is an illustration:

_images/separate_chaining.png

Activity

  • Complete the interface of the class.

    • Create the constructor

    • Create the insert and remove method

    • Create the private data members.

 1template <typename HashedObj>
 2class HashTable
 3{
 4public:
 5
 6private:
 7
 8    void rehash();
 9    size_t myhash(const HashObj &x) const;
10
11}

Hashing

Now we need to implement the hash function.

In C++, it is usually done with a function object template:

template <typename Key>
class hash{
public:
    size_t operator() (const Key &k) const;
};

It is already implemented for basic types such as int or string.

Example

The code given in the third example would be written like this:

 1template <>
 2class hash<string>{
 3public:
 4    size_t operator() (const string &key){
 5        size_t hashVal = 0;
 6
 7        for(char ch: key){
 8            hashVal = 37*hashVal+ch;
 9        }
10
11        return hashVal;
12    }
13};

The type size_t is an unsigned integer type that represents the size of an object.

It guarantees that it can store an array index.

We can finally see how to implement myhash().

size_t myhash(const HashObj &x) const{
    static hash<HashedObj> hf;
    return hf(x) % theLists.size();
}

Example

We will see a practical example.

Consider the following class Employee:

 1class Employee{
 2public:
 3
 4    Employee(string name, double salary, int seniority): 
 5        name{name}, salary{salary}, seniority{seniority}
 6    {
 7
 8    }
 9
10    const string& getName() const{
11        return name;
12    } 
13
14    bool operator==(const Employee &rhs) const{
15        return getName() == rhs.getName();
16    }
17
18    bool operator!=(const Employee &rhs){
19        return !(*this == rhs);
20    }
21
22private:
23
24    string name;
25    double salary;
26    int seniority;

The first is to implement the hash function object template operator().

Activity

  • Implement the hash function object.

Implementation

There are 4 methods that need to be implemented:

  • makeEmpty

  • contains

  • remove

  • insert

makeEmpty

The method is simple, for each list in the vector, clear the list.

contains

It is also simple.

  • You start by hashing the object.

  • Then, you get the list in the vector.

  • Finally, you try to find the element in this list.

remove

The first steps are the same that for contains.

If the element is not found, return false.

Otherwise, erase the element, reduce currentSize by 1 and return true.

insert

The first steps are the same that for contains.

If the element is not inside the list, insert it.

Increase currentSize by 1, and if currentSize is greater than the size of the list rehash.

Finally return true.

Activity

Implement these methods.