Sunday, July 20, 2014

Cooking up Hash Tables

Originally I was thinking I should rename this post, but I've decided to go ahead and keep the name since I find it kind of funny. That said, hash tables are super cool, and a really important data structure that I believe you should absolutely know.  Even Googlers says they are one of the most important data structures around.  So how exactly do you implement a simple hash table?  Let's go ahead and explore this!

What is a Hash Table?

A normal table as an array
Index Value
0 Gerald
1 Jack
2 Joseph
The way to think of a hash table is to think of it exactly like you would a normal table!  You have a bunch of items that you want to retrieve at specific positions, and you want to do so as quickly as possibly.
A hash table
Key Value
Oldest Brother Gerald
Middle Brother Jack
Youngest Brother Joseph
The difference is that most tables you need to figure out the position, which is usually given via specific index that corresponds to a given item.  A hash table however simply takes a chosen key (such as a phrase, words, or something else, rather than just a number) and returns a value associated with it!  You can see the differences between these two to the right.

The key difference as you can see between the two is the introduction of a key.  This key allows you to quickly access a value, rather than searching through each array index for the value, make the speed of this to be O(1) on average, with a worst case of O(n).  The big thing that determines the speed is the quality of the hashing function that is used for the keys.

What is a Hash?

As you have probably noticed, the actual hash it the single most important part of creating a hash table.  Without a proper hashing function, you will end up with a lot of problems with what are called collisions.  Before we go into the problems with hashing, let's first talk about how you do it.

A very simple way to think about hashing it the idea of converting an object (either a string, class, or some other thing) into a unique integer value.  Note that I said unique, so just summing the values of each character in a string wouldn't work as some strings would then have the same values.  For the purpose of this tutorial, we are going to focus on hashing with strings since they are ubiquitous, simple, and still provide basic challenges.  So as I said, simply summing the values of each character in a string won't cut it, but that is still a good starting point for turning the string into an integer value.  So how can we help make sure that each sum is as unique as possible, without making the value random?  The answer is with one of the greatest math tools around: Prime numbers.

The Greatness of Prime Numbers

Prime numbers turn out to be an incredible way to generate unique numbers, while making sure that they aren't just random numbers.  If you take any two prime numbers, you'll get a unique number that only has four multiples: The two prime numbers, one, and the number itself.  This means that if we take each character and multiply it by a prime number before adding it to the total sum, we will increase our chance of preventing duplicate hashes.  So the formula ends up looking like so:

hash = prime1;
foreach(character c in key)
{
    hash = (hash * prime2 * c);
}
Where prime1 and prime2 are two unique prime numbers.

Now this formula is pretty nice, but there's actually still one more improvement we can make to it, and that's adding in some XOR operations.  This helps to further make sure that each hash is unique, since again, the chances of XOR operations occurring the same amount of times on the same sequence of characters is very low.  The modified and final version of the formula as a result ends up looking like this:

hash = prime1;
foreach(character c in key)
{
    hash = (hash * prime2) ^ (prime3 * c);
}
Where prime1, prime2 and prime3 are unique prime numbers.

This results in a hash with very few collisions, and is generally accepted as a good way to do hashing currently for basic applications.  The final thing that you will want to do is perform a modulo operation on the final hash result with the maximum number of items currently allowed in the table, to prevent an overflow from occurring.

Edit: Someone pointed out that there are better hashing functions out there, and I want to point out that yes, there are.  The reason I am showing this one is because it is easier to explain, and works fairly well from the little bit of testing I have done.  That said, you can find some better hashing functions (and worse ones) here.

Creating the Actual Hash Table

Creating the actual hash table is even easier than you'd think.  The table itself can be simply an array that holds N items, and when you perform the modulo operation on your final hash, you use it with N.  You can then take the resulting hash, and place the item at the array index of that hash.  When you need to access the item again, simply pass the key back into the hashing function, and retrieve the item from the array with the resulting hash value.

The Problems

There are of course a few problems that still occur with these methods.  The first is that while the hashing formula I mentioned above works fairly well, there will still be instances where two keys would create the same hash.  There are a few different ways that this can be handled, though I won't be going over those in this blog post since that's a topic in and of itself.  The other issue is that this can still only hold so many values, and when you try to insert a new value to a full table, you'll have to rebuild the whole table again with more values, while also transferring the old values.  You can make this re-sizing less of an issue by simply increasing the amount of values held each time exponentially (2-4-16-256-65536-so on and so forth), which will reduce how often you will have to do this.  This is also still only a O(n) operation, so is rather manageable overall.

Conclusions

I hope this tutorial has helped with understanding how a hash table works.  Hash tables are used all over the place, and if you want to be able to work at somewhere such as Google, you absolutely must know how to create a simple hash table.  For those who are interested in seeing some example code, you can find a simple C++ implementation over on my Github page with the code (Though it does not prevent collisions between two hashes that are the same, and again, the hashing function is very simple).  Maybe in a future tutorial, I can talk about how to handle collisions when they do occur.  As always, feel free to leave any comments and feedback below!

No comments:

Post a Comment