Data Structures - Hash Tables

Associative arrays, structures which have meaningful keys associated with values instead of indexes being associated with values, are another abstract data type. The first rule to remember is that keys must be unique, but values can be duplicates. Custom objects can be stored as values, and implementations of associative arrays differ across languages; in Objective-C, we have the NSDictionary class. Underneath the hood of every associative array is a data structure known as a hash table.

A hash table is essentially a pair of arrays, one that stores a key and the other that stores an associated value. Entries in a hash table are called buckets, and hash tables begin with multiple buckets that are waiting to be filled with data. What determines which bucket will be used to store a value from our key value pair? The answer to that lies in understanding a core part of the hash table: the hash function.

   This diagram illustrates the basic idea behind a hash function. It takes in raw data and through some deterministic operation outputs an index.


This diagram illustrates the basic idea behind a hash function. It takes in raw data and through some deterministic operation outputs an index.

Imagine we have some raw data, perhaps a custom object, that we are going to store in a hash table. How does a hash function determine where to store that data? Hash functions can be written in many different ways, and some are much better than others. In its most generic sense, a hash function takes some data input (let's focus on the key in our key-value pair here), and performs some operations on it to generate a large integer. This is a way of purposefully representing that data in a shortened way; this is known as hashing, and the integer output is a hash or a hash value. There are two rules to remember when discussing hash functions: 1. Hashing is an irreversible process (it is not invertible) and 2. some information is lost when hashing.

A hash function should be able to generate some data from some input using the same set of rules every single time; we can say that hash functions must be deterministic. While a hash function should produce the same hash for two equal objects (allocated separately but holding duplicate data), two equal hashes do not necessarily come from the same data. When two different pieces of data result in the same hash value, we have what is known as a collision; we will discuss how to manage collisions shortly.

A hash function takes a key from a key-value pair and ultimately maps it to an integer within the range [0, N-1], where N is the capacity of the bucket array of our hash table. The goal is to use h(k), which represents the hash value of a key (where h is our hash function and k is the key), as an index of our bucket array. Depending on the hash function, our hash value for a key may be a very large integer; we must find a way to reduce that number down to a size relative to the size of the hash table. An example of a way to do this is by taking the modulo of the integer with regard to the number of buckets; this final output would represent the hash table bucket array index. We then map to that index of the array and store the value in our key-value pair.

The implication of all of this is that if we need to find a value and we have the key, we can just pass that key to our hash function, get our final hash value output, and index directly into the array using that value. Indexing into an array, remember, is a random access operation (and is a constant time operation). Although similar to arrays in that we have random access, it is unlike both arrays and linked lists in that there is no concept of a linear/binary search or traversing (which are all linear time functions).

Collision Management

Hash functions sometimes result in collisions - this is when a generated hash value isn't unique enough between two different objects, resulting in our function attempting to map two different objects to the same bucket. One way to deal with this is to have a bucket where multiple values are attempting to be stored simply hold an array or a linked list. This way, when you index into that bucket, you can either iterate through an array storing multiple objects, or you can traverse over a linked list to do the same. This solution to collisions is known as separate chaining.

Next time we'll be covering trees and graphs (briefly). After that, it's all about Cocoa Touch.