Data Structures

Data Structures - Hash Tables

Associative arrays, structures which have meaningful keys associated with values instead of indexes being associated with values, are another abstract data type. The first rule to remember is that keys must be unique, but values can be duplicates. Custom objects can be stored as values, and implementations of associative arrays differ across languages; in Objective-C, we have the NSDictionary class. Underneath the hood of every associative array is a data structure known as a hash table.

A hash table is essentially a pair of arrays, one that stores a key and the other that stores an associated value. Entries in a hash table are called buckets, and hash tables begin with multiple buckets that are waiting to be filled with data. What determines which bucket will be used to store a value from our key value pair? The answer to that lies in understanding a core part of the hash table: the hash function.

   This diagram illustrates the basic idea behind a hash function. It takes in raw data and through some deterministic operation outputs an index.

 

This diagram illustrates the basic idea behind a hash function. It takes in raw data and through some deterministic operation outputs an index.

Imagine we have some raw data, perhaps a custom object, that we are going to store in a hash table. How does a hash function determine where to store that data? Hash functions can be written in many different ways, and some are much better than others. In its most generic sense, a hash function takes some data input (let's focus on the key in our key-value pair here), and performs some operations on it to generate a large integer. This is a way of purposefully representing that data in a shortened way; this is known as hashing, and the integer output is a hash or a hash value. There are two rules to remember when discussing hash functions: 1. Hashing is an irreversible process (it is not invertible) and 2. some information is lost when hashing.

A hash function should be able to generate some data from some input using the same set of rules every single time; we can say that hash functions must be deterministic. While a hash function should produce the same hash for two equal objects (allocated separately but holding duplicate data), two equal hashes do not necessarily come from the same data. When two different pieces of data result in the same hash value, we have what is known as a collision; we will discuss how to manage collisions shortly.

A hash function takes a key from a key-value pair and ultimately maps it to an integer within the range [0, N-1], where N is the capacity of the bucket array of our hash table. The goal is to use h(k), which represents the hash value of a key (where h is our hash function and k is the key), as an index of our bucket array. Depending on the hash function, our hash value for a key may be a very large integer; we must find a way to reduce that number down to a size relative to the size of the hash table. An example of a way to do this is by taking the modulo of the integer with regard to the number of buckets; this final output would represent the hash table bucket array index. We then map to that index of the array and store the value in our key-value pair.

The implication of all of this is that if we need to find a value and we have the key, we can just pass that key to our hash function, get our final hash value output, and index directly into the array using that value. Indexing into an array, remember, is a random access operation (and is a constant time operation). Although similar to arrays in that we have random access, it is unlike both arrays and linked lists in that there is no concept of a linear/binary search or traversing (which are all linear time functions).

Collision Management

Hash functions sometimes result in collisions - this is when a generated hash value isn't unique enough between two different objects, resulting in our function attempting to map two different objects to the same bucket. One way to deal with this is to have a bucket where multiple values are attempting to be stored simply hold an array or a linked list. This way, when you index into that bucket, you can either iterate through an array storing multiple objects, or you can traverse over a linked list to do the same. This solution to collisions is known as separate chaining.

Next time we'll be covering trees and graphs (briefly). After that, it's all about Cocoa Touch.

Data Structures - Stacks & Queues

Today we're going to talk about stacks and queues - two abstract data types that just use data structures (arrays or linked lists) under the hood. For Objective-C support, head near the bottom of the post.

   LIFO (stack) vs FIFO (queue)

 

LIFO (stack) vs FIFO (queue)

To visualize a stack, think of a stack of dishes - you wash the first one, dry it, and place it on a table. Every other dish that is washed and dried is placed on top of this first one, creating a stack. Which dish comes off first when it's time to put them away? The conventional way to do it would be to pull the topmost item off of the stack and put it away. This explains the protocol followed by stacks which is LIFO, or last in, first out. The last item to go on the stack (dish at the top) is the first to come out.

Stacks are simple because you can generally perform three operations - push (add something to the top of the stack), pop (you can take something off the top of the stack), and peek (you can see what item is at the top of the stack without doing anything to it). They're sort of a wrapper around an array or linked list with limitations on what operations can be performed.

The term "stack" can also refer to a concept involving memory allocation - see the bottom of the post for information on this.

Queues can be visualized by thinking of a line in a store - the first person to get on line is the first person to be serviced at the counter. This data type follows the FIFO protocol - first in, last out. The first person in line is the last person to get out of the line. Often times, queues are used in concurrency situations - you can create queues to order processes and control what process runs when.

*In the context of this blog, we are talking about heaps and stacks as they exist within the realms of CPU memory. Keep in mind, however, that heap may also refer to the tree-based data structure, entirely separate from CPU memory. Heaps can be max or min heaps; there is a lot to learn about them, so we will not be covering them as data structures in this blog series.

OBJECTIVE-C SUPPORT:

Stacks are not explicitly supported in Objective-C, but they are simple enough to implement with, say, an NSMutableArray. You can add a category to the NSMutableArray class and add methods for pop, push and peek; it's that simple.

Queues are supported in Objective-C in a few ways: NSOperationQueue/NSOperations, and Grand Central Dispatch (GCD). GCD is lower-level than NSOperationQueue, and is written in C; however, it is important to understand that although it adds another layer of abstraction, NSOperationQueue uses GCD under the hood. You can also implement a queue yourself similarly to how we discussed doing so for a stack using a mutable array.

GCD provides queue implementation for use with many things, concurrency being one of them; it does so in the form of blocks (read up on them!). Both allow for priority queues, meaning you can insert processes to run into the middle of an existing queue by setting its priority.

Whether to use GCD vs NSOperationQueue is a point of contention, but it should be noted that GCD is much more light-weight than NSOperationQueue. It involves no thread-safety locking mechanisms, and is less taxing on the CPU. As such, some developers prefer to use it, as long as it suits their needs and they require none of the higher-level functionality of NSOperationQueue.

*In following convention, UI updates should happen on the main queue, and other processes should be executed on background queues. This ensures that there is minimal (or no) lag for the end-user of an application.

Left for review are hash tables, trees and a brief introduction to graphs. I will try and be brief with the last two so that we can move on to more iOS specific concepts.

 

Perhaps without realizing it, programmers use stacks every day in coding - the stack refers to a place in computer memory (along with the heap). The heap refers to memory on the CPU dedicated to items that are explicitly allocated memory in a program. It is up to the programmer to release the memory holding onto those objects when they're no longer needed. Heap memory is accessed using pointers (or some other reference), and as such, it is somewhat slow to read/write to. The stack, on the other hand, is an abstract data type used to "manage" a region of CPU memory that stores temporary variables which are created by functions. When we refer to "the stack", we are talking about the region of memory (versus just "a stack" which refers to some instance of the data type). Whenever we enter a function, all variables created are pushed onto a stack frame, which itself sits on the stack. A new stack frame exists for each function. When a function exits, the frame is popped off of the top; once a stack variable is freed, that piece of memory becomes available for other stack variables. FOR A LIST OF PROS AND CONS BETWEEN STACKS AND HEAPS, REFER TO THE BOTTOM OF THIS POST.

HEAP

PROS:

  • Variables can be globally accessed
  • There is no limit on memory size (limitations set by CPU)

CONS:

  • Relatively slow access
  • Programmer is in charge of allocating and deallocating variables
  • Efficiency regarding use of memory is not guaranteed.

STACK

PROS:

  • Fast access
  • Don't need to explicitly deallocate variables
  • Memory won't become fragmented from deallocation (efficient use of memory)

CONS:

  • Access to local variables (within function) only
  • There are limitations on size of the stack

 

 

Data Structures - Linked Lists

Before we dive into a new data structure, be aware that there is a great quick guide on Computer Science concepts (data structures and common algorithms) that you can access here.

Now, let's discuss another data structure that builds off of our knowledge of arrays: the Linked List.

The linked list is made up of units called nodes, and each node contains two things: a value, and a pointer to the next node so that we can find it in memory. Unlike arrays, linked lists are collections of data stored that are not stored contiguously. Because of this, linked lists can grow in size very easily; you can simply add on an object to a linked list anywhere in memory, and the previous node will simply contain a pointer to the new node, wherever it may be.

Linked lists, as data structures, are often used to implement other abstract data types; they describe functionality more than implementation. The underlying implementation can be due to data structures such as arrays, hash tables, linked lists, etc. For example, stacks, queues, and associate arrays can be implemented using linked lists; they themselves are abstract data types, but are implemented using data structures.

Another way to differentiate between data structures and abstract data types is that data structures are looked at from the perspective of the person implementing them; an abstract data type is viewed from the perspective of a user, meaning that the data is described by its behavior.

Just as a developer can have variations of a standard array (think dimensions), we can have doubly linked lists, where nodes contain pointers to both the next node and the previous node, and even circular linked lists, where instead of the last node pointing to a "dummy" node, as is sometimes done, the last node contains a pointer to the first one. Linked lists support sequential access; that is, you must traverse from the "head" node through the list until you find your desired object. This means that indexing in a linked list is a linear time function; the Big O notation for linked list access (and also searching) is O(n).

                                This is a diagram illustrating the insertion of a node to an existing linked list

                              

This is a diagram illustrating the insertion of a node to an existing linked list

 

However, insertion and deletion are optimized operations in linked lists - they are both constant time functions (O(1)). If something needs to be inserted, we simply redirect the pointers in the linked list so that the intended previous one points to the new node, and the new node points to the intended next one. The same goes for deletion; we can simply redirect the node pointers so that nothing points to a certain node, and deallocate it from memory.

*note - if you don't already have a pointer to a node that you want to delete, the actual time for deletion would be (search_time) + O(1). The O(1) represents the constant time deletion. The same goes for insertion.

Arrays, at least in my experience, are used far more often than linked lists. There isn't even support for linked lists in Objective-C, although you can implement your own in one of a few ways; check stack overflow for more information on replicating linked list functionality in Objective-C.

One time a linked list may be useful is when using it to implement, for example, a queue. As we will learn, queues are FIFO (first in, last out), which means that the first object to be placed on a queue is the first one to exit the queue. Because insertion is a constant time function for linked lists, inserting to and deletion from the middle of a linked list is optimized. For a priority queue, in which you may break the FIFO rule when something has a high priority flag placed on it, linked lists would work nicely since you can remove directly from themiddle of one.

Next time we'll be covering very simple abstract data types - stacks and queues!

Data Structures - Arrays

In the context of data structures, I'll be writing about them in a language-agnostic fashion. However, I WILL be sure to address how they are used in Cocoa Touch programming (the NS classes that the structures correspond to, differences between Objective-C and other languages, etc). Over the next few blog posts, I'll be going over Arrays, Links, Stacks, Queues, and Hash Tables. We will be discussing some of the main uses of data structures which ultimately determine which is the best to use for your situation: indexing, searching, inserting, deleting, and sorting.

Side note: I finally took the time to find the origins of shell, terminal and console. First of all, most Macs use bash as their shell. The shell is just the program that processes commands and returns output. The terminal a wrapper program that contains and runs the shell program. A console is a special kind of terminal used to communicate with your operating system at a low level (not standard on Mac machines).
Today, we use terminal emulators (the terminal program on a Mac). These are virtual terminals that replace the terminals of years past, which were little more than a keyboard and a monitor used to type in commands, pass them to the shell, process them with the computer, and return an output. Today we more commonly use graphical user interfaces to interact with our computers.

Side note aside, let's begin by discussing a very commonly used data structure, the array.

Arrays are ordered collections of data stored in memory contiguously, which means that a whole block of memory is allocated for an array. For this reason, arrays cannot grow in size easily, since there may not be additional memory RIGHT NEXT to the block of memory already set aside for an array. This leads us to our first point, which is that arrays are immutable (for the first Objective-C specific point on this, see #1 at the bottom of this post).

Arrays can be one-dimensional, two-dimensional (an array of arrays), three-dimensional (an array of two-dimensional arrays), etc. It just depends on how you want to structure your data. They can even be jagged (a two-dimensional array where each array is not necessarily the same length).

Arrays allow for random, or direct, access. This means that we can index right into an array. Our computer already knows the size of an array, so if we say "take array A and go to the 4th index", our computer immediately knows where in memory that array is located, how much space is taken up, and can jump right to the 4th index. This means that indexing in an array is a constant time function, meaning that additional inputs of data in an array structure don't lead to increases in time complexity. The Big O notation for array indexing is O(1).

Searching, on the other hand, is a memory intensive process, as the computer must iterate over every item in the array until a desired object is found. It doesn't know what is being stored in an array, only WHERE in MEMORY it is stored. The best case scenario is that the object we are looking for is at index 0, and the worst case is that it is the last index in the array. Either way, we must assume the worst case scenario, which means iterating over the entire array until we find our object. Searching in an array is a linear time function, meaning that additional inputs result in a corresponding increase in time complexity. The big O notation for array searching is O(n).

We can optimize searching by using a binary search - we sort an array in ascending or descending order, check the middle element, and if that value is higher than our desired value, we search the lower half of the data values. This process continues, halving the data set each time, until we find (or don't find) the desired value.

1. With regard to the Objective-C implementation of arrays, although it is more memory intensive to do so, the language offers the NSMutableArray class (along with NSArray and CFArray. This is an array that DOES have dynamic sizing; however, things change when you're using this kind of data structure. If you are adding objects to a mutable array, there MAY be space in the contiguous memory block for that additional data. However, all the other objects must be shuffled around to keep the indices in place. Additionally, if there is no space in that contiguous block, the entire array must be copied over with the new values to a new contiguous block somewhere else in memory.
As a result, if data is subject to frequent change / growth in size, perhaps there are other data structures available better suited to a developer's needs.

Arrays are good to use when access is prioritized (constant time function). If you are able to insert data into an array (mutable) and you are inserting at the end, THEN insertion is a constant time function (we can jump directly to the last index of the array in memory). However, if end-insertion is not happening, the time complexity is linear (O(n)) since we must iterate through the entire array to find our desired point of insertion.
 

AN ADDITIONAL POINT ON RANDOM/DIRECT ACCESS (INDEXING) OF AN ARRAY BEING CONSTANT-TIME:

The math involved in indexing an array is addition and multiplication - these are both considered to be constant time operations. In other words, adding 1000 + 1000 + 1000 doesn't take much more time than adding 15 + 15 (and the same goes for multiplication). Higher values and more of them don't corresponding to a linear increase in the time required for the computation, so they are considered constant.

A crude representation of an array in a contiguous memory block, with each collection member mapped to it's place in memory

A crude representation of an array in a contiguous memory block, with each collection member mapped to it's place in memory

Let's say we want our program to access the third index in the array (index 2). We know that the first index has a memory address of x004. An address is 2 bytes large in this case, so each address increases by 2 here. To get the index we want, we must compute:
(address of first index) + ((desired index) (2 bytes per index in memory))
This comes out to (x004) + (index 2 * 2) -> (x004 + 4 bytes).
This leads us to the address x008, which holds the values 99 (the value of the third index, or index 2).
As you can see, multiplication and addition are constant operations required to index into an array; this is the reason why array indexing is a constant-time function.

*if you see me using the term deterministic in future posts, take it to mean that if given the same input each time, an algorithm will return the same output.