Data Structures — Unit 5 — Hash Tables, Bit Sets, and Other Cool Data Structures

Bit vector representations

Suppose that we only want to represent subsets of a set S that is finite and reasonably small. Also we'll suppose that there is an easy to compute one-one and onto function from the members of S to the numbers {0,1,...|S|-1}.

For the sake of simplicity, I'll assume that S = {0,1,...M-1} for some reasonably small M.

Then we can represent each set, T, with an array of booleans

class Set {
    private: bool A[M] ;
    public: ... 
} ;

x is in the set exactly if

A[x]

Example:

Extra Note The use of a global constant M in the Set is not very good style. In reality, I would probably make M be a template parameter to the class. However for the sake of not cluttering things up with template declarations, I'm going to simply assume M is some global constant int, greater than 0. End Note.

Most operations are now constant time (Q(1) ).

bool Set::contains( int x ) {
    assert( 0 <= x && x < M ) ; 
    return A[x] ; }

void Set::insert( int x ) {
    assert( 0 <= x && x < M ) ; 
    A[x] = true ; }

void Set::remove( int x ) {
    assert( 0 <= x && x < M ) ; 
    A[x] = false ; }

True bit vectors

In most C++ implementations each bool takes up one byte. So this is not a very space efficient representation. Can use individual bits instead as follows.

If we use words of size b and A contains words, x is in the set iff

bit (x mod b) of A[x div b] is 1

Note char variables in C++ are always 8 bits.

class Set {
    private: char A[(M+7)/8] ;
    public: ... 
} ;

Example

With implementations

bool Set::contains( int x ) {
    assert( 0 <= x && x < M ) ; 
    return A[x/8] & (1 << (x%8))) ; }

void Set::insert( int x ) {
    assert( 0 <= x && x < M ) ; 
    A[x/8] = A[x/8] | (1<< (x%8))) ; }

void Set::remove( int x ) {
    assert( 0 <= x && x < M ) ; 
    A[x] =  A[x/8] & ~(1<< (x%8))) ; }

This improves space efficiency 8 fold, but might cost time.

Union, intersection, and complement of sets is easily done using bit parallel "or", "and", and "not" operations.

Using ints or long ints (typically 32 or 64 bits) instead of bytes (8 bits) may make bit-vectors slightly more time efficient.

Extending the idea to partial functions

If we have a partial function whose domain is {0,1,2,...,M-1} for a reasonably small M, then we can extend the bit-vector idea by including an array of size M of the range type. The bit-vector represents the preimage of the partial function.

Witness lists

Bit sets still have some inefficient operations. To initialize the set to empty is Q(M), as is finding an arbitrary member of the set.

Also iterating through all the members of a set requires Q(M) iterations rather than Q(N) where N is the current size of the set.

Set::Set() {
    for( int i=0; i < M ; ++i ) A[i] = false ;
}

int Set::getAny() {
    for( int i=0 ;  i < M ; ++i ) {
        if( contains(i) ) return i ; }
    assert( false ) ;
}

Witness lists extends the idea of bit-vectors as follows

We track the current size of the set with variable "size".
We use a "witness array", W, to contain all the members of the set (in no particular order, but without duplicates) in its first "size" locations. We call these members the witnesses. So iff x is in the set, there will exist an i such that W[i] = x.
If an item x is in the set, instead of A[x] being true, it is the index of x in the witness array.

Thus x is in the set exactly if

0 £ A[x] < size and W[ A[x] ] = x

Example:

Note that we don't need to initialize the A array! This is one of the rare cases where it is ok to fetch from a variable that we may have never stored into.

class Set { // Witness list implementation
    private: int A[ M ] ;
    private: int W ;
    private: int size ;
    public: ... 
} ;

with implementations

Set::Set() { // constructor
    size = 0 ; // And that's it!
}

bool Set::contains( int x ) {
    assert( 0 <= x && x < M ) ; 
    return 0 <= A[x] && A[x] < size && W[A[x]]==x ; }

void Set::insert( int x ) {
    assert( 0 <= x && x < M ) ; 
    if( ! contains( x ) ) {
        W[size] = x ;
        A[x] = size ;
        size += 1 ; }
 }

void Set::remove( int x ) {
    assert( 0 <= x && x < M ) ; 
    if( contains( x ) ) {
        // Copy the last witness over the x in the witness list
        A[ W[ size-1] ] = A[x] ;
        W[ A[x] ] = W[ size-1 ] ;
        size -= 1 ; }        
}

int Set::getAny() {
    assert( size > 0 ) ;
    return W[0] ;
}

Now all the operations are Q(1).

Extending the idea to partial functions

Suppose the domain of a partial function is {0,1,...M-1}.

Simply add another array, R, which parallels the A array.

If x is in the domain of the function, then R[x] is the value of the function at that point.

Hashing

I used to go to a university sports complex where only members were allowed in.

Members were expected to leave their membership cards at the front desk when they entered and to retrieve them when they left.
Since a few hundred members would be in the building at a time an efficient way to store and retrieve cards was needed.
The method chosen was this: A box with 100 cubby holes was built (a 10 by 10 grid). Each card was filed according to its last 2 digits.
To retrieve, the attendant grabbed all cards from the appropriate cubby hole and did a linear search for the membership number. (Which, of course, no member ever forgot.)
This is a simple example of hashing. The hash function here is

Bucket hashing

In its simplest form hashing for representing sets works like this

We pick a fixed positive integer B smaller than number of values in the element type.
We pick a function, h, from the element type to the set {0,1,...B-1}. h is the hash function.
We call h(x) the "hash value" of x.
h is typically onto, but not one-one. I.e. two values may have the same hash value.
We represent the set with an array, bucket, of B objects, each of which represents a set. (For example we could have an array of B linked lists, or an array of B AVL trees.)
To insert a value, x, in to the set, we insert it into bucket[ h(x) ].
To determine if an element, x, is contained in the set, we look in bucket[ h(x) ].
To delete an item from a set, we delete it from bucket[ h(x) ].
I.e. all elements that hash to the same value are kept in the same bucket.

The hash function should do a good job of distributing the values of the element type among the various buckets.

E.g. the probability of a randomly chosen value having a particular hash value should be uniform.
E.g. the probability of 2 similar values having the same hash value should be the same as the probability for 2 dissimilar values, for any reasonable of similar.

If this is the case, a set of size N will have buckets of size roughly N/B.

Example 0

If we use linked lists and have a set of size 1,000 and B=1,000,
then the expected length of each linked list will be 1 and
the expected time to do a successful search will be 1/2 comparisons and
the expected time to do an unsuccessful search will be 1 comparison.
Compare this to 500 and 1000 if hashing is not used.

Example 1

If we use linked lists and have a set of size 1,000,000 and B=1,000,
then the expected length of each linked list will be 1,000 and
the expected time to do a successful search will be 500 comparisons and
the expected time to do an unsuccessful search will be 1,000 comparisons.
Compare this to 500,000 and 1,000,000, if hashing is not used.

Example 2

If we use AVL trees and have a set of size 1,000,000 and B=1,000,
then the expected size of each AVL tree will be 1,000 and the expected height will be less than 15 and
the expected time to do a search will be less than 15 comparisons.
Compare this to 30, if hashing is not used.

Extending to functions.

The above discussion assumes we are representing sets.

The extension to partial functions (tables) is straight-forward.

We hash the only the domain element.

Time complexity

Hashing does not improve the worst-case time complexity

In the worst-case all the values could hash to the same thing.

Nor, if B is a constant, does it improve the average-case time complexity.

However, from a practical point of view, hashing can make a vast difference as it can improve times by large constants.

Good Hash Functions

Selecting a good hash function is something of an art

[to be completed.]

Caching

Caching attempts to store frequently used items where they will be found most quickly.

For example, if we store items in a linked list,

then every time an item is searched for we could move that item closer to the head of the list.
Eventually the most frequently used items will be closest to the head of the list.
The past is a good predictor of the future.

A similar trick is possible with plain search trees and (to a lesser extent) AVL search trees.