## Lecture 23 - 03/13

### Hashing

• Hashing is most common implementation for a set/map
• Slight performance improvement over BST implementations
• With BSTs:
• We must have comparable items
• Must also maintain bushiness (non-trivial, can use methods like red-black tree or 2-3 tree)
• BST gets us to inserting/checking for containing data in Θ(logN)$\Theta(\log N)$, but we can improve on this
• Terrible idea for storing data: unordered array
• contains and insert take Θ(N)$\Theta(N)$ time
• Good idea: use the data itself as an array index
• This leads to constant time lookup - just looking up memory location specified by data value
• This is highly wasteful of memory, but gives us constant time contains and insert
• Generalizing this idea: use data object's actual value to get an index number
• For Strings, we can try converting the object to base 26$26$
• Ex: for 'cat' let c = 3$3$, a = 1$1$, and t = 20$20$ (position in alphabet). Then we can say the index of 'cat' =3262+1261+20260=2074$= 3*26^2+1*26^1+20*26^0=2074$
• Could also use base 32$32$ (this is particularly nice for binary representations, since it would make each letter of the String 5$5$ bits (32=25$32=2^5$), so every 5$5$ bits of memory encodes for a letter - try to understand why multiplying by 25$2^5$ is equivalent to shifting bits right 5$5$ places)
• Summary:
• We have some sort of hashCode that takes some data object and returns some integer representation, that we can then use as an index
• We need to resolve ambiguity - when hashCode returns same index for different objects - called collision handling
• Better idea - HashTable:
• Have a fixed number of boxes M$M$ in an array, take hashCode number modulo M$M$, and use this as index
• Collision handling: use method like external chaining - at each of the M$M$ boxes, have a linked-list/ArrayList/... structure to store all the items of that common index
• Other methods for collision resolution exist, such as open addressing (look this up, not necessarily better than external chaining; in this class we will use external chaining)
• Note: a good hashCode will give index values that are well balanced (very even distribution across all M$M$ values)
• Load factor: if we have N$N$ items distributed over M$M$ buckets, the load factor is L=N/M$L = N/M$
• Average runtime for contains and insert is Θ(L)$\Theta(L)$
• We can ensure a very fast data structure by making sure that L$L$ is always small
• Whenever L$L$ exceeds some threshold, we resize our array of boxes, thus increasing M$M$ and decreasing L$L$
• Must appropriately re-assign all the N$N$ elements to their new positions in the resized array of boxes
• Note: for doing modulo operation, use Math.floorMod(hashCodeValue, M), as this accounts for negative hashCode values
• Performance: if we ensure resizing to keep the load factor L$L$ below a certain constant threshold, and we ensure the N$N$ items of the table are spread out/well distributed across the M$M$ buckets, we have an amortized constant average runtime: Θ(1)$\Theta(1)$
• Hash functions:
• Constant time amortized performance depends on a balanced hash function (like ensuring a BST is bushy, but a good hash function is luckily a much conceputally simpler problem)
• Using a base that is a power of 2$2$ (2,4,8,16,32,...$2,4,8,16,32,...$) makes for a bad hashCode
• For strings, base 31$31$ is used
• Important to have many parts of the data contribute to the hashCode value - nothing should be ignored
• Base of hashCodes should NOT be a factor of the number of buckets M$M$ of the array
• For objects that consist of multiple members, such as a Collection object, we can get a hashCode of the entire Collection by something like the following:
• Take the first member's own hashcode (most likely by calling that member object's own hashCode function)
• Add it to some total hashCode number, and then "smear/elevate" the number (spread it's encoding out over all the bits by multiplying by a number like 31$31$)
• Repeat with the remaining members (get the hashCode of the next member, add it to the total, smear the total, repeat)
• Can save time in hashing function by only looking at 'top' portion of data when computing
• Java requires all methods have hashCode method
• Never store mutable objects in a HashTable/HashMap
• Never override equals without also overriding hashCode