散列表

Introduction

如果所有的键都是小整数,我们可以用一个数组来实现无序的符号表,将键作为数组的索引而数组中的键 i 处储存的就是它对应的值。这样我们就可以快速访问任意键的值。

使用散列的查找算法分为两步:
第一步:使用散列函数将被查找的键转化为数组的一个索引
第二步:处理碰撞冲突(拉链法和线性探测法)

Issues.

  • Computing the hash function.
  • Equality test: Method for checking whether two keys are equal.
  • Collision resolution: Algorithm and data structure to handle two keys that hash to the same array index.

散列表是算法在时间空间上作出权衡的经典例子。

  • 如果没有内存限制,我们可以直接将键作为(可能是一个超大的)数组的索引,那么所有的查找操作只需要访问内存一次即可完成。但这种理想情况不会经常出现,因为当键很多时需要的内存太大。
  • 如果没有时间限制,我们可以使用无序数组并进行顺序查找,这样就只需要很少的内存。

Classic space-time tradeoff.

  • No space limitation: trivial hash function with key as index.
  • No time limitation: trivial collision resolution with sequential search.
  • Space and time limitations: hashing (the real world).

hash function



Idealistic goal. Scramble the keys uniformly to produce a table index.
  • Efficiently computable.
  • Each table index equally likely for each key.





Java’s hash code conventions

All Java classes inherit a method hashCode(), which returns a 32-bit int.

Requirement. If x.equals(y), then (x.hashCode() == y.hashCode()).
Highly desirable. If !x.equals(y), then (x.hashCode() != y.hashCode()).

Default implementation. Memory address of x.
Legal (but poor) implementation. Always return 17.
Customized implementations. Integer, Double, String, File, URL, Date, … User-defined types. Users are on their own.

Implementing hash code: integers, booleans, and doubles

Java library implementations

Implementing hash code: strings

Java library implementation

Performance optimization.

  • Cache the hash value in an instance variable.
  • Return cached value.

Hash code design

separate chaining

Collisions

Collisions. Two distinct keys hashing to same index.

  • Birthday problem => can’t avoid collisions unless you have a ridiculous (quadratic) amount of memory.
  • Coupon collector + load balancing => collisions are evenly distributed.

Separate chaining symbol table

Use an array of M < N linked lists.

  • Hash: map key to integer i between 0 and M - 1.
  • Insert: put at front of ith chain (if not already there).
  • Search: need to search only ith chain.

Separate chaining ST: Java implementation

Analysis of separate chaining

Proposition. Under uniform hashing assumption, prob. that the number of keys in a list is within a constant factor of N / M is extremely close to 1.

ST implementations: summary

linear probing

Collision resolution: open addressing

Open addressing.
When a new key collides, find next empty slot, and put it there.

Linear probing hash table demo

Hash. Map key to integer i between 0 and M-1.
Search. Search table index i; if occupied but no match, try i+1, i+2, etc.

Linear probing hash table summary

Hash. Map key to integer i between 0 and M-1.
Insert. Put at table index i if free; if not try i+1, i+2, etc.
Search. Search table index i; if occupied but no match, try i+1, i+2, etc.
Note. Array size M must be greater than number of key-value pairs N.

Linear probing ST implementation

ST implementations: summary