https://github.com/0xh3xa/algorithms
Algorithms and Data Structures ❤️🚀
https://github.com/0xh3xa/algorithms
algorithms datastructures java study-notes
Last synced: 10 months ago
JSON representation
Algorithms and Data Structures ❤️🚀
- Host: GitHub
- URL: https://github.com/0xh3xa/algorithms
- Owner: 0xh3xa
- Created: 2019-10-29T20:13:08.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2023-10-30T17:55:59.000Z (over 2 years ago)
- Last Synced: 2023-11-18T03:21:32.575Z (over 2 years ago)
- Topics: algorithms, datastructures, java, study-notes
- Language: Java
- Homepage:
- Size: 381 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Algorithms and Data Structures
[![Algorithms][alg-img]][rep-url] [![Data Structures][datastr-img]][rep-url] [![Open Source Love][open-source-img]][rep-url]
Algorithms and data structures' implementations in Java from the `Algorithms 4th edition` :book:
# What is this course?
* Intermediate level survey course
* Programming and problem solving with applications
# Definitions
* `Algorithms`: Method for solving a problem
* `Data structures`: Method to store information
* `Program = Algorithms + Data structures`
# Topics
* `Part I`
01. Data types: stack, queue, bag, union find, priority queue
02. Sorting: quicksort, mergesort, heapsort, radix sorts
03. Searching: BST, red-black BST, hash table
* `Part II`
04. Graphs: BFS, DFS, Prime, Kruskal, Dijkstra
05. Strings: KMP, regular expression, TST, Huffman, LZW
06. Advanced: B-tee, suffix array, maxflow
---
# Why Algorithms is so important?
Algorithms all around us
01. Internet: web search, packet routing, distribute file sharing, ...
02. Biology: human genome project, protein folding, ...
03. Computers: circuit layout, file system, compilers, ...
04. Computer graphics: Movies, video games, virtual reality, ...
05. Security: cell phones, e-commerce, voting machines, ...
06. Multimedia: MP3, JPG, Divx, HDTV, face recognition, ...
07. Social networks: recommendations, news feeds, advertisements, ...
08. Physics: N-body simulation, particle collision simulation, ...
# Steps for solving the problem
01. Model the problem
02. Find an algorithm to solve it
03. Fast enough? Fits in memory?
04. If not, figure out why
05. Find a way to address the problem
06. Iterate until satisfied
---
# Algorithm Analyze
## Reasons to Analyze Algorithms
* Predict performance
* Compare algorithms
* Provide guarantees
* Understand theoretical basis
## The scientific method
* Observe some feature in the natural world
* Hypothesize a model that is consistent with the observations
* Predict events using the hypothesis
* Verify the predictions by making further observations
* Validate by repeating until the hypothesis and observations agree
## Principles
* Experiments must be reproducible.
* Hypotheses must be falsifiable.
## Empirical Analysis
* Manual measurement (benchmarking) with a stopwatch or programmatic timing method.
* Measure running time for different input sizes *N* (e. g. doubling time) and observe the relationship between the running times.
## Data Analysis
* Plot running time T(N) vs input size *N*.
* Plot as log-log plot, often get a straight line. lg(T(N)) vs lg(N). Plot tells you the exponent of *N*.
* Regression, power law: *a x Nb*
* Once you have the power b from the slope of the log-log plot, solve for *a* in the equation *T(N) = a x Nb*
## Doubling Hypothesis
* Run program, doubling the size of the input and observe ratios. Observe to what it converges, do not take the average!
| N | T(N) | Ratio | lg(Ratio) |
| ---- | ---- | ----- | -------- |
| 1000 | 0. 1 | - | - |
| 2000 | 0. 8 | 7. 7 | 2. 9 |
| 4000 | 6. 4 | 8 | 3 |
| ... | ... | ... | ... |
* Hypothesis: Running time is about *a x Nb*, where *b = lg(Ratio)*
* Caveat: Cannot identify logarithmic factors with the doubling hypothesis.
* Calculate *a* by solving *T(N) = a x Nb* for a with all other variables now known.
## Experimental algorithmic
* System independent effects (determines constant *a* and exponent *b* in power law)
+ Algorithm
+ Input data
* System dependent effects (contribute only to constant *a* in power law)
+ Hardware: CPU, memory, cache
+ Software compiler, interpreter, garbage collector
+ System: OS, network, other applications
## Mathematical Models
* Analyze individual operations to determine complexity
* Simplification 1
Count only the most expensive ones, i. e. those that take the most time or where time x frequency is highest.
* Simplification 2
Ignore lower order terms, e. g. in 5xN3 + 20N + 16, ignore the term with N
and the constant 16 (which is 16 x N0) because they are less significant in
comparison with the highest order term. We use *tilde notation* `~` to say that *5
x N3 + 20 x N + 16 __~ 5 x N3__*. Technical definition is that for *f(N) ~
G(N)* when *N* goes towards infinity, the lower order terms become so
insignificant that *f(N)/g(N) = 1*
## Order-of-Growth Classifications
- A great number of algorithms (most) are described by the following order of growth functions
+ 1 (constant)
+ log N (logarithmic)
+ N (linear)
+ N log N (linearithmic)
+ N2 (quadratic)
+ N3 (cubic)
+ 2N (exponential)
> Note: lgN means log2N

* We say the algorithm "is proportional to" e. g. constant time
---
# Properties
01. Reflexive: p is connected to q
02. Symmetric: if p is connected to q, then q is connected to p
03. Transitive: if p is connected to q and q is connected to r, then p is connected to r
# Dynamic connectivity
Applications based on this:
01. Pixels in a digital photo
02. Computers in network
03. Friends in a social network
04. Transistors in a computer chip
05. Elements in a mathematical set
06. Variables names in Fortran program
07. Metallic sites in a composite system
## Quick find (Eager approach)
* Data structure
* Integer array `id[]` of size N
* Interpretation: p and q are connected (iff) if and only if they have the same id
* Complexity
| Initialize | Union | Find |
|------------|-------|------|
| N | N | 1 |
* Defect
- Find too expensive (Could be N array accesses)
- If you have N union commands over N objects will be O(N2) quadratic
* Note: We can not accept Quadratic in big problems Quadratic algorithms do not scale
### Rough standards (for now)
* 109 operations per second
* 109 words of main memory
* Touch all words in approximately 1 second
### E. g Huge problem for quick find
* 109 union commands of 109 objects
* Quick find takes more than 1018 operations
* 30+ years of computer time!
## Quick Union
* Set the first element based on the root of the second element
* Complexity
| Initialize | Union | Find |
|------------|-------|------|
| N | N' | 1 |
* Defect
- Tree can get tall
- Find too expensive (Could be N array accesses)
## Quick Union improvement
01. Weighted Quick union
* Modify the Quick union to avoid tall trees
* Balance by linking root of smaller tree to root of larger tree
* Depth of any node x is at most `lg N`
* Complexity
| Initialize | Union | Find |
|------------|-------|------|
| N | lg N' | lg N |
02. Quick union with path compression `N + M lg N`
03. Weighted Quick union with path compression `N + M lg * N`
> Note: WQUPC reduce time from 30 years to 6 seconds
## Union find Applications
01. Percolation
02. Games (Go, Hex)
03. Dynamic connectivity
04. Last common ancestor
05. Hoshen-kopelman algorithm in physics
## Percolation
A model for many physical systems
* N-by-N grid of sites
* Each site is open probability p (or blocked with probability 1-p)
* System percolates iff top and bottom are connected by open sites
* Applications in real life
| Model | System | Vacant site | Occupied site | Percolates |
| -------| -------|-------------|-------------|----------|
| electricity | material| conductor | insulated | conducts |
| fluid flow | material| empty | blocked | porous |
| social interaction | population | person | empty | communicates |
---
# Data structures Design
* Good practice to make an abstraction between the outside world and internal implementation, In java we will use interface
- Benefits
+ Client can't know details of implementation
+ Implementation can't know details of client needs
+ Design: creates modular, reusable libraries
+ Performance: use optimized implementation where it matters
* Client: program using operations defined in interface
* Implementation: actual code implementing operations
* Interface: description of data type, basic operations
## Stack
* `LIFO` (last in first out), useful in many applications
* Operations: `push(Item item), pop(), size(), isEmpty()`
* There are two implementation of stack using `Linkedlist` and `Array`
* Stack removes the item most recently added
* What are the Differences between LinkedList and Array implementation?
- Linkedlist: Use extra space for dealing with links
- Array: resize/shrink the array takes some time
* When should I use Linkedlist or Array implementation?
- if time is important and don't want to lose any input i. e. dealing with internet packet use `Linkedlist` implementation, But if you take care of `memory` space use `Array` implementation
* How duplicate/shrinking array?
- `Duplicate` When reach 100% full the array resize`(arr.length * 2)`
- `Shrink` when reach one quarter full to the half `resize(arr.length / 2)`
* Stack applications
- Parsing in a compiler
- Java virtual machine
- Undo in word processor
- Back button in a web browser
- Implementation function calls in a compiler
- Arithmetic expression evaluation
- Reverse objects
* `LinkedList` implementation code
``` java
public class LinkedStack implements Stack {
private class NodeList {
Item item;
NodeList next;
public NodeList(Item item) {
this.item = item;
}
}
private NodeList first = null;
private int size = 0;
public void push(Item item) {
if (first == null) {
first = new NodeList(item);
first.next = null;
} else {
NodeList oldFirst = first;
first = new NodeList(item);
first.next = oldFirst;
}
size++;
}
public Item pop() {
if (isEmpty()) {
throw new NoSuchElementException("stack underflow");
}
Item item = first.item;
first = first.next;
size--;
return item;
}
}
```
---
## Queue
* `FIFO` (first in first out), useful in many applications
* Operations: `enqueue(Item item), dequeue(), size(), isEmpty()`
* Queue removes the item lest recently
* There are two implementation of stack using `Linkedlist` and `Array`
* Queue applications
- CPU scheduling
- Disk scheduling
- Data transfer asynchronously between two processes. Queue is used for synchronization.
- Breadth First search in a Graph
- Call Center phone systems
* `LinkedList` implementation code
``` java
public class LinkedQueue implements Queue {
private class NodeList {
Item item;
NodeList next;
public NodeList(Item item) {
this.item = item;
}
}
private NodeList first;
private NodeList last;
private int size;
public LinkedQueue() {
first = null;
last = null;
size = 0;
}
public void enqueue(Item item) {
NodeList oldLast = last;
last = new NodeList(item);
last.next = null;
if (isEmpty()) {
first = last;
} else {
oldLast.next = last;
}
size++;
}
public Item dequeue() {
if (isEmpty()) {
throw new NoSuchElementException("Queue underflow");
}
Item item = first.item;
first = first.next;
if (isEmpty()) {
last = null;
}
size--;
return item;
}
public Item peek() {
if (isEmpty()) {
throw new NoSuchElementException("Queue underflow");
}
return first.item;
}
}
```
---
# Elementary sorts
* Rearrange array of N times into ascending/descending order based on a key
- Selection sort
- Insertion sort
- Shell sort
- Heap sort
- Quick sort
* Implementation in java there are `Comparable` and `Comparator` interfaces we will them any of them in the implementation of the sort algorithms
to allow sort any generic data types
* There are three return values: 1, 0, -1 and throw Exception if incompatible types or null
+ V less than W (return -1)
+ V equal to W (return 0)
+ V greater than W (return 1)
* Total order
+ `Antisymmetric` : if v<=w and w<=v, then v=w
+ `Transitivity` : if v<=w and w<=x, then v<=x
+ `Totality` : either v<=w or w<=v or both
## Selection sort
* Scan from left to right
* Find the index of `min` of smallest remaining entry, then swap `a[i]` and `a[min]` `-→` `Time Complexity O(N2)` and doesn't sensitive if the input is sorted
`Code`
``` java
public static > void sort(Item[] arr) {
int N = arr.length;
int min;
for (int i = 0; i < N; i++) {
min = i;
for (int j = i + 1; j < N; j++) {
if (less(arr[j], arr[min])) {
min = j;
}
}
swap(arr, i, min);
}
}
```
## Insertion sort
* Scan from left to right
* Swap `a[i]` with each larger entry to its left ← `Time Complexity O(N2)` and has good performance over `partially sorted arrays`
* Fast when the array is partially sorted `O(N)`
* Array called partially sorted when number of elements to be changed less than or equal cN
`Code`
``` java
public static > void sort(Item[] arr) {
int N = arr.length;
for (int i = 1; i < N; i++) {
for (int j = i; j > 0 && less(arr[j], arr[j - 1]); j--) {
swap(arr, j, j - 1);
}
}
}
```
## Shell sort
* Move entries more than one position at a time by `h-sorting` the array What's the `h value` Knuth says `3x+1`
* Complexity N3/2
`Code`
``` java
public static > void sort(Item[] arr) {
int N = arr.length;
int h = 1;
while (h < N / 3)
h = 3 * h + 1;
while (h >= 1) {
for (int i = h; i < N; i++) {
for (int j = i; j >= h && less(arr[j], arr[j - h]); j -= h) {
swap(arr, j, j - h);
}
}
h /= 3;
}
}
```
* Why Shell sort uses insertion sort internally?
01. Fast unless array size is huge
02. Tiny used in some embedded systems
03. Hardware sort prototype
## Shuffle sort
* Generate a random real number for each array entry
* Sort array
* Knuth shuffle
- Pick integer r between 0 and i uniformly at random
- Swap `a[i]` and `a[r]`
- Complexity: `O(N)`
`Code`
``` java
public static > void shuffle(Item[] arr) {
int N = arr.length;
Random random = new Random();
for (int i = 0; i < N; i++) {
int r = random.nextInt(i + 1);
swap(arr, i, r);
}
}
```
## Applications in sorting
* Convex hull of a set of N points
- Is the smallest perimeter fence enclosing the points
- Equivalent definitions:
01. Smallest convex set containing all the points
02. Smallest area convex polygon enclosing the points
03. Convex polygon enclosing the points, whose vertices are points in set
- Convex hull output. Sequence of vertices in counterclockwise order
- Mechanical algorithm. Hammer nails perpendicular to plane, search elastic rubber band around points
- Convex hull application
01. Robot motion planning. Find shortest path in the plan from s to t that avoids a polygonal obstacle
+ Fact. Shortest path is either straight line from s to t or it is one of two polygonai chains of convex hull
02. Farthest pair problem. Given N points in the plane, find a pair, find a pair of points with the largest Euclidean distance between them
+ Fact. Farthest pair of points are extreme points on convex hull
- Convex hull: geometric properties
+ Fact. Can traverse the convex hull by making only counterclockwise turns
+ Fact. The vertices of convex hull appear in interesting order of polar angle with respect to point p with lowest y-coordinate
- Graham scan, based on above facts |
+ Choose point p with smallest y-coordinate
+ Sort points by polar angle with p
+ Consider points in order, discard unless if create a ccw turn
+ Q. How to find point p with smallest y-coordinate?
A. Define a total order, comparing by y-coordinate
+ Q. How to sort points by polar angle with respect to p?
A. Define a total order for each point p
+ Q. How to determine where p1 → p2 → p3 is counterclockwise turn?
A. Computational geometry
+ Q. How to sort efficiently?
A. Merge sort in `N lg N`
- Implement CCW
+ CCW. Given three points a, b and c, is a→b→c a counterclockwise turn?
A. Determinant (or cross product) gives 2x signed area of planer triangle
01. if signed area > 0, then a→b→c is counterclockwise
02. if signed area < 0, then a→b→c is clockwise
03. if signed area = 0, then a→b→c are colliner
+ Running time `N lg N` for sorting and linear for rest
## Merge sort
* This sort based on the technique of `divide-and-conquer`
* Java sort for objects
* Steps:
- Divide array into two halves
- Recursively sort each half
- Merge two halves
* Complexity `N lg N`
`Code`
``` java
public static > void sort(Item[] arr, Item[] aux, int lo, int hi) {
if (hi <= lo)
return;
int mid = lo + (hi - lo) / 2;
sort(arr, aux, lo, mid);
sort(arr, aux, mid + 1, hi);
merge(arr, aux, lo, mid, hi);
}
private static > void merge(Item[] arr, Item[] aux, int lo, int mid, int hi) {
for (int k = lo; k <= hi; k++)
aux[k] = arr[k];
int i = lo, j = mid + 1;
for (int k = lo; k <= hi; k++) {
if (i > mid)
arr[k] = aux[j++];
else if (j > hi)
arr[k] = aux[i++];
else if (less(aux[j], aux[i]))
arr[k] = aux[j++];
else
arr[k] = aux[i++];
}
}
```
* Mergesort improvement
- Stop if array is already sorted: !less(arr[mid+1], arr[mid])
- Cutoff to insertion sort = 7
- Eliminate-the-copy-to-the-auxiliary-array trick
* First draft of a Report on the EDVAC by John von Neuman
* Compare Running time between Insertionsort and Mergesort
- Laptop executes 108 compares/second
- Supercomputer executes 1012 compares/second
- In this below table will compare between normal computer in first row and supercomputer in second row
Insertionsort N2
| Million | Billion |
|-----------|-----------|
| 2. 8 hours | 317 years |
| 1 second | 1 week |
Mergesort `N lg N`
| Million | Billion |
|----------|---------|
| 1 second | 18 min |
| instant | instant |
> Note: Good algorithm are better than supercomputers
### Bottom-up version of Mergesort
* Basic plan
01. Pass through array, merging subarrays of size 1
02. Repeat for subarrays of size 2, 4, 8, 16, ...
03. Slower than Recursive by 10%
`Code`
``` java
public static > void sortBottomUp(Item[] arr) {
int N = arr.length;
Item[] aux = (Item[]) new Comparable[N];
for (int sz = 1; sz < N; sz = sz + sz)
for (int lo = 0; lo <= N - sz; lo += sz + sz)
merge(arr, aux, lo, lo + sz - 1, Math.min(lo + sz + sz - 1, N - 1));
}
```
## Sort Stability
* Suppose you want to sort `BY_NAME` then `BY_SECTION`
* You should sort and keep the equal elements that they came as input, don't change equal elements position
* Which sorts are stable?
+ Insertionsort
+ Mergesort
* Why Selectionsort and Shellsort not stable?
+ Selectionsort keeps keep pointer from past and might move an item some equal item
- Shellsort makes long distance exchanges
## Quick sort
* One of the most important algorithm in 20th century
* Java sort for primitive types
* Basic plan
01. Shuffle the array
02. Partition
03. Sort
* Complexity `N lg N`
* Faster than Mergesort
* In place algorithm
* Not stable
* Worst case in quicksort will not gonna happen
* Problems in quick sort
`Code`
``` java
public static > void sort(Item[] arr) {
KnuthShuffleSort.shuffle(arr);
sort(arr, 0, arr.length - 1);
}
private static > void sort(Item[] arr, int lo, int hi) {
if (hi <= lo)
return;
if (hi <= lo + CUTOFF) {
InsertionSort.sort(arr, lo, hi);
return;
}
int j = partition(arr, lo, hi);
sort(arr, lo, j - 1);
sort(arr, j + 1, hi);
}
private static > int partition(Item[] arr, int lo, int hi) {
int i = lo, j = hi + 1;
while (true) {
while (less(arr[++i], arr[lo]))
if (i == hi)
break;
while (less(arr[lo], arr[--j]))
;
if (i >= j)
break;
swap(arr, i, j);
}
swap(arr, lo, j);
return j;
}
```
* Improvement
01. Insertion sort for small subarrays
- Cut OFF ~ 10 items
02. Median of sample
- Best choice of pivot item = median
- Median-of-3 random items
## Selection
* Goal. Given an array of N items, find the kth largest
- Min(k=0), max(k=N-1), median(k=N/2)
* Applications
- Order statistics
- Find the top k
* Use theory as a guide
- Easy `N lg N` upper bound. How? By sorting the array and loop util reach kth
- Easy `N` upper bound for k = 1,2,3. How?
- Easy `N` lower bound. Why?
* Which is true?
- `N lg N` lower bound? ← Is selection as hard as sorting?
- `N` upper bound? ← Is there a linear algorithm for each k?
* Quick-select
- Version of Quick-sort
- Entry a[j] is in
- No larger entry to the left of j
- No smaller entry to the right of j
- Repeat in one sub-array, depending on j; finished when j equals k
- Analysis: Linear time on average
> Remark. Quick-select uses ~ 1/2N2 compares in the worst case, but (as with quicksort) the random shuffle provides a probabilistic guarantee
- Algorithm
``` java
public static > Comparable select(Item[] arr, int k) {
KnuthShuffleSort.shuffle(arr);
int lo = 0, hi = arr.length - 1;
while (hi > lo) {
int j = partition(arr, lo, hi);
if (j < k)
lo = j + 1;
else if (j > k)
hi = j - 1;
else
return arr[k];
}
return arr[k];
}
```
> Remark. But, constants are too high => not used in practice
* Use theory as a guide
- Still in worthwhile to seek practical linear time (worst-case) algorithm
- Until one is discovered, use quick-select if you don't need a full sort
## Duplicate keys
* Often, purpose of sort is to bring items with equal keys together
- Sort population by age
- Find collinear points
- Remove duplicates from mailing list
- Sort job applicants by college attended
* Typical characteristics of such applications
- Huge array
- Small number of key values
* Mergesort with duplicate keys. always between 1/2 N lg N and N lg N compares
* Quicksort with duplicate keys.
- Algorithms goes quadratic unless partition stop on equal keys
- 1990s C user found this defect in qsort()
- Mistake. Put all items equals to the partitioning item in one side
+ Consequence. ~1/2N2 compares when all keys equal
B A A B A B B B C C C A A A A A A A A A A `A`
- Recommended. Stp scan on item equals to the partitioning item
+ Consequence. ~ N lg N compares when all keys equal
B A A B A B B B C C C A A A A A `A` A A A A A
- Desirable. Put all items equal to the partitioning item in place
A A A `B B B B B` C C C `A A A A A A A A A A A`
* 3-way partitioning
- Goal. Partition array into 3 parts so that:
01. Entries between lt and gt equal to partition item v
02. No larger entries to left of lt
03. No smaller entries to right of gt
- Dutch national flag problem. [Edsger Dijkstra]
+ Conventional wisdom until mid 1990s: not worth doing
+ New approach discovered when fixing mistake in C library qsort()
+ Now incorporated into qsort() and Java system sort
- Steps
01. Let v be partitioning item a[lo]
02. Scan i from left to right
. (a[i] < v): exchange a[lt] with a[i]; increment both it and i
. (a[i] > v): exchange a[gt] with a[i]; decrement gt
. (a[i] == v): increment i
`Code`
``` java
private static > void sort(Item[] arr, int lo, int hi) {
if (hi <= lo)
return;
int lt = lo, gt = hi;
Item v = arr[lo];
int i = lo;
while (i <= gt) {
int cmp = arr[i].compareTo(v);
if (cmp < 0)
swap(arr, lt++, i++);
else if (cmp > 0)
swap(arr, i, gt--);
else
i++;
}
sort(arr, lo, lt - 1);
sort(arr, gt + 1, hi);
}
```
* Proposition. [Sedgewick-Bentley, 1997]
- Quicksort with 3-way partition is entropy-optimal
* Bottom line. Randomized quicksort with 3-way partitioning reduces running time from linearithmic to linear in broad class of application
## System sorts
* Sort applications
- obvious applications
+ Sort a list of names
+ Organize an MP3 library
+ Display Google PageRank results
+ List Rss feed in reverse chronological order
- Problems became easy once items are in sorted order
+ Find the median
+ Binary search in a database
+ Identify statistical outliers
+ Find duplicates in a mailing list
- Non-obvious applications
+ Data compression
+ Computer graphics
+ Computational biology
+ Load balancing on a parallel computers
* Java System sort
- Arrays.sort()
- Has different method for each primitive type
- Has a method for data types that implement Comparable
- Has a method that uses a Comparator
- Uses tuned quicksort for primitive types; tuned mergesort for objects
## System sort: Which algorithm to use?
* Many sorting algorithms to choose from
- Internal sorts
+ Insertion sort, selection sort, bubblesort, shaker sort
+ Quicksort, mergesort, heapsort, samplesort, shellsort
+ Solitaire sort, red-black sort, splaysort, Yaroslavskiy sort, psort
- External sorts. Poly-phase mergesort, cascade-merge, oscallating sort
- String/radix sorts. Distribution, MSD, LSD, 3-way string quicksort
- Parallel sorts
+ Bitonic sort, Batcher even-odd sort
+ Smooth sort, cube sort, column sort
+ GPUsort
* Applications have diverse attributes
- Stable?
- Parallel?
- Deterministic?
- Key all distinct?
- Multiple key types?
- Linked list or arrays?
- Large or small items?
- Is your array randomly ordered?
- Need guaranteed performance?
* Elementary sort may be method of choice for some combination
- Cannot cover all combinations of attributes
* Q. Is the system sort good enough?
A. Usually
---
## Sort complexity
|Name|In-place|Stable|Best |Average |Worst|Remarks|
|----|-------|------|------|---------|-----|-------|
|Selectionsort|Yes|No|1/2 N2|1/2 N2|1/2 N2|N exchanges|
|Insertionsort|Yes|Yes|N|l/4N2|l/2N2|use for small N or partially ordered|
|Shellsort|Yes|No|N log3N|?|cN3/2|tight code, sub-quadratic|
|Mergesort|No|Yes|½ N lg N|N lg N|N lg N|N lg N guarantee, stable|
|Quicksort|Yes|No|N|N lg N|½ N2|N lg N probabilistic guarantee fastest in practice|
|3-ways Quicksort|Yes|No|N2/2|2 N ln N|½ N2|improves quicksort in presence of duplicate keys|
|Timesort|No|Yes|N|N lg N|N lg N|-|
---
# Priority Queues
* A collection is a data types that store group of items
|data type|key operations|data structures|
|---------|--------------|---------------|
|stack|Push, Pop|linked list, resizing array|
|queue|Enqueue, Dequeue|linked list, resizing array|
|priority queue|Insert, Del-Max/Min|binary heap|
|symbol table|Put, Get, Delete|BST, hash table|
|set|Add, Contains, Delete|BST, hash table|
## Priority Queue (PQ)
* Collections. Insert and delete items. which item to delete?
- Stack. Remove the item most recently added
- Queue. Remove the item least recently added
- Randomized queue. Remove a random item
- Priority Queue. Remove the `largest` or `smallest` item
* Operations: `insert(Item item), delMax(), isEmpty(), max(), size()`
* Applications
- Event-driven simulation: [customers in a line, colliding particles]
- Numerical computation: [reducing roundoff error]
- Data compression: [Huffman codes]
- Graph searching: [Dijkstra's algorithm, Prim's algorithms]
- Number theory: [sum of powers]
- Artificial intelligence: [A* search]
- Statistics: [maintain largest M values in a sequence]
- Operating systems: [load balancing, interrupt handling]
- Discrete optimization: [bin packing, scheduling]
- Spam filtering: [Bayesian spam filters]
* Implementation: you can make multiple implementation using unordered and ordered array, and Binary heap
* Complexity of unordered and ordered array
|Implementation|Insert|Del max|Max|
|--------------|------|-------|---|
|unordered array|1|N|N|
|ordered array|N|1|1|
|goal|lg N|lg N|lg N|
---
## Binary heaps
* Binary tree, Empty or node with links to left and right binary trees
* Complete tree, perfectly balanced, except for bottom level
* Complexity: `lg N`
* Height of complete tree with N nodes is `lg N` , why?
Height only increase when N is a power of 2
* A complete binary tree in nature

* Binary heap representations
- Binary heap, Array representation of a heap-ordered complete binary tree.
- Heap-ordered binary tree
+ Keys in nodes
+ Parent's key no smaller than children's keys
- Array representation
+ Indices start at 1
+ Take nodes in level order
+ No explicit links needed!
* Binary heap properties
- Parent of node at k is at `k/2`
- Children of node at k are at `2k` and right `2k+1`
* Best practice use immutable keys
* Underflow and overflow
- Underflow: throw exception if deleting from empty PQ
- Overflow: add no-arg constructor and use resize array
* To avoid maxing the operating like `delMax(), delMin()` there are two implementations `MaxPQ` and `MinPQ`
- The difference in comparing `less()`, `greater()`
* Good practice to use immutable keys, to prevent the client to change the key while processing the PQ
- Advantages
+ Simplifies debugging
+ Safer in presence of hostile code
+ Simplifies concurrent programming
+ Safe to use as key in priority queue or symbol table
- Disadvantages
+ Must create new object for each data type value
`Code`
``` java
public void insert(Item key) {
pq[++N] = key;
swim(N);
}
public Item delMax() {
if (isEmpty())
throw new IndexOutOfBoundsException();
Item max = pq[1];
swap(1, N);
N--; // decrement N here, will be used in sink
sink(1);
pq[N + 1] = null; // prevent loitering, object no longer needed
return max;
}
public Item max() {
return pq[1];
}
private void swim(int k) {
while (k > 1 && less(k / 2, k)) { // k less than parent, swap
swap(k / 2, k);
k /= 2;
}
}
private void sink(int k) {
while (2 * k <= N) {
int j = 2 * k;
if (j < N && less(j, j + 1))
j++;
if (!less(k, j)) // K less than children then break
break;
swap(k, j);
k = j;
}
}
```
---
## Heapsort
* Basic plan
- Create max heap with all N keys
- Repeatedly remove the maximum key
* Steps
- First pass
+ Build heap using bottom-up method
+ arrange from max parent to min child
``` java
for (int k = N/2; k >= 1; k--) sink(arr, k, N);
```
+ we will loop from N/2 to 1 because in sink operation
- Second pass
+ Remove the maximum, one at a time
+ Leave in array, instead of nulling out
+ replace from min to max in the array
``` java
while (N > 1) {
swap(arr, 1, N--);
sink(arr, 1, N);
}
```
* In place sorting algorithm with `N lg N` worst-case
* Proposition. Heap construction uses <= 2 N compares and exchanges
* Proposition. Heapsort uses <= 2 N lg N compares and exchanges
* Significance. In-place sorting algorithm with N lg N worst-case
- Mergesort: no, linear extra space
- Quicksort: no, quadratic time in worst-case
- Heapsort: yes
* Bottom line: heapsort is optimal for both time and space, but:
- Inner loop longer than quicksort's
- Make poor usage of cache memory
- Not stable
`Code`
``` java
public final static > void sort(Item[] pq) {
int n = pq.length;
// heapify phase
for (int k = n / 2; k >= 1; k--) {
sink(pq, k, n);
}
// sortdown phase
int k = n;
while (k > 1) {
swap(pq, 1, k--);
sink(pq, 1, k);
}
}
// Get the largest and put as a parent
private static > void sink(Item[] pq, int k, int n) {
while (2 * k <= n) {
int j = 2 * k;
if (j < n && less(pq, j, j + 1))
j++;
if (!less(pq, k, j)) // K greater or equal than children then break
break;
swap(pq, k, j);
k = j;
}
}
```
* Sorting algorithms complexity till heapsort
|Name|In-place|Stable|Best |Average |Worst|Remarks|
|----|-------|------|------|---------|-----|-------|
|Selectionsort|Yes|No|1/2 N2|1/2 N2|1/2 N2|N exchanges|
|Insertionsort|Yes|Yes|N|l/4N2|l/2N2|use for small N or partially ordered|
|Shellsort|Yes|No|N log3N|?|cN3/2|tight code, sub-quadratic|
|Mergesort|No|Yes|½ N lg N|N lg N|N lg N|N lg N guarantee, stable|
|Quicksort|Yes|No|N|N lg N|½ N2|N lg N probabilistic guarantee fastest in practice|
|3-ways Quicksort|Yes|No|N2/2|2 N ln N|½ N2|improves quicksort in presence of duplicate keys|
|Timesort|No|Yes|N|N lg N|N lg N|-|
|Heapsort|Yes|No|N|2 N lg N|2 N lg N|N lg N|N lg N guarantee, in-place|
|???|Yes|Yes|N lg N|N lg N|N lg N|holy sorting grail|
---
## Event driven simulation
* Goal. Simulate the motion of N moving particles that behave according to the laws of elastic collision
* Idea goes back to Einstein
* Application based on priority queue, without PQ you can not do it with large number of particles because would required quadratic time and not affordable
* Hard disc model
- Moving particles interact via elastic collisions with each other and walls
- Each particle is a disc with know position, velocity, mass and radius
- No other forces
* Significance. Relates macroscopic observables to microscopic dynamics
- Maxwell-Boltzmann: distribution of speeds as function of temperature
- Einstein: explain Brownian motion of pollen grains
* Time-driven simulation
- Update the position of each particle every after dt units of time, and check for overlaps
- If overlap, roll back the clock to the time of the collision, update the velocities of the colliding particles, and continue the simulation
- Main drawbacks
+ ~ N2/2 overlap checks per time quantum
+ Simulation is too slow if dt is very small
+ May miss collisions if dt is too large
* Event-driven simulation
- Between collisions, particle move in straight-line trajectories
- Change sate only when something happens
- Focus on times when collision occur
- Maintain PQ of collision events, prioritized by time
- Remove the min = get next collision
- Collision prediction. Given position, velocity, and radius of a particle, when will it collide next a wall or another particle?
- Collision resolution. if collision occurs, update colliding particle(s) according to law of elastic collisions
- Steps
+ Initialization
01. Fill PQ with all potential particle-wall collisions
02. Fill PQ with all potential particle-particle collisions
---
# Symbol tables
## API
* Key-value pair abstraction
- Insert a value with specified key
- Give a key, search for the corresponding value
* Ex. DNS lookup
- Insert URL with specified IP address
- Give URL, find corresponding IP address
* Applications
01. Dictionary
02. Book index
03. File share
04. Compiler
05. Routing table
06. DNS
07. Genomics
08. File system
09. Web search
* Operations:
```
put(Key key, Value val)
get(Key key)
delete(Key key)
contains(Key key)
isEmpty()
size()
keys()
```
* Conventions:
- Values are not null
- Method get() returns null if key not present
- Method put() overwrites old values with new value
* Key type: several natural assumption
- Assume keys are Comparable, use compareTo()
- Assume keys are any generic type, use equals() to test equality
- Assume keys are any generic type, use equals() to test equality, use hashCode() to scramble key
* Best practices: Use immutable types for symbol data keys
- Immutable in Java: String, Integer, Double, java.io. File, ...
- Mutable in Java:: StringBuilder, java.net. URL, Arrays, ...
* Equality test
- All Java classes inherit a method equals()
- Java requirements. For any references x, y and z
+ Reflexive: x.equals(x) is true
+ Symmetric: x.equals(y), iff y.equals(x)
+ Transitive: if x.equals(y) and y.equals(z), then x.equals(z)
+ Non-null: x.equals(null) is false
* Equals design
- Standard recipe for user-defined type
+ Optimization for reference equality
+ Check again null
+ Check that two objects are of the same type and cast
+ Compare each significant field
01. if field is a primitive type, use ==
02. if field is an object, use equals()()
03. if field is any array, apply to each entry
- Best practices
+ No need to use calculated fields that depends on other fields
+ Compare fields mostly likely to differ first
+ Make compareTo() consistent with equals() `x.equals(y), if and only if (x.compareTo(y) == 0)`
## Elementary implementations
* Sequential search (unordered list)
- Data structure: Maintain an (unordered) linked list of key-value pairs
- Node contains key and value
- Search: Scan through all keys until find a match
- Insert: Scan through all keys until find a match, if no match add to front
- summary
|worst-case search|worst-case insert|average-case search|average-case insert|ordered|key interface|
|-----------------|-----------------|-------------------|-------------------|-------|-------------|
|N|N|N / 2|N|no|equals()|
* Binary search in an order array
- Data structure: Maintain an ordered array of key-value pairs
- Rank helper function. how many keys < k?
- summary
|worst-case search|worst-case insert|average-case search|average-case insert|ordered|key interface|
|-----------------|-----------------|-------------------|-------------------|-------|-------------|
|lg N|N|lg N|N / 2|yes|compareTo()|
* BST Binary search tree
## Ordered operations
``` java
void put(Key key, Value value)
Value get(Key key)
void delete(Key key)
boolean contains(Key key)
boolean isEmpty()
int size()
Key min()
Key max()
Key floor(Key key) // largest key less than or equal to key
Key ceiling(Key key) // smallest key greater than or equal to key
int rank(Key key) // number of key less than key
Key select(int k) // Key of rank k
void deleteMin() // delete smallest key
void deleteMax() // delete largest key
int size(Key lo, Key hi)
Iterable keys(Key lo, Key hi)
Iterable keys()
```
## Binary search trees
* Classic data structure provides efficient implementations of ST algorithm
* BST is a binary tree in symmetric order
* A binary tree is either
- Empty
- Two disjoint binary tress(left and right)
* Symmetric order each node has a key, and every node's key is:
- Larger than all keys in its left subtree
- Smaller than all keys in its right subtree
* Representation in java using a Node has key, value and reference to left and right
`Code`
``` java
public void put(Key key, Value val) {
root = put(root, key, val);
size++;
}
private Node put(Node node, Key key, Value val) {
if (node == null)
return new Node(key, val);
int cmp = key.compareTo(node.key);
if (cmp < 0)
node.left = put(node.left, key, val);
else if (cmp > 0)
node.right = put(node.right, key, val);
else
node.val = val;
return node;
}
public Value get(Key key) {
Node node = root;
while (node != null) {
int cmp = key.compareTo(node.key);
if (cmp < 0)
node = node.left;
else if (cmp > 0)
node = node.right;
else
return node.val;
}
return null;
}
```
* Tree shape
- Many BSTs correspond to same set of keys
- Number of compares for search/insert is equal to 1 + depth of node
- Remark. Tree shape depends on order of insertion
- Worst case will be in natural order
* Mathematical analysis
- Proposition. If N distinct keys are inserted into a BST in random order, the expected number of compares for a search/insert is ~ 2 ln N
- Pf. 1-1 correspondence with quicksort partitioning
- But worst case height is N
|worst-case search|worst-case insert|average-case search|average-case insert|ordered|key interface|
|-----------------|-----------------|-------------------|-------------------|-------|-------------|
|N|N|1.39 lg N|1.39 lg N|stay tunned|compareTo()|
## Ordered ST operations
* Minimum and maximum
- Minimum. Smallest key in table
- Maximum. Largest key in table
- Q. How to find min/max in BST?
+ For min move to the left from the root until find null key
+ For max move to the right from the root until find null key
* Floor and ceiling
- Floor. Largest key <= to given key
- Ceiling. Smallest key >= to given key
* Computing the floor
- Case 1. [k equals the key at root]
+ The floor of k is k
- Case 2. [k is less than the key at root]
+ The floor of k in the left subtree
- Case 3. [k is greater than the key at root]
+ The floor of k is in the right subtree
+ if there is any key <= k in right subtree otherwise it is the key in the root
``` java
public Key floor(Key key) {
Node node = floor(root, key);
if (node == null) return null;
return node.key;
}
private Node floor(Node node, Key key) {
if (node == null) return null;
int cmp = key.compareTo(node.key);
if (cmp == 0) return node;
if (cmp < 0) return floor(node.left, key);
Node n = floor(node.right, key);
if (n != null) return n;
else return node;
}
```
* Ceiling
``` java
public Key ceiling(Key key) {
Node node = ceiling(root, key);
if (node == null)
return null;
return node.key;
}
private Node ceiling(Node node, Key key) {
if (node == null)
return null;
int cmp = key.compareTo(node.key);
if (cmp == 0)
return node;
if (cmp < 0) {
Node n = ceiling(node.left, key);
if (n != null)
return n;
else
return n;
}
return ceiling(node.right, key);
}
```
* Rank
* Rank. How many keys < k?
- Easy recursive algorithm (3 cases!)
``` java
public int rank(Key key) {
return rank(key, root);
}
private int rank(Key key, Node node) {
if (node == null) return 0;
int cmp = key.compareTo(node.key);
if (cmp < 0) return rank(key, node.left);
else if (cmp > 0) return 1 + size(node.left) + rank(key, node.right);
else return size(node.left);
}
```
* Inorder traversal
- Traverse left subtree
- Enqueue key
- Traverse right substree
``` java
public Iterable keys() {
Queue q = new Queue<>();
inorder(root, q);
return q;
}
private void inorder(Node node, Queue q) {
if (node == null) return;
inorder(node.left, q);
q.enqueue(node.key);
inoder(node.right, q);
}
```
* Property. inorder traversal of a BST yields keys in ascending order
## Deletion in BST
* To remove a node with a given key
- Set its value to null
- Leave key in tree to guide searches (but don't consider it equal in search)
- Cost ~ 2 ln N' per insert, search, and delete (if keys in random order)
- Unsatisfied solution. Tombstone (memory) overload
* To delete the minimum
- Go left until finding a node with null left link
- Replace that node by its right link
- Update subtree counts
``` java
public void deleteMin() {
root = deleteMin(root);
}
private Node deleteMin(Node node) {
if (node.left == null) return node.right;
node.left = deleteMin(node.left);
node.count = 1 + size(node.left) + size(node.right);
return node;
}
```
* Hibbard deletion
- To delete a node with key k: search for node n containing key k
+ Case 0. Delete n by setting parent link to null
+ Case 1. Delete n by replacing parent link
+ Case 2. [2 children]
01. Find successor x of n
02. Delete the minimum in n's right subtree
03. Put x in n's spot
``` java
public void delete(Key key) {
size--;
root = delete(root, key);
}
private Node delete(Node node, Key key) {
if (node == null)
return null;
int cmp = key.compareTo(node.key);
if (cmp < 0)
node.left = delete(node.left, key);
else if (cmp > 0)
node.right = delete(node.right, key);
else {
if (node.right == null)
return node.left;
if (node.left == null)
return node.right;
Node t = node;
node = min(t.right);
node.right = deleteMin(t.right);
node.left = t.left;
}
node.count = size(node.left) + size(node.right) + 1;
return node;
}
```
---
## Balanced search tree
### 2-3 search trees
* Allow 1 or 2 keys per node
- 2-node: one key, two children
- 3-node: two keys, three children
* Perfect balance: every path from the root to null link has same length
* Implementation: Red-black BSTs
### Red-black BSTs
* Represent 2-3 tree as a BST (binary search tree)
* Left-leaning red-black BSTs (Guibas-Sedgewick 1979 and 2007)
- Use "internal" left-learning links as "glue" for 3-nodes
- A BST such that:
+ No node has two red links connected to it
+ Every path from root to the null link has the same number of black links
+ Red links lean left
### B-Tree
* TODO this part
---
## Geometric applications of BSTs
* Intersections among geometric objects
* Applications. CAD, games, movies, virtual reality, database, ...
* Efficient solutions. Binary search trees (and extensions)
### 1 d range search
* Extensions of ordered symbol table
- Same operations or ordered symbol table
- Range search: find all keys between k1 and k2
- Range count: number of keys between k1 and k2
* Application. Database queries
- i.e salary between val-1 AND val-2
* Geometric interpretation
- Keys are point of a line
- Find/count points in a given 1 d interval
* Implementations
- Unordered array. Fast insert, slow range search
- Order array. Slow insert, binary search for k1 and k2 to do range search
- 1 d range count
### line segment intersection
* TODO this part
### kd trees
* TODO this part
### interval search trees
* Instead of points the data is interval
* 1 d interval search. Data structure to hold set of (overlapping) intervals
* Create BST, where each node stores an interval (lo, hi)
- Use left endpoints as BST key
- Store max endpoint in subtree rooted at node
* TODO this part
### rectangle intersection
* TODO this part
---
## Hash tables
* Basic plan:
- Save items in a key-indexed table (index is a function of the key)
- Hash function: Method for computing array index from key
* Issues:
- Computing the has function
- Equality test: Method for checking whether two keys are equal
- Collision resolution: Algorithm and data structure to handle two keys that hash to the same array index
* Classic space-time tradeoff
- No space limitation: trivial hash function with key as index
- No time limitation: trivial collision resolution with sequential search
- Space and time limitations: hashing (the real world)
### Hash functions
* Efficiently computable
* Each table index equally likely for each key
* Ex 1. Phone numbers:
- Bad: first three digits
- Better: last three digits
* Ex 2. Social security numbers:
- Bad: first three digits
- Better: last three digits
* Practical challenge, need different approach for each key type
* Java's hash code conventions
- All java classes inherit a method `hashCode()` , with return 32-bit in
- Requirement, if `x.equals(y), then (x.hashCode() == y.hashCode())`
- Highly desirable: if `!x.equals(y), then (x.hashCode() != y.hashCode())`
- Default implementation. Memory address of x
- Legal (but poor) implementation. Always return 17
- Customized implementations. Integer, Double, String, File, URL, Date, ...
``` tree
x
|
-----
| |
-----
|
x.hashCode()
```
* Implementing hash code: strings
- Cache the hash value in an instance variable
- Return cached value
``` java
public final class String {
private int hash = 0; // cache of hash code
private final char[] s;
public int hashCode() {
int h = hash;
if (h != 0) return h; // returned cached value
for (int i = 0; i < length(); i++) {
h = s[i] + (31 * h);
}
hash = h; // store cache of hash code
return h;
}
}
```
* Implementing hash code: user-defined types
``` java
public final class Transaction implements Comparable {
private final String who;
private final Date when;
private final double amount;
public int hashCode() {
int hash = 17; // nonzero constant
// typical a small prime 31
// get all fields in the class
hash = 31 * hash + who.hashCode();
hash = 31 * hash + when.hashCode();
hash = 31 * hash + ((Double) amount).hashCode();
return hash;
}
}
```
* Hash code design
- Standard recipe for user-defined types
+ Combine each significant field using the `31x+y` rule
+ If field is a primitive type, use wrapper type `hashCode()`
+ If field is null, return 0
+ If field is a reference type, use `hashCode()` ← applies rule recursively
+ If field is an array, apply to each entry ← or use `Arrays.deepHashCode()`
- In practice. Recipe works reasonably well, used in Java libraries
- In theory. Keys are bit string: universal hash functions exits
- Basic rule. Need to use the whole key to compute hash code, consult an expert for state-of-the-art hash codes
* Modular hashing
- Hash code. An int between -231 and 231-1
- Hash function. An int between 0 and M-1 (for uses as array index)
+ M typically a prime or power of 2
- Why choose a prime for M?
+ We will use all the bits in the number in that point
There is a problem in below code for detect the negative values
``` java
// Bug
private int hash(Key key) {
return key.hashCode() % M;
}
```
There is another problem in below code which is will hit
-231
``` java
// 1-in-a-billion bug
private int hash(Key key) {
return Math.abs(key.hashCode()) % M;
}
```
The correct implementation, this is a template for hashCode to get number between 0 and M - 1
``` java
private int hash(Key key) {
return (key.hashCode() & 0x7fffffff) % M;
}
```
* Uniform hashing assumption
- Uniform hashing assumption. Each key is equally likely to hash to an integer between 0 and M - 1
- Bins and balls. Throw balls uniformly at random into M bins
- Classically studies in statistics
+ Birthday problem. Expect two balls in the same bin after `~ sqrt(PI M /2)` tosses
+ Coupon collector. Expect every bin hash >= 1 ball after `~ M ln M` tosses
+ Load balancing. After M tosses, expect most loaded bin hash `Omega (log M / Log Log M)` balls
### Separate chaining (Collision resolution)
* Collision, Two distinct keys hashing to the same index
- Birthday problem → can't avoid collisions unless you have a ridiculous quadratic amount of memory
- Coupon collector + load balancing → collisions will be evenly distributed
- Challenge. Deal with collisions efficiently
* Separate chaining symbol table
- Use an array of M < N linked lists [H.P. Luhn, IBM 1953]
+ Hash: map key to integer i between 0 and M-1
+ Insert: put at front of ith chain (if not already there)
+ Search: need to search only ith chain
``` java
public class SeparateChainingHashST {
private int M = 97;
private Node[] st = new Node[M];
private static class Node {
private Key key;
private Value val;
private Node next;
}
private int hash(Key key) {
return (key.hashCode() & 0xfffffff) % M;
}
public Value get(Key key) {
int i = hash(key);
for (Node node = st[i]; node != null; node = node.next) {
if (key.equals(node.key)) return node.val;
}
return null;
}
public void put(Key key, Value val) {
int i = hash(key);
for (Node node = st[i]; node != null; node = node.next) {
if (key.equals(node.key)) {
node.val = val;
return;
}
}
st[i] = new Node(key, val, st[i]); // put the new node at the beginning of the LinkedList
}
}
```
* Consequence. Number of probes for search/insert is proportional to N/M
- M too large ==> too many empty chains
- M too small ==> chains too long
- Typical choice `M ~ N/5` ==> constant time ops
### Linear probing
* Open addressing. When a new key collides, find the next empty slot and put it there
* Hash. Map key to integer i between 0 and M-1
* Insert. Put at table index i if free, if not try i+1, i+2, etc
* Search. Search table index i, if occupied but no match, try i+1, i+2, etc
> Note: Array size M must be greater than number of key-value pairs N
``` java
public class LinearProbingHashST {
private int M = 30001;
private Value[] vals = (Value[]) new Object[M];
private Key[] keys = (Key[]) new Object[M];
private int hash(Key key) {
return (key.hashCode() & 0xfffffff) % M;
}
public void put(Key key, Value val) {
int i;
for (i = hash(key); keys[i] != null; i = (i + 1) % M) {
if (keys[i].equals(key)) break;
}
keys[i] = key;
vals[i] = val;
}
public Value get(Key key) {
for (int i = hash(key); keys[i] != null; i = (i + 1) % M) {
if (key.equals(keys[i])) return vals[i];
}
return null;
}
}
```
* Clustering
- Cluster. A contiguous block of items
- Observation. New keys likely to hash into middle of big clusters
* Knuth's parking problem
- Model. Cars arrive at one-way street with M parking spaces, each desires a random space i: if space i is taken, try i + 1, i + 2, etc
- Q. What is mean displacement of a car?
+ Half-full. With M / 2 cars, mean displacement is ~ 3 / 2
+ Full. With M cars, mean displacement is ~ sqrt(PI * M / 8)
### ST implementations: summary
|ST implementation|Worst-case search| Worst-case insert|Worst-case delete|Ordered iteration|key interface|
|-----------------|-----------------|------------------|-----------------|-----------------|-------------|
|linked list|N|N|N|no|equals()|
|binary search (ordered array)|log N|N|N|yes|compareTo()|
|BST|N|N|N|stay tuned|compareTo()|
|2-3 tree|c lg N|c lg N|?|yes|compareTo()|
|red-black tree|2 lg N|2 lg N|2 lg N|yes|compareTo()|
|separate chaining|lg N*|lg N*|lg N*|no|equals()|
|linear probing|lg N*|lg N*|lg N*|no|equals()|
> Note: * under uniform hashing assumption
### Context
* War storey: String hashing in Java
- String hashCode() in Java 1.1
+ For long strings: only examine 8-9 evenly spaced characters
+ Benefit: save time in performing arithmetic
+ Downside: great potential for bad collision patterns
``` java
public int hashCode() {
int hash = 0;
int skip = Math.max(1, length() / 8);
for (int i = 0; i < length(); i += skip) {
hash = st[i] + (37 * hash);
}
return hash;
}
```
- Could end with examine the same spaced character, so you have to examine the all string
* Q. Is the uniform hashing assumption important in practice?
- A. Obvious situations: aircraft control, nuclear reactor, pacemaker
- A. Surprising situations. denial-of-service attacks
* Algorithmic complexity attach on Java
- Goal. Find family of strings with the same hash code
- Solution. the base 31 hash code is part of Java's string API
* Diversion: one-way hash functions
- One-way hash function. "Hard" to find a key that will hash to a desired value (or two keys that hash to the same value)
+ Ex. MD4, MD5, SHA-0, SHA-1, SHA-2, WHIRLPOOL, RIPEMD-160, etc
+ Applications. Digital fingerprint, message digest, storing passwords
+ Caveat. Too expensive for use in ST implementations
* Separate chaining vs. Linear probing
- Separate chaining
+ Easier to implement delete
+ Performance degrades gracefully
+ Clustering less sensitive to poorly-designed hash function, if you have a bad function
- Linear probing
+ Less wasted space
+ Better cache performance
- Q. How to delete?
- Q. How to resize?
* Hashing: variations on the theme
- Many improved versions have been studied.
- Two-probe hashing (separate-chaining variant)
+ Hash to two positions, insert key in shorter of two chains
+ Reduces expected length of the longest chain to log log N
- Double hashing (linear-probing variant)
+ Use linear probing, but skip a variable amount, not just 1 each time
+ Effectively eliminates clustering
+ Can allow table to become nearly full
+ More difficult to implement delete
- Cuckoo hashing (linear-probing variant)
+ Hash key to two positions, insert key either position, if occupied reinsert displaced key into its alternative position (and recur)
_ Constant worst case time for search
* Applications
- Security One way hash function: MD4, MD5, SHA-0, SHA-1, SHA-2, WHIRLPOOL, RIPEMD-160, ...
+ Digital fingerprint
+ Message digest
+ Storing passwords
- Dictionary lookup
+ DNS lookup
+ Amino acids
+ Class list
* Hash tables vs.balanced search trees
- Hash tables
+ Simpler to code
+ No effective alternative for unordered keys
+ Faster for simple keys (a few arithmetic ops versus log N compares)
+ Better system support in Java for strings (e.g. cached hash code)
- Balanced search trees
+ Stronger performance guarantee
+ Support for ordered ST operations
+ Easier to implement `compareTo()` correctly than `equals()` and `hashCode()`
- Java system includes both:
+ Red-black BSTs: java.util. TreeMap, java.util. TreeSet
+ Hash tables: java.util. HashMap, java.util. IdentityHashMap
### Applications
* Sets
- Mathematical set: a collection of `distinct` keys
- Operations: `add(Key key), contains(Key key), remove(Key key), size(), iterator()`
* Dictionary lookup
- Command-line arguments
+ A comma-separated value (CVS) file
+ Key field
+ Value field
- Ex 1. DNS lookup
- Ex 2. Amino acids
- Ex 3. Class list
* File indexing
- Goal. Index a PC (or the web), given a list of files specified, create an index so that you can efficiently find all files contains a given query string
- Book index
+ Goal. preprocess a text corpus to support concordance queries: given a word, find all occurrences with their immediate contexts
* Sparse vectors
- Matrix-vector multiplication
- Problem. sparse matrix-vector multiplication
- Assumptions. Matrix dimension is 10, 000 average nonzeros per row ~ 10
- Vector representations
+ 1D array (standard) representation
. Constant time access to elements
. Space proportional to N
+ Symbol table representation
. Key = index, value = entry
. Efficient iterator
. Space proportional to number of nonzeros
- Matrix representations
+ 2D array (standard) matrix representation. Each row of matrix is an array
+ Space proportional of N2
- Sparse matrix representation: Each row of matrix is a sparse vector
+ Efficient access to elements
+ Space proportional to number of nonzeros (plus N)
``` java
public class SparseVector {
private HashST v;
public SparseVector() {
v = new HashST<>(); // empty ST represents all 0s vector
}
public void put(int i, double x) {
v.put(i, x); // put all not zero values
}
public double get(int i) {
if (!v.contains(i)) return 0.0;
return v.get(i);
}
public Iterable indices() {
return v.keys;
}
public double dot(double[] that) {
double sum = 0.0;
fo (int i : indices()) {
sum += that[i] * this.get(i);
}
return sum;
}
}
```
---
# Graph
## Undirected graphs
* Graph.: Set of vertices connected pairwise by edges
* Why study graph algorithms?
- Thousand of practical applications
- Hundreds of graph algorithms known
- Interesting and broadly useful abstraction
- Challenging branch of computer science and discrete math
* Example of graph
- Protein-protein interaction network
- The internet as mapped by the Opte project
- Map of science click-streams
- Facebook friends
- One week of Enron emails
- Framingham heart study
* Graph terminology
- Path: Sequence of vertices connected edges
- Cycle: Path whose first and last vertices are the same
- Two vertices are `connected` if there is a path between them
* Some graph-processing problems
- Path. Is there a path between v and w?
- Shortest path. What is the shortest path between v and w?
- Cycle. Is there a cycle in the graph?
- Euler cycle. Is there a cycle that uses each edge exactly once?
- Hamilton cycle. Is there a cycle that uses each vertex exactly once
- Connectivity. Is there a way to connect all of the vertices?
- MST. What is the best way to connect all of the vertices?
- Bi-connectivity. Is there a vertex whose removal disconnects the graph?
- Planarity. Can you draw the graph in the plane with no crossing edges
- Graph isomorphism. Do two adjacency lists represent the same graph?
### Undirected Graph API
* Graph drawing provides intuition about the structure of the graph
* Caveat intuition can be misleading
* Vertex representation
- Will use integers between 0 and V-1
- Applications: convert between names and integers which symbol table
- Implementations
01. Set-of-edges graph representation
. Maintain a list of the edges (linked list or array)
. In-efficient implementation make unusable in huge graph
02. Adjacency-matrix graph representation
. Maintain a two-dimensional V-by-V boolean array, for each edge v-w in graph: adj[v][w] = adj[w][v] = true
. Not very widely used because for a huge graph, you would have billion square number of entries
03. Adjacency-list graph representation
. Maintain vertex-indexed array of lists
. Widely used
* Operations: `addEdge(int v, int w), adj(int v), V(), E(), toString()`
`Code`
``` java
public class Graph {
private final int vertices;
private int edges;
private Bag[] adj;
public Graph(int v) {
this.vertices = v;
this.edges = 0;
adj = (Bag[]) new Bag[v];
for (int i = 0; i < v; i++) {
adj[v] = new Bag();
}
}
public int getVertices() {
return vertices;
}
public int getEdges() {
return edges;
}
public void addEdge(int v, int w) {
adj[v].add(w);
adj[w].add(v);
edges++;
}
public int degree(int v) {
return adj[v].size();
}
public Iterable adj(int v) {
return adj[v];
}
}
```
* In practice, use adjacency-lists representation
- Algorithms based on iterating over vertices adjacent to v
- Real-world graphs tend to be `sparse` (huge number of vertices small average vertex degree)
| representation | space | add edge | edge between v and w? | iterate over vertices adjacent to v? |
|------------------|---------------|---------------|-----------------------|--------------------------------------|
| list of edges | E | 1 | E | E |
| adjacency matrix | v2 | 1* | 1 | v |
| adjacency lists | E + V | 1 | degree(v) | degree(v) |
---
### Depth-first search DFS Undirected search
* Classical graphical search algorithm
* Once you explored vertex suspend then explore the new vertices
* Good example: Maze graph
- Vertex = intersection
- Edge = passage
* Algorithm Idea:
- Unroll a ball of string behind you
- Mark each visited intersection and each visited passage
- Retrace steps when no unvisited options
* Goal: systematically search through a graph
* Idea: Mimic maze exploration
* Typically applications:
- Find all vertices connected to a given source vertex
- Find a path between two vertices
* Design pattern: We decouple Graph representation from graph-processing routine
- Create a Graph object
- Pass the Graph to a graph-processing routine
- Query the graph-processing routine for information
* Depth-first search demo
- To visit a vertex v:
+ Mark v as visited
+ Recursively visit all unmarked vertices w adjacent to v
- Put unvisited vertices on a `stack`
`Code`
``` java
public class DepthFirstPath {
private boolean[] marked;
private int[] edgeTo;
private int s;
public DepthFirstPath(Graph G, int s) {
dfs(G, s);
}
private void dfs(Graph G, int v) {
marked[v] = true;
for (int w : G.adj(v)) {
if (!marked[w]) {
dfs(G, w);
edgeTo[w] = v;
}
}
}
}
```
* Proposition. DFS marks all vertices connected to s in time proportional to the `sum of their degrees` from single source
- PF. [correctness]
+ If w marked, then w connected to s (why?)
+ If w connected to s, then w marked
+ If w unmarked, then consider last edge on a path from s to w that goes from a marked vertex to an unmarked one
- Pf. [running time]
+ Each vertex connected to s is visited once
* Proposition. After DFS, can find vertices connected to s in constant time and can find a path to s (if not exist) in time proportional to its length
- Pf. edgeTo[] is parent-link representation of a tree rooted at s
* Challenge. Flood fill (Photoshop magic wand)
- Assumptions. Pictures has millions to billions of pixels
+ Solution. Build a grid graph
01. Vertex: pixel
02. Edge: between two adjacent gray pixels
03. Blob: all pixels connected to given pixel
---
### Breadth-first search BFS Undirected search
* Explore vertices and it's adjacency then go next vertices for exploration
* Not recursive algorithm, it uses a queue
* Put s into a FIFO queue, and mark s as visited, Repeat until the queue is empty
- Remove the least recently added vertex v
- Add each of v's unvisited neighbors to the queue, and mark them as visited
* Proposition. BFS computes the shortest paths (fewest number of edges) from s to all other vertices in a graph in time proportional to `E + V` from single source
- Pf. [correctness] Queue always consists of zero or more vertices of distance k from s, followed by zero more vertices of distance k + 1
- Pf. [running time] Each vertex connected to s is visited once
`Code`
``` java
public class BreadthFirstPath {
private boolean[] marked;
private int[] edgeTo;
private int[] distTo;
private void bfs(Graph G, int s) {
Queue q = new Queue();
q.enqueue(s);
marked[s] = true;
distTo[s] = 0;
while (!q.isEmpty()) {
int v = q.dequeue();
for (int w : G.adj(v)) {
if (!marked[w]) {
q.enqueue(w);
marked[w] = true;
edgeTo[w] = v;
distTo[w] = distTo[v] + 1;
}
}
}
}
}
```
* Applications
- routing: Fewest number of hops in a communication network
### BFS vs. DFS
* Depth-first search. Put unvisited vertices on a `stack`
* Breadth-first search. Put unvisited vertices on a `queue`
* Shortest path. Find from s to w that uses `fewest number of edges`
```tree
1
/ \
/ \
2 3
/ \ / \
4 5 6 7
```
* Level order, BFS: 1, 2, 3, 4, 5, 6, 7
* Pre-order, DFS: 1, 2, 4, 5, 3, 6, 7
---
### Connected components
* Connectivity queries
- Def. vertices v and w are connected if there is a path between them
- Goal. preprocess graph to answer queries of the form is v connected to w? in constant time
- Operations: `connected(int v, int w), count(), id(int v)`
* Union-find? not quite
* Depth-first search. Yes
* The relation "is connected to" is an equivalence relation
- Reflexive: v is connected to v
- Symmetric: if v is connected to w, then w is connected to v
- Transitive: if v connected to w and w connected to x, then v connected to x
* Def. A connected component is a maximal set of connected vertices
* Goal. Partition vertices into connected components
* Demo
- To visit a vertex v:
+ Mark vertex v as visited
+ Recursively visit all unmarked vertices adjacent to v
``` java
public class ConnectedComponent {
private boolean[] marked;
private int[] id;
private int count;
public ConnectedComponent(Graph graph) {
marked = new boolean[graph.getVertices()];
id = new int[graph.getVertices()];
for (int s = 0; s < graph.getVertices(); s++) {
if (!marked[s]) {
dfs(graph, s);
count++;
}
}
}
private void dfs(Graph graph, int v) {
marked[v] = true;
id[v] = count; // all vertices discovered in same call have samd id
for (int w : graph.adj(v)) {
if (!marked[w])
dfs(graph, w);
}
}
/**
* two vertices are connected if have the same id
*/
public boolean connected(int v, int w) {
return id[v] == id[w];
}
/**
* the id of the vert
*/
public int id(int v) {
return id[v];
}
/**
* sum of the connected components in the graph
*/
public int count() {
return count;
}
}
```
* Applications
- Particle detection
+ Given grayscale image of particles, identity blobs
01. Vertex: pixel
02. Edge: between two adjacent pixels with grayscale value >= 70 "black 0, white = 255"
03. Blob: connected components of 20-30 pixels
+ Particle tracking. Track moving particle over time
### Graph challenges
* Graph-processing challenge 1
- Problem. Is a graph bipartite?
+ You can use DFS to solve it
+ Divide the vertices into two subsets with property every edge connects one subset to another
>0-1
0-2
0-5
0-6
>1-3
2-3
2-4
>4-5
4-6
+ Application: is dating graph bipartite?
- Problem. Find a cycle
+ You can use DFS to solve it
+ Ex. 0-5-4-6-0
- Problem. the seven bridges of konigsberg [1736]
+ Euler tour. Is there a (general) cycle that uses each edge exactly once?
. Answer. A connected grapy is Eulerian iff all vertices have `even` degree
- Problem. Find a (general) cycle that uses every edge exactly once
+ Eulerian tour (classic graph-processing problem)
- Problem. Find a (cycle) that visits every vertex exactly once
+ Sometimes called "Travelling sells person between cities and visit each city once"
. Hamiltonian cycle (classical NP-complete problem)
- Problem. Are two graphs identically except for vertex names?
+ Graph isomorphism problem longstanding open problem
+ No one knows how to classify the problem
- Problem. Lay out a graph in the plane without crossing edges?
+ Classic problem in graph processing
+ Linear-time DFS-based planarity algorithm discovered by Tarjan in 1970s
---
## Directed graphs
* Edges now have direction
* Digraph: set of vertices connected pairwise by `directed` edges
* Examples
- road network
+ Vertex = intersection
+ Edge = one-way street
- Political blogosphere graph
+ Vertex = political blog
+ Edge = link
- Overnight interbank load graph
+ Vertex = bank
+ Edge = overnight loan
- Implication graph
+ Vertex = variable
+ Edge = logical implication
- Combinatorial circuit
+ Vertex = logical gate
+ Edge = wire
- WordNet graph
+ Vertex = synset
+ Edge = hypernym relationship
* Digraph applications
|Digraph|Vertex|Directed Edge|
|-------|------|-------------|
|web|web page|hyperlink|
|game|board position|legal move|
|object graph|object|pointer|
|cell phone|person|placed call|
|financial|bank|transaction|
|control flow|code block|jump|
|Inheritance hierarchy |class|inherits from|
* Some digraph problems:
- Is there a directed path from s to t?
- Shortest path, What's the shortest directed path from s to t?
- Topological sort, can you draw a digraph so that all edge point upwards?
- Strong connectivity, Is there a directed path between all pairs of vertices?
- Transitive closure, For which vertices v and w is there a path from v to w?
- PageRank, What is the importance of a web page?
### Digraph API
* Operations: `addEdge(int v, int w), adj(int v), vertices(), edges(), reverse(), toString()`
``` java
public class Digraph {
private static final String NEW_LINE = System.lineSeparator();
private final int vertices;
private int edges;
private Bag[] adj;
public Digraph(int v) {
this.vertices = v;
this.edges = 0;
adj = (Bag[]) new Bag[v];
for (int i = 0; i < v; i++) {
adj[i] = new Bag();
}
}
public int getVertices() {
return vertices;
}
public int getEdges() {
return edges;
}
public void addEdge(int v, int w) {
adj[v].add(w);
edges++;
}
/**
* get number of adjacency to vertex v
*/
public int degree(int v) {
return adj[v].size();
}
/**
* get adjacency to vertex v
*/
public Iterable adj(int v) {
return adj[v];
}
/**
* Returns the reverse of the digraph
*/
public Digraph reverse() {
Digraph reverse = new Digraph(vertices);
for (int v = 0; v < vertices; v++) {
for (int w : adj(v)) {
reverse.addEdge(w, v);
}
}
return reverse;
}
@Override
public String toString() {
StringBuilder builder = new StringBuilder();
for (int v = 0; v < vertices; v++) {
builder.append(v + ": ");
for (int w : adj[v]) {
builder.append(w + " ");
}
builder.append(NEW_LINE);
}
return builder.toString();
}
}
```
* Digraph representations
- In practice. Use adjacency-list representation
+ Algorithms based on iterating over vertices adjacent to v
+ Real-world graphs tend to be sparse
| representation | space | add edge | edge between v and w? | iterate over vertices adjacent to v? |
|------------------|---------------|---------------|-----------------------|--------------------------------------|
| list of edges | E | 1 | E | E |
| adjacency matrix | v2 | 1+ | 1 | v | |
| adjacency lists | E + V | 1 | outdegree(v) | outdegree(v) |
### Depth-first search in digraphs
* Problem. find all vertices reachable from s along a directed path
* Some method as for undirected graphs
- Every undirected graph is digraph (with edges in both directions)
- DFS is a `digraph` algorithm
* Applications:
01. Reachability application: Program control-flow analysis
- Every program is a digraph
+ Vertex = basic block of instructions (straight-line program)
+ Edge = jump
- Dead-code elimination
+ find and remove unreachable code
- Infinite-loop detected
+ Determine whether exit is unreachable
02. Mark-sweep garbage collector
- Every data structure is a digraph
+ Vertex = object
+ Edge = reference
- Memory cost. Uses 1 extra mark bit per object (plus DFS stack)
- Roots. Objects known to be directly accessible by program (e.g. stack)
- Reachable objects. Objects indirectly accessible by program (starting at a root and following a chain of pointers)
- Mark-sweep algorithm. [McCarthy, 1960]
+ Mark: mark all reachable objects
+ Sweep: if object is unmarked, it is garbage (so add to free list)
* DFS enables direct solution of simple digraph problems
01. Reachability
02. Path finding
03. Topological sort
04. Directed cycle direction
* Basis for solving difficult digraph problems
01. 2-satisfiability
02. Directed Euler path
03. Strongly-connected components
### Breadth-first search in digraphs
* BFS is a `digraph` algorithms
* Proposition. BFS computes shortest paths (fewest number of edges) from s to all other vertices n a digraph in time proportional to `E + V`
* Multiple-source shortest paths
- Given. a digraph and set of source vertices, find shortest path from any vertex in the set to each other vertex
- Q. How to implement multi-source constructor?
+ A. Use BFS, but initialize by `enqueuing all source vertices`
* Web crawler
- Goal. Crawl web, starting from some root web page, say www.princeton.edu
+ Solution. [BFS with implicit digraph]
. Choose root web page as source s
. Maintain a Queue of websites to explore
. Maintain a SET of discovered websites
. Dequeue the next website and enqueue websites to which it links (provided you haven't done so before)
- Q. Why not use DFS?
+ A. You gonna far away for searching web, some web page traps new page
`Code`
``` java
public class BareBonesWebCrawler {
private Queue queue = new ArrayQueue<>();
private Set discovered = new HashSet<>();
private static final String REGEX = "https://(\\w+\\.)*(\\w+)";
public BareBonesWebCrawler(String root) {
queue.enqueue(root);
discovered.add(root);
bfs();
}
private void bfs() {
while (!queue.isEmpty()) {
String v = queue.dequeue();
System.out.println(v);
String input = readRawHtml(v);
Pattern pattern = Pattern.compile(REGEX);
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
String w = matcher.group();
if (!discovered.contains(w)) {
discovered.add(w);
queue.enqueue(w);
}
}
}
}
private String readRawHtml(String url) {
URL u;
try {
u = new URL(url);
URLConnection conn = u.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream()));
StringBuffer buffer = new StringBuffer();
String inputLine;
while ((inputLine = in.readLine()) != null) {
buffer.append(inputLine);
}
in.close();
// System.out.println(buffer.toString());
return buffer.toString();
} catch (IOException e) {
e.printStackTrace();
}
return "";
}
public static void main(String[] args) {
new BareBonesWebCrawler("https://github.com/openjdk");
}
}
```
### Topological sort
* Goal. Given a set of tasks to completed with precedence constraints in which order should we schedule with the tasks?
* Digraph model
- Vertex = task
- Edge = precedence constraint
* DAG
- `acyclic` Digraph with no cycle
- If you have cycle there is no way to solve the problem
* Topological sort. Redraw DAG so all edges points upwards
* Solution. DFS
* Demo
01. run depth-first search
02. Return vertices in reverse post order
`Code`
``` java
public class DepthFirstOrder {
private boolean[] marked;
private Stack reverseOrder;
public DepthFirstOrder(Digraph graph) {
reverseOrder = new ArrayStack<>();
marked = new boolean[graph.getVertices()];
for (int v = 0; v < graph.getVertices(); v++) {
if (!marked[v]) {
dfs(graph, v);
}
}
}
private void dfs(Digraph graph, int v) {
marked[v] = true;
for (int w : graph.adj(v))
if (!marked[w]) {
dfs(graph, w);
reverseOrder.push(v);
}
// return all vertices in reverse DFS post-order
public Stack reversePost() {
return reverseOrder;
}
}
```
, Detect cycle in directed graph
``` java
public class DirectedCycleDetector {
private boolean[] marked;
private int[] edgeTo;
private Stack cycle; // vertices on a cycle
private boolean[] onStack; // vertices on recursive call stack
public DirectedCycleDetector(Digraph graph) {
onStack = new boolean[graph.getVertices()];
marked = new boolean[graph.getVertices()];
edgeTo = new int[graph.getVertices()];
for (int v = 0; v < graph.getVertices(); v++) {
if (!marked[v] && cycle == null)
dfs(graph, v);
}
}
private void dfs(Digraph graph, int v) {
onStack[v] = true;
marked[v] = true;
for (int w : graph.adj(v)) {
if (cycle != null) {
// short circuit if directed cycle found
return;
} else if (!marked[w]) {
// found new vertex, so recur
edgeTo[w] = v;
dfs(graph, w);
} else if (onStack[w]) {
// trace back directed cycle
cycle = new Stack<>();
for (int x = v; x != w; x = edgeTo[x]) {
cycle.push(x);
}
cycle.push(w);
cycle.push(v);
}
}
onStack[v] = false;
}
public boolean hasCycle() {
return cycle != null;
}
public Iterable cycle() {
return cycle;
}
}
```
, Will check if no cycle first then get order
``` java
public class TopologicalSort {
private Stack order;
public TopologicalSort(Digraph graph) {
DirectedCycleDetector finder = new DirectedCycleDetector(graph);
if (!finder.hasCycle()) {
DepthFirstOrder dfs = new DepthFirstOrder(graph);
order = dfs.reversePost();
}
}
public Stack order() {
return order;
}
public boolean hasOrder() {
return order != null;
}
public boolean isDag() {
return hasOrder();
}
}
```
* Used for package management, like maven, brew, etc
* Proposition. Reverse DFS post-order of a DAG is a topological order
- Pf. [correctness] Consider any edge v → w. When dfs(v) is called
+ Case 1: dfs(w) has already been called and returned. Thus, w was done before v
+ Case 2: dfs(w) has not yet been called, dfs(w) will get called directly on indirectly by dfs(v) and will finish before dfs(v). Thus, w will be done before v
+ Case 3: dfs(w) has already been called, but has not yet returned. Can't happen in a DAG: function call stack contains path from w to v, so v→w would complete a cycle
* Directed cycle detection
- Proposition. A digraph has a topological order iff no directed cycle
+ Pf.
01. If directed cycle, topological order impossible
02. If directed cycle, DFS-based algorithm finds a topological order
- Goal. Given a digraph. find a directed cycle
+ Solution. DFS
- Applications
01. Scheduling. Given a set of tasks to be completed with precedence constraints, in what order should we schedule the tasks??
. Remark. A directed cycle implies scheduling problem is infeasible
02. Java compiler does cycle detection (cyclic inheritance)
03. Microsoft Excel does cycle detection (and has a circular reference toolbar!)
`Code`
``` java
public class DirectedCycleDetector {
private boolean[] marked;
private int[] edgeTo;
private Stack cycle; // vertices on a cycle
private boolean[] onStack; // vertices on recursive call stack
public DirectedCycleDetector(Digraph graph) {
onStack = new boolean[graph.getVertices()];
marked = new boolean[graph.getVertices()];
edgeTo = new int[graph.getVertices()];
for (int v = 0; v < graph.getVertices(); v++) {
if (!marked[v] && cycle == null)
dfs(graph, v);
}
}
private void dfs(Digraph graph, int v) {
onStack[v] = true;
marked[v] = true;
for (int w : graph.adj(v)) {
if (cycle != null) {
// short circuit if directed cycle found
return;
} else if (!marked[w]) {
// found new vertex, so recur
edgeTo[w] = v;
dfs(graph, w);
} else if (onStack[w]) {
// trace back directed cycle
cycle = new Stack<>();
for (int x = v; x != w; x = edgeTo[x]) {
cycle.push(x);
}
cycle.push(w);
cycle.push(v);
}
}
onStack[v] = false;
}
public boolean hasCycle() {
return cycle != null;
}
public Iterable cycle() {
return cycle;
}
}
```
### Strong components
* Def. Vertices v and w are `strongly connected` if there is a directed path from v to w and a directed path from w to v
* Key property. Strong connectivity is an `equivalence relation`
- v is strongly connected to v
- If v is strongly connected to w, then w is strongly connected to v
- If v is strongly connected to w and w connected to x, then v is strongly connected x
* v and w are `connected` if there is a path between v and w
* Applications
- Food web graph
+ Vertex = species
+ Edge = from producer to consumer
+ Strong component. Subset of species with common energy flow
- Software modules
+ Vertex = software module
+ Edge = from module to dependency
+ Strong component. Subset of mutually interacting modules
01. Approach 1. Package strong components together
02. Approach 2. Use to improve design
* Kosaraju-Sharir algorithm
- Reverse graph. Strong components in G are same as GR
- Kernel DAG. Contact each strong components into a single vertex
- Idea
+ Compute topological order (reverse post-order) in kernel DAG
+ Run DFS, considering vertices in reverse topological order
- Demo
01. Phase 1. Compute reverse postorder in GR
02. Phase 2. Run DFS in G, visiting unmarked vertices in reverse postorder of GR
- Simple (but mysterious) algorithm for computing strong components
- Proposition. Kosaraju-Sharir algorithm computes the strong components of a digraph in time proportional to E + V
+ pf.
01. Running time: bottleneck is running DFS twice (and computing GR)
02. Correctness: tricky
03. Implementation: easy
`Code`
``` java
public class KosarajuSharirCC {
private boolean[] marked;
private int[] id;
private int count;
public KosarajuSharirCC(Digraph graph) {
marked = new boolean[graph.getVertices()];
id = new int[graph.getVertices()];
count = 0;
DepthFirstOrder order = new DepthFirstOrder(graph);
for (int s : order.reversePost()) {
if (!marked[s]) {
dfs(graph, s);
count++;
}
}
}
private void dfs(Digraph graph, int v) {
marked[v] = true;
id[v] = count;
for (int w : graph.adj(v)) {
if (!marked[w]) {
dfs(graph, w);
}
}
}
public boolean stronglyConnected(int v, int w) {
return id[v] == id[w];
}
public int id(int v) {
return id[v];
}
public int count() {
return count;
}
}
```
---
## MST (Minimum spanning trees)
* Given. `Undirected graph` G with positive edge weights (connected)
* Def. A `spanning tree` of G is a `subgraph` T that is both a `tree` (connected and `acyclic`) and `spanning` (includes all of the vertices)
* Goal. Find a min weight spanning tree
* Applications
- Dithering
- Cluster analysis
- Real-time face verification
- Image registration with Renyi entropy
- Find road networks in satellite and aerial imagery
- Network design (communication, electrical, computer, road)
- Auto config protocol for ether bridging to avoid cycles in a network
- Reducing data storage in sequencing amino acids in a protein
### Greedy algorithm
* General principle of Algorithm desing
* Simplifying assumptions
- Edge weights are `distinct`
- Graph is `connected`
* Based on these MST exists and unique
* Cut property
- Def. A `cut` in a graph is partition of its vertices into two non-empty set
- Def. A `crossing edge` connects a vertex in one set with a vertex in the other
- Cut property. Given any cut, the crossing edge of min weight is the MST
- Pf. Suppose min-weight crossing edge *e* is not in the MST
+ Adding e to the MST creates a cycle
+ Some other edge *f* in cycle must be a crossing edge
+ Removing *f* and adding *e* is also a spanning tree
+ Since weight of *e* is less than the weight of *f* that spanning tree is lower weight
* Greedy MST algorithm demo
- Start with all edges colored gray
- Find cut with no black crossing edges, color its min-weight edge black
- Repeat until V-1 edges are colored black
* Proposition. The greedy algorithm computes the MST
- Pf. [correctness]
+ Any edge colored black is in the MST (via cut property)
+ Fewer than *V-1* black edges → cut with no black crossing edges (consider cut whose vertices are one connected component)
* Efficient implementations. Choose cut? find min-weight edge?
01. Kruskal's algorithm [stay tuned]
02. Prim's algorithm [stay tuned]
03. Boruvka's algorithm
* Q. What if edge weights are not all distinct?
- A. Greedy MST algorithm still correct if equal weights are present!
* Q. What iff graph is not connected?
- A. Compute m,minimum spanning forest = MST of each components
### Edge-weight graph API
* Edge abstraction needed for weighted edges
* Idiom for processing an edge e: `int v = e.either(), w = e.other(v); `
``` java
public class Edge implements Comparable {
private final int v, w;
private final double weight;
public Edge(int v, int w, double weight) {
this.v = v;
this.w = w;
this.weight = weight;
}
/**
* Either endpoint
*/
public int either() {
return v;
}
/**
* Other endpoint
*/
public int other(int vertex) {
if (vertex == v)
return w;
else
return v;
}
@Override
public int compareTo(Edge that) {
if (this.weight < that.weight)
return -1;
else if (this.weight > that.weight)
return 1;
return 0;
}
}
```
* Graph API which has weighted edges, we will use `EdgeWeightedGraph`
``` java
public class EdgeWeightedGraph {
private final int vertices;
private Bag[] adj;
public EdgeWeightedGraph(int v) {
this.vertices = v;
adj = (Bag[]) new Bag[v];
for (int i = 0; i < v; i++) {
adj[v] = new Bag();
}
}
public int getVertices() {
return vertices;
}
public void addEdge(Edge edge) {
int v = edge.either(), w = edge.other(v);
adj[v].add(edge);
adj[w].add(edge);
}
public Iterable adj(int v) {
return adj[v];
}
}
```
* Edge-weighted graph: adjacency-list representation
- Maintain vertex-indexed array of Edge lists
* Minimum spanning tree API
- Q. How to represent the MST?
``` java
public class MST {
MST(EdgeWeightedGraph G)
Iterable edges() // edges in MST
double weight() // weight of MST
}
```
### Kruskal's algorithm
* classical algorithm for computing MST
* Steps:
- Consider edges in ascending order of weight
- Add next edge to tree T unless doing so would create a cycle
- Ignore edge that create a cycle
* Proposition. [Kruskal 1956] Kruskal's algorithm computes the MST
- Pf. Kruskal's algorithm is a special case of the greedy MST algorithm
+ Suppose Kruskal's algorithm colors the edge *e=v-w* black
+ Cut = set of vertices connected to *v* in tree *T*
+ No crossing edge is black
+ No crossing edge has lower weight, why?
* Challenge. Would adding edge *v-w* to tree *T* create a cycle? if not, add it
- *V* run DFS from v, check if w is reachable (T has at most V-1 edges)
- log*V use union-find data structure
- Efficient solution. Use the `Union-find` data structure
. Maintain a set for each connected component in T
. If v and w are in same set, then add v-w would create a cycle
. To add v-w to T, merge sets containing v and w
`Code`
``` java
public class KruskalMST {
private double weight;
private Queue mst;
public KruskalMST(EdgeWeightedGraph graph) {
MinPQ pq = new MinPQ<>(graph.getVertices());
mst = new ArrayQueue<>();
for (Edge e : graph.edges())
pq.insert(e);
UnionFind unionFind = new WeightedQuickUnionPassCompression(graph.getVertices());
while (!pq.isEmpty() && mst.size() < graph.getVertices() - 1) {
Edge e = pq.delMin();
int v = e.either(), w = e.other(v);
if (!unionFind.isConnected(v, w)) {
unionFind.union(v, w);
mst.enqueue(e);
weight += e.weight();
}
}
}
public Iterable edges() {
return mst;
}
public double weight() {
return this.weight;
}
}
```
* Proposition. Kruskal's algorithm computes MST in time proportional to *E log E* (in the worst case)
### Prim's algorithm
* Classical algorithm for computing MST
* Steps:
- Start with vertex 0 and greedily grow tree T
- Add to T the min weight edge with exactly one endpoint in T
- Repeat until V-1 edges
* Proposition. Prim's algorithm computes the MST
- Pf. Prime's algorithm is a special case of the greedy MST algorithm
+ Suppose edge *e* = min weight edge connecting a vertex on the tree to a vertex not on the tree
+ Cut = set of vertices connected on tree
+ No crossing edge is black
+ No crossing edge has lower weight
* Challenge. Find the min weight edge with exactly one endpoint in *T*
- *E* ← try all edges
- *lg E* ← Priority Queue
### Prim's algorithm: lazy implementation
* Challenge. Find the min weight edge with exactly one endpoint in *T*
- Lazy solution. Maintain a PQ of `edges` with (at least) one endpoint in T
+ Key = edge, priority = weight of edge
+ Delete-min to determine next edge *e = v-w* to add to *T*
+ Disregard if both endpoints v and w are in *T*
. add to PQ any edge incident to w (assuming other endpoint not in *T*)
. add *w* to *T*
* Prim's algorithm (lazy) demo
- Start with vertex 0 and greedily grow tree *T*
- Add to *T* the min weight edge with exactly one endpoint *T*
- Repeat until *V-1* edges
`Code`
``` java
public class LazyPrimMST {
private double weight;
private boolean[] marked;
private Queue mst;
private MinPQ pq;
public LazyPrimMST(EdgeWeightedGraph graph) {
pq = new MinPQ<>(graph.getVertices());
mst = new ArrayQueue<>();
marked = new boolean[graph.getVertices()];
for (int v = 0; v < graph.getVertices(); v++) { // run Prim from all vertices to
if (!marked[v])
prim(graph, v); // get a minimum spanning forest
}
}
private void prim(EdgeWeightedGraph graph, int s) {
scan(graph, s);
while (!pq.isEmpty()) {
Edge e = pq.delMin();
int v = e.either(), w = e.other(v);
if (marked[v] && marked[w])
continue;
mst.enqueue(e);
weight += e.weight();
if (!marked[v])
scan(graph, v); // v become part of tree
if (!marked[w])
scan(graph, w); // w become part of tree
}
}
// add all edges e incident to v onto pq if the other endpoint has not yet been
// scanned
private void scan(EdgeWeightedGraph graph, int v) {
marked[v] = true;
for (Edge e : graph.adj(v))
if (!marked[e.other(v)])
pq.insert(e);
}
public Iterable edges() {
return mst;
}
public double weight() {
return this.weight;
}
}
```
* Proposition. Lazy Prim's algorithm computes the MST in time proportional to *E log E*
and extra space proportional to *E* (in the worst case)
- Pf.
|operation|frequency|binary heap|
|---------|---------|-----------|
|delete min| E | lg E |
|insert| E | lg E |
### Prim's algorithm: eager implementation
* Challenge. Find the min weight edge with exactly one endpoint in *T*
* Eager solution. Maintain a PQ of vertices connected by an edge to *T*, where priority of vertex *v* = weight of shortest edge connected *v* to T
- Delete min vertex *v* and add its associated edge *e = v-w* to *T*
- Update PQ by considering all edges *e=v-x* incident to v
+ ignore if x is already in T
+ add x to PQ if not already on it
+ `decrease priority` of x if v-x becomes shortest edge connecting x to T
* Prim's algorithm (eager) demo
- Start with vertex 0 and greedily grow tree *
- Add to T the min weight edge with exactly one endpoint in T
- Repeat until *V-1* edges
* Indexed priority queue
- Associate an index between 0 and N-1 with each key in a priority queue
+ Client can insert and delete-the-minimum
+ Client can change the key by specify the index
- Implementation
+ Start with same code as MinPQ
+ Maintain parallel arrays keys[], pq[], and qp[] so that:
01. keys[i] is the priority of i
02. pq[i] is the index of the key in heap position i
03. pq[i] is the index of the key in heap position i
04. qp[i] is the heap position of the key with index i
+ Use swim(qp[k]) implementation
``` java
public class IndexMinPQ> {
IndexMinPQ(int N)
void insert(int i, Key key)
void decreaseKey(int i, Key key)
boolean contains(int i)
int delMin()
boolean isEmpty()
int size()
}
```
* Prims' (eager) algorithm: running time
- Depends on PQ implementation: v insert, v delete-min, E decrease key
|PQ implementation|Insert|delete-min|decrease-key|total|
|-----------------|------|----------|------------|-----|
|array|1|V|1|V2|
|binary heap|log V| log V|log V|E log V|
|d-way heap|logdV|d logdV|logdV|E log1/vV|
|Fibonacci heap|1+|log V*|1+|E + V log V|
> Note: + amortized
### MST context
* Euclidean MST
- Given *N* points in the plane, find MST connecting them, where the distances between point paris are their `Euclidean` distances
* Scientific application: clustering
- K-clustering. Divide a set of objects classify into K coherent groups
- Distance function. Numeric value specifying "closeness" of two objects
- Goal. Divide into clusters so that objects in different clusters are far apart
* Applications
- Routing in mobile ad hoc networks
- Document categorization for web search
- Similarity searching in medical image databases
- Sky-cat: cluster 109 objects into stars, quasars, galaxies
---
## Shortest path
* Given an edge weighted digraph, find the shortest path from *s* to *t*
* Applications:
- Map routing
- Car navigation
- Network routing protocols (OSPF, BGP, RIP)
- Optimal pipelining of VLSI chip
- Robot navigation
* Which vertices?
- Source-sink: from one vertex to another
- Single source: from one vertex to every other
- All pairs: between all pairs of vertices
* Restrictions on edge weights
- Nonnegative weights
- Arbitrary weights
- Euclidean weights
* Cycles?
- No directed cycles
- No "negative cycles"
### Weighted directed edge API
* Similar to Edge for undirected graphs, but a bit simpler
`Code`
``` java
public class DirectedEdge {
private final int v, w;
private final double weight;
public DirectedEdge(int v, int w, double weight) {
this.v = v;
this.w = w;
this.weight = weight;
}
public int from() {
return v;
}
public int to() {
return w;
}
public double weight() {
return weight;
}
}
```
### Edge-weighted digraph API
* Same as EdgeWeightedGraph except replace Graph with Digraph
`Code`
``` java
public class EdgeWeightedDigraph {
private final int vertices;
private final Bag[] adj;
public EdgeWeightedDigraph(int vertices) {
this.vertices = vertices;
adj = (Bag[]) new Bag[vertices];
for (int v = 0; v < vertices; v++) {
adj[v] = new Bag<>();
}
}
public void addEdge(DirectedEdge e) {
int v = e.from();
adj[v].add(e);
}
public Iterable adj(int v) {
return adj[v];
}
}
```
### Single-source shortest paths API
* Goal. Find the shortest path from s to every other vertex
`API`
``` java
public class SP {
SP(EdgeWeightedDigraph graph, int s)
double distTo(int v)
Iterable pathTo(int v)
boolean hasPathTo(int v)
}
```
### Shortest paths properties
* Goal. Find the shortest path from s to every other vertex
* Observation. A shortest-paths tree (SPT) solution exists, why?
* Consequence. Can represent the SPT with two vertices-indexed arrays:
- distTo[v] is length of shortest path from s to v
- edgeTo[v] is last edge on shortest path from s to v
`Code`
``` java
public double distTo(int v) {
return distTo[v];
}
public Iterable pathTo(int v) {
Stack path = new Stack<>();
for (DirectedEdge e = edgeTo[v]; e != null; e = edgeTo[e.from()]) {
path.push(e);
}
return path;
}
```
* Edge relaxation
- Relax edge *e = v → w*
+ distTo[v] is length of shortest `known` path from s to v
+ distTo[v] is length of shortest `known` path from s to w
+ edgeTo[w] is last edge on shortest `known` path from s to w
+ If e = v→w gives shorter path to w through v, update both distTo[w] and edgeTo[w]
`Code`
``` java
private void relax(DirectedEdge e) {
int v = e.from(), w = e.to();
if (distTo[w] > distTo[v] + e.weight() {
distTo[w] = distTo[v] + e.weight();
edgeTo[w] = e;
if (pq.contains(w)) pq.decreaseKey(w, distTo[w]);
else pq.insert(w, distTo[w]);
}
}
```
* Shortest-paths optimality conditions
- Proposition. Let *G* be an edge-weighted digraph then distTo[] are the shortest path distances from s iff
+ distTo[s] = 0
+ For each vertex v, distTo[v] is the length of some path from s to v
+ For each edge = e = v→w, distTo[w] <= distTo[v] + e.weight()
+ Pf. [necessary]
. Suppose that distTo[w] > distTo[v] + e.weight() for some edge e = v→w
. Then, e gives a path from s to w (through v) of length less than distTo[w]
* Generic shortest-paths algorithm
- Initialize distTo[s] = 0 and distTo[v] = ∞ for all other vertices
- Repeat until optimality donstions are satisfied
+ Relax any edge
* Efficient implementations. How to choose which edge to relax?
- Dijkstra's algorithm (non-negative weights)
- Topological sort algorithm (no directed cycles)
- Bellman-Ford algorithm (no negative cycles)
### Dijkstra's algorithm
* Dijkstra's algorithm demo
- Consider vertices in increasing order of distance from s (non-tree vertex with the lowest distTo[] value)
- Add vertex to tree and relax all edges pointing from that vertex
`Code`
``` java
public class DijkstraSP {
private DirectedEdge[] edgeTo;
private double[] distTo;
private IndexMinPQ pq;
public DijkstraSP(EdgeWeightedDigraph graph, int s) {
edgeTo = new DirectedEdge[graph.getVertices()];
distTo = new double[graph.getVertices()];
pq = new IndexMinPQ<>(graph.getVertices());
for (int v = 0; v < graph.getVertices(); v++) {
distTo[v] = Double.POSITIVE_INFINITY;
}
distTo[s] = 0.0;
pq.insert(s, 0.0);
while (!pq.isEmpty()) {
int v = pq.delMin();
for (DirectedEdge e : graph.adj(v)) {
relax(e);
}
}
}
private void relax(DirectedEdge e) {
int v = e.from(), w = e.to();
if (distTo[w] > distTo[v] + e.weight(){
distTo[w] = distTo[v] + e.weight();
edgeTo[w] = e;
if (pq.contains(w)) pq.decreaseKey(w, distTo[w]);
else pq.insert(w, distTo[w]);
}
}
}
```
* Proposition. Dijkstra's algorithm computes a SPT in any edge-weighted digraph with nonnegative weights
- Pf.
+ Each edge e = v→ w is relaxed exactly once (when v is relaxed) leaving distTo[w] <= distTo[v] + e.weight()
+ Inequality holds until algorithm terminates because
. distTo[w] cannot increase
. distTo[v] will not change
* Computing spanning trees in graphs
- Dijkstra's algorithm seems familiar?
+ Prim's algorithm is essentially the `same` algorithm
+ Both are in a family of algorithms that computes a graph's `spanning tree`
- Main distinction. Rule uses to choose next vertex for the tree
+ Prim's: Closest vertex to the tree (via an undirected edge)
+ Dijkstra's: Closest vertex to the source (via a directed path)
> Note: DFS and BFS are also in this family of algorithms
* Dijkstra's algorithm: which priority queue?
- Bottom line
+ Array implementation optimal for dense graphs
+ Binary heap much faster for sparse graphs
+ 4-way heap worth the trouble in performance critical situations
+ Fibonacci heap best in theory, but not worth implementing
|PQ implementation|Insert|delete-min|decrease-key|total|
|-----------------|------|----------|------------|-----|
|array|1|V|1|V2|
|binary heap|log V| log V|log V|E log V|
|d-way heap|logdV|d logdV|logdV|E log1/vV|
|Fibonacci heap|1+|log V*|1+|E + V log V|
### Edge-weighted DAGs
* Q. Suppose that an edge digraph has no directed cycles is it easier to find shortest paths than in general digraph?
- A. Yes
* Acyclic shortest paths demo
- Consider vertices in topological order
- Relax all edges pointing from that vertex
* Proposition. Topological sort algorithm computes SPT in any edge weighted DAG in time proportional to *E + V*
- Pf.
+ Each edge e = v→w relaxed exactly once (when v is relaxed)
+ Inequality holds until algorithm terminates because
. distTo[w] cannot increase
. distTo[v] will not change
+ Thus, upon termination, shortest-paths optimality conditions hold
`Code`
``` java
public class AcyclicSP {
private DirectedEdge[] edgeTo;
private double[] distTo;
public AcyclicSP(EdgeWeightedDigraph graph, int s) {
edgeTo = new DirectedEdge[graph.getVertices()];
distTo = new double[graph.getVertices()];
for (int v = 0; v < graph.getVertices(); v++) {
distTo[v] = Double.POSITIVE_INFINITY;
}
distTo[s] = 0.0;
TopologicalSort topological = new TopologicalSort(graph);
for (int v : topological.order()) {
for (DirectedEdge e : graph.adj(v)) {
relax(e);
}
}
}
private void relax(DirectedEdge e) {
int v = e.from(), w = e.to();
if (distTo[w] > distTo[v] + e.weight()) {
distTo[w] = distTo[v] + e.weight();
edgeTo[w] = e;
}
}
public boolean hasPathTo(int v) {
return distTo[v] < Double.POSITIVE_INFINITY;
}
public double distTo(int v) {
return distTo[v];
}
public Iterable pathTo(int v) {
if (!hasPathTo(v))
return null;
Stack path = new ArrayStack<>();
for (DirectedEdge e = edgeTo[v]; e != null; e = edgeTo[e.from()]) {
path.push(e);
}
return path;
}
}
```
* Applications
- Content-aware resizing
+ Seam carving. Resizing an image without distortion for display on cell phones and web browsers
+ Photoshop, GIMP, etc
- To find vertical seam:
+ Grid DAG: vertex = pixel, edge = from pixel to 3 downward neighbors
+ Weight of pixel = energy function of 8 neighboring pixels
+ Seam = shortest path (sum of vertex weights) from top to bottom
- To remove vertical seam:
+ Delete pixels on seam (one in each row)
* Longest paths in edge-weighted DAGs
- Formulate as a shortest paths problem in edge-weighted DAGs
[equivalent: reverse senes of equality in relax()]
+ Negate all weights
+ Find shortest paths
+ Negate weights in result
- Parallel job scheduling. Given a set of jobs with durations and precedence constraints, schedule the jobs (by finding a start time for each) so as to achieve the minimum completion time, while respecting the constraints
+ Critical path method (CPM)
### Negative weights
* Dijkstra. Doesn't work with negative edge weights
* Negative cycles
- Def. A negative cycle is a directed cycle whose sum of edge weights is negative
* Bellman-Ford algorithm
- Initialize distTo[s] = 0 and distTo[v] = ∞ for all other vertices
- Repeat V times:
+ Relax each edge
- Proposition. Dynamic programming algorithm computes SPT in any edge weighted digraph with no negative cycles in time proportional to *E x V*
+ Pf idea. After passing *i*, found shortest path containing at most *i* edges
- Improvement
+ Observation. If distTo[v] does not change during pass i, no need to relax any edge pointing from v in pass i+1
+ FIFO implementation. Maintain queue of vertices whose distTo[] changed
+ Overall effect
. The running time is still proportional to *E x V* in worst case
. But much faster than in practice
### Single source shortest-path implementation: cost summary
|algorithm|restriction|typical case|worst case|extra space|
|---------|-----------|------------|----------|-----------|
|topological sort|no directed cycles|E + V|E + V|V|
|Dijkstra (binary heap)| no negative weights|E log V|E log V|V|
|Bellman-ford|no negative cycles| E V | E V | V
|Bellman-ford (queue-based) |no negative cycles| E + V | E V | V
* Remark 1. Directed cycles make the problem harder
* Remark 2. Negative weights make the problem harder
* Remark 3. Negative cycles make the problem intractable
### Finding a negative cycle
* Negative cycle. Add two method to the API for SP
``` java
boolean hasNegativeCycle()
Iterable negativeCycle()
```
* Observation. If there is a negative cycle, Bellman-Forest gets stuck in loop, un-updating distTo[] and edgeTo[] entries of vertices in the cycle
* Proposition. If any vertex v is updated in phase V, there exist a negative cycle (and can trace back edgeTo[v] entries to find it)
- In practice. Check for negative cycles more frequently
* Applications
- Arbitrage detection
+ Problem. Given table of exchange rates, is there an arbitrage opportunity?
- Ex. $1.000 → 741 Euros → 1,012.206 Canadian dollars → $1,007.14497
- Currency exchange graph
+ Vertex = currency
+ Edge = transaction, with weight equal to exchange rate
+ Find a directed cycle whose product of edge weights is > 1
|Currency|USD|EUR|GBP|CHF|CAD|
|--------|---|---|---|---|---|
|USD|1|0.741|0.657|1.061|1.011|
|EUR|1.350|1|0.888|1.433|1.366
|GBP|1.521|1.126|1|1.614|1.538|
|CHF|0.943|0.698|0.620|1|0.953|
|CAD|0.995|0.732|0.650|1.049|1|
### Shortest paths summary
* Dijkstra's algorithm
- Nearly linear-time when weights are nonnegative
- Generalization en-compresses DFS, BFS and Prim
* Acyclic edge-weighted digraphs
- Arise in applications
- Faster than Dijkstra's algorithm
- Negative weights are no problem
* Negative weights and negative cycles
- Arise in applications
- If no negative cycles, can find shortest paths via Bellman-Ford
- If negative cycles, can find one via Bellman-Ford
* Shortest-paths is a broadly useful problem-solving model
---
# Maximum flow
* General problem solving model
## Mincut problem
* Input. An edge-weighted digraph, source vertex *s* and target vertex *t*
* Def. A `st-cut (cut)` is a partition of the vertices into two disjoint sets with *s* in one set *A* and *t* in the other set *B*
* Def. its capacity is the sum of the capacities of the edges from *A* to *B*
* Minimum st-cut (mincut) problem. Find a cut of minimum capcity
* Applications
- Free world goal. Cut supplies (if cold war turns into real war)
- Government-in-power's goal. Cut off communication of set of people
## Maxflow problem
* Input. An edge-weighted digraph, source vertex *s* and target *t*
* Every edge has a positve capcity
* Def. An st-flow (flow) is an assignment of values to the edges such that:
- Capacity constraint: 0 <= edge's flow <= edge's capacity
- Local equilibrium: inflow = outflow at every vertex (except s and t)
* Maximum st-flow (maxflow) problem. Find a flow of maximum value
### Ford-Fulkerson algorithm
* Steps
- Initialization. Start with 0 flow
- Augmenting path. Find an undirected path from s to t such that:
+ Can increase flow on forward edges (not full)
+ Can decrease flow on backward edge (not empty)
- Termination. All paths from s to t are blocked by either a:
+ Full forward edge
+ Empty backward edge
* Ford-Fulkerson algorithm
```
Start with 0 flow
While there exists an augmenting path
- find an augmenting path
- compute bottleneck capcity
- increase flow on that path by bottleneck capacity
```
* Questions
- How to compute a mincut? Easy
- How to find an augmenting path? BFS works well
- If FF terminates, does it always compute a maxflow? Yes
- Does FF always terminate? If so, after how many augmentations?
### Running time analysis
* TODO this part
### Java code
* TODO this part
---
# Strings
* String. Sequence of characters
* Important fundamentals abstractions
- Information processing
- Genomic sequences
- Communication systems (e.g email)
- Programming systems (e.g java programs)
* C char data type
- Typically an 8-bit integer
- Supports 7-bit ASCII
- Can represent only 256 characters
* Java char data type
- Support 16-bit Unicode
- Support 21-bit Unicode 3.0 (awkwardly)
* String data type
- String data type in Java. Sequence of characters (immutable)
- Length. Number of characters
- Substring extraction. Get a contiguous subsequence of characters
- String concatenation. Append one character to end of another string
`Code`
``` java
public final class String implements Comparable {
private char[] value; // characters
private int offset; // index of first char in array
private int length; // length of substring
private int hash; // cache of hashCode()
public int length() { return length;}
public char charAt(int i) { return value[i + offset];}
private String(int offset, int length, char[] value) {
this.offset = offset;
this.length = length;
this.value = value;
}
public String substring(int from, int to) {
return new String(offset + from, to - from, value); // copy of reference to char array
}
}
```
* String data type performance
- Memory `40 + 2N` bytes for a virgin String of length of N
- Remark. StringBuffer data type is similar, but thread safe (and slower)
- String operations
|operation|guarantee|extra space|
|---------|---------|-----------|
|length()|1|1|
|charAt()|1|1|
|substring()|1|1|
|concat()|N|N|
* StringBuilder data type performance
- Remark StringBuffer data type is similar, but thread safe (and slower)
- StringBuilder operations
|operation|guarantee|extra space|
|---------|---------|-----------|
|length()|1|1|
|charAt()|1|1|
|substring()|N|N|
|concat()|1*|1*|
* String vs. StringBuilder
- Q. How to efficiently reverse a string?
Quadratic time
``` java
public static String reverse(String s) {
String rev = "";
for (int i = s.length() - 1; i >= 0; i--) rev += s.charAt(i);
return rev;
}
```
Linear time
``` java
public static String reverse(String s) {
StringBuilder rev = new StringBuilder();
for (int i = s.length() - 1; i >= 0; i--) rev.append(s.charAt(i));
return rev.toString();
}
```
> Note: For `concat` string use `StringBuilder` , for `substring` use `String`
* Alphabets
- Radix. Number of digits *R* in alphabet
|name|R()|lgR()|characters|
|----|---|-----|----------|
|Binary|2|1|01|
|Octal|8|3|01234567|
|Decimal|10|4|0123456789|
|HexaDecimal|16|4|0123456789ABCED|
|DNA|4|2|ACTG|
|LowerCase|26|5|a-z|
|UpperCase|26|5|A-Z|
|Protein|20|5|A-Y|
|Base64|64|6|A-Za-z0-9+/|
|ASCII|128|7|ASCII characters|
|Extended ASCII|256|8|extended ASCII characters|
|Unicode16|65536|16|unicode characters|
## Key-indexed counting
* We stopped in sorting algorithm at Heap sort
- Lower bound ~ N lg N compares required by any compared-based algorithm
- Q. Can e do better (despite the lower bound)?
+ A. Yes if e don't depend o key compares
* Frequency of operations = key compares
* Assumption about keys
- Assumption. Keys are integers between 0 and R-1
- Implication. Can use key as an array index
- Applications
01. Sort string by first letter
02. Sort class roster by section
03. Sort phone numbers by area code
04. Subroutine in a sorting algorithm [stay tuned]
> Remark. keys may have associated data => can't just count up number of keys of each value
* Goal. Sort an array a[] of N integers between 0 and R-1
* Key-indexed counting demo
- Count frequencies of each letter using key as index
- Compute frequency cumulates which specify destination
- Access cumulates using key as index to remove items
- Copy back into original array
* Performance ~ `11 N + 4 R` and extra space `N + R`
- R is Radix from table above
`Code`
``` java
public final static void sort(char[] a) {
int N = a.length;
int R = 256;
int[] count = new int[R + 1];
char[] aux = new char[N];
// count frequencies
for (int i = 0; i < N; i++)
count[a[i] + 1]++; // offset by 1
// compute cumulates
for (int r = 0; r < R; r++)
count[r + 1] += count[r];
// move items
for (int i = 0; i < N; i++)
aux[count[a[i]]++] = a[i];
// copy back
for (int i = 0; i < N; i++)
a[i] = aux[i];
}
```
## LSD radix sort
* Least significant digit first string sort
- Consider characters from right to left
- Stably sort using *dth* character as the key (using key-indexed counting)
* Proposition. LSD sorts fixed-length strings in ascending order
- Pf. [by induction on i]
+ After pass *i*, strings are sorted by last *i* characters
. If two strings differ on sort key, key-indexed sort puts them in proper relative order
. If two strings agree on sort key, stability keeps them in proper relative order
`Code`
``` java
public static void sort(String[] a, int W) {
int R = 256;
int N = a.length;
String[] aux = new String[N];
for (int d = W - 1; d >= 0; d--) {
int[] count = new int[R + 1];
for (int i = 0; i < N; i++)
count[a[i].charAt(d) + 1]++;
for (int r = 0; r < R; r++)
count[r + 1] += count[r];
for (int i = 0; i < N; i++)
aux[count[a[i].charAt(d)]++] = a[i];
for (int i = 0; i < N; i++)
a[i] = aux[i];
}
}
```
* Problem. Sort one million 32-bit integers
- Ex. Google interview
- Which sorting method to use?
01. Insertion sort
02. Mergesort
03. Quicksort
04. Heapsort
05. LSD string sort
* Summary of the performance of sorting algorithms
|algorithm|guarantee|random|extra space|stable?|operations on keys|
|---------|---------|------|-----------|-------|------------------|
|insertion sort|N2/2|N2/4|1|yes|compareTo()|
|mergesort|N lg N|N lg N|N |yes|compareTo()|
|quicksort|1.39 N lg N|1.39 N lg N|c lg N |no|compareTo()|
|heapsort|2 N lg N|2 N lg N|1|no|compareTo()|
|LSD*|2 W N |2 W N|N + R|yes|charAt()|
## MSD radix sort
* Most significant digit first string sort
* Steps
- Partition array into R pieces according to first character (use key-indexed counting)
- Recursively sort all strings that start ith each character (key-indexed counts delineate subarrays to sort)
* Variable-length strings
- Treat strings as if they had an extra char at end (smaller than any char)
- C strings. Have extra char '\0' at end → no extra work to do
`Code`
``` java
public final static void sort(String[] a) {
String[] aux = new String[a.length];
sort(a, aux, 0, a.length - 1, 0);
}
private final static void sort(String[] a, String[] aux, int lo, int hi, int d) {
if (hi <= lo)
return;
int[] count = new int[R + 2];
for (int i = lo; i <= hi; i++) {
int c = charAt(a[i], d);
count[c + 2]++;
}
for (int r = 0; r < R + 1; r++)
count[r + 1] += count[r];
for (int i = lo; i <= hi; i++) {
int c = charAt(a[i], d);
aux[count[c + 1]++] = a[i];
}
for (int i = lo; i <= hi; i++)
a[i] = aux[i - lo];
for (int r = 0; r < R; r++)
sort(a, aux, lo + count[r], lo + count[r + 1] - 1, d + 1); // sort R subarrays recursively
}
private final static int charAt(String s, int d) {
if (d < s.length())
return s.charAt(d);
else
return -1;
}
```
* Performance
- Observation 1. Much too slow for small subarrays
+ Each function call needs its own count[] array
+ ASCII (256 counts) 100x slower than copy pass for N=2
+ Unicode (65,536 counts): 32.000x slower for N = 2
- Observation 2. Huge number of small subarrays because of recursion
- Improvement
01. Cutoff to insertion sort for small subarrays
- Number of characters examined
+ MSD examines just enough characters to sort the keys
+ Number of characters examined depends on keys
+ Can be sublinear in input size
* Disadvantages of MSD string sort
- Access memory "randomly" (cache inefficient)
- Inner loop has a lot of instructions
- Extra space for count[]
- Extra space for aux[]
* Disadvantages of quicksort
- Linearithmic number of string compares (not linear)
- Has to re-scan many characters in keys with long prefix matches
* Complexity
|algorithm|guarantee|random|extra space|stable?|operations on keys|
|---------|---------|------|-----------|-------|------------------|
|insertion sort|N2/2|N2/4|1|yes|compareTo()|
|mergesort|N lg N|N lg N|N |yes|compareTo()|
|quicksort|1.39 N lg N|1.39 N lg N|c lg N |no|compareTo()|
|heapsort|2 N lg N|2 N lg N|1|no|compareTo()|
|LSD*|2 W N|2 W N|N + R|yes|charAt()|
|MSD*|2 W N|N log2N|N + D R|yes|charAt()|
## 3-Way radix quicksort
* Combines benefits of Quicksort and MSD radix
* Overview. Do 3-way partitioning on the dth character
- Less overhead than R-way partition in MSD string sort
- Does not re-examine characters equal to the partition char (but does re-examine characters not equal to the partition char)
`Code`
``` java
public final class MSDQuickSort {
public final static void sort(String[] a) {
sort(a, 0, a.length - 1, 0);
}
private final static void sort(String[] a, int lo, int hi, int d) {
if (hi <= lo)
return;
int lt = lo, gt = hi;
int v = charAt(a[lo], d);
int i = lo + 1;
while (i <= gt) {
int t = charAt(a[i], d);
if (t < v)
swap(a, lt++, i++);
else if (t > v)
swap(a, i, gt--);
else
i++;
}
sort(a, lo, lt - 1, d);
if (v >= 0)
sort(a, lt, gt, d + 1);
sort(a, gt + 1, hi, d);
}
private final static int charAt(String s, int d) {
if (d < s.length())
return s.charAt(d);
else
return -1;
}
private static void swap(String[] a, int i, int j) {
String temp = a[i];
a[i] = a[j];
a[j] = swamp;
}
}
```
* Performance
- Uses ~ 2 N ln N character compares on average for random string
### 3-way string quicksort vs. MSD string sort
* MSD string sort
- Is cache-inefficient
- Too much memory sortig count[]
- Too much overhead reinitialiazing count[] and aux[]
* 3-way string quicksort
- Has a short inner loop
- Is cache-friendly
- Is in-place
* Complexity
|algorithm|guarantee|random|extra space|stable?|operations on keys|
|---------|---------|------|-----------|-------|------------------|
|insertion sort|N2/2|N2/4|1|yes|compareTo()|
|mergesort|N lg N|N lg N|N |yes|compareTo()|
|quicksort|1.39 N lg N|1.39 N lg N|c lg N |no|compareTo()|
|heapsort|2 N lg N|2 N lg N|1|no|compareTo()|
|LSD+|2 W N|2 W N|N + R|yes|charAt()|
|MSD+|2 W N|N log2N|N + D R|yes|charAt()|
|3-way string quicksort|1.39 W N lg N|1.39 W N lg N|log N + W|no|charAt()|
## Suffix arrays
* Keyword-in-context search
- Given a text of N characters, preprocess it to enable fast substring search (find all occurrence of query string context)
- Applications
+ Grep
+ Linguistics
+ Databases
+ Web search
+ Word processing
- Suffix sort
+ Take the input string and form suffixes
+ Sort suffixes to bring repeated substrings together
+ You can use Binary search
- Suffix-sort soltion
+ Preprocess: suffix sort the text
+ Query: binary search for query, scan until mismatch
* Longest repeated substring
- Given a string of *N* characters, find the longest repeated substring
- Applications
+ Bioinformatics
+ Cryptanalysis
+ Data compression
+ Visualize repetitions in music
- Brute-force algorithm
+ Try all indices *i* and *j* for start of position match
+ Compute longest common prefix (LCP) for each pair
+ The same solution before: Suffix-sort solution
`Code`
``` java
public String lrs(String s) {
int N = s.length();
String[] suffixes = new String[N];
for (int i = 0; i < N; i++) suffix[i] = s.substring(i, N);
Arrays.sort(suffixes);
String lrs = "";
for (int i = 0; i < N-1; i++) {
int len = lcp(suffix[i], suffix[i+1]);
if (len > lrs.length()) lrs = suffixes[i].substring(0, len);
}
return lrs;
}
```
* Sorting challenge
- Problem. Five scientists A, B, C, D, and E are looking for long repeated substring in a genome with over 1 billion nucleotides
+ A has a grad student do it by hand
+ B uses brute force (check all pairs)
+ C uses suffix sorting solution with insertion sort
+ D uses suffix sorting solution wit mergesort
+ ✔ E uses suffix sorting solutio with 3-way string quicksort (but only if LRS is not long)
- Q. Which one is more likely to lead to a cure for cancer?
- Problem. Suffix sort an arbitrary string of length N
+ Q. What is the worst-case running time of best algorithm for problem?
. ✔ Linearithmic (Manber-Myers algorithm)
. ✔ Linear (suffix trees)
---
# Search in String
* Data structure for searching in String
* Complexity for previous searching algorithms
|algorithm|search|insert|delete|ordered operations|operations on keys|
|---------|------|------|------|------------------|------------------|
|red-black BST|lg N|lg N|lg N|yes|compareAt()|
|hash table|1+|1+|1+|no|equals() hashCode()|
* Q. Can we do better?
- A. yes, if we can avoid examining the entire key, as with string sorting
* String symbol table API
``` java
public class StringST {
StringST()
void put(String key, Value val)
void get(String key)
void delete(String key)
}
```
* Goal. Faster than hashing, more flexible than BSTs
* Challenge. Efficient performance for string keys
## R-way Tries
* From re`trie`val, but pronounced "try", to distribute from binary search tree
- For now store *characters* in nodes (not key)
- Each node has *R* children, one for each possible character, where *R* is the possible of character
- Store values in nodes corresponding to last characters in keys
* Search in a trie
- Follow links corresponding to each character in the key
+ Search hit: node where search ends has a non-null value
+ Search miss: reach null link or node where search ends has null value
* Insertion into a trie
- Follow links correspoinding to each character in the key
+ Encounter a null link: create new node
+ Encounter the last character of the key: set value in that node
* Delete in an R-way tries
- Find the node corresponding to key and set value to null
- If node has null value and all null links, remove that node (and recur)
* Trie performance
- Search hit. Need to examine all *L* characters for equality
- Search miss.
+ Could have mismatch on
- first character
+ Typical case examine only a few characters
- Space. *R* null links at each leaf. (but sub-linear space possible if many short strings share common prefixes)
- Bottom line. Fast search hit and even faster search miss, but `wastes space`
* Goal. Design a data structure to perform efficient spell checking
- Solution. Build a 26-way trie (key = word, value = bit), 26 English letters
`Code`
``` java
public class TriesST {
private final static int R = 256;
private class Node {
private Object val;
private Node[] next = (Node[]) new Object[R];
}
private Node root;
public void put(String key, Value val) {
root = put(root, key, val, 0);
}
private Node put(Node node, String key, Value val, int d) {
if (node == null)
node = new Node();
if (d == key.length()) {
node.val = val;
return node;
}
char c = key.charAt(d);
node.next[c] = put(node.next[c], key, val, d + 1);
return node;
}
public boolean contains(String key) {
return get(key) != null;
}
public Value get(String key) {
Node node = get(root, key, 0);
if (node == null)
return null;
return (Value) node.val;
}
private Node get(Node node, String key, int d) {
if (node == null)
return null;
if (d == key.length())
return node;
char c = key.charAt(d);
return get(node.next[c], key, d + 1);
}
}
```
* Comparison
|algorithm|search|search miss|insert|space|
|---------|------|-----------|------|-----|
|red-black BST|L + c lg2 N|c lg2N|c lg2N|4N|
|hash table (linear probing)|L|L|L|4N to 16N|
|R-way tries|L|log*N|L|(R+1)N|
* R-way tries
- Method of choice for small R
- Too much memory for large R
* Challenge. Use less memory, e.g. 65,536-way trie for Unicode!
## Ternary search tries
* Store characters and values in nodes (not keys)
* Each node has 3 children: small(left), equal(middle), larger(right)
* Search in a TST
- Follow links corresponding to each character in the key
+ If less take left link, if greater take right link
+ If equal take the middle link and move to the next key character
- Search hit. Node where search ends has a non-nul value
- Search miss. Reach a null link or node where search node ends has null value
* 26-way trie vs. TST
- 26-way trie. 26 null links in each leaf
- TST. 3 null links in each leaf
`Code`
``` java
public class TernaryST {
private class Node {
Value val;
char c;
Node left, mid, right;
}
private Node root;
public void put(String key, Value val) {
root = put(root, key, val, 0);
}
private Node put(Node node, String key, Value val, int d) {
char c = key.charAt(d);
if (node == null) {
node = new Node();
node.c = c;
}
if (c < node.c)
node.left = put(node.left, key, val, d);
else if (c > node.c)
node.right = put(node.right, key, val, d);
else if (d < key.length() - 1)
node.mid = put(node.mid, key, val, d + 1);
else
node.val = val;
return node;
}
public boolean contains(String key) {
return get(key) != null;
}
public Value get(String key) {
Node node = get(root, key, 0);
if (node == null)
return null;
return node.val;
}
private Node get(Node node, String key, int d) {
if (node == null)
return null;
char c = key.charAt(d);
if (c < node.c)
return get(node.left, key, d);
else if (c > node.c)
return get(node.right, key, d);
else if (d < key.length() - 1)
return get(node.mid, key, d + 1);
else
return node;
}
}
```
* Comparison
|algorithm|search|search miss|insert|space|
|---------|------|-----------|------|-----|
|red-black BST|L + c lg2 N|c lg2N|c lg2N|4N|
|hash table (linear probing)|L|L|L|4N to 16N|
|R-way tries|L|logRN|L|(R+1)N|
|TST|L + ln N|ln N|L + ln N|4 N|
> Remark. can build balanced TST via rotations to achieve `L + Log N` worst-case guarantees
* String symbol table implementation cost summary
|algorithm|search|search miss|insert|space|
|---------|------|-----------|------|-----|
|red-black BST|L + c lg2 N|c lg2N|c lg2N|4N|
|hash table (linear probing)|L|L|L|4N to 16N|
|R-way tries|L|log*N|L|(R+1)N|
|TST|L + ln N|ln N|L + ln N|4 N|
|TST with R2|L + ln N|ln N|L + ln N|4 N + R2|
* Bottom line. TST is as fast as hashing (for string keys), space efficient
* TST vs. hashing
- Hashing
01. Need to examine entire key
02. Search hits and misses cost about the same
03. Performance relies on hash function
04. Does not support ordered symbol table operations
- TSTs
01. works only for strings \(or digital keys\)
02. Only examines just enough key characters
03. Search miss may involve only a fe characters
04. Supports ordered symbol table operations \(plus others\)
* Bottom line. TSTs are:
- Faster than hashing (especially for search misses)
- More flexible than red-black BSTs [stay tuned]
## Character based operations
* The string symbol table API supports several useful character-based operations
* Prefix match. keys with prefix "sh": "she", "shells" and "shore"
* wildcard match. keys that match ".he": "she" and "the"
* Longest prefix. key that is the longest prefix of "shellsort": "shells"
* Prefix matches
- Find all keys in a symbol table starting with a given prefix
- Ex. Autocomplete in a cell phone, search bar (Google search), text editor or shell
01. User types characters one at a time
02. System reports all matching strings
* T9 texting
- Goal. type text messages on a phone keypad
- Multi-tap input. Enter a letter by repeatedly pressing a key until the desired
- T9 text input
01. Find all words that correspond to given sequence of numbers
02. Press 0 to see all completion options
* Patricia trie
- Remove one-way branching
- Each node represents a sequence of characters
- Implementation: one step beyond this course
- Applications
01. Database search
02. P2P network search
03. IP routing table: find longest prefix match
04. Compressed quad-tree for N-body simulation
05. Efficiently storing and query XML documents
### String symbol tables summary
* A success stroy in algorithm design and analysis
* Red-black BST
- Performance guarantee: log N key compares
- Supports ordered symbol table API
* Hash tables
- Performance guarantee: constant number of probes
- Requires good hash function for key type
* Tries. R-way TST
- Performance guarantee: log N characters accessed
- Support character-based operations
`Bottom line. You can get at anything by examining 50-100 bits!!!!!`
---
# Substring search algorithms
* Goal. find pattern of length *M* in a text of length *N* (typically N > M)
* Applications
- Find and replace
- computer forensics. Search memory or disk for signatures, e.g. al URLs or RSA keys that the user has entered
- Identify patterns indicative of spam
01. Profits
02. Lose weight
03. herbal viagra
04. there is no catch
05. this is a one-time mailing
06. this message is sent in compliance with spam regulations
- Screen scraping. Extract relevant data from web page
01. Ex. find string delimited by and after first occurrence of pattern last trade
## Brute-force substring search
* Check for pattern starting at each text position
* Worst case ~ *M N* char compares
* Brute-force algorithm can be slow if text and pattern are repetitive
`Code`
``` java
public final static int indexOf(String text, String sub) {
int N = text.length();
int M = sub.length();
for (int i = 0; i <= N - M; i++) {
int j;
for (j = 0; j < M; j++) {
if (sub.charAt(i + j) != text.charAt(j))
break;
if (j == M)
return i; // index in text where patterns starts
}
}
return N; // not found
}
```
* Bad performance for search in repeated e.g. "AB" in "AAAAAAAAAB"
- The pattern could be so long
* Backup
- In many applications, we want to avoid `backup` in text stream, we will see in the next algorithms
01. Treat input as stream of data
02. Abstract model: standard input
- Approach 1. Maintain buffer of lat *M* characters
- Approach 2. Stay tuned
``` java
/**
* Alternative impl brute-force with explicit backup
*
* @param text String
* @param sub String
* @return int
*/
public final static int indexOfEnhanced(String text, String sub) {
int i, N = text.length();
int j, M = sub.length();
for (i = 0, j = 0; i < N && j < M; i++) {
if (text.charAt(i) == sub.charAt(j))
j++;
else { // backup
i -= j;
j = 0;
}
}
if (j == M)
return i - M;
else
return -1;
}
```
* Brute-force is not always good enough
* Theoretical challenge. Linear-time guarantee
* Practical challenge. Avoid backup in text stream
## Knuth-Morris-Pratt (KMP)
* Coolest algorithm :)
* This is the algorithm when theoretical meets practices, this discovered by two theoreticians and a hacker
* Intuition. Suppose e are search in text for pattern BAAAAAAAAA
* KMP. Clever method to `avoid backup` in brute-force substring
* Deterministic finite state automaton (DFA)
- Finite number of states (includes start and halt)
- Exactly one transition for each char in alphabet
- Accept if sequence of transitions leads to halt state
* Q. what is interpretation of DFA state after reading in txt[i]?
- A. State = number of characters in pattern that have been matched
* Key difference from brute-force implementation
- Need to pre-compute dfa[][] from pattern
- Text pointer i never decrements
* Running time
- Simulate DFA on text: at most N character accesses
- Build DFA: ho to do efficiently? [warning: tricky algorithm ahead!]
* How to build DFA from pattern?
- If in state j and next char `c != pat.charAt(j)` , then the last `j-1` characters of input are pat[1..j-1] followed by c
* Proposition. KMP substring search access no more than `M + N` chars to search for a pattern of length M in a text of length N
- Pf. each pattern char accessed once when constructing DFA, each text char accessed once (in the worst case) when simulating of the DFA
* Proposition. KMP constructs dfa[][] in time and space proportional to *R M*
- Because we have all of those mismatches
`Code`
``` java
public class KMP {
private static final int R = 256;
private int M;
private int[][] dfa;
public KMP(String pattern) {
M = pattern.length();
dfa = new int[R][M];
dfa[pattern.charAt(0)][0] = 1;
for (int i = 0, j = 1; j < M; j++) {
for (int c = 0; c < R; c++) {
dfa[c][j] = dfa[c][i]; // copy mis-match cases
}
dfa[pattern.charAt(j)][j] = j + 1; // set match case
i = dfa[pattern.charAt(j)][i]; // update restart state
}
}
public int search(String text) {
int i, j, N = text.length();
for (i = 0, j = 0; i < N && j < M; i++) {
j = dfa[text.charAt(i)][j]; // no backup
}
if (j == M)
return i - M;
else
return N;
}
}
```
## Boyer-Moore
* KMP in linear time can we improve over that? yes : D
* Rather than scan elements from left to right will do scan from right to left
, to know fast if then end pattern matches or not to avoid from 1 to M-1
* Key here: find the mismatch quickly
* Steps
- Scan characters in pattern from right to left
- Can skip as many as M text chars when finding one not in the pattern
* How much to skip?
- Pre-compute index of rightmost occurrence of character c in pattern (-1 if character not in the pattern)
* Property. substring search with the Boyer-Moore mismatched character heuristic takes about ~ `N / M` character compares to search for a pattern of length M in a text of length N
* worse-case can be as bad as ~ `M N`
* Boyer-Moore variant. Can improve worst case to ~ `3 N` by adding a KMP-like rule to guard against repetitive patterns
`Code`
``` java
public class BoyerMoore {
private static final int R = 256;
private int[] right;
private String pat;
public BoyerMoore(String pattern) {
this.pat = pat;
right = new int[R];
for (int c = 0; c < R; c++)
right[c] = -1;
for (int j = 0; j < pat.length(); j++)
right[pat.charAt(j)] = j;
}
public int indexOf(String text) {
int N = text.length();
int M = pat.length();
int skip;
for (int i = 0; i <= N - M; i += skip) {
skip = 0;
for (int j = M - 1; j >= 0; j--) {
if (pat.charAt(j) != text.charAt(i + j)) {
skip = Math.max(1, j - right[text.charAt(i + j)]);
break;
}
}
if (skip == 0)
return i;
}
return -1; // not found
}
}
```
## Rabin-Karp
* Basic idea = modular hashing
- Computer a hash of pattern characters 0 to M-1
- For each i, compute a hash of text characters i to M+i-1
- If pattern hash = text substring hash check for a match
* Efficiently computing the hash function
- Modular hash function. Using the notation *ti* for txt.charAt(i)
+ We wish to compute xi = ti RM-1* + ti+1 RM-2+...+ti+M-1 R0 (mod Q)
- Horner's method. Linear-time method to evaluate degree-M polynomial
`Code`
``` java
private long hash(String key, int M) {
long h = 0;
for (int j = 0; j < M; j++)
h = (R * h + key.charAt(j)) % Q;
return h;
}
```
* Complexity `7 N`
* Rabin-Karp search
- Advantages
+ Extends to 2d patterns
+ Extends to finding multiple patterns
- disdvantages
+ Arithmetic ops slower than char compares
+ Las Vegas version require backup
+ Poor worst-case guarantee
`Code`
``` java
/**
* Rabin-Karp substring search for operation Index of Which return the first
* occupance of substring in text
*/
public class RabinKarp {
private static final int R = 256;
private final String pattern;
private final long patternHash;
private final int M; // pattern length
private final long Q; // modulus
private long RM;
public RabinKarp(String pattern) {
this.pattern = pattern;
M = pattern.length();
Q = longRandomPrime();
// R^(M-1) % Q
RM = 1;
for (int i = 1; i <= M - 1; i++) {
RM = (R * RM) % Q;
}
patternHash = hash(pattern, M);
}
// a random 31-bit prime
private static long longRandomPrime() {
BigInteger prime = BigInteger.probablePrime(31, new Random());
return prime.longValue();
}
private long hash(String key, int M) {
long h = 0;
for (int i = 0; i < M; i++) {
h = (R * h + key.charAt(i)) % Q;
}
return h;
}
public int search(String text) {
int N = text.length();
if (N < M)
return -1;
// hash of first M characters in text
long h = hash(text, M);
if ((patternHash == h) && check(text, 0))
return 0;
// check for hash match; if hash match, check for exact match
for (int i = M; i < N; i++) {
// Remove leading digit
h = (h + Q - RM * text.charAt(i - M) % Q) % Q;
// Add trailing digit
h = (R * h + text.charAt(i)) % Q;
// Check match
int offset = i - M + 1;
if ((patternHash == h) && check(text, offset))
return offset;
}
return -1;
}
// Las Vegas version: does pattern[] match text[i..i-m+1] ?
private boolean check(String text, int i) {
for (int j = 0; j < M; j++) {
if (pattern.charAt(j) != text.charAt(i + j)) {
return false;
}
}
return true;
}
// Monte Carlo version: always return true
// private boolean check(String text, int i) {
// return true;
// }
}
```
* Compare the substring algorithms
|algorithm|version|guarantee|typical|backup in input|correct|extra space|
|---------|-------|---------|-------|---------------|-------|-----------|
|brute force|M N|1.1 N|yes|yes|1|
|knuth-Morris-Pratt (full DFA)|2 N|1.1 N|no|yes|M R|
|knuth-Morris-Pratt (mismatch transitions only)|3 N|1.1 N|no|yes|M|
|Boyer-Moore (full algorithm)|3 N|N / M|yes|yes|R|
|Boyer-Moore (mismatched char)|M N|N / M|yes|yes|R|
|Robin-Karp (monte carlo)|7 N|7 N|no|yes*|1|
|Robin-Karp (las vegas)|7 N*|7 N|yes|yes|1|
`*` probability guarantee with uniform hash function
---
# Regular expression
* Pattern matching
- Substring search. Find a string iin text
- Pattern matching. Find one of a specified set of strings in text
- Ex. [genomics]
+ Fragile X syndrome is a common cause of metntal retardation
+ Human genome contains triplet repeats of CGG or AGG bracketed by GCC at the beginning and CTG at the end
+ Pattern GCG(GGC|AGG)*CTG
* Applicatons
- Process natural language
- Scan for virus signature
- Specify a programming language
- Access information in digital libraries
- Search genome using PROSITE patterns
- Filter text (spam, NetNanny, Carnivore, malware)
- Validate data-entry fields (dates, email, URL, credit card)
- Parse text files
+ Compile a Java program
+ Crawl and index the web
+ Read in data stored in ad hoc input file format
+ Create Java docmentation from Javadoc comments
* Regular expressions
- Is a notation to specify a set of string `possible to infinite`
|operation|order|example RE|matches|does not match|
|---------|-----|----------|-------|--------------|
|concatenation|3|AABAAB|AABAAB|every other string|
|or|4|AA|BAAB|AA BAAB|every other string
|closure|2|AB*A|AA ABBBBBBBA|AB ABABA|
|parentheses|1|(AB)*A|A ABABABABA|AA ABBA|
## NFA
* Duality between REs and DFAs
- RE. Concise way to describe a set of strings
- DFA. Machine to recognize whether a given string is in a given set
- Kleene's theorem
+ For any DFA, there exists a RE that describes the same set of strings
+ For any RE, there exists a DFA that recognizes the same set of strings
* Pattern matching implementation: basic plan (first attempt)
- Overview is the same as for KMP
+ No backup in text input stream
+ Linear-time guarantee
- Underlying abstraction. Deterministic finite state automata (DFA)
- Basic plan. [apply Kleen's theorem]
+ Build DFA from RE
+ Simulate DFA with text as input
- Bad news. Basic plan is infeasible (DFA may have exponential # of states)
* Pattern matching implementation: basic plan (revised)
- Overview is the same as for KMP
+ No backup in text input stream
+ Quadratic-time guarantee (linear-time typical)
- Underlying abstraction. Nondeterministic finite state automata (NFA)
- Basic plan. [apply Kleen's theorem]
+ Build NFA from RE
+ Simulate NFA with text as input
* Nondeterministic finite-state automata
- Regular expression matching NFA
+ RE enclosed in parantheses
+ One state per RE character (start = 0, accept = M)
+ Red e-transition (change state, but don't scan text)
+ Black mtach transition (change state and scan to next text char)
+ Accept if any sequence of transitions ends in accept state
### NFA simulation
* NFA representation
- State names. Integer from 0 to M
- Match-transitions. Keep regular expression in array re[]
- e-transitions. Store in digraph G
* Q. How to efficiently simulte an NFA?
- A. Maintain set of all possible states that NFA could be in after reading the first *i* text characters
### NFA constuction
* Sates. Incluse a state for each symbol in the RE, plus an accept state
* Proposition. Building the NFA corresponding to an *M*-character RE takes time and space proportional to M
- Pf. For each of the *M* characters in the RE, we add at most three e-transitions and execute at most two stack operations
`Code`
``` java
public class NFA {
private char[] re;
private Digraph graph;
private int M;
private String regexp;
public NFA(String regexp) {
this.regexp = regexp;
this.re = regexp.toCharArray();
this.M = regexp.length();
this.graph = buildEpsilonTransitionDigraph();
}
private Digraph buildEpsilonTransitionDigraph() {
Digraph g = new Digraph(M + 1);
Stack ops = new ArrayStack<>();
for (int i = 0; i < M; i++) {
int lp = i;
if (regexp.charAt(i) == '(' || regexp.charAt(i) == '|')
ops.push(i);
else if (regexp.charAt(i) == ')') {
int or = ops.pop();
if (regexp.charAt(or) == '|') {
lp = ops.pop();
g.addEdge(lp, or + 1);
g.addEdge(or, i);
} else
lp = or;
}
if (i < M - 1 && regexp.charAt(i + 1) == '*') {
g.addEdge(lp, i + 1);
g.addEdge(i + 1, lp);
}
if (regexp.charAt(i) == '(' || regexp.charAt(i) == '*' || regexp.charAt(i) == ')')
g.addEdge(i, i + 1);
}
return g;
}
public boolean recognize(String txt) {
Bag pc = new Bag<>();
DirectedDFS dfs = new DirectedDFS(graph, 0);
for (int v = 0; v < graph.getVertices(); v++) {
if (dfs.visited(v))
pc.add(v);
}
for (int i = 0; i < txt.length(); i++) {
Bag match = new Bag<>();
for (int v : pc) {
if (v == M)
continue;
if ((re[v] == txt.charAt(i)) || re[v] == '.')
match.add(v + 1);
}
dfs = new DirectedDFS(graph, match);
pc = new Bag<>();
for (int v = 0; v < graph.getVertices(); v++) {
if (dfs.visited(v))
pc.add(v);
}
}
for (int v : pc)
if (v == M)
return true;
return false;
}
}
```
### Applications
* Grep. Take a RE as a command-line argument and print the lines from standard input having some substring that is matched by the RE
- Bottom-line. Worst-case for grep (proportional to M N) is the same as for brute-force substring search
- Typical grep application: Crossword puzzles
`Code`
``` java
public class Grep {
public static void main(String[] args) throws FileNotFoundException {
String regexp = "(.*" + args[1] + ".*)";
NFA nfa = new NFA(regexp);
Scanner scanner = new Scanner(new File(args[0]));
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
if (nfa.recognize(line))
System.out.println(line);
}
}
}
```
* To complete the implementation
- Add wildcard
- Add multiway or
- Handle metacharacters
- Support character classes
- Add capturing capabilities
- Extend the closure operator
- Error checking and recovery
- Greedy vs. reluctant matching
* Broadly applicables programmer's tool
- Originated in Unix in the 1970s
- Many languages support extended regular expression
- Built into grep, awk, emacs, Perl, PHP, Python, Javascript
### Context
* Abstract machines, languages, and nondeterminism
- Basis of the theory of computation
- Intensively studied since the 1930s
- Basis of programming language
* Compiler. A program that translates a program to machine code
- KMP string → DFA
- grep RE → NFA
- javac Java language → Java byte code
|operation|KMP|grep|Java|
|---------|---|----|----|
|Pattern|string|RE|program|
|compiler output|DFA|NFA|byte code|
|simulator|DFA simulator|NFA-simulator|JVM|
---
# Data compression
## Introduction
* Compression reduces size of a file
- To save `space` when sorting it
- To save `time` when transmitting it
- Most files have lots of redundancy
* Who needs compressions?
- Moore's law: # transistors on a chip doubles every 18-24 months
- Parkinsons' law: data expands to fill space available
- Text, image, sound, video, ....
* Applications
- Generic file compression
01. Files: GZIP, BZIP, 7z
02. Archives: RKZIP
03. File systems: NTFS, HFS+, ZFS
- Multimedia
01. Images: GIF, JPEG
02. Sound: MP3
03. Video: MPED, DivX, HDTV
- Communication
01. ITU-T T4 Group 3 fax
02. Skype
- Databases
01. Google
02. Facebook
* Lossless compression and expansion
- Message. Binary data B we want to compress
- Compress. Generates a "compressed" representation C(B)
- Expand. Reconstructs original bit-stream B
- Compression ration. Bits in `C(B) / bits in B`
+ Ex. 50-75% or better compression ration for natural language
* Food for thought
- Data compression has been omnipresent since antiquilty
+ Number systems
+ Natural languages
+ Mathematical notation
- has played a central role in communication technology
+ Grade 2 Braille
+ Morse code
+ Telephone system
- And is part of modern life
+ MP3
+ MPEG
* Another application. Data representation: genomic code
- Genome. String over the alphabet { A, C, T, G }
+ Goal. Encode an *N*-character genome: ATAGATGCATِG...
+ Standard ASCII encoding
. 8 bits per chars
. 8 N its
. because they are 4 characters you can use 2 bits to store them
- Fixed-length code. A-bit code support alphabet of size 2k
- Amazing but true. Initial genomic databases in 1990s used ASCII
## Run length coding
* Simple type of redundancy in a bitstream. Long runs of repeated bits
*
- This string
```
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
```
- Representation. 4-bit counts to represent alternating runs of 0s and 1s: 15 0s, then7 1s, then 7 0s, then 11 1a
- Q. How many bits to store the counts?
+ A. We'll use 8 (but 4 in the example above)
- Q. What to do when run length exceeds max count?
+ A. if longer than 55, intersperse runs on length -
- Applications. JPEG, ITU-T T$ Group 3 fax, ....
## Huffman compression
* Variable-length codes
* Use different number of bits to encode different chars
* Ex. Morse code: `... - - - . . .`
- Issue. Ambiguity
+ SOS ?
+ V7 ?
+ IAMIE ?
+ EEWNI ?
- In practice. Use a medium gap to separate codewords
* Q. How do we avoid ambiguity?
- A. Ensure that no codeword is a `prefix` of another
+ Ex 1. Fixed-length code
+ Ex 2. General prefix-free code
+ Ex 3. General prefix-free code
* Q. How to represent the prefix-free code?
- A. binary trie!
+ Chars in leaves
+ Codeword is path from root to leaf
* Compression
- Method 1: start at leaf, follow path up to the root, print bits in reverse
- Method 2: creates ST of key-value pairs
* Expansion
- Start at root
- Go left if bit is 0, go right if 1
- If leaf node, print char and return to root
* Q. How to write the trie?
- A. Write pre-order traversal of trie, mark leaf and internal nodes with a bit
* How to find best prefix-free code?
- Shannon-Fano algorithm
+ Partition symbols S into two subsets S0 and S1 of (roughly) equal freq
+ Codewords for symbols in S0 start with 0, for symbols in S1 start with 1
+ Recur in S0 and S1
- Problem 1. How to divide up symbols?
- Problem 2. Not optimal!
* Huffman algorithm demo
- Start with one node corresponding to each character wit weight equal to frequency
- Select two tries with min weight
- Merge into single trie with cumulative weight
* Applications
- JPEG
- PDF adobe
- MP3
- Divx MPEG4
- Gzip
* Huffman codes
- Q. How to find best prefix-free code?
. Huffman algorithm:
01. Count frequency freq\[i\] for each char i in input
02. Start with one node corresponding to each char i \(with weight freq\[i\]\)
03. Repeat until single trie formed
Select two tries with min weight freq[i] and freq[j]
Merge into single trie with freq[i] + freq[j]
* Running time. Use a binary heap `N + R log R`
---
## LZW compression
* Static model. Same model for all texts
- Fast
- Not optimal: different texts have different statistical properties
- Ex: ASCII, Morse code
* Dynamic model. Generate model based on text
- Preliminary pass needed to generate model
- Must transmit the model
- Ex: Huffman code
* Adaptive model. Progressively learn and update model as you read text
- More accurate modeling produces better compression
- Decoding must start from beginning
- EX: LZW
* Steps
01. Create ST associating W-bit codewords with string keys
02. Initialize ST with codewords for single-char keys
03. Find longest string s in ST that is a prefix of un-scanned part of input
04. Write the w-bit codeword associated with s
05. Add s\+c to ST, where c is next char in the input
* Q. How to represent LZW compression code table?
- A. A trie to support longest prefix match
* How big to make ST?
- How long is message?
- While message similar model
- [many variations have been developed]
* What to do when ST fills up?
- Throw away and start over [GIF]
- Throw away when not effective [Unix compression]
- [many other variations]
* LZW in the real world
- Lempel-Ziv and friends
. LZ77
. LZ78
. LZW
. Deflate / zlib = LZ77 variant + huffman
---
## Data compression summary
* Lossless compression
- Represent fixed-length symbols with variable-length codes [huffman]
- Represent variable-length symbols with fixed-length codes [LZW]
* Lossy compression [not covered in this course]
- JPEG, MPEG, MP3
- FFT, wavelets, fractals
---
# Overview: introduction to advanced topics
* Reduction: design algorithms, establish lower bounds, classify problems
* Linear programming: the ultimate practical problem-solving model
* Intractability: problems beyond our reach
* Shifting gears
- From individual problems to problem-solving models
- From linear/quadratic to polynomial/exponential scale
- From details of implementation to conceptual framework
* Goal
- Place algorithms we've studied in a larger context
- Introduce you to important and essential ideas
- Inspire you to learn more about algorithms :smile: !
---
## Reductions
* Classify problems according to computational requirements
|complexity|order of growth|examples|
|----------|---------------|--------|
|linear|N|min, max, median, burrows-wheeler transform|
|linearithmic|N log N|sorting, convex hull, closest pair, farthest pair|
|quadratic|N2|?|
|.....|.....|.....|
|exponential|cN|?|
* Frustrating news. Huge number of problems have defined classification
* Suppose we could (could not) solve problem X efficiently
what else could (could not) we solve efficiently?
* Definition
- Problem X reduces to problem Y if you can use an algorithm that solves Y to help solve X
- Cost of solving X = total cost of solving Y + cost of reduction
- Ex 1. [finding the median reduces to sorting]
. To find median of N items
01. Sort N items
02. Return item in the middle
. Cost of solving finding the median `N log N + 1`
- Ex 2. [element distinctness reduces to sorting]
. To solve elements distinctness on N times:
01. Sort N items
02. Check adjacent paris for equality
. Cost of solving finding the median `N log N + 1`
---
## Design algorithms
* Def. Problem X reduces to problem Y if you can use an algorithm that solves Y to help solve X
* Design algorithm. Given algorithm for Y, can also solve X
* Ex.
- 3-collinear reduces to sorting
- Finding the median reduces to sorting
- Element distinctness reduces to sorting
- CPM reduces to topological sort
- Arbitrage reduces to shortest paths
- Burrows-wheeler transform reduces to suffix sort
* Mentality. Since I know how to solve Y, can I use that algorithm to solve X
---
## Establish lower bound
* Goal. prove that a problem require a certain number of steps
* Ex. In decision tree model, any compare-based sorting algorithm requires theta N log N in the worst case
* BAd news. very difficult to establish lower bounds from scratch
* Good news. spreed theta N log N lower bound to Y by reducing sorting to Y
* Linear-time reductions
- Def. problem X `linear-time reduces` to problem Y if x can be solved with:
. Linear number of standard computational steps
. Constant number of calls to Y
- Ex. almost all of the reductions we've seen so far
- Establish lower bound
. If x takes theta N log N steps, then so does Y
. If x takes theta N2, then so does Y
- Mentality
. If I could easily solve Y, then I could easily solve X
. I can't easily solve X
. Therefore, I can't easily solve Y
---
## Classify problems
* Desiderata. Problem with algorithm that matches lower bound
- Ex. sorting and convex hull have complexity of N log N
* Desiderata. prove that two problems X and Y have the same complexity
- First, show that problem X linear-time reduces to Y
- Second, show that Y linear-time reduces to X
- Concludes that X and Y have same complexity
* Integer arithmetic reductions
- Integer multiplication. given two N-bit integers, compute their product
- Brute force. N2 bit operations
- Q. Is brute-force algorithm optimal?
[open-source-img]:https://badges.frapsoft.com/os/v1/open-source.svg?v=10
[alg-img]:https://img.shields.io/static/v1?label=Topic&message=Algorithms&color=orange&style=flat
[datastr-img]:https://img.shields.io/static/v1?label=Topic&message=Data-Structures&color=blue&style=flat
[rep-url]: https://github.com/ahmadmoawad/algorithms