An open API service indexing awesome lists of open source software.

https://github.com/0xh3xa/algorithms

Algorithms and Data Structures ❤️🚀
https://github.com/0xh3xa/algorithms

algorithms datastructures java study-notes

Last synced: 11 months ago
JSON representation

Algorithms and Data Structures ❤️🚀

Awesome Lists containing this project

README

          

# Algorithms and Data Structures

[![Algorithms][alg-img]][rep-url] [![Data Structures][datastr-img]][rep-url] [![Open Source Love][open-source-img]][rep-url]

Algorithms and data structures' implementations in Java from the `Algorithms 4th edition` :book:

# What is this course?

* Intermediate level survey course
* Programming and problem solving with applications

# Definitions

* `Algorithms`: Method for solving a problem
* `Data structures`: Method to store information
* `Program = Algorithms + Data structures`

# Topics

* `Part I`

01. Data types: stack, queue, bag, union find, priority queue
02. Sorting: quicksort, mergesort, heapsort, radix sorts
03. Searching: BST, red-black BST, hash table

* `Part II`

04. Graphs: BFS, DFS, Prime, Kruskal, Dijkstra
05. Strings: KMP, regular expression, TST, Huffman, LZW
06. Advanced: B-tee, suffix array, maxflow

---

# Why Algorithms is so important?

Algorithms all around us

01. Internet: web search, packet routing, distribute file sharing, ...
02. Biology: human genome project, protein folding, ...
03. Computers: circuit layout, file system, compilers, ...
04. Computer graphics: Movies, video games, virtual reality, ...
05. Security: cell phones, e-commerce, voting machines, ...
06. Multimedia: MP3, JPG, Divx, HDTV, face recognition, ...
07. Social networks: recommendations, news feeds, advertisements, ...
08. Physics: N-body simulation, particle collision simulation, ...

# Steps for solving the problem

01. Model the problem
02. Find an algorithm to solve it
03. Fast enough? Fits in memory?
04. If not, figure out why
05. Find a way to address the problem
06. Iterate until satisfied

---

# Algorithm Analyze

## Reasons to Analyze Algorithms

* Predict performance
* Compare algorithms
* Provide guarantees
* Understand theoretical basis

## The scientific method

* Observe some feature in the natural world
* Hypothesize a model that is consistent with the observations
* Predict events using the hypothesis
* Verify the predictions by making further observations
* Validate by repeating until the hypothesis and observations agree

## Principles

* Experiments must be reproducible.
* Hypotheses must be falsifiable.

## Empirical Analysis

* Manual measurement (benchmarking) with a stopwatch or programmatic timing method.
* Measure running time for different input sizes *N* (e. g. doubling time) and observe the relationship between the running times.

## Data Analysis

* Plot running time T(N) vs input size *N*.
* Plot as log-log plot, often get a straight line. lg(T(N)) vs lg(N). Plot tells you the exponent of *N*.
* Regression, power law: *a x Nb*
* Once you have the power b from the slope of the log-log plot, solve for *a* in the equation *T(N) = a x Nb*

## Doubling Hypothesis

* Run program, doubling the size of the input and observe ratios. Observe to what it converges, do not take the average!

| N | T(N) | Ratio | lg(Ratio) |
| ---- | ---- | ----- | -------- |
| 1000 | 0. 1 | - | - |
| 2000 | 0. 8 | 7. 7 | 2. 9 |
| 4000 | 6. 4 | 8 | 3 |
| ... | ... | ... | ... |

* Hypothesis: Running time is about *a x Nb*, where *b = lg(Ratio)*
* Caveat: Cannot identify logarithmic factors with the doubling hypothesis.
* Calculate *a* by solving *T(N) = a x Nb* for a with all other variables now known.

## Experimental algorithmic

* System independent effects (determines constant *a* and exponent *b* in power law)

+ Algorithm
+ Input data

* System dependent effects (contribute only to constant *a* in power law)

+ Hardware: CPU, memory, cache
+ Software compiler, interpreter, garbage collector
+ System: OS, network, other applications

## Mathematical Models

* Analyze individual operations to determine complexity

* Simplification 1

Count only the most expensive ones, i. e. those that take the most time or where time x frequency is highest.

* Simplification 2

Ignore lower order terms, e. g. in 5xN3 + 20N + 16, ignore the term with N
and the constant 16 (which is 16 x N0) because they are less significant in
comparison with the highest order term. We use *tilde notation* `~` to say that *5
x N3 + 20 x N + 16 __~ 5 x N3__*. Technical definition is that for *f(N) ~
G(N)* when *N* goes towards infinity, the lower order terms become so
insignificant that *f(N)/g(N) = 1*

## Order-of-Growth Classifications

- A great number of algorithms (most) are described by the following order of growth functions
+ 1 (constant)
+ log N (logarithmic)
+ N (linear)
+ N log N (linearithmic)
+ N2 (quadratic)
+ N3 (cubic)
+ 2N (exponential)

> Note: lgN means log2N

graph_order_growth

* We say the algorithm "is proportional to" e. g. constant time

---

# Properties

01. Reflexive: p is connected to q
02. Symmetric: if p is connected to q, then q is connected to p
03. Transitive: if p is connected to q and q is connected to r, then p is connected to r

# Dynamic connectivity

Applications based on this:

01. Pixels in a digital photo
02. Computers in network
03. Friends in a social network
04. Transistors in a computer chip
05. Elements in a mathematical set
06. Variables names in Fortran program
07. Metallic sites in a composite system

## Quick find (Eager approach)

* Data structure
* Integer array `id[]` of size N
* Interpretation: p and q are connected (iff) if and only if they have the same id
* Complexity

| Initialize | Union | Find |
|------------|-------|------|
| N | N | 1 |

* Defect

- Find too expensive (Could be N array accesses)
- If you have N union commands over N objects will be O(N2) quadratic

* Note: We can not accept Quadratic in big problems Quadratic algorithms do not scale

### Rough standards (for now)

* 109 operations per second
* 109 words of main memory
* Touch all words in approximately 1 second

### E. g Huge problem for quick find

* 109 union commands of 109 objects
* Quick find takes more than 1018 operations
* 30+ years of computer time!

## Quick Union

* Set the first element based on the root of the second element
* Complexity

| Initialize | Union | Find |
|------------|-------|------|
| N | N' | 1 |

* Defect

- Tree can get tall
- Find too expensive (Could be N array accesses)

## Quick Union improvement

01. Weighted Quick union
* Modify the Quick union to avoid tall trees
* Balance by linking root of smaller tree to root of larger tree
* Depth of any node x is at most `lg N`

* Complexity

| Initialize | Union | Find |
|------------|-------|------|
| N | lg N' | lg N |

02. Quick union with path compression `N + M lg N`

03. Weighted Quick union with path compression `N + M lg * N`

> Note: WQUPC reduce time from 30 years to 6 seconds

## Union find Applications

01. Percolation
02. Games (Go, Hex)
03. Dynamic connectivity
04. Last common ancestor
05. Hoshen-kopelman algorithm in physics

## Percolation

A model for many physical systems

* N-by-N grid of sites
* Each site is open probability p (or blocked with probability 1-p)
* System percolates iff top and bottom are connected by open sites
* Applications in real life

| Model | System | Vacant site | Occupied site | Percolates |
| -------| -------|-------------|-------------|----------|
| electricity | material| conductor | insulated | conducts |
| fluid flow | material| empty | blocked | porous |
| social interaction | population | person | empty | communicates |

---

# Data structures Design

* Good practice to make an abstraction between the outside world and internal implementation, In java we will use interface

- Benefits
+ Client can't know details of implementation

+ Implementation can't know details of client needs

+ Design: creates modular, reusable libraries

+ Performance: use optimized implementation where it matters

* Client: program using operations defined in interface

* Implementation: actual code implementing operations

* Interface: description of data type, basic operations

## Stack

* `LIFO` (last in first out), useful in many applications
* Operations: `push(Item item), pop(), size(), isEmpty()`

* There are two implementation of stack using `Linkedlist` and `Array`

* Stack removes the item most recently added
* What are the Differences between LinkedList and Array implementation?

- Linkedlist: Use extra space for dealing with links
- Array: resize/shrink the array takes some time

* When should I use Linkedlist or Array implementation?

- if time is important and don't want to lose any input i. e. dealing with internet packet use `Linkedlist` implementation, But if you take care of `memory` space use `Array` implementation

* How duplicate/shrinking array?

- `Duplicate` When reach 100% full the array resize`(arr.length * 2)`
- `Shrink` when reach one quarter full to the half `resize(arr.length / 2)`

* Stack applications

- Parsing in a compiler
- Java virtual machine
- Undo in word processor
- Back button in a web browser
- Implementation function calls in a compiler
- Arithmetic expression evaluation
- Reverse objects

* `LinkedList` implementation code

``` java
public class LinkedStack implements Stack {

private class NodeList {
Item item;
NodeList next;

public NodeList(Item item) {
this.item = item;
}
}

private NodeList first = null;
private int size = 0;

public void push(Item item) {
if (first == null) {
first = new NodeList(item);
first.next = null;
} else {
NodeList oldFirst = first;
first = new NodeList(item);
first.next = oldFirst;
}
size++;
}

public Item pop() {
if (isEmpty()) {
throw new NoSuchElementException("stack underflow");
}
Item item = first.item;
first = first.next;
size--;
return item;
}
}
```

---

## Queue

* `FIFO` (first in first out), useful in many applications
* Operations: `enqueue(Item item), dequeue(), size(), isEmpty()`

* Queue removes the item lest recently
* There are two implementation of stack using `Linkedlist` and `Array`

* Queue applications

- CPU scheduling
- Disk scheduling
- Data transfer asynchronously between two processes. Queue is used for synchronization.
- Breadth First search in a Graph
- Call Center phone systems

* `LinkedList` implementation code

``` java
public class LinkedQueue implements Queue {

private class NodeList {
Item item;
NodeList next;

public NodeList(Item item) {
this.item = item;
}
}

private NodeList first;
private NodeList last;
private int size;

public LinkedQueue() {
first = null;
last = null;
size = 0;
}

public void enqueue(Item item) {
NodeList oldLast = last;
last = new NodeList(item);
last.next = null;
if (isEmpty()) {
first = last;
} else {
oldLast.next = last;
}
size++;
}

public Item dequeue() {
if (isEmpty()) {
throw new NoSuchElementException("Queue underflow");
}
Item item = first.item;
first = first.next;
if (isEmpty()) {
last = null;
}
size--;
return item;
}

public Item peek() {
if (isEmpty()) {
throw new NoSuchElementException("Queue underflow");
}
return first.item;
}
}
```

---

# Elementary sorts

* Rearrange array of N times into ascending/descending order based on a key

- Selection sort
- Insertion sort
- Shell sort
- Heap sort
- Quick sort

* Implementation in java there are `Comparable` and `Comparator` interfaces we will them any of them in the implementation of the sort algorithms

to allow sort any generic data types

* There are three return values: 1, 0, -1 and throw Exception if incompatible types or null

+ V less than W (return -1)
+ V equal to W (return 0)
+ V greater than W (return 1)

* Total order

+ `Antisymmetric` : if v<=w and w<=v, then v=w
+ `Transitivity` : if v<=w and w<=x, then v<=x
+ `Totality` : either v<=w or w<=v or both

## Selection sort

* Scan from left to right
* Find the index of `min` of smallest remaining entry, then swap `a[i]` and `a[min]` `-→` `Time Complexity O(N2)` and doesn't sensitive if the input is sorted

`Code`

``` java
public static > void sort(Item[] arr) {
int N = arr.length;
int min;
for (int i = 0; i < N; i++) {
min = i;
for (int j = i + 1; j < N; j++) {
if (less(arr[j], arr[min])) {
min = j;
}
}
swap(arr, i, min);
}
}
```

## Insertion sort

* Scan from left to right
* Swap `a[i]` with each larger entry to its left ← `Time Complexity O(N2)` and has good performance over `partially sorted arrays`

* Fast when the array is partially sorted `O(N)`

* Array called partially sorted when number of elements to be changed less than or equal cN

`Code`

``` java
public static > void sort(Item[] arr) {
int N = arr.length;
for (int i = 1; i < N; i++) {
for (int j = i; j > 0 && less(arr[j], arr[j - 1]); j--) {
swap(arr, j, j - 1);
}
}
}
```

## Shell sort

* Move entries more than one position at a time by `h-sorting` the array What's the `h value` Knuth says `3x+1`

* Complexity N3/2

`Code`

``` java
public static > void sort(Item[] arr) {
int N = arr.length;
int h = 1;
while (h < N / 3)
h = 3 * h + 1;

while (h >= 1) {
for (int i = h; i < N; i++) {
for (int j = i; j >= h && less(arr[j], arr[j - h]); j -= h) {
swap(arr, j, j - h);
}
}
h /= 3;
}
}
```

* Why Shell sort uses insertion sort internally?

01. Fast unless array size is huge
02. Tiny used in some embedded systems
03. Hardware sort prototype

## Shuffle sort

* Generate a random real number for each array entry
* Sort array
* Knuth shuffle

- Pick integer r between 0 and i uniformly at random
- Swap `a[i]` and `a[r]`
- Complexity: `O(N)`

`Code`

``` java
public static > void shuffle(Item[] arr) {
int N = arr.length;
Random random = new Random();
for (int i = 0; i < N; i++) {
int r = random.nextInt(i + 1);
swap(arr, i, r);
}
}
```

## Applications in sorting

* Convex hull of a set of N points

- Is the smallest perimeter fence enclosing the points
- Equivalent definitions:
01. Smallest convex set containing all the points
02. Smallest area convex polygon enclosing the points
03. Convex polygon enclosing the points, whose vertices are points in set

- Convex hull output. Sequence of vertices in counterclockwise order
- Mechanical algorithm. Hammer nails perpendicular to plane, search elastic rubber band around points
- Convex hull application
01. Robot motion planning. Find shortest path in the plan from s to t that avoids a polygonal obstacle
+ Fact. Shortest path is either straight line from s to t or it is one of two polygonai chains of convex hull

02. Farthest pair problem. Given N points in the plane, find a pair, find a pair of points with the largest Euclidean distance between them
+ Fact. Farthest pair of points are extreme points on convex hull
- Convex hull: geometric properties
+ Fact. Can traverse the convex hull by making only counterclockwise turns
+ Fact. The vertices of convex hull appear in interesting order of polar angle with respect to point p with lowest y-coordinate
- Graham scan, based on above facts |
+ Choose point p with smallest y-coordinate
+ Sort points by polar angle with p
+ Consider points in order, discard unless if create a ccw turn
+ Q. How to find point p with smallest y-coordinate?

A. Define a total order, comparing by y-coordinate

+ Q. How to sort points by polar angle with respect to p?

A. Define a total order for each point p

+ Q. How to determine where p1 → p2 → p3 is counterclockwise turn?

A. Computational geometry

+ Q. How to sort efficiently?

A. Merge sort in `N lg N`

- Implement CCW
+ CCW. Given three points a, b and c, is a→b→c a counterclockwise turn?

A. Determinant (or cross product) gives 2x signed area of planer triangle

01. if signed area > 0, then a→b→c is counterclockwise
02. if signed area < 0, then a→b→c is clockwise
03. if signed area = 0, then a→b→c are colliner

+ Running time `N lg N` for sorting and linear for rest

## Merge sort

* This sort based on the technique of `divide-and-conquer`

* Java sort for objects
* Steps:

- Divide array into two halves
- Recursively sort each half
- Merge two halves

* Complexity `N lg N`

`Code`

``` java
public static > void sort(Item[] arr, Item[] aux, int lo, int hi) {
if (hi <= lo)
return;
int mid = lo + (hi - lo) / 2;
sort(arr, aux, lo, mid);
sort(arr, aux, mid + 1, hi);
merge(arr, aux, lo, mid, hi);
}

private static > void merge(Item[] arr, Item[] aux, int lo, int mid, int hi) {

for (int k = lo; k <= hi; k++)
aux[k] = arr[k];

int i = lo, j = mid + 1;
for (int k = lo; k <= hi; k++) {
if (i > mid)
arr[k] = aux[j++];
else if (j > hi)
arr[k] = aux[i++];
else if (less(aux[j], aux[i]))
arr[k] = aux[j++];
else
arr[k] = aux[i++];
}
}
```

* Mergesort improvement

- Stop if array is already sorted: !less(arr[mid+1], arr[mid])
- Cutoff to insertion sort = 7
- Eliminate-the-copy-to-the-auxiliary-array trick

* First draft of a Report on the EDVAC by John von Neuman
* Compare Running time between Insertionsort and Mergesort

- Laptop executes 108 compares/second
- Supercomputer executes 1012 compares/second
- In this below table will compare between normal computer in first row and supercomputer in second row

Insertionsort N2

| Million | Billion |
|-----------|-----------|
| 2. 8 hours | 317 years |
| 1 second | 1 week |

Mergesort `N lg N`

| Million | Billion |
|----------|---------|
| 1 second | 18 min |
| instant | instant |

> Note: Good algorithm are better than supercomputers

### Bottom-up version of Mergesort

* Basic plan

01. Pass through array, merging subarrays of size 1
02. Repeat for subarrays of size 2, 4, 8, 16, ...
03. Slower than Recursive by 10%

`Code`

``` java
public static > void sortBottomUp(Item[] arr) {
int N = arr.length;
Item[] aux = (Item[]) new Comparable[N];

for (int sz = 1; sz < N; sz = sz + sz)
for (int lo = 0; lo <= N - sz; lo += sz + sz)
merge(arr, aux, lo, lo + sz - 1, Math.min(lo + sz + sz - 1, N - 1));
}
```

## Sort Stability

* Suppose you want to sort `BY_NAME` then `BY_SECTION`

* You should sort and keep the equal elements that they came as input, don't change equal elements position
* Which sorts are stable?

+ Insertionsort
+ Mergesort

* Why Selectionsort and Shellsort not stable?

+ Selectionsort keeps keep pointer from past and might move an item some equal item

- Shellsort makes long distance exchanges

## Quick sort

* One of the most important algorithm in 20th century
* Java sort for primitive types
* Basic plan

01. Shuffle the array
02. Partition
03. Sort

* Complexity `N lg N`

* Faster than Mergesort
* In place algorithm
* Not stable
* Worst case in quicksort will not gonna happen
* Problems in quick sort

`Code`

``` java
public static > void sort(Item[] arr) {
KnuthShuffleSort.shuffle(arr);
sort(arr, 0, arr.length - 1);
}

private static > void sort(Item[] arr, int lo, int hi) {
if (hi <= lo)
return;
if (hi <= lo + CUTOFF) {
InsertionSort.sort(arr, lo, hi);
return;
}
int j = partition(arr, lo, hi);
sort(arr, lo, j - 1);
sort(arr, j + 1, hi);
}

private static > int partition(Item[] arr, int lo, int hi) {
int i = lo, j = hi + 1;
while (true) {
while (less(arr[++i], arr[lo]))
if (i == hi)
break;
while (less(arr[lo], arr[--j]))
;
if (i >= j)
break;
swap(arr, i, j);
}
swap(arr, lo, j);
return j;
}
```

* Improvement
01. Insertion sort for small subarrays
- Cut OFF ~ 10 items
02. Median of sample
- Best choice of pivot item = median
- Median-of-3 random items

## Selection

* Goal. Given an array of N items, find the kth largest

- Min(k=0), max(k=N-1), median(k=N/2)

* Applications

- Order statistics
- Find the top k

* Use theory as a guide
- Easy `N lg N` upper bound. How? By sorting the array and loop util reach kth
- Easy `N` upper bound for k = 1,2,3. How?
- Easy `N` lower bound. Why?

* Which is true?
- `N lg N` lower bound? ← Is selection as hard as sorting?
- `N` upper bound? ← Is there a linear algorithm for each k?

* Quick-select
- Version of Quick-sort
- Entry a[j] is in
- No larger entry to the left of j
- No smaller entry to the right of j
- Repeat in one sub-array, depending on j; finished when j equals k

- Analysis: Linear time on average

> Remark. Quick-select uses ~ 1/2N2 compares in the worst case, but (as with quicksort) the random shuffle provides a probabilistic guarantee

- Algorithm

``` java
public static > Comparable select(Item[] arr, int k) {
KnuthShuffleSort.shuffle(arr);
int lo = 0, hi = arr.length - 1;
while (hi > lo) {
int j = partition(arr, lo, hi);
if (j < k)
lo = j + 1;
else if (j > k)
hi = j - 1;
else
return arr[k];
}
return arr[k];
}
```

> Remark. But, constants are too high => not used in practice

* Use theory as a guide

- Still in worthwhile to seek practical linear time (worst-case) algorithm
- Until one is discovered, use quick-select if you don't need a full sort

## Duplicate keys

* Often, purpose of sort is to bring items with equal keys together
- Sort population by age
- Find collinear points
- Remove duplicates from mailing list
- Sort job applicants by college attended

* Typical characteristics of such applications
- Huge array
- Small number of key values

* Mergesort with duplicate keys. always between 1/2 N lg N and N lg N compares

* Quicksort with duplicate keys.
- Algorithms goes quadratic unless partition stop on equal keys

- 1990s C user found this defect in qsort()
- Mistake. Put all items equals to the partitioning item in one side
+ Consequence. ~1/2N2 compares when all keys equal

B A A B A B B B C C C A A A A A A A A A A `A`

- Recommended. Stp scan on item equals to the partitioning item
+ Consequence. ~ N lg N compares when all keys equal

B A A B A B B B C C C A A A A A `A` A A A A A

- Desirable. Put all items equal to the partitioning item in place

A A A `B B B B B` C C C `A A A A A A A A A A A`

* 3-way partitioning
- Goal. Partition array into 3 parts so that:
01. Entries between lt and gt equal to partition item v
02. No larger entries to left of lt
03. No smaller entries to right of gt
- Dutch national flag problem. [Edsger Dijkstra]
+ Conventional wisdom until mid 1990s: not worth doing
+ New approach discovered when fixing mistake in C library qsort()
+ Now incorporated into qsort() and Java system sort
- Steps
01. Let v be partitioning item a[lo]
02. Scan i from left to right

. (a[i] < v): exchange a[lt] with a[i]; increment both it and i
. (a[i] > v): exchange a[gt] with a[i]; decrement gt
. (a[i] == v): increment i

`Code`

``` java
private static > void sort(Item[] arr, int lo, int hi) {
if (hi <= lo)
return;
int lt = lo, gt = hi;
Item v = arr[lo];
int i = lo;
while (i <= gt) {
int cmp = arr[i].compareTo(v);
if (cmp < 0)
swap(arr, lt++, i++);
else if (cmp > 0)
swap(arr, i, gt--);
else
i++;
}
sort(arr, lo, lt - 1);
sort(arr, gt + 1, hi);
}
```

* Proposition. [Sedgewick-Bentley, 1997]
- Quicksort with 3-way partition is entropy-optimal

* Bottom line. Randomized quicksort with 3-way partitioning reduces running time from linearithmic to linear in broad class of application

## System sorts

* Sort applications

- obvious applications
+ Sort a list of names
+ Organize an MP3 library
+ Display Google PageRank results
+ List Rss feed in reverse chronological order

- Problems became easy once items are in sorted order

+ Find the median
+ Binary search in a database
+ Identify statistical outliers
+ Find duplicates in a mailing list

- Non-obvious applications

+ Data compression
+ Computer graphics
+ Computational biology
+ Load balancing on a parallel computers

* Java System sort

- Arrays.sort()
- Has different method for each primitive type
- Has a method for data types that implement Comparable
- Has a method that uses a Comparator
- Uses tuned quicksort for primitive types; tuned mergesort for objects

## System sort: Which algorithm to use?

* Many sorting algorithms to choose from

- Internal sorts
+ Insertion sort, selection sort, bubblesort, shaker sort
+ Quicksort, mergesort, heapsort, samplesort, shellsort
+ Solitaire sort, red-black sort, splaysort, Yaroslavskiy sort, psort

- External sorts. Poly-phase mergesort, cascade-merge, oscallating sort

- String/radix sorts. Distribution, MSD, LSD, 3-way string quicksort

- Parallel sorts
+ Bitonic sort, Batcher even-odd sort
+ Smooth sort, cube sort, column sort
+ GPUsort

* Applications have diverse attributes

- Stable?
- Parallel?
- Deterministic?
- Key all distinct?
- Multiple key types?
- Linked list or arrays?
- Large or small items?
- Is your array randomly ordered?
- Need guaranteed performance?

* Elementary sort may be method of choice for some combination
- Cannot cover all combinations of attributes
* Q. Is the system sort good enough?

A. Usually

---

## Sort complexity

|Name|In-place|Stable|Best |Average |Worst|Remarks|
|----|-------|------|------|---------|-----|-------|
|Selectionsort|Yes|No|1/2 N2|1/2 N2|1/2 N2|N exchanges|
|Insertionsort|Yes|Yes|N|l/4N2|l/2N2|use for small N or partially ordered|
|Shellsort|Yes|No|N log3N|?|cN3/2|tight code, sub-quadratic|
|Mergesort|No|Yes|½ N lg N|N lg N|N lg N|N lg N guarantee, stable|
|Quicksort|Yes|No|N|N lg N|½ N2|N lg N probabilistic guarantee fastest in practice|
|3-ways Quicksort|Yes|No|N2/2|2 N ln N|½ N2|improves quicksort in presence of duplicate keys|
|Timesort|No|Yes|N|N lg N|N lg N|-|

---

# Priority Queues

* A collection is a data types that store group of items

|data type|key operations|data structures|
|---------|--------------|---------------|
|stack|Push, Pop|linked list, resizing array|
|queue|Enqueue, Dequeue|linked list, resizing array|
|priority queue|Insert, Del-Max/Min|binary heap|
|symbol table|Put, Get, Delete|BST, hash table|
|set|Add, Contains, Delete|BST, hash table|

## Priority Queue (PQ)

* Collections. Insert and delete items. which item to delete?

- Stack. Remove the item most recently added
- Queue. Remove the item least recently added
- Randomized queue. Remove a random item
- Priority Queue. Remove the `largest` or `smallest` item

* Operations: `insert(Item item), delMax(), isEmpty(), max(), size()`

* Applications

- Event-driven simulation: [customers in a line, colliding particles]
- Numerical computation: [reducing roundoff error]
- Data compression: [Huffman codes]
- Graph searching: [Dijkstra's algorithm, Prim's algorithms]
- Number theory: [sum of powers]
- Artificial intelligence: [A* search]
- Statistics: [maintain largest M values in a sequence]
- Operating systems: [load balancing, interrupt handling]
- Discrete optimization: [bin packing, scheduling]
- Spam filtering: [Bayesian spam filters]

* Implementation: you can make multiple implementation using unordered and ordered array, and Binary heap
* Complexity of unordered and ordered array

|Implementation|Insert|Del max|Max|
|--------------|------|-------|---|
|unordered array|1|N|N|
|ordered array|N|1|1|
|goal|lg N|lg N|lg N|

---

## Binary heaps

* Binary tree, Empty or node with links to left and right binary trees

* Complete tree, perfectly balanced, except for bottom level

* Complexity: `lg N`

* Height of complete tree with N nodes is `lg N` , why?

Height only increase when N is a power of 2

* A complete binary tree in nature

graph_order_growth

* Binary heap representations

- Binary heap, Array representation of a heap-ordered complete binary tree.

- Heap-ordered binary tree
+ Keys in nodes
+ Parent's key no smaller than children's keys

- Array representation
+ Indices start at 1
+ Take nodes in level order
+ No explicit links needed!

* Binary heap properties

- Parent of node at k is at `k/2`
- Children of node at k are at `2k` and right `2k+1`

* Best practice use immutable keys
* Underflow and overflow

- Underflow: throw exception if deleting from empty PQ
- Overflow: add no-arg constructor and use resize array

* To avoid maxing the operating like `delMax(), delMin()` there are two implementations `MaxPQ` and `MinPQ`
- The difference in comparing `less()`, `greater()`

* Good practice to use immutable keys, to prevent the client to change the key while processing the PQ

- Advantages
+ Simplifies debugging
+ Safer in presence of hostile code
+ Simplifies concurrent programming
+ Safe to use as key in priority queue or symbol table

- Disadvantages
+ Must create new object for each data type value

`Code`

``` java
public void insert(Item key) {
pq[++N] = key;
swim(N);
}

public Item delMax() {
if (isEmpty())
throw new IndexOutOfBoundsException();
Item max = pq[1];
swap(1, N);
N--; // decrement N here, will be used in sink
sink(1);
pq[N + 1] = null; // prevent loitering, object no longer needed
return max;
}

public Item max() {
return pq[1];
}

private void swim(int k) {
while (k > 1 && less(k / 2, k)) { // k less than parent, swap
swap(k / 2, k);
k /= 2;
}
}

private void sink(int k) {
while (2 * k <= N) {
int j = 2 * k;
if (j < N && less(j, j + 1))
j++;
if (!less(k, j)) // K less than children then break
break;
swap(k, j);
k = j;
}
}
```

---

## Heapsort

* Basic plan

- Create max heap with all N keys
- Repeatedly remove the maximum key

* Steps
- First pass
+ Build heap using bottom-up method
+ arrange from max parent to min child

``` java
for (int k = N/2; k >= 1; k--) sink(arr, k, N);
```

+ we will loop from N/2 to 1 because in sink operation

- Second pass
+ Remove the maximum, one at a time
+ Leave in array, instead of nulling out
+ replace from min to max in the array

``` java
while (N > 1) {
swap(arr, 1, N--);
sink(arr, 1, N);
}
```

* In place sorting algorithm with `N lg N` worst-case

* Proposition. Heap construction uses <= 2 N compares and exchanges

* Proposition. Heapsort uses <= 2 N lg N compares and exchanges

* Significance. In-place sorting algorithm with N lg N worst-case

- Mergesort: no, linear extra space
- Quicksort: no, quadratic time in worst-case
- Heapsort: yes

* Bottom line: heapsort is optimal for both time and space, but:

- Inner loop longer than quicksort's
- Make poor usage of cache memory
- Not stable

`Code`

``` java
public final static > void sort(Item[] pq) {
int n = pq.length;

// heapify phase
for (int k = n / 2; k >= 1; k--) {
sink(pq, k, n);
}

// sortdown phase
int k = n;
while (k > 1) {
swap(pq, 1, k--);
sink(pq, 1, k);
}
}

// Get the largest and put as a parent
private static > void sink(Item[] pq, int k, int n) {
while (2 * k <= n) {
int j = 2 * k;
if (j < n && less(pq, j, j + 1))
j++;
if (!less(pq, k, j)) // K greater or equal than children then break
break;
swap(pq, k, j);
k = j;
}
}
```

* Sorting algorithms complexity till heapsort

|Name|In-place|Stable|Best |Average |Worst|Remarks|
|----|-------|------|------|---------|-----|-------|
|Selectionsort|Yes|No|1/2 N2|1/2 N2|1/2 N2|N exchanges|
|Insertionsort|Yes|Yes|N|l/4N2|l/2N2|use for small N or partially ordered|
|Shellsort|Yes|No|N log3N|?|cN3/2|tight code, sub-quadratic|
|Mergesort|No|Yes|½ N lg N|N lg N|N lg N|N lg N guarantee, stable|
|Quicksort|Yes|No|N|N lg N|½ N2|N lg N probabilistic guarantee fastest in practice|
|3-ways Quicksort|Yes|No|N2/2|2 N ln N|½ N2|improves quicksort in presence of duplicate keys|
|Timesort|No|Yes|N|N lg N|N lg N|-|
|Heapsort|Yes|No|N|2 N lg N|2 N lg N|N lg N|N lg N guarantee, in-place|
|???|Yes|Yes|N lg N|N lg N|N lg N|holy sorting grail|

---

## Event driven simulation

* Goal. Simulate the motion of N moving particles that behave according to the laws of elastic collision

* Idea goes back to Einstein

* Application based on priority queue, without PQ you can not do it with large number of particles because would required quadratic time and not affordable

* Hard disc model

- Moving particles interact via elastic collisions with each other and walls
- Each particle is a disc with know position, velocity, mass and radius
- No other forces

* Significance. Relates macroscopic observables to microscopic dynamics
- Maxwell-Boltzmann: distribution of speeds as function of temperature
- Einstein: explain Brownian motion of pollen grains

* Time-driven simulation
- Update the position of each particle every after dt units of time, and check for overlaps
- If overlap, roll back the clock to the time of the collision, update the velocities of the colliding particles, and continue the simulation
- Main drawbacks
+ ~ N2/2 overlap checks per time quantum
+ Simulation is too slow if dt is very small
+ May miss collisions if dt is too large

* Event-driven simulation
- Between collisions, particle move in straight-line trajectories
- Change sate only when something happens
- Focus on times when collision occur
- Maintain PQ of collision events, prioritized by time
- Remove the min = get next collision

- Collision prediction. Given position, velocity, and radius of a particle, when will it collide next a wall or another particle?

- Collision resolution. if collision occurs, update colliding particle(s) according to law of elastic collisions

- Steps
+ Initialization
01. Fill PQ with all potential particle-wall collisions
02. Fill PQ with all potential particle-particle collisions

---

# Symbol tables

## API

* Key-value pair abstraction

- Insert a value with specified key
- Give a key, search for the corresponding value

* Ex. DNS lookup

- Insert URL with specified IP address
- Give URL, find corresponding IP address

* Applications

01. Dictionary
02. Book index
03. File share
04. Compiler
05. Routing table
06. DNS
07. Genomics
08. File system
09. Web search

* Operations:

```

put(Key key, Value val)
get(Key key)
delete(Key key)
contains(Key key)
isEmpty()
size()
keys()
```

* Conventions:

- Values are not null
- Method get() returns null if key not present
- Method put() overwrites old values with new value

* Key type: several natural assumption

- Assume keys are Comparable, use compareTo()
- Assume keys are any generic type, use equals() to test equality
- Assume keys are any generic type, use equals() to test equality, use hashCode() to scramble key

* Best practices: Use immutable types for symbol data keys

- Immutable in Java: String, Integer, Double, java.io. File, ...
- Mutable in Java:: StringBuilder, java.net. URL, Arrays, ...

* Equality test

- All Java classes inherit a method equals()
- Java requirements. For any references x, y and z

+ Reflexive: x.equals(x) is true
+ Symmetric: x.equals(y), iff y.equals(x)
+ Transitive: if x.equals(y) and y.equals(z), then x.equals(z)
+ Non-null: x.equals(null) is false

* Equals design

- Standard recipe for user-defined type

+ Optimization for reference equality
+ Check again null
+ Check that two objects are of the same type and cast
+ Compare each significant field
01. if field is a primitive type, use ==
02. if field is an object, use equals()()
03. if field is any array, apply to each entry

- Best practices
+ No need to use calculated fields that depends on other fields
+ Compare fields mostly likely to differ first
+ Make compareTo() consistent with equals() `x.equals(y), if and only if (x.compareTo(y) == 0)`

## Elementary implementations

* Sequential search (unordered list)

- Data structure: Maintain an (unordered) linked list of key-value pairs
- Node contains key and value
- Search: Scan through all keys until find a match
- Insert: Scan through all keys until find a match, if no match add to front
- summary

|worst-case search|worst-case insert|average-case search|average-case insert|ordered|key interface|
|-----------------|-----------------|-------------------|-------------------|-------|-------------|
|N|N|N / 2|N|no|equals()|

* Binary search in an order array

- Data structure: Maintain an ordered array of key-value pairs
- Rank helper function. how many keys < k?
- summary

|worst-case search|worst-case insert|average-case search|average-case insert|ordered|key interface|
|-----------------|-----------------|-------------------|-------------------|-------|-------------|
|lg N|N|lg N|N / 2|yes|compareTo()|

* BST Binary search tree

## Ordered operations

``` java
void put(Key key, Value value)
Value get(Key key)
void delete(Key key)
boolean contains(Key key)
boolean isEmpty()
int size()
Key min()
Key max()
Key floor(Key key) // largest key less than or equal to key
Key ceiling(Key key) // smallest key greater than or equal to key
int rank(Key key) // number of key less than key
Key select(int k) // Key of rank k
void deleteMin() // delete smallest key
void deleteMax() // delete largest key
int size(Key lo, Key hi)
Iterable keys(Key lo, Key hi)
Iterable keys()
```

## Binary search trees

* Classic data structure provides efficient implementations of ST algorithm

* BST is a binary tree in symmetric order

* A binary tree is either
- Empty
- Two disjoint binary tress(left and right)

* Symmetric order each node has a key, and every node's key is:

- Larger than all keys in its left subtree
- Smaller than all keys in its right subtree

* Representation in java using a Node has key, value and reference to left and right

`Code`

``` java
public void put(Key key, Value val) {
root = put(root, key, val);
size++;
}

private Node put(Node node, Key key, Value val) {
if (node == null)
return new Node(key, val);
int cmp = key.compareTo(node.key);
if (cmp < 0)
node.left = put(node.left, key, val);
else if (cmp > 0)
node.right = put(node.right, key, val);
else
node.val = val;
return node;
}

public Value get(Key key) {
Node node = root;
while (node != null) {
int cmp = key.compareTo(node.key);
if (cmp < 0)
node = node.left;
else if (cmp > 0)
node = node.right;
else
return node.val;
}
return null;
}
```

* Tree shape

- Many BSTs correspond to same set of keys
- Number of compares for search/insert is equal to 1 + depth of node
- Remark. Tree shape depends on order of insertion
- Worst case will be in natural order

* Mathematical analysis
- Proposition. If N distinct keys are inserted into a BST in random order, the expected number of compares for a search/insert is ~ 2 ln N
- Pf. 1-1 correspondence with quicksort partitioning
- But worst case height is N

|worst-case search|worst-case insert|average-case search|average-case insert|ordered|key interface|
|-----------------|-----------------|-------------------|-------------------|-------|-------------|
|N|N|1.39 lg N|1.39 lg N|stay tunned|compareTo()|

## Ordered ST operations

* Minimum and maximum

- Minimum. Smallest key in table
- Maximum. Largest key in table
- Q. How to find min/max in BST?

+ For min move to the left from the root until find null key
+ For max move to the right from the root until find null key

* Floor and ceiling

- Floor. Largest key <= to given key
- Ceiling. Smallest key >= to given key

* Computing the floor

- Case 1. [k equals the key at root]
+ The floor of k is k

- Case 2. [k is less than the key at root]
+ The floor of k in the left subtree

- Case 3. [k is greater than the key at root]
+ The floor of k is in the right subtree
+ if there is any key <= k in right subtree otherwise it is the key in the root

``` java
public Key floor(Key key) {
Node node = floor(root, key);
if (node == null) return null;
return node.key;
}

private Node floor(Node node, Key key) {
if (node == null) return null;

int cmp = key.compareTo(node.key);
if (cmp == 0) return node;
if (cmp < 0) return floor(node.left, key);

Node n = floor(node.right, key);
if (n != null) return n;
else return node;
}
```

* Ceiling

``` java
public Key ceiling(Key key) {
Node node = ceiling(root, key);
if (node == null)
return null;
return node.key;
}

private Node ceiling(Node node, Key key) {
if (node == null)
return null;

int cmp = key.compareTo(node.key);
if (cmp == 0)
return node;
if (cmp < 0) {
Node n = ceiling(node.left, key);
if (n != null)
return n;
else
return n;
}

return ceiling(node.right, key);
}
```

* Rank

* Rank. How many keys < k?

- Easy recursive algorithm (3 cases!)

``` java
public int rank(Key key) {
return rank(key, root);
}

private int rank(Key key, Node node) {
if (node == null) return 0;
int cmp = key.compareTo(node.key);

if (cmp < 0) return rank(key, node.left);
else if (cmp > 0) return 1 + size(node.left) + rank(key, node.right);
else return size(node.left);
}
```

* Inorder traversal

- Traverse left subtree
- Enqueue key
- Traverse right substree

``` java
public Iterable keys() {
Queue q = new Queue<>();
inorder(root, q);
return q;
}

private void inorder(Node node, Queue q) {
if (node == null) return;
inorder(node.left, q);
q.enqueue(node.key);
inoder(node.right, q);
}
```

* Property. inorder traversal of a BST yields keys in ascending order

## Deletion in BST

* To remove a node with a given key

- Set its value to null
- Leave key in tree to guide searches (but don't consider it equal in search)
- Cost ~ 2 ln N' per insert, search, and delete (if keys in random order)
- Unsatisfied solution. Tombstone (memory) overload

* To delete the minimum

- Go left until finding a node with null left link
- Replace that node by its right link
- Update subtree counts

``` java
public void deleteMin() {
root = deleteMin(root);
}

private Node deleteMin(Node node) {
if (node.left == null) return node.right;
node.left = deleteMin(node.left);
node.count = 1 + size(node.left) + size(node.right);
return node;
}
```

* Hibbard deletion

- To delete a node with key k: search for node n containing key k
+ Case 0. Delete n by setting parent link to null
+ Case 1. Delete n by replacing parent link
+ Case 2. [2 children]
01. Find successor x of n
02. Delete the minimum in n's right subtree
03. Put x in n's spot

``` java
public void delete(Key key) {
size--;
root = delete(root, key);
}

private Node delete(Node node, Key key) {
if (node == null)
return null;
int cmp = key.compareTo(node.key);
if (cmp < 0)
node.left = delete(node.left, key);
else if (cmp > 0)
node.right = delete(node.right, key);
else {
if (node.right == null)
return node.left;
if (node.left == null)
return node.right;
Node t = node;
node = min(t.right);
node.right = deleteMin(t.right);
node.left = t.left;
}
node.count = size(node.left) + size(node.right) + 1;
return node;
}
```

---

## Balanced search tree

### 2-3 search trees

* Allow 1 or 2 keys per node

- 2-node: one key, two children
- 3-node: two keys, three children

* Perfect balance: every path from the root to null link has same length
* Implementation: Red-black BSTs

### Red-black BSTs

* Represent 2-3 tree as a BST (binary search tree)
* Left-leaning red-black BSTs (Guibas-Sedgewick 1979 and 2007)
- Use "internal" left-learning links as "glue" for 3-nodes
- A BST such that:

+ No node has two red links connected to it
+ Every path from root to the null link has the same number of black links
+ Red links lean left

### B-Tree

* TODO this part

---

## Geometric applications of BSTs

* Intersections among geometric objects
* Applications. CAD, games, movies, virtual reality, database, ...
* Efficient solutions. Binary search trees (and extensions)

### 1 d range search

* Extensions of ordered symbol table
- Same operations or ordered symbol table
- Range search: find all keys between k1 and k2
- Range count: number of keys between k1 and k2

* Application. Database queries
- i.e salary between val-1 AND val-2

* Geometric interpretation
- Keys are point of a line
- Find/count points in a given 1 d interval

* Implementations
- Unordered array. Fast insert, slow range search
- Order array. Slow insert, binary search for k1 and k2 to do range search
- 1 d range count

### line segment intersection

* TODO this part

### kd trees

* TODO this part

### interval search trees

* Instead of points the data is interval
* 1 d interval search. Data structure to hold set of (overlapping) intervals

* Create BST, where each node stores an interval (lo, hi)
- Use left endpoints as BST key
- Store max endpoint in subtree rooted at node

* TODO this part

### rectangle intersection

* TODO this part

---

## Hash tables

* Basic plan:

- Save items in a key-indexed table (index is a function of the key)
- Hash function: Method for computing array index from key

* Issues:

- Computing the has function
- Equality test: Method for checking whether two keys are equal
- Collision resolution: Algorithm and data structure to handle two keys that hash to the same array index

* Classic space-time tradeoff

- No space limitation: trivial hash function with key as index
- No time limitation: trivial collision resolution with sequential search
- Space and time limitations: hashing (the real world)

### Hash functions

* Efficiently computable
* Each table index equally likely for each key
* Ex 1. Phone numbers:

- Bad: first three digits
- Better: last three digits

* Ex 2. Social security numbers:

- Bad: first three digits
- Better: last three digits

* Practical challenge, need different approach for each key type

* Java's hash code conventions

- All java classes inherit a method `hashCode()` , with return 32-bit in

- Requirement, if `x.equals(y), then (x.hashCode() == y.hashCode())`

- Highly desirable: if `!x.equals(y), then (x.hashCode() != y.hashCode())`

- Default implementation. Memory address of x

- Legal (but poor) implementation. Always return 17

- Customized implementations. Integer, Double, String, File, URL, Date, ...

``` tree
x
|
-----
| |
-----
|
x.hashCode()
```

* Implementing hash code: strings

- Cache the hash value in an instance variable
- Return cached value

``` java
public final class String {
private int hash = 0; // cache of hash code
private final char[] s;

public int hashCode() {
int h = hash;
if (h != 0) return h; // returned cached value
for (int i = 0; i < length(); i++) {
h = s[i] + (31 * h);
}
hash = h; // store cache of hash code
return h;
}
}
```

* Implementing hash code: user-defined types

``` java
public final class Transaction implements Comparable {
private final String who;
private final Date when;
private final double amount;

public int hashCode() {
int hash = 17; // nonzero constant
// typical a small prime 31
// get all fields in the class
hash = 31 * hash + who.hashCode();
hash = 31 * hash + when.hashCode();
hash = 31 * hash + ((Double) amount).hashCode();
return hash;
}
}
```

* Hash code design

- Standard recipe for user-defined types

+ Combine each significant field using the `31x+y` rule

+ If field is a primitive type, use wrapper type `hashCode()`

+ If field is null, return 0

+ If field is a reference type, use `hashCode()` ← applies rule recursively

+ If field is an array, apply to each entry ← or use `Arrays.deepHashCode()`

- In practice. Recipe works reasonably well, used in Java libraries

- In theory. Keys are bit string: universal hash functions exits

- Basic rule. Need to use the whole key to compute hash code, consult an expert for state-of-the-art hash codes

* Modular hashing

- Hash code. An int between -231 and 231-1
- Hash function. An int between 0 and M-1 (for uses as array index)
+ M typically a prime or power of 2

- Why choose a prime for M?
+ We will use all the bits in the number in that point

There is a problem in below code for detect the negative values

``` java
// Bug
private int hash(Key key) {
return key.hashCode() % M;
}
```

There is another problem in below code which is will hit
-231

``` java
// 1-in-a-billion bug
private int hash(Key key) {
return Math.abs(key.hashCode()) % M;
}
```

The correct implementation, this is a template for hashCode to get number between 0 and M - 1

``` java
private int hash(Key key) {
return (key.hashCode() & 0x7fffffff) % M;
}
```

* Uniform hashing assumption

- Uniform hashing assumption. Each key is equally likely to hash to an integer between 0 and M - 1

- Bins and balls. Throw balls uniformly at random into M bins

- Classically studies in statistics

+ Birthday problem. Expect two balls in the same bin after `~ sqrt(PI M /2)` tosses

+ Coupon collector. Expect every bin hash >= 1 ball after `~ M ln M` tosses

+ Load balancing. After M tosses, expect most loaded bin hash `Omega (log M / Log Log M)` balls

### Separate chaining (Collision resolution)

* Collision, Two distinct keys hashing to the same index

- Birthday problem → can't avoid collisions unless you have a ridiculous quadratic amount of memory

- Coupon collector + load balancing → collisions will be evenly distributed

- Challenge. Deal with collisions efficiently

* Separate chaining symbol table

- Use an array of M < N linked lists [H.P. Luhn, IBM 1953]

+ Hash: map key to integer i between 0 and M-1

+ Insert: put at front of ith chain (if not already there)

+ Search: need to search only ith chain

``` java
public class SeparateChainingHashST {
private int M = 97;
private Node[] st = new Node[M];

private static class Node {
private Key key;
private Value val;
private Node next;
}

private int hash(Key key) {
return (key.hashCode() & 0xfffffff) % M;
}

public Value get(Key key) {
int i = hash(key);
for (Node node = st[i]; node != null; node = node.next) {
if (key.equals(node.key)) return node.val;
}
return null;
}

public void put(Key key, Value val) {
int i = hash(key);
for (Node node = st[i]; node != null; node = node.next) {
if (key.equals(node.key)) {
node.val = val;
return;
}
}
st[i] = new Node(key, val, st[i]); // put the new node at the beginning of the LinkedList
}
}
```

* Consequence. Number of probes for search/insert is proportional to N/M

- M too large ==> too many empty chains

- M too small ==> chains too long

- Typical choice `M ~ N/5` ==> constant time ops

### Linear probing

* Open addressing. When a new key collides, find the next empty slot and put it there

* Hash. Map key to integer i between 0 and M-1

* Insert. Put at table index i if free, if not try i+1, i+2, etc

* Search. Search table index i, if occupied but no match, try i+1, i+2, etc

> Note: Array size M must be greater than number of key-value pairs N

``` java
public class LinearProbingHashST {
private int M = 30001;
private Value[] vals = (Value[]) new Object[M];
private Key[] keys = (Key[]) new Object[M];

private int hash(Key key) {
return (key.hashCode() & 0xfffffff) % M;
}

public void put(Key key, Value val) {
int i;
for (i = hash(key); keys[i] != null; i = (i + 1) % M) {
if (keys[i].equals(key)) break;
}
keys[i] = key;
vals[i] = val;
}

public Value get(Key key) {
for (int i = hash(key); keys[i] != null; i = (i + 1) % M) {
if (key.equals(keys[i])) return vals[i];
}
return null;
}
}
```

* Clustering

- Cluster. A contiguous block of items

- Observation. New keys likely to hash into middle of big clusters

* Knuth's parking problem

- Model. Cars arrive at one-way street with M parking spaces, each desires a random space i: if space i is taken, try i + 1, i + 2, etc

- Q. What is mean displacement of a car?

+ Half-full. With M / 2 cars, mean displacement is ~ 3 / 2

+ Full. With M cars, mean displacement is ~ sqrt(PI * M / 8)

### ST implementations: summary

|ST implementation|Worst-case search| Worst-case insert|Worst-case delete|Ordered iteration|key interface|
|-----------------|-----------------|------------------|-----------------|-----------------|-------------|
|linked list|N|N|N|no|equals()|
|binary search (ordered array)|log N|N|N|yes|compareTo()|
|BST|N|N|N|stay tuned|compareTo()|
|2-3 tree|c lg N|c lg N|?|yes|compareTo()|
|red-black tree|2 lg N|2 lg N|2 lg N|yes|compareTo()|
|separate chaining|lg N*|lg N*|lg N*|no|equals()|
|linear probing|lg N*|lg N*|lg N*|no|equals()|

> Note: * under uniform hashing assumption

### Context

* War storey: String hashing in Java

- String hashCode() in Java 1.1

+ For long strings: only examine 8-9 evenly spaced characters

+ Benefit: save time in performing arithmetic

+ Downside: great potential for bad collision patterns

``` java
public int hashCode() {
int hash = 0;
int skip = Math.max(1, length() / 8);
for (int i = 0; i < length(); i += skip) {
hash = st[i] + (37 * hash);
}
return hash;
}
```

- Could end with examine the same spaced character, so you have to examine the all string

* Q. Is the uniform hashing assumption important in practice?

- A. Obvious situations: aircraft control, nuclear reactor, pacemaker

- A. Surprising situations. denial-of-service attacks

* Algorithmic complexity attach on Java

- Goal. Find family of strings with the same hash code

- Solution. the base 31 hash code is part of Java's string API

* Diversion: one-way hash functions

- One-way hash function. "Hard" to find a key that will hash to a desired value (or two keys that hash to the same value)

+ Ex. MD4, MD5, SHA-0, SHA-1, SHA-2, WHIRLPOOL, RIPEMD-160, etc

+ Applications. Digital fingerprint, message digest, storing passwords

+ Caveat. Too expensive for use in ST implementations

* Separate chaining vs. Linear probing

- Separate chaining

+ Easier to implement delete

+ Performance degrades gracefully

+ Clustering less sensitive to poorly-designed hash function, if you have a bad function

- Linear probing
+ Less wasted space
+ Better cache performance

- Q. How to delete?
- Q. How to resize?

* Hashing: variations on the theme

- Many improved versions have been studied.

- Two-probe hashing (separate-chaining variant)

+ Hash to two positions, insert key in shorter of two chains

+ Reduces expected length of the longest chain to log log N

- Double hashing (linear-probing variant)

+ Use linear probing, but skip a variable amount, not just 1 each time

+ Effectively eliminates clustering

+ Can allow table to become nearly full

+ More difficult to implement delete

- Cuckoo hashing (linear-probing variant)
+ Hash key to two positions, insert key either position, if occupied reinsert displaced key into its alternative position (and recur)

_ Constant worst case time for search

* Applications

- Security One way hash function: MD4, MD5, SHA-0, SHA-1, SHA-2, WHIRLPOOL, RIPEMD-160, ...
+ Digital fingerprint
+ Message digest
+ Storing passwords

- Dictionary lookup
+ DNS lookup
+ Amino acids
+ Class list

* Hash tables vs.balanced search trees

- Hash tables

+ Simpler to code

+ No effective alternative for unordered keys

+ Faster for simple keys (a few arithmetic ops versus log N compares)

+ Better system support in Java for strings (e.g. cached hash code)

- Balanced search trees

+ Stronger performance guarantee

+ Support for ordered ST operations

+ Easier to implement `compareTo()` correctly than `equals()` and `hashCode()`

- Java system includes both:

+ Red-black BSTs: java.util. TreeMap, java.util. TreeSet

+ Hash tables: java.util. HashMap, java.util. IdentityHashMap

### Applications

* Sets

- Mathematical set: a collection of `distinct` keys
- Operations: `add(Key key), contains(Key key), remove(Key key), size(), iterator()`

* Dictionary lookup

- Command-line arguments
+ A comma-separated value (CVS) file
+ Key field
+ Value field

- Ex 1. DNS lookup
- Ex 2. Amino acids
- Ex 3. Class list

* File indexing

- Goal. Index a PC (or the web), given a list of files specified, create an index so that you can efficiently find all files contains a given query string

- Book index
+ Goal. preprocess a text corpus to support concordance queries: given a word, find all occurrences with their immediate contexts

* Sparse vectors

- Matrix-vector multiplication

- Problem. sparse matrix-vector multiplication

- Assumptions. Matrix dimension is 10, 000 average nonzeros per row ~ 10

- Vector representations

+ 1D array (standard) representation

. Constant time access to elements

. Space proportional to N

+ Symbol table representation

. Key = index, value = entry

. Efficient iterator

. Space proportional to number of nonzeros

- Matrix representations

+ 2D array (standard) matrix representation. Each row of matrix is an array

+ Space proportional of N2

- Sparse matrix representation: Each row of matrix is a sparse vector

+ Efficient access to elements

+ Space proportional to number of nonzeros (plus N)

``` java
public class SparseVector {
private HashST v;

public SparseVector() {
v = new HashST<>(); // empty ST represents all 0s vector
}

public void put(int i, double x) {
v.put(i, x); // put all not zero values
}

public double get(int i) {
if (!v.contains(i)) return 0.0;
return v.get(i);
}

public Iterable indices() {
return v.keys;
}

public double dot(double[] that) {
double sum = 0.0;
fo (int i : indices()) {
sum += that[i] * this.get(i);
}
return sum;
}
}
```

---

# Graph

## Undirected graphs

* Graph.: Set of vertices connected pairwise by edges

* Why study graph algorithms?

- Thousand of practical applications

- Hundreds of graph algorithms known

- Interesting and broadly useful abstraction

- Challenging branch of computer science and discrete math

* Example of graph

- Protein-protein interaction network

- The internet as mapped by the Opte project

- Map of science click-streams

- Facebook friends

- One week of Enron emails

- Framingham heart study

* Graph terminology

- Path: Sequence of vertices connected edges

- Cycle: Path whose first and last vertices are the same

- Two vertices are `connected` if there is a path between them

* Some graph-processing problems

- Path. Is there a path between v and w?

- Shortest path. What is the shortest path between v and w?

- Cycle. Is there a cycle in the graph?

- Euler cycle. Is there a cycle that uses each edge exactly once?

- Hamilton cycle. Is there a cycle that uses each vertex exactly once

- Connectivity. Is there a way to connect all of the vertices?

- MST. What is the best way to connect all of the vertices?

- Bi-connectivity. Is there a vertex whose removal disconnects the graph?

- Planarity. Can you draw the graph in the plane with no crossing edges

- Graph isomorphism. Do two adjacency lists represent the same graph?

### Undirected Graph API

* Graph drawing provides intuition about the structure of the graph

* Caveat intuition can be misleading

* Vertex representation

- Will use integers between 0 and V-1

- Applications: convert between names and integers which symbol table

- Implementations

01. Set-of-edges graph representation

. Maintain a list of the edges (linked list or array)

. In-efficient implementation make unusable in huge graph

02. Adjacency-matrix graph representation

. Maintain a two-dimensional V-by-V boolean array, for each edge v-w in graph: adj[v][w] = adj[w][v] = true

. Not very widely used because for a huge graph, you would have billion square number of entries

03. Adjacency-list graph representation

. Maintain vertex-indexed array of lists

. Widely used

* Operations: `addEdge(int v, int w), adj(int v), V(), E(), toString()`

`Code`

``` java
public class Graph {

private final int vertices;
private int edges;
private Bag[] adj;

public Graph(int v) {
this.vertices = v;
this.edges = 0;
adj = (Bag[]) new Bag[v];
for (int i = 0; i < v; i++) {
adj[v] = new Bag();
}
}

public int getVertices() {
return vertices;
}

public int getEdges() {
return edges;
}

public void addEdge(int v, int w) {
adj[v].add(w);
adj[w].add(v);
edges++;
}

public int degree(int v) {
return adj[v].size();
}

public Iterable adj(int v) {
return adj[v];
}
}
```

* In practice, use adjacency-lists representation

- Algorithms based on iterating over vertices adjacent to v

- Real-world graphs tend to be `sparse` (huge number of vertices small average vertex degree)

| representation | space | add edge | edge between v and w? | iterate over vertices adjacent to v? |
|------------------|---------------|---------------|-----------------------|--------------------------------------|
| list of edges | E | 1 | E | E |
| adjacency matrix | v2 | 1* | 1 | v |
| adjacency lists | E + V | 1 | degree(v) | degree(v) |

---

### Depth-first search DFS Undirected search

* Classical graphical search algorithm

* Once you explored vertex suspend then explore the new vertices

* Good example: Maze graph

- Vertex = intersection
- Edge = passage

* Algorithm Idea:

- Unroll a ball of string behind you
- Mark each visited intersection and each visited passage
- Retrace steps when no unvisited options

* Goal: systematically search through a graph

* Idea: Mimic maze exploration

* Typically applications:

- Find all vertices connected to a given source vertex
- Find a path between two vertices

* Design pattern: We decouple Graph representation from graph-processing routine

- Create a Graph object
- Pass the Graph to a graph-processing routine
- Query the graph-processing routine for information

* Depth-first search demo

- To visit a vertex v:

+ Mark v as visited
+ Recursively visit all unmarked vertices w adjacent to v

- Put unvisited vertices on a `stack`

`Code`

``` java
public class DepthFirstPath {
private boolean[] marked;
private int[] edgeTo;
private int s;

public DepthFirstPath(Graph G, int s) {
dfs(G, s);
}

private void dfs(Graph G, int v) {
marked[v] = true;
for (int w : G.adj(v)) {
if (!marked[w]) {
dfs(G, w);
edgeTo[w] = v;
}
}
}
}
```

* Proposition. DFS marks all vertices connected to s in time proportional to the `sum of their degrees` from single source

- PF. [correctness]
+ If w marked, then w connected to s (why?)
+ If w connected to s, then w marked
+ If w unmarked, then consider last edge on a path from s to w that goes from a marked vertex to an unmarked one

- Pf. [running time]
+ Each vertex connected to s is visited once

* Proposition. After DFS, can find vertices connected to s in constant time and can find a path to s (if not exist) in time proportional to its length

- Pf. edgeTo[] is parent-link representation of a tree rooted at s

* Challenge. Flood fill (Photoshop magic wand)

- Assumptions. Pictures has millions to billions of pixels

+ Solution. Build a grid graph
01. Vertex: pixel
02. Edge: between two adjacent gray pixels
03. Blob: all pixels connected to given pixel

---

### Breadth-first search BFS Undirected search

* Explore vertices and it's adjacency then go next vertices for exploration

* Not recursive algorithm, it uses a queue

* Put s into a FIFO queue, and mark s as visited, Repeat until the queue is empty

- Remove the least recently added vertex v
- Add each of v's unvisited neighbors to the queue, and mark them as visited

* Proposition. BFS computes the shortest paths (fewest number of edges) from s to all other vertices in a graph in time proportional to `E + V` from single source

- Pf. [correctness] Queue always consists of zero or more vertices of distance k from s, followed by zero more vertices of distance k + 1

- Pf. [running time] Each vertex connected to s is visited once

`Code`

``` java
public class BreadthFirstPath {
private boolean[] marked;
private int[] edgeTo;
private int[] distTo;

private void bfs(Graph G, int s) {
Queue q = new Queue();
q.enqueue(s);
marked[s] = true;
distTo[s] = 0;
while (!q.isEmpty()) {
int v = q.dequeue();
for (int w : G.adj(v)) {
if (!marked[w]) {
q.enqueue(w);
marked[w] = true;
edgeTo[w] = v;
distTo[w] = distTo[v] + 1;
}
}
}
}
}
```

* Applications

- routing: Fewest number of hops in a communication network

### BFS vs. DFS

* Depth-first search. Put unvisited vertices on a `stack`

* Breadth-first search. Put unvisited vertices on a `queue`

* Shortest path. Find from s to w that uses `fewest number of edges`

```tree
1
/ \
/ \
2 3
/ \ / \
4 5 6 7

```

* Level order, BFS: 1, 2, 3, 4, 5, 6, 7
* Pre-order, DFS: 1, 2, 4, 5, 3, 6, 7

---

### Connected components

* Connectivity queries

- Def. vertices v and w are connected if there is a path between them
- Goal. preprocess graph to answer queries of the form is v connected to w? in constant time
- Operations: `connected(int v, int w), count(), id(int v)`

* Union-find? not quite
* Depth-first search. Yes

* The relation "is connected to" is an equivalence relation

- Reflexive: v is connected to v
- Symmetric: if v is connected to w, then w is connected to v
- Transitive: if v connected to w and w connected to x, then v connected to x

* Def. A connected component is a maximal set of connected vertices

* Goal. Partition vertices into connected components

* Demo

- To visit a vertex v:

+ Mark vertex v as visited
+ Recursively visit all unmarked vertices adjacent to v

``` java
public class ConnectedComponent {

private boolean[] marked;
private int[] id;
private int count;

public ConnectedComponent(Graph graph) {
marked = new boolean[graph.getVertices()];
id = new int[graph.getVertices()];
for (int s = 0; s < graph.getVertices(); s++) {
if (!marked[s]) {
dfs(graph, s);
count++;
}
}
}

private void dfs(Graph graph, int v) {
marked[v] = true;
id[v] = count; // all vertices discovered in same call have samd id
for (int w : graph.adj(v)) {
if (!marked[w])
dfs(graph, w);
}
}

/**
* two vertices are connected if have the same id
*/
public boolean connected(int v, int w) {
return id[v] == id[w];
}

/**
* the id of the vert
*/
public int id(int v) {
return id[v];
}

/**
* sum of the connected components in the graph
*/
public int count() {
return count;
}
}
```

* Applications

- Particle detection

+ Given grayscale image of particles, identity blobs

01. Vertex: pixel
02. Edge: between two adjacent pixels with grayscale value >= 70 "black 0, white = 255"
03. Blob: connected components of 20-30 pixels

+ Particle tracking. Track moving particle over time

### Graph challenges

* Graph-processing challenge 1

- Problem. Is a graph bipartite?

+ You can use DFS to solve it

+ Divide the vertices into two subsets with property every edge connects one subset to another

>0-1
0-2
0-5
0-6

>1-3
2-3
2-4

>4-5
4-6

+ Application: is dating graph bipartite?

- Problem. Find a cycle

+ You can use DFS to solve it

+ Ex. 0-5-4-6-0

- Problem. the seven bridges of konigsberg [1736]

+ Euler tour. Is there a (general) cycle that uses each edge exactly once?

. Answer. A connected grapy is Eulerian iff all vertices have `even` degree

- Problem. Find a (general) cycle that uses every edge exactly once

+ Eulerian tour (classic graph-processing problem)

- Problem. Find a (cycle) that visits every vertex exactly once

+ Sometimes called "Travelling sells person between cities and visit each city once"

. Hamiltonian cycle (classical NP-complete problem)

- Problem. Are two graphs identically except for vertex names?

+ Graph isomorphism problem longstanding open problem

+ No one knows how to classify the problem

- Problem. Lay out a graph in the plane without crossing edges?

+ Classic problem in graph processing

+ Linear-time DFS-based planarity algorithm discovered by Tarjan in 1970s

---

## Directed graphs

* Edges now have direction

* Digraph: set of vertices connected pairwise by `directed` edges

* Examples

- road network
+ Vertex = intersection
+ Edge = one-way street

- Political blogosphere graph
+ Vertex = political blog
+ Edge = link

- Overnight interbank load graph
+ Vertex = bank
+ Edge = overnight loan

- Implication graph
+ Vertex = variable
+ Edge = logical implication

- Combinatorial circuit
+ Vertex = logical gate
+ Edge = wire

- WordNet graph
+ Vertex = synset
+ Edge = hypernym relationship

* Digraph applications

|Digraph|Vertex|Directed Edge|
|-------|------|-------------|
|web|web page|hyperlink|
|game|board position|legal move|
|object graph|object|pointer|
|cell phone|person|placed call|
|financial|bank|transaction|
|control flow|code block|jump|
|Inheritance hierarchy |class|inherits from|

* Some digraph problems:

- Is there a directed path from s to t?

- Shortest path, What's the shortest directed path from s to t?

- Topological sort, can you draw a digraph so that all edge point upwards?

- Strong connectivity, Is there a directed path between all pairs of vertices?

- Transitive closure, For which vertices v and w is there a path from v to w?

- PageRank, What is the importance of a web page?

### Digraph API

* Operations: `addEdge(int v, int w), adj(int v), vertices(), edges(), reverse(), toString()`

``` java
public class Digraph {

private static final String NEW_LINE = System.lineSeparator();

private final int vertices;
private int edges;
private Bag[] adj;

public Digraph(int v) {
this.vertices = v;
this.edges = 0;
adj = (Bag[]) new Bag[v];
for (int i = 0; i < v; i++) {
adj[i] = new Bag();
}
}

public int getVertices() {
return vertices;
}

public int getEdges() {
return edges;
}

public void addEdge(int v, int w) {
adj[v].add(w);
edges++;
}

/**
* get number of adjacency to vertex v
*/
public int degree(int v) {
return adj[v].size();
}

/**
* get adjacency to vertex v
*/
public Iterable adj(int v) {
return adj[v];
}

/**
* Returns the reverse of the digraph
*/
public Digraph reverse() {
Digraph reverse = new Digraph(vertices);
for (int v = 0; v < vertices; v++) {
for (int w : adj(v)) {
reverse.addEdge(w, v);
}
}
return reverse;
}

@Override
public String toString() {
StringBuilder builder = new StringBuilder();
for (int v = 0; v < vertices; v++) {
builder.append(v + ": ");
for (int w : adj[v]) {
builder.append(w + " ");
}
builder.append(NEW_LINE);
}
return builder.toString();
}
}
```

* Digraph representations

- In practice. Use adjacency-list representation

+ Algorithms based on iterating over vertices adjacent to v

+ Real-world graphs tend to be sparse

| representation | space | add edge | edge between v and w? | iterate over vertices adjacent to v? |
|------------------|---------------|---------------|-----------------------|--------------------------------------|
| list of edges | E | 1 | E | E |
| adjacency matrix | v2 | 1+ | 1 | v | |
| adjacency lists | E + V | 1 | outdegree(v) | outdegree(v) |

### Depth-first search in digraphs

* Problem. find all vertices reachable from s along a directed path

* Some method as for undirected graphs

- Every undirected graph is digraph (with edges in both directions)

- DFS is a `digraph` algorithm

* Applications:

01. Reachability application: Program control-flow analysis

- Every program is a digraph

+ Vertex = basic block of instructions (straight-line program)

+ Edge = jump

- Dead-code elimination

+ find and remove unreachable code

- Infinite-loop detected

+ Determine whether exit is unreachable

02. Mark-sweep garbage collector

- Every data structure is a digraph

+ Vertex = object
+ Edge = reference

- Memory cost. Uses 1 extra mark bit per object (plus DFS stack)

- Roots. Objects known to be directly accessible by program (e.g. stack)

- Reachable objects. Objects indirectly accessible by program (starting at a root and following a chain of pointers)

- Mark-sweep algorithm. [McCarthy, 1960]
+ Mark: mark all reachable objects
+ Sweep: if object is unmarked, it is garbage (so add to free list)

* DFS enables direct solution of simple digraph problems

01. Reachability
02. Path finding
03. Topological sort
04. Directed cycle direction

* Basis for solving difficult digraph problems

01. 2-satisfiability
02. Directed Euler path
03. Strongly-connected components

### Breadth-first search in digraphs

* BFS is a `digraph` algorithms

* Proposition. BFS computes shortest paths (fewest number of edges) from s to all other vertices n a digraph in time proportional to `E + V`

* Multiple-source shortest paths

- Given. a digraph and set of source vertices, find shortest path from any vertex in the set to each other vertex

- Q. How to implement multi-source constructor?
+ A. Use BFS, but initialize by `enqueuing all source vertices`

* Web crawler

- Goal. Crawl web, starting from some root web page, say www.princeton.edu

+ Solution. [BFS with implicit digraph]

. Choose root web page as source s

. Maintain a Queue of websites to explore

. Maintain a SET of discovered websites

. Dequeue the next website and enqueue websites to which it links (provided you haven't done so before)

- Q. Why not use DFS?

+ A. You gonna far away for searching web, some web page traps new page

`Code`

``` java
public class BareBonesWebCrawler {

private Queue queue = new ArrayQueue<>();
private Set discovered = new HashSet<>();
private static final String REGEX = "https://(\\w+\\.)*(\\w+)";

public BareBonesWebCrawler(String root) {
queue.enqueue(root);
discovered.add(root);
bfs();
}

private void bfs() {
while (!queue.isEmpty()) {
String v = queue.dequeue();
System.out.println(v);
String input = readRawHtml(v);
Pattern pattern = Pattern.compile(REGEX);
Matcher matcher = pattern.matcher(input);

while (matcher.find()) {
String w = matcher.group();
if (!discovered.contains(w)) {
discovered.add(w);
queue.enqueue(w);
}
}
}
}

private String readRawHtml(String url) {
URL u;
try {
u = new URL(url);
URLConnection conn = u.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream()));
StringBuffer buffer = new StringBuffer();
String inputLine;
while ((inputLine = in.readLine()) != null) {
buffer.append(inputLine);
}
in.close();
// System.out.println(buffer.toString());
return buffer.toString();
} catch (IOException e) {
e.printStackTrace();
}
return "";
}

public static void main(String[] args) {
new BareBonesWebCrawler("https://github.com/openjdk");
}
}
```

### Topological sort

* Goal. Given a set of tasks to completed with precedence constraints in which order should we schedule with the tasks?

* Digraph model

- Vertex = task

- Edge = precedence constraint

* DAG
- `acyclic` Digraph with no cycle

- If you have cycle there is no way to solve the problem

* Topological sort. Redraw DAG so all edges points upwards

* Solution. DFS

* Demo

01. run depth-first search
02. Return vertices in reverse post order

`Code`

``` java
public class DepthFirstOrder {

private boolean[] marked;
private Stack reverseOrder;

public DepthFirstOrder(Digraph graph) {
reverseOrder = new ArrayStack<>();
marked = new boolean[graph.getVertices()];
for (int v = 0; v < graph.getVertices(); v++) {
if (!marked[v]) {
dfs(graph, v);
}
}
}

private void dfs(Digraph graph, int v) {
marked[v] = true;
for (int w : graph.adj(v))
if (!marked[w]) {
dfs(graph, w);
reverseOrder.push(v);
}

// return all vertices in reverse DFS post-order
public Stack reversePost() {
return reverseOrder;
}
}
```

, Detect cycle in directed graph

``` java
public class DirectedCycleDetector {

private boolean[] marked;
private int[] edgeTo;
private Stack cycle; // vertices on a cycle
private boolean[] onStack; // vertices on recursive call stack

public DirectedCycleDetector(Digraph graph) {
onStack = new boolean[graph.getVertices()];
marked = new boolean[graph.getVertices()];
edgeTo = new int[graph.getVertices()];

for (int v = 0; v < graph.getVertices(); v++) {
if (!marked[v] && cycle == null)
dfs(graph, v);
}
}

private void dfs(Digraph graph, int v) {
onStack[v] = true;
marked[v] = true;
for (int w : graph.adj(v)) {
if (cycle != null) {
// short circuit if directed cycle found
return;
} else if (!marked[w]) {
// found new vertex, so recur
edgeTo[w] = v;
dfs(graph, w);
} else if (onStack[w]) {
// trace back directed cycle
cycle = new Stack<>();
for (int x = v; x != w; x = edgeTo[x]) {
cycle.push(x);
}
cycle.push(w);
cycle.push(v);
}
}
onStack[v] = false;
}

public boolean hasCycle() {
return cycle != null;
}

public Iterable cycle() {
return cycle;
}
}
```

, Will check if no cycle first then get order

``` java
public class TopologicalSort {

private Stack order;

public TopologicalSort(Digraph graph) {
DirectedCycleDetector finder = new DirectedCycleDetector(graph);
if (!finder.hasCycle()) {
DepthFirstOrder dfs = new DepthFirstOrder(graph);
order = dfs.reversePost();
}
}

public Stack order() {
return order;
}

public boolean hasOrder() {
return order != null;
}

public boolean isDag() {
return hasOrder();
}
}
```

* Used for package management, like maven, brew, etc

* Proposition. Reverse DFS post-order of a DAG is a topological order

- Pf. [correctness] Consider any edge v → w. When dfs(v) is called

+ Case 1: dfs(w) has already been called and returned. Thus, w was done before v

+ Case 2: dfs(w) has not yet been called, dfs(w) will get called directly on indirectly by dfs(v) and will finish before dfs(v). Thus, w will be done before v

+ Case 3: dfs(w) has already been called, but has not yet returned. Can't happen in a DAG: function call stack contains path from w to v, so v→w would complete a cycle

* Directed cycle detection

- Proposition. A digraph has a topological order iff no directed cycle

+ Pf.

01. If directed cycle, topological order impossible

02. If directed cycle, DFS-based algorithm finds a topological order

- Goal. Given a digraph. find a directed cycle

+ Solution. DFS

- Applications

01. Scheduling. Given a set of tasks to be completed with precedence constraints, in what order should we schedule the tasks??

. Remark. A directed cycle implies scheduling problem is infeasible

02. Java compiler does cycle detection (cyclic inheritance)

03. Microsoft Excel does cycle detection (and has a circular reference toolbar!)

`Code`

``` java
public class DirectedCycleDetector {

private boolean[] marked;
private int[] edgeTo;
private Stack cycle; // vertices on a cycle
private boolean[] onStack; // vertices on recursive call stack

public DirectedCycleDetector(Digraph graph) {
onStack = new boolean[graph.getVertices()];
marked = new boolean[graph.getVertices()];
edgeTo = new int[graph.getVertices()];

for (int v = 0; v < graph.getVertices(); v++) {
if (!marked[v] && cycle == null)
dfs(graph, v);
}
}

private void dfs(Digraph graph, int v) {
onStack[v] = true;
marked[v] = true;
for (int w : graph.adj(v)) {
if (cycle != null) {
// short circuit if directed cycle found
return;
} else if (!marked[w]) {
// found new vertex, so recur
edgeTo[w] = v;
dfs(graph, w);
} else if (onStack[w]) {
// trace back directed cycle
cycle = new Stack<>();
for (int x = v; x != w; x = edgeTo[x]) {
cycle.push(x);
}
cycle.push(w);
cycle.push(v);
}
}
onStack[v] = false;
}

public boolean hasCycle() {
return cycle != null;
}

public Iterable cycle() {
return cycle;
}
}
```

### Strong components

* Def. Vertices v and w are `strongly connected` if there is a directed path from v to w and a directed path from w to v

* Key property. Strong connectivity is an `equivalence relation`

- v is strongly connected to v
- If v is strongly connected to w, then w is strongly connected to v
- If v is strongly connected to w and w connected to x, then v is strongly connected x

* v and w are `connected` if there is a path between v and w

* Applications

- Food web graph

+ Vertex = species
+ Edge = from producer to consumer
+ Strong component. Subset of species with common energy flow

- Software modules

+ Vertex = software module
+ Edge = from module to dependency
+ Strong component. Subset of mutually interacting modules
01. Approach 1. Package strong components together
02. Approach 2. Use to improve design

* Kosaraju-Sharir algorithm

- Reverse graph. Strong components in G are same as GR

- Kernel DAG. Contact each strong components into a single vertex

- Idea
+ Compute topological order (reverse post-order) in kernel DAG
+ Run DFS, considering vertices in reverse topological order

- Demo

01. Phase 1. Compute reverse postorder in GR
02. Phase 2. Run DFS in G, visiting unmarked vertices in reverse postorder of GR

- Simple (but mysterious) algorithm for computing strong components

- Proposition. Kosaraju-Sharir algorithm computes the strong components of a digraph in time proportional to E + V

+ pf.

01. Running time: bottleneck is running DFS twice (and computing GR)
02. Correctness: tricky
03. Implementation: easy

`Code`

``` java
public class KosarajuSharirCC {

private boolean[] marked;
private int[] id;
private int count;

public KosarajuSharirCC(Digraph graph) {
marked = new boolean[graph.getVertices()];
id = new int[graph.getVertices()];
count = 0;
DepthFirstOrder order = new DepthFirstOrder(graph);
for (int s : order.reversePost()) {
if (!marked[s]) {
dfs(graph, s);
count++;
}
}
}

private void dfs(Digraph graph, int v) {
marked[v] = true;
id[v] = count;
for (int w : graph.adj(v)) {
if (!marked[w]) {
dfs(graph, w);
}
}
}

public boolean stronglyConnected(int v, int w) {
return id[v] == id[w];
}

public int id(int v) {
return id[v];
}

public int count() {
return count;
}
}
```

---

## MST (Minimum spanning trees)

* Given. `Undirected graph` G with positive edge weights (connected)

* Def. A `spanning tree` of G is a `subgraph` T that is both a `tree` (connected and `acyclic`) and `spanning` (includes all of the vertices)

* Goal. Find a min weight spanning tree

* Applications

- Dithering
- Cluster analysis
- Real-time face verification
- Image registration with Renyi entropy
- Find road networks in satellite and aerial imagery
- Network design (communication, electrical, computer, road)
- Auto config protocol for ether bridging to avoid cycles in a network
- Reducing data storage in sequencing amino acids in a protein

### Greedy algorithm

* General principle of Algorithm desing

* Simplifying assumptions

- Edge weights are `distinct`
- Graph is `connected`

* Based on these MST exists and unique

* Cut property

- Def. A `cut` in a graph is partition of its vertices into two non-empty set
- Def. A `crossing edge` connects a vertex in one set with a vertex in the other

- Cut property. Given any cut, the crossing edge of min weight is the MST

- Pf. Suppose min-weight crossing edge *e* is not in the MST

+ Adding e to the MST creates a cycle
+ Some other edge *f* in cycle must be a crossing edge
+ Removing *f* and adding *e* is also a spanning tree
+ Since weight of *e* is less than the weight of *f* that spanning tree is lower weight

* Greedy MST algorithm demo

- Start with all edges colored gray
- Find cut with no black crossing edges, color its min-weight edge black
- Repeat until V-1 edges are colored black

* Proposition. The greedy algorithm computes the MST

- Pf. [correctness]

+ Any edge colored black is in the MST (via cut property)
+ Fewer than *V-1* black edges → cut with no black crossing edges (consider cut whose vertices are one connected component)

* Efficient implementations. Choose cut? find min-weight edge?
01. Kruskal's algorithm [stay tuned]
02. Prim's algorithm [stay tuned]
03. Boruvka's algorithm

* Q. What if edge weights are not all distinct?
- A. Greedy MST algorithm still correct if equal weights are present!

* Q. What iff graph is not connected?
- A. Compute m,minimum spanning forest = MST of each components

### Edge-weight graph API

* Edge abstraction needed for weighted edges

* Idiom for processing an edge e: `int v = e.either(), w = e.other(v); `

``` java
public class Edge implements Comparable {

private final int v, w;
private final double weight;

public Edge(int v, int w, double weight) {
this.v = v;
this.w = w;
this.weight = weight;
}

/**
* Either endpoint
*/
public int either() {
return v;
}

/**
* Other endpoint
*/
public int other(int vertex) {
if (vertex == v)
return w;
else
return v;
}

@Override
public int compareTo(Edge that) {
if (this.weight < that.weight)
return -1;
else if (this.weight > that.weight)
return 1;
return 0;
}
}
```

* Graph API which has weighted edges, we will use `EdgeWeightedGraph`

``` java
public class EdgeWeightedGraph {

private final int vertices;
private Bag[] adj;

public EdgeWeightedGraph(int v) {
this.vertices = v;
adj = (Bag[]) new Bag[v];
for (int i = 0; i < v; i++) {
adj[v] = new Bag();
}
}

public int getVertices() {
return vertices;
}

public void addEdge(Edge edge) {
int v = edge.either(), w = edge.other(v);
adj[v].add(edge);
adj[w].add(edge);
}

public Iterable adj(int v) {
return adj[v];
}
}
```

* Edge-weighted graph: adjacency-list representation

- Maintain vertex-indexed array of Edge lists

* Minimum spanning tree API

- Q. How to represent the MST?

``` java
public class MST {
MST(EdgeWeightedGraph G)
Iterable edges() // edges in MST
double weight() // weight of MST
}
```

### Kruskal's algorithm

* classical algorithm for computing MST
* Steps:

- Consider edges in ascending order of weight
- Add next edge to tree T unless doing so would create a cycle
- Ignore edge that create a cycle

* Proposition. [Kruskal 1956] Kruskal's algorithm computes the MST

- Pf. Kruskal's algorithm is a special case of the greedy MST algorithm

+ Suppose Kruskal's algorithm colors the edge *e=v-w* black

+ Cut = set of vertices connected to *v* in tree *T*

+ No crossing edge is black

+ No crossing edge has lower weight, why?

* Challenge. Would adding edge *v-w* to tree *T* create a cycle? if not, add it

- *V* run DFS from v, check if w is reachable (T has at most V-1 edges)

- log*V use union-find data structure

- Efficient solution. Use the `Union-find` data structure

. Maintain a set for each connected component in T

. If v and w are in same set, then add v-w would create a cycle

. To add v-w to T, merge sets containing v and w

`Code`

``` java
public class KruskalMST {

private double weight;
private Queue mst;

public KruskalMST(EdgeWeightedGraph graph) {
MinPQ pq = new MinPQ<>(graph.getVertices());
mst = new ArrayQueue<>();
for (Edge e : graph.edges())
pq.insert(e);

UnionFind unionFind = new WeightedQuickUnionPassCompression(graph.getVertices());

while (!pq.isEmpty() && mst.size() < graph.getVertices() - 1) {
Edge e = pq.delMin();
int v = e.either(), w = e.other(v);
if (!unionFind.isConnected(v, w)) {
unionFind.union(v, w);
mst.enqueue(e);
weight += e.weight();
}
}
}

public Iterable edges