Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/robertdavidgraham/wc2

Investigates optimizing 'wc', the Unix word count program
https://github.com/robertdavidgraham/wc2

Last synced: 11 days ago
JSON representation

Investigates optimizing 'wc', the Unix word count program

Lists

README

        

# wc2 - asynchronous state machine parsing

There have been multiple articles lately implementing the
classic `wc` program in various programming *languages*, to
"prove" their favorite language can be "just as fast" as C.

This project does something different.
Instead of a different *language* it uses a different *algorithm*.
The new algorithm is significantly faster -- implementing in a
slow language like JavaScript is still faster than the original
`wc` program written in C.

The algorithm is known as an "asynchronous state-machine parser".
It's a technique for *parsing* that you don't learn in college.
It's more *efficient*, but more importantly, it's more *scalable*.
That's why your browser uses a state-machine to parse GIFs,
and most web servers use state-machiens to parse incoming HTTP requests.

This projects contains three versions:
* `wc2o.c` is a simplified 25 line version highlighting the idea
* `wc2.c` is the full version in C, supporting Unicode
* `wc2.js` is the version in JavaScript

## The basic algorithm

The algorithm reads input and passes each byte one at a time
to a state-machine. It looks something like:

```c
length = fread(buf, 1, sizeof(buf), fp);
for (i=0; i
int main(void)
{
static const unsigned char table[4][4] = {
{2,0,1,0,}, {2,0,1,0,}, {3,0,1,0,}, {3,0,1,0,}
};
static const unsigned char column[256] = {
0,0,0,0,0,0,0,0,0,1,2,1,1,1,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,
};
unsigned long counts[4] = {0,0,0,0};
int state = 0;
int c;

while ((c = getchar()) != EOF) {
state = table[state][column[c]];
counts[state]++;
}

printf("%lu %lu %lu\n", counts[1], counts[2],
counts[0] + counts[1] + counts[2] + counts[3]);
return 0;
}
```

The key part that does all the word counting is in the two lines inside:

```c
while ((c = getchar()) != EOF) {
state = table[state][column[c]];
counts[state]++;
}
```

This is only defined for ASCII, so you can see the state-machine on a
single-line in the code (`table`).

## Additional tools

This project includes additional tools:
* `wctool` to generate large test files
* `wcdiff` to find difference between two implementatins of `wc`
* `wcstream` to fragment input files (demonstrates a bug in macOS's `wc`)

The program `wc2.c` has the same logic, the difference being that it
generates a larger state-machine for parsing UTF-8.

## Pointer arithmetic

C has a peculiar idiom called "pointer arithmetic", where pointers can
be incremented. Looping through a buffer is done with an expression like
`*buf++` instead of `buf[i++]`. Many programmers think pointer-arithmetic
is faster.

To test this, the `wc2.c` program has an option `-P` that makes this
small change, to test the difference in speed.