Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/robertdavidgraham/wc2
Investigates optimizing 'wc', the Unix word count program
https://github.com/robertdavidgraham/wc2
Last synced: 12 days ago
JSON representation
Investigates optimizing 'wc', the Unix word count program
- Host: GitHub
- URL: https://github.com/robertdavidgraham/wc2
- Owner: robertdavidgraham
- Created: 2019-11-18T19:42:48.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2024-07-07T19:30:48.000Z (4 months ago)
- Last Synced: 2024-10-14T11:39:42.681Z (about 1 month ago)
- Language: C
- Size: 79.1 KB
- Stars: 251
- Watchers: 5
- Forks: 15
- Open Issues: 11
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# wc2 - asynchronous state machine parsing
There have been multiple articles lately implementing the
classic `wc` program in various programming *languages*, to
"prove" their favorite language can be "just as fast" as C.This project does something different.
Instead of a different *language* it uses a different *algorithm*.
The new algorithm is significantly faster -- implementing in a
slow language like JavaScript is still faster than the original
`wc` program written in C.The algorithm is known as an "asynchronous state-machine parser".
It's a technique for *parsing* that you don't learn in college.
It's more *efficient*, but more importantly, it's more *scalable*.
That's why your browser uses a state-machine to parse GIFs,
and most web servers use state-machiens to parse incoming HTTP requests.This projects contains three versions:
* `wc2o.c` is a simplified 25 line version highlighting the idea
* `wc2.c` is the full version in C, supporting Unicode
* `wc2.js` is the version in JavaScript## The basic algorithm
The algorithm reads input and passes each byte one at a time
to a state-machine. It looks something like:```c
length = fread(buf, 1, sizeof(buf), fp);
for (i=0; i
int main(void)
{
static const unsigned char table[4][4] = {
{2,0,1,0,}, {2,0,1,0,}, {3,0,1,0,}, {3,0,1,0,}
};
static const unsigned char column[256] = {
0,0,0,0,0,0,0,0,0,1,2,1,1,1,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,
};
unsigned long counts[4] = {0,0,0,0};
int state = 0;
int c;while ((c = getchar()) != EOF) {
state = table[state][column[c]];
counts[state]++;
}printf("%lu %lu %lu\n", counts[1], counts[2],
counts[0] + counts[1] + counts[2] + counts[3]);
return 0;
}
```The key part that does all the word counting is in the two lines inside:
```c
while ((c = getchar()) != EOF) {
state = table[state][column[c]];
counts[state]++;
}
```This is only defined for ASCII, so you can see the state-machine on a
single-line in the code (`table`).## Additional tools
This project includes additional tools:
* `wctool` to generate large test files
* `wcdiff` to find difference between two implementatins of `wc`
* `wcstream` to fragment input files (demonstrates a bug in macOS's `wc`)The program `wc2.c` has the same logic, the difference being that it
generates a larger state-machine for parsing UTF-8.## Pointer arithmetic
C has a peculiar idiom called "pointer arithmetic", where pointers can
be incremented. Looping through a buffer is done with an expression like
`*buf++` instead of `buf[i++]`. Many programmers think pointer-arithmetic
is faster.To test this, the `wc2.c` program has an option `-P` that makes this
small change, to test the difference in speed.