Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/robertdavidgraham/wc2

Investigates optimizing 'wc', the Unix word count program
https://github.com/robertdavidgraham/wc2

Last synced: 11 days ago
JSON representation

Investigates optimizing 'wc', the Unix word count program

Host: GitHub
URL: https://github.com/robertdavidgraham/wc2
Owner: robertdavidgraham
Created: 2019-11-18T19:42:48.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2022-11-02T18:42:08.000Z (over 1 year ago)
Last Synced: 2023-05-14T05:05:14.742Z (about 1 year ago)
Language: C
Size: 79.1 KB
Stars: 25
Watchers: 3
Forks: 6
Open Issues: 2
Metadata Files:
- Readme: README.md

Lists

awesome-stars - robertdavidgraham/wc2 - Investigates optimizing 'wc', the Unix word count program (C)

README

        # wc2 - asynchronous state machine parsing

There have been multiple articles lately implementing the 

classic `wc` program in various programming *languages*, to

"prove" their favorite language can be "just as fast" as C.

This project does something different.

Instead of a different *language* it uses a different *algorithm*.

The new algorithm is significantly faster -- implementing in a

slow language like JavaScript is still faster than the original

`wc` program written in C.

The algorithm is known as an "asynchronous state-machine parser".

It's a technique for *parsing* that you don't learn in college.

It's more *efficient*, but more importantly, it's more *scalable*.

That's why your browser uses a state-machine to parse GIFs,

and most web servers use state-machiens to parse incoming HTTP requests.

This projects contains three versions:

* `wc2o.c` is a simplified 25 line version highlighting the idea

* `wc2.c` is the full version in C, supporting Unicode

* `wc2.js` is the version in JavaScript

## The basic algorithm

The algorithm reads input and passes each byte one at a time

to a state-machine. It looks something like:

```c

    length = fread(buf, 1, sizeof(buf), fp);

    for (i=0; i

int main(void)

{

    static const unsigned char table[4][4] = {

        {2,0,1,0,}, {2,0,1,0,}, {3,0,1,0,},  {3,0,1,0,}

    };

    static const unsigned char column[256] = {

        0,0,0,0,0,0,0,0,0,1,2,1,1,1,0,0,0,

        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,

    };

    unsigned long counts[4] = {0,0,0,0};

    int state = 0;

    int c;

    while ((c = getchar()) != EOF) {

        state = table[state][column[c]];

        counts[state]++;

    }

    printf("%lu %lu %lu\n", counts[1], counts[2], 

                counts[0] + counts[1] + counts[2] + counts[3]);

    return 0;

}

```

The key part that does all the word counting is in the two lines inside:

```c

    while ((c = getchar()) != EOF) {

        state = table[state][column[c]];

        counts[state]++;

    }

```

This is only defined for ASCII, so you can see the state-machine on a

single-line in the code (`table`).

## Additional tools

This project includes additional tools:

 * `wctool` to generate large test files

 * `wcdiff` to find difference between two implementatins of `wc`

 * `wcstream` to fragment input files (demonstrates a bug in macOS's `wc`)

The program `wc2.c` has the same logic, the difference being that it 

generates a larger state-machine for parsing UTF-8.

## Pointer arithmetic

C has a peculiar idiom called "pointer arithmetic", where pointers can

be incremented. Looping through a buffer is done with an expression like

`*buf++` instead of `buf[i++]`. Many programmers think pointer-arithmetic

is faster.

To test this, the `wc2.c` program has an option `-P` that makes this

small change, to test the difference in speed.