{"id":17875397,"url":"https://github.com/mbrubeck/wf-recs","last_synced_at":"2026-02-06T02:05:09.989Z","repository":{"id":476887,"uuid":"102261","full_name":"mbrubeck/wf-recs","owner":"mbrubeck","description":"Tim Bray's Wide Finder implemented with Make and RecordStream","archived":false,"fork":false,"pushed_at":"2009-01-06T22:41:12.000Z","size":292,"stargazers_count":5,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2026-02-02T07:38:08.393Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mbrubeck.png","metadata":{"files":{"readme":"README.markdown","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2009-01-06T22:07:57.000Z","updated_at":"2019-08-13T13:51:50.000Z","dependencies_parsed_at":"2022-08-16T10:25:23.054Z","dependency_job_id":null,"html_url":"https://github.com/mbrubeck/wf-recs","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mbrubeck/wf-recs","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mbrubeck%2Fwf-recs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mbrubeck%2Fwf-recs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mbrubeck%2Fwf-recs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mbrubeck%2Fwf-recs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mbrubeck","download_url":"https://codeload.github.com/mbrubeck/wf-recs/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mbrubeck%2Fwf-recs/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29145565,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-06T01:13:33.096Z","status":"online","status_checked_at":"2026-02-06T02:00:08.092Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-28T11:15:24.823Z","updated_at":"2026-02-06T02:05:09.976Z","avatar_url":"https://github.com/mbrubeck.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"About\n=====\n\nTed Gould's post about [transcoding music files using Make][1] got me thinking\nabout combining Make and [RecordStream][3] into a barebones [MapReduce][2]\nframework.  As a demonstration, I wrote a partial solution to Tim Bray's [Wide\nFinder][6] challenge, to calculate hit counts from web server logs.\n\nRecordStream (also known as \"recs\") is a collection of Unix tools for\nproducing and processing files full of [JSON][4] objects.  It was developed by\nmy friends Ben Bernard and Keith Amling.  Since RecordStream files contain one\nobject per line, the recs tools are easy to script in combination with\ntraditional Unix tools like `head` and `grep`.\n\n\u003e *Aside: Unfortunately, the version of RecordStream on Google Code is very\n\u003e old, and in particular does not have much in the way of usage or\n\u003e installation instructions.  Hopefully Ben and Keith can get approval to\n\u003e publish the many, many, many improvements they and others have made since\n\u003e they started working at Amazon.com.*\n\nI think of Make as a functional programming environment for shell scripting.\nIt's simple to write a Makefile where each rule's output depends only on the\nspecified input files.  Since all dependencies between rules are declared in\nthe Makefile, we get some nice features for free:\n\n1. The entire process can be parallelized (using `make -j`) with no explicit\n   instructions from the programmer.  Running `make -j 2` on my dual-core\n   workstation resulted in a 1.75\u0026times; speed-up for this program.\n2. By saving intermediate files, we can do incremental MapReduce updates\n   ([like CouchDB][5]).  Just add or remove log files from the data directory\n   and re-run `make`.  The new files will be analyzed, then combined with\n   cached results from any already-processed files.  If you had a huge\n   collection of files, you could cache multiple levels of aggregate results\n   (e.g. weekly, monthly, daily) to make incremental updates even more\n   efficient.\n3. Saving intermediate files also prevents other types of duplicate work.  For\n   example, the first step in my Wide Finder program converts the logs to into\n   RecordStream's JSON format.  This is more work than necessary for the Wide\n   Finder problem alone, but in the real world you could then reuse the\n   results to do other calculations.\n4. You can kill the process at any point and later restart it near where it\n   stopped\u0026mdash;or even migrate it to another machine.\n\nHere's my [source code][7]: 16 line of code, not counting comments or blank\nlines.  It's not as fast as any of the heavily-optimized versions, but it\ndemonstrates a practical way to do parallel data processing with very little\neffort.\n\nInstructions\n============\n\n1. Check out [RecordStream][8].  Add the commands in the `bin` directory to\n   your `$PATH`, and copy the contents of the lib directory into your Perl\n   include path (e.g. `/usr/local/lib/site_perl`).\n2. Clone this git repository: `git clone git://github.com/mbrubeck/wf-recs.git`\n3. Run `make` inside your cloned repository.  Use `make -j 4` to run four\n   concurrent jobs.  (Adjust for the number of processor cores in your\n   machine.)\n\nExercise for the reader:  Integrate this with a distributed Make program, to\nmake it run across multiple computers on a network.\n\nNotes\n=====\n\nThe sample data in the `data` directory are the 10,000-line\n\u003chttp://www.tbray.org/tmp/o10k.ap\u003e, split into ten files.  I wimped out and\ndid not include file-splitting in the code itself.\n\nEric Wong's [Wide Finder 2 entry][9] also used Make, although his\ncode is currently unavailable.\n\n\n\n[1]: http://gould.cx/ted/blog/Where_music_is_going\n[2]: http://en.wikipedia.org/wiki/MapReduce\n[3]: http://code.google.com/p/recordstream/\n[4]: http://json.org/\n[5]: http://horicky.blogspot.com/2008/10/couchdb-implementation.html\n[6]: http://www.tbray.org/ongoing/When/200x/2007/09/20/Wide-Finder\n[7]: Makefile\n[8]: http://code.google.com/p/recordstream/source/checkout\n[9]: http://groups.google.com/group/wide-finder/browse_thread/thread/10b3cc5d3d10e384/fcb8c1ce2f5ef480\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmbrubeck%2Fwf-recs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmbrubeck%2Fwf-recs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmbrubeck%2Fwf-recs/lists"}