{"id":22026830,"url":"https://github.com/timbray/topfew","last_synced_at":"2025-04-09T13:03:41.385Z","repository":{"id":53260100,"uuid":"265160401","full_name":"timbray/topfew","owner":"timbray","description":"Finds the field values (or combinations of values) which appear most often in a stream of records.","archived":false,"fork":false,"pushed_at":"2024-09-05T04:48:25.000Z","size":2595,"stargazers_count":192,"open_issues_count":3,"forks_count":6,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-04-02T11:49:36.559Z","etag":null,"topics":["go","logging"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/timbray.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-05-19T06:08:06.000Z","updated_at":"2025-03-30T00:46:51.000Z","dependencies_parsed_at":"2024-04-25T20:29:20.324Z","dependency_job_id":"531deeaa-68c6-4a3b-af60-9a2ae9c15dfe","html_url":"https://github.com/timbray/topfew","commit_stats":null,"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timbray%2Ftopfew","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timbray%2Ftopfew/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timbray%2Ftopfew/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timbray%2Ftopfew/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/timbray","download_url":"https://codeload.github.com/timbray/topfew/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248045230,"owners_count":21038553,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["go","logging"],"created_at":"2024-11-30T07:32:16.361Z","updated_at":"2025-04-09T13:03:41.359Z","avatar_url":"https://github.com/timbray.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# topfew\n\n[![Tests](https://github.com/timbray/topfew/actions/workflows/tests.yaml/badge.svg)](https://github.com/timbray/topfew/actions/workflows/tests.yaml)\n[![codecov](https://codecov.io/gh/timbray/topfew/branch/main/graph/badge.svg)](https://codecov.io/gh/timbray/topfew)\n[![Go Report Card](https://goreportcard.com/badge/github.com/timbray/topfew)](https://goreportcard.com/report/github.com/timbray/topfew)\n[![timbray/topfew](https://img.shields.io/github/go-mod/go-version/timbray/topfew)](https://github.com/timbray/topfew)\n[![0 dependencies!](https://0dependencies.dev/0dependencies.svg)](https://0dependencies.dev)\n\nA program that finds and prints out the top few records in which a certain field or combination of fields occurs most frequently.\n\nThis is release 2.0 of Topfew.\n\n## Examples\n\nTo find the IP address that most commonly hits your web site, given an Apache logfile named `access_log`.\n\n`topfew --fields 1 access_log`\n\nThe same effect could be achieved with\n\n`awk '{print $1}' access_log | sort | uniq -c | sort -rn | head`\n\nBut **topfew** is usually much faster.\n\nDo the same, but exclude high-traffic bots (omitting the filename).\n\n`topfew --fields 1 --vgrep googlebot --vgrep bingbot`\n\nMost popular IP addresses from May 2020.\n\n`topfew --fields 1 --grep '\\[../May/2020'`\n\nMost popular hour/minute of the day for retrievals.\n\n`topfew --fields 4 --sed \"\\\\[\" \"\"  --sed '^[^:]*:' ''  --sed ':..$' ''`\n\n## Usage\n\n```shell\ntopfew\n\t-n, --number (output line count) [default is 10]\n\t-f, --fields (field list) [default is the whole record]\n\t-q, --quotedfields [respect \"-delimited space-separated fields]\n\t-p, --fieldseparator (regexp) [use provided regexp to separate fields]\n\t-g, --grep (regexp) [may repeat, default is accept all]\n\t-v, --vgrep (regexp) [may repeat, default is reject none]\n\t-s, --sed (regexp) (replacement) [may repeat, default is no changes]\n\t-w, --width (segment count) [default is result of runtime.numCPU()]\n\t--sample\n\t-h, -help, --help\n\tfilename [default is stdin]\n\nAll the arguments are optional; if none are provided, topfew will read records \nfrom the standard input and list the 10 which occur most often.\n```\n## Options\n`-n integer`, `--number integer` \n\nHow many of the highest‐occurrence‐count lines to print out. \nThe default value is 10.\n\n`-f fieldlist, --fields fieldlist`\n\nSpecifies which fields should be extracted from incoming records and used in computing occurrence counts.\nThe fieldlist must be a comma‐separated  list  of  integers  identifying  field numbers, which start at one, for example 3 and 2,5,6.\nThe fields must be provided in order, so 3,1,7 is an error.\n\nIf no fieldlist is provided, **topfew** treats the whole input record as a single field.\n\n`-p separator, --fieldseparator separator` \n\nProvides a regular expression that is used as a field separator instead of the default white space.\nThis is likely to incur a significant performance cost.\n\n`-q, --quotedfields`\n\nSome files, for example Apache httpd logs, use space-separation but also\nallow spaces within fields which are delimited by `\"`. The -q/--quotedfields\nargument allows **topfew** to process these correctly. It is an error to specify both\n-p and -q.\n\n`-g regexp`, `--grep regexp`\n\nThe  initial **g** suggests `grep`.\nThis option applies the provided regular expression to each record as it is read and if the regexp does not match the record, **topfew** bypasses it.\n\nThis option can be provided multiple times; the provided regular expressions will be applied in the order they appear on the command line.\n\n`-v regexp`, `--vgrep regegxp`\n\nThe initial **v** suggests `grep ‐v`. This operation is the  inverse  of `-g` and `-‐grep`, rejecting records that match the  provided regular  expression.  \nAs  with `grep`, it can be provided multiple times.\n\n`-s regexp replacement`, `--sed regexp replacement`\n\nAs its name suggests, applies sed‐style editing by replacing any text that matches the provided regexp with the provided replacement.\nIt  works on the fields in the fieldlist after they have been extracted from the record.\n\nIf ()‐enclosed capturing groups appear in the regexp,  they  may be referred to as **$1**, **$2**, and so on in, the replacement.\n\nThis  option can be provided many times, and the replacement operations are performed in the order they appear on  the  command line.\n\n`--sample`\n\nIt can be tricky to get the regular expressions in the `−g`, `−v`, and `−s` options  right.\nSpecifying `-−sample`  causes  **topfew**  to  print lines to the standard output that display the filtering and field‐editing logic.\nIt can  only  be used when processing standard input, not a file.\n\n`-w integer`, `--width integer`\n\nIf a file name is specified then **topfew**, rather than reading it from end to end, will divide it into segments and process it in multiple parallel threads.\nThe optimal number of threads depends in a complicated way on how many cores your CPU has what kind of cores they are, and the storage architecture.\n\nThe default is the result of the Go `runtime.NumCPU()` calls and often produces good results.\n\n`-h`, `-help`, `--help`\n\nDescribes the function and options of **topfew**.\n\n## Records and fields\n\nRecords are separated by newlines, fields within records by white space, defined as one or more space or tab characters.\n\nThe field separator can be overridden with the --fieldseparator option.\n\n## Case study: Apache access_log\n\nHere is a line from an Apache httpd `access_log` file. For readability, the fields are \nseparated by line-breaks and numbered. Note that the fields are mostly space-separated, but that field 6,\nsummarizing the request and its result, is delimited by quote characters `\"`.\n\n```\n1. 202.113.19.244 \n2. - \n3. - \n4. [12/Mar/2007:08:04:39 \n5. -0800] \n6. \"GET /ongoing/picInfo.xml?o=http://www.tbray.org/ongoing/When/200x/2007/03/10/Beautiful-Code HTTP/1.1\" \n7. 200 \n8. 137 \n9. \"http://www.tbray.org/ongoing/When/200x/2007/03/10/Beautiful-Code\" \n10. \"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2) Gecko/20070219 Firefox/2.0.0.2\"\n```\n\nThe fetch of `picInfo.xml` signals that this is an actual browser request, likely signifying that \na human was involved; the URL following the `o=` is the resource the human looked at. Here is a \n**topfew** invocation that yields a list of the top 5 URLs that were fetched by a human:\n\n```shell\ntopfew -g picInfo.xml -f 6 -q -s '\\?utm.*' '' -s \" HTTP/...\" \"\" -s \"GET .*\\/ongoing\" \"\"\n```\n\nNote the `-g` to select only lines with `picInfo.xml`, the `-q` to request correct processing\nof quote-delimited fields, and the sequence of `-s` patterns to clean up the results.\n\n## Performance issues\n\nSince the effect of topfew can be exactly duplicated with a combination of `awk`, `grep`, `sed` and `sort`, you wouldn’t be using it if you didn’t care about performance. \nTopfew is quite highly tuned and pushes your computer’s I/O subsystem and Go runtime hard.\nTherefore, the observed effects of combinations of options can vary dramatically from system to system.\n\nFor example, if I want to list the top records containing the string `example` from a file named `big-file` I could do either of the following:\n\n```shell\ntopfew -g example big-file \ngrep example big-file |topfew\n```\n\nWhen I benchmark topfew on a modern Apple-Silicon Mac and an elderly spinning-rust Linux VPS, I observe that the first option is faster on Mac, the second on Linux.\n\nOnly one performance issue is uncomplicated: Topfew will **always** run faster on a named file than a standard-input stream.\n\n## Credits\n\nTim Bray created version 0.1 of Topfew, and the path toward 1.0 was based chiefly on ideas stolen from Dirkjan Ochtman and contributed by Simon Fell.\nThe GitHub CI was based on Michael Gasch’s implementation from my Quamina repository, and he helped with Topfew’s as well.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftimbray%2Ftopfew","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftimbray%2Ftopfew","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftimbray%2Ftopfew/lists"}