{"id":21620805,"url":"https://github.com/miku/filterline","last_synced_at":"2025-04-11T09:18:04.565Z","repository":{"id":33783080,"uuid":"37449795","full_name":"miku/filterline","owner":"miku","description":"Command line tool to filter file by line number.","archived":false,"fork":false,"pushed_at":"2024-05-22T07:41:32.000Z","size":39,"stargazers_count":12,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-25T06:33:48.697Z","etag":null,"topics":["filter","unix"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/miku.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-06-15T07:20:23.000Z","updated_at":"2024-05-22T07:41:36.000Z","dependencies_parsed_at":"2022-09-08T18:01:26.675Z","dependency_job_id":null,"html_url":"https://github.com/miku/filterline","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miku%2Ffilterline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miku%2Ffilterline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miku%2Ffilterline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miku%2Ffilterline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/miku","download_url":"https://codeload.github.com/miku/filterline/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248366766,"owners_count":21092031,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["filter","unix"],"created_at":"2024-11-24T23:12:58.055Z","updated_at":"2025-04-11T09:18:04.546Z","avatar_url":"https://github.com/miku.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"# README\n\nfilterline filters a file by line numbers.\n\nTaken from [here](http://unix.stackexchange.com/questions/209404/filter-file-by-line-number). There's an [awk version](https://gist.github.com/miku/bc8315b10413203b31de), too.\n\n## Installation\n\nThere are deb and rpm [packages](https://github.com/miku/filterline/releases).\n\nTo build from source:\n\n    $ git clone https://github.com/miku/filterline.git\n    $ cd filterline\n    $ make\n\n## Usage\n\nNote that line numbers (L) **must be sorted** and **must not contain duplicates**.\n\n    $ filterline\n    Usage: filterline FILE1 FILE2\n\n    FILE1: line numbers, FILE2: input file\n\n    $ cat fixtures/L\n    1\n    2\n    5\n    6\n\n    $ cat fixtures/F\n    line 1\n    line 2\n    line 3\n    line 4\n    line 5\n    line 6\n    line 7\n    line 8\n    line 9\n    line 10\n\n    $ filterline fixtures/L fixtures/F\n    line 1\n    line 2\n    line 5\n    line 6\n\n    $ filterline \u003c(echo 1 2 5 6) fixtures/F\n    line 1\n    line 2\n    line 5\n    line 6\n\nSince 0.1.4, there is an `-v` flag to \"invert\" matches.\n\n    $ filterline -v \u003c(echo 1 2 5 6) fixtures/F\n    line 3\n    line 4\n    line 7\n    line 8\n    line 9\n    line 10\n\n## Performance\n\nFiltering out 10 million lines from a 1 billion lines file (14G) takes about 33\nseconds (dropped caches, i7-2620M):\n\n    $ time filterline 10000000.L 1000000000.F \u003e /dev/null\n    real    0m33.434s\n    user    0m25.334s\n    sys     0m5.920s\n\nA similar [awk script](https://gist.github.com/miku/bc8315b10413203b31de) takes about 2-3 times longer.\n\n## Use case: data compaction\n\nOne use case for such a filter is *data compaction*. Imagine that you harvest\nan API every day and you keep the JSON responses in a log.\n\nWhat is a log?\n\n\u003e A log is perhaps the simplest possible storage abstraction. It is an\n  **append-only**, totally-ordered sequence of records ordered by time.\n\nFrom: [The Log: What every software engineer should know about real-time data's unifying abstraction](https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying)\n\nFor simplicity let's think of the log as a *file*. So everytime you harvest\nthe API, you just *append* to a file:\n\n```sh\n$ cat harvest-2015-06-01.ldj \u003e\u003e log.ldj\n$ cat harvest-2015-06-02.ldj \u003e\u003e log.ldj\n...\n```\n\nThe API responses can contain entries that are *new* and entries which\nrepresent *updates*. If you want to answer the question:\n\n\u003e What is the current state of each record?\n\n... you would have to find the most recent version of each record in that log file. A\ntypical solution would be to switch from a file to a database of sorts and do\nsome kind of\n[upsert](https://wiki.postgresql.org/wiki/UPSERT#.22UPSERT.22_definition).\n\nBut how about logs with 100M, 500M or billions of records? And what if you do\nnot want to run extra component, like a database?\n\nYou can make this process a shell one-liner, and a reasonably fast one, too.\n\n## Data point: Crossref Snapshot\n\n[Crossref](https://en.wikipedia.org/wiki/Crossref) hosts a constantly evolving\nindex of scholarly metadata, available via\n[API](https://www.crossref.org/documentation/retrieve-metadata/rest-api/). We\nuse `filterline` to turn a sequence of hundreds of daily api updates into a\nsingle snapshot, via\n[span-crossref-snapshot](https://github.com/miku/span/blob/master/cmd/span-crossref-snapshot/main.go)\n(more\n[details](https://github.com/datasets/awesome-data/issues/284#issuecomment-405089255)):\n\n```shell\n$ filterline L \u003c(zstd -dc -T0 data.ndj.zst) | zstd -c -T0 \u003e snapshot.ndj.zst\n\n             ^                  ^                             ^\n             |                  |                             |\n       lines to keep       ~1B+ records, 4T+             latest versions, ~140M records\n```\n\nCrunching through ~1B messages takes about 65 minutes, about 1GB/s.\n\n\u003e Look, ma, just [files](http://www.catb.org/~esr/writings/taoup/html/ch01s06.html).\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmiku%2Ffilterline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmiku%2Ffilterline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmiku%2Ffilterline/lists"}