{"id":13528184,"url":"https://github.com/bitly/data_hacks","last_synced_at":"2025-05-15T12:06:03.955Z","repository":{"id":1092003,"uuid":"946824","full_name":"bitly/data_hacks","owner":"bitly","description":"Command line utilities for data analysis","archived":false,"fork":false,"pushed_at":"2024-01-16T09:55:12.000Z","size":50,"stargazers_count":1939,"open_issues_count":20,"forks_count":193,"subscribers_count":135,"default_branch":"master","last_synced_at":"2025-04-14T22:18:34.995Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://github.com/bitly/data_hacks","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"googleapis/google-api-go-client","license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bitly.png","metadata":{"files":{"readme":"README.markdown","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2010-09-28T22:09:22.000Z","updated_at":"2025-04-07T03:54:43.000Z","dependencies_parsed_at":"2024-06-19T22:50:15.213Z","dependency_job_id":"42a6f89c-457e-4375-bee1-5167ce192540","html_url":"https://github.com/bitly/data_hacks","commit_stats":{"total_commits":37,"total_committers":13,"mean_commits":"2.8461538461538463","dds":0.4054054054054054,"last_synced_commit":"c66693bc41299691974868557e9d7f4b30f2b3cd"},"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bitly%2Fdata_hacks","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bitly%2Fdata_hacks/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bitly%2Fdata_hacks/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bitly%2Fdata_hacks/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bitly","download_url":"https://codeload.github.com/bitly/data_hacks/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254337613,"owners_count":22054253,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T06:02:16.759Z","updated_at":"2025-05-15T12:05:58.939Z","avatar_url":"https://github.com/bitly.png","language":"Python","funding_links":[],"categories":["Python","Big data on a single machine / on the command line"],"sub_categories":[],"readme":"data_hacks\n==========\n\nCommand line utilities for data analysis\n\nInstalling: `pip install data_hacks`\n\nInstalling from github `pip install -e git://github.com/bitly/data_hacks.git#egg=data_hacks`\n\nInstalling from source `python setup.py install`\n\ndata_hacks are friendly. Ask them for usage information with `--help`\n\nhistogram.py\n------------\n\nA utility that parses input data points and outputs a text histogram\n\nExample:\n\n    $ cat /tmp/data | histogram.py --percentage --max=1000 --min=0\n    # NumSamples = 60; Min = 0.00; Max = 1000.00\n    # 1 value outside of min/max\n    # Mean = 332.666667; Variance = 471056.055556; SD = 686.335236; Median 191.000000\n    # each ∎ represents a count of 1\n        0.0000 -   100.0000 [    28]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (46.67%)\n      100.0000 -   200.0000 [     2]: ∎∎ (3.33%)\n      200.0000 -   300.0000 [     2]: ∎∎ (3.33%)\n      300.0000 -   400.0000 [     8]: ∎∎∎∎∎∎∎∎ (13.33%)\n      400.0000 -   500.0000 [     8]: ∎∎∎∎∎∎∎∎ (13.33%)\n      500.0000 -   600.0000 [     7]: ∎∎∎∎∎∎∎ (11.67%)\n      600.0000 -   700.0000 [     3]: ∎∎∎ (5.00%)\n      700.0000 -   800.0000 [     0]:  (0.00%)\n      800.0000 -   900.0000 [     1]: ∎ (1.67%)\n      900.0000 -  1000.0000 [     0]:  (0.00%)\n\nWith logarithmic scale\n\n    $ printf 'import random\\nfor i in range(1000):\\n print random.randint(0,10000)'|\\\n        python -|./data_hacks/histogram.py -l\n    # NumSamples = 1000; Min = 2.00; Max = 9993.00\n    # Mean = 4951.757000; Variance = 8279390.995951; SD = 2877.393090; Median 4828.000000\n    # each ∎ represents a count of 6\n        2.0000 -    11.7664 [     3]:\n       11.7664 -    31.2991 [     0]:\n       31.2991 -    70.3646 [     5]:\n       70.3646 -   148.4956 [    11]: ∎\n      148.4956 -   304.7576 [    15]: ∎∎\n      304.7576 -   617.2815 [    35]: ∎∎∎∎∎\n      617.2815 -  1242.3294 [    51]: ∎∎∎∎∎∎∎∎\n     1242.3294 -  2492.4252 [   128]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎\n     2492.4252 -  4992.6168 [   269]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎\n     4992.6168 -  9993.0000 [   483]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎\n\nninety_five_percent.py\n----------------------\n\nA utility script that takes a stream of decimal values and outputs the 95% time.\n\nThis is useful for finding the 95% response time from access logs.\n\nExample (assuming response time is the last column in your access log):\n\n    $ awk '{print $NF}' /path/to/access.log | ninety_five_percent.py\n    \nsample.py\n---------\n\nFilter a stream to a random sub-sample of the stream\n\nExample:\n\n    $ cat access.log | sample.py 10% | post_process.py\n\nrun_for.py\n----------\n\nPass through data for a specified amount of time\n\nExample:\n\n    $ tail -f access.log | run_for.py 10s | post_process.py\n\nbar_chart.py\n------------\n\nGenerate an ascii bar chart for input data (this is like a visualization of `uniq -c`)\n\n    $ cat data | bar_chart.py\n    # each ∎ represents a count of 1. total 63\n    14:40 [    49] ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎\n    14:41 [    14] ∎∎∎∎∎∎∎∎∎∎∎∎∎∎\n\n`bar_chart.py` and `histogram.py` also support ingesting pre-aggregated values. Simply provide a two column input of `count\u003cwhitespace\u003evalue` for `-a` or `value\u003cwhitespace\u003ecount` for `-A`:\n\n    $ sort /path/to/data | uniq -c | bar_chart.py -a\n\nThis is very convenient if you pull data out, say Hadoop or MySQL already aggregated.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbitly%2Fdata_hacks","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbitly%2Fdata_hacks","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbitly%2Fdata_hacks/lists"}