Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/bitly/data_hacks
Command line utilities for data analysis
https://github.com/bitly/data_hacks
Last synced: about 18 hours ago
JSON representation
Command line utilities for data analysis
- Host: GitHub
- URL: https://github.com/bitly/data_hacks
- Owner: bitly
- Created: 2010-09-28T22:09:22.000Z (over 14 years ago)
- Default Branch: master
- Last Pushed: 2024-01-16T09:55:12.000Z (12 months ago)
- Last Synced: 2025-01-03T20:08:17.700Z (8 days ago)
- Language: Python
- Homepage: http://github.com/bitly/data_hacks
- Size: 48.8 KB
- Stars: 1,940
- Watchers: 136
- Forks: 195
- Open Issues: 20
-
Metadata Files:
- Readme: README.markdown
Awesome Lists containing this project
- my-awesome-github-stars - bitly/data_hacks - Command line utilities for data analysis (Python)
- awesome-repositories - bitly/data_hacks - Command line utilities for data analysis (Python)
- awesome-machine-learning-engineering - Data hacks
- project-awesome - bitly/data_hacks - Command line utilities for data analysis (Python)
README
data_hacks
==========Command line utilities for data analysis
Installing: `pip install data_hacks`
Installing from github `pip install -e git://github.com/bitly/data_hacks.git#egg=data_hacks`
Installing from source `python setup.py install`
data_hacks are friendly. Ask them for usage information with `--help`
histogram.py
------------A utility that parses input data points and outputs a text histogram
Example:
$ cat /tmp/data | histogram.py --percentage --max=1000 --min=0
# NumSamples = 60; Min = 0.00; Max = 1000.00
# 1 value outside of min/max
# Mean = 332.666667; Variance = 471056.055556; SD = 686.335236; Median 191.000000
# each ∎ represents a count of 1
0.0000 - 100.0000 [ 28]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (46.67%)
100.0000 - 200.0000 [ 2]: ∎∎ (3.33%)
200.0000 - 300.0000 [ 2]: ∎∎ (3.33%)
300.0000 - 400.0000 [ 8]: ∎∎∎∎∎∎∎∎ (13.33%)
400.0000 - 500.0000 [ 8]: ∎∎∎∎∎∎∎∎ (13.33%)
500.0000 - 600.0000 [ 7]: ∎∎∎∎∎∎∎ (11.67%)
600.0000 - 700.0000 [ 3]: ∎∎∎ (5.00%)
700.0000 - 800.0000 [ 0]: (0.00%)
800.0000 - 900.0000 [ 1]: ∎ (1.67%)
900.0000 - 1000.0000 [ 0]: (0.00%)With logarithmic scale
$ printf 'import random\nfor i in range(1000):\n print random.randint(0,10000)'|\
python -|./data_hacks/histogram.py -l
# NumSamples = 1000; Min = 2.00; Max = 9993.00
# Mean = 4951.757000; Variance = 8279390.995951; SD = 2877.393090; Median 4828.000000
# each ∎ represents a count of 6
2.0000 - 11.7664 [ 3]:
11.7664 - 31.2991 [ 0]:
31.2991 - 70.3646 [ 5]:
70.3646 - 148.4956 [ 11]: ∎
148.4956 - 304.7576 [ 15]: ∎∎
304.7576 - 617.2815 [ 35]: ∎∎∎∎∎
617.2815 - 1242.3294 [ 51]: ∎∎∎∎∎∎∎∎
1242.3294 - 2492.4252 [ 128]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2492.4252 - 4992.6168 [ 269]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
4992.6168 - 9993.0000 [ 483]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ninety_five_percent.py
----------------------A utility script that takes a stream of decimal values and outputs the 95% time.
This is useful for finding the 95% response time from access logs.
Example (assuming response time is the last column in your access log):
$ awk '{print $NF}' /path/to/access.log | ninety_five_percent.py
sample.py
---------Filter a stream to a random sub-sample of the stream
Example:
$ cat access.log | sample.py 10% | post_process.py
run_for.py
----------Pass through data for a specified amount of time
Example:
$ tail -f access.log | run_for.py 10s | post_process.py
bar_chart.py
------------Generate an ascii bar chart for input data (this is like a visualization of `uniq -c`)
$ cat data | bar_chart.py
# each ∎ represents a count of 1. total 63
14:40 [ 49] ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
14:41 [ 14] ∎∎∎∎∎∎∎∎∎∎∎∎∎∎`bar_chart.py` and `histogram.py` also support ingesting pre-aggregated values. Simply provide a two column input of `countvalue` for `-a` or `valuecount` for `-A`:
$ sort /path/to/data | uniq -c | bar_chart.py -a
This is very convenient if you pull data out, say Hadoop or MySQL already aggregated.