Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/librecat/catmandu-stat

Catmandu support for basic statistical data analysis
https://github.com/librecat/catmandu-stat

Last synced: 4 days ago
JSON representation

Catmandu support for basic statistical data analysis

Awesome Lists containing this project

README

        

# NAME

Catmandu::Stat - Catmandu modules for working with statistical data

# SYNOPSIS

# Calculate statistics on the availabity of the ISBN fields in the dataset
cat data.json | catmandu convert JSON to Stat --fields isbn

# Preprocess data and calculate statistics
catmandu convert MARC to Stat --fix 'marc_map(020a,isbn)' --fields isbn < data.mrc

# Or in fix files

# Calculate the mean of foo. E.g. foo => [1,2,3,4]
stat_mean(foo) # foo => '2.5'

# Calculate the median of foo. E.g. foo => [1,2,3,4]
stat_median(foo) # foo => '2.5'

# Calculate the standard deviation of foo. E.g. foo => [1,2,3,4]
stat_stddev(foo) # foo => '1.12'

# Calculate the variance of foo. E.g. foo => [1,2,3,4]
stat_variance(foo) # foo => '1.25'

# MODULES

- [Catmandu::Exporter::Stat](https://metacpan.org/pod/Catmandu::Exporter::Stat)
- [Catmandu::Fix::stat\_mean](https://metacpan.org/pod/Catmandu::Fix::stat_mean)
- [Catmandu::Fix::stat\_median](https://metacpan.org/pod/Catmandu::Fix::stat_median)
- [Catmandu::Fix::stat\_stddev](https://metacpan.org/pod/Catmandu::Fix::stat_stddev)
- [Catmandu::Fix::stat\_variance](https://metacpan.org/pod/Catmandu::Fix::stat_variance)

# EXAMPLES

The Catmandu::Stat distribution includes a CSV file on the Sacramento crime rate in January 2006,
"t/SacramentocrimeJanuary2006.csv" also available at
http://samplecsvs.s3.amazonaws.com/SacramentocrimeJanuary2006.csv

To view statistics on the fields available in this file type:

$ catmandu convert CSV to Stat < t/SacramentocrimeJanuary2006.csv

| name | count | zeros | zeros% | min | max | mean | variance | stdev | uniq~ | uniq% | entropy |
|---------------|-------|-------|--------|-----|-----|------|----------|-------|-------|-------|-----------|
| # | 7584 | | | | | | | | | | |
| address | 7584 | 0 | 0.0 | 1 | 1 | 1 | 0.0 | 0.0 | 5425 | 71.5 | 12.4/12.4 |
| beat | 7584 | 0 | 0.0 | 1 | 1 | 1 | 0.0 | 0.0 | 20 | 0.3 | 4.3/12.9 |
| cdatetime | 7584 | 0 | 0.0 | 1 | 1 | 1 | 0.0 | 0.0 | 5071 | 66.9 | 12.3/12.3 |
| crimedescr | 7584 | 0 | 0.0 | 1 | 1 | 1 | 0.0 | 0.0 | 305 | 4.0 | 5.6/12.6 |
| district | 7584 | 0 | 0.0 | 1 | 1 | 1 | 0.0 | 0.0 | 6 | 0.1 | 2.6/12.9 |
| grid | 7584 | 0 | 0.0 | 1 | 1 | 1 | 0.0 | 0.0 | 537 | 7.1 | 7.8/9.9 |
| latitude | 7584 | 0 | 0.0 | 1 | 1 | 1 | 0.0 | 0.0 | 5288 | 69.7 | 12.4/12.4 |
| longitude | 7584 | 0 | 0.0 | 1 | 1 | 1 | 0.0 | 0.0 | 5295 | 69.8 | 12.4/12.4 |
| ucr_ncic_code | 7584 | 0 | 0.0 | 1 | 1 | 1 | 0.0 | 0.0 | 88 | 1.2 | 4.1/12.9 |

The file has 7584 rows where and all the fields `address` to `ucr_ncic_code` contain values.
Each field has only one value (no arrays available in the CSV file). The are 5492 unique
addresses in the CSV file. The `district` field has the lowest entropy, most of its values are
shared among many rows.

# SEE ALSO

[Catmandu](https://metacpan.org/pod/Catmandu),
[Catmandu::Breaker](https://metacpan.org/pod/Catmandu::Breaker),

# AUTHOR

Patrick Hochstenbach, ``

# LICENSE AND COPYRIGHT

This program is free software; you can redistribute it and/or modify it
under the terms of either: the GNU General Public License as published
by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.