{"id":26062241,"url":"https://github.com/dpmcmlxxvi/clistats","last_synced_at":"2025-04-11T11:09:45.953Z","repository":{"id":18578891,"uuid":"21781964","full_name":"dpmcmlxxvi/clistats","owner":"dpmcmlxxvi","description":"A command line interface tool to compute statistics from a file or the command line.","archived":false,"fork":false,"pushed_at":"2021-02-08T00:47:22.000Z","size":102,"stargazers_count":40,"open_issues_count":2,"forks_count":4,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-25T07:36:08.310Z","etag":null,"topics":["c-plus-plus","command-line","statistics"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dpmcmlxxvi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-07-13T03:19:11.000Z","updated_at":"2024-04-18T17:49:53.000Z","dependencies_parsed_at":"2022-09-06T08:11:01.338Z","dependency_job_id":null,"html_url":"https://github.com/dpmcmlxxvi/clistats","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dpmcmlxxvi%2Fclistats","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dpmcmlxxvi%2Fclistats/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dpmcmlxxvi%2Fclistats/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dpmcmlxxvi%2Fclistats/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dpmcmlxxvi","download_url":"https://codeload.github.com/dpmcmlxxvi/clistats/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248381789,"owners_count":21094527,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["c-plus-plus","command-line","statistics"],"created_at":"2025-03-08T15:55:08.994Z","updated_at":"2025-04-11T11:09:45.917Z","avatar_url":"https://github.com/dpmcmlxxvi.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"clistats\n================================================================================\n\nclistats is a command line interface tool to compute statistics of a set\nof delimited input numbers from a stream such as a Comma Separated Value (.csv)\nor Tab Separated Value (.tsv) file. The default delimiter is a comma. Input data can\nbe a file, a redirected pipe, or the manually entered at the console. To stop\nprocessing and display the statistics during manual input, enter the EOF signal\n(CTRL-D on POSIX systems like Linux or Cygwin or CTRL-Z on Windows).\n\n### I/O options\n\n  * Input data can be from a file, standard input, or a pipe\n  * Output can be written to a file, standard output, or a pipe\n  * Output uses headers that start with \"#\" to enable piping to gnuplot\n    \n### Parsing options\n\n  * Signal, end-of-file, or blank line based detection to stop processing\n  * Comment and delimiter character can be set\n  * Columns can be filtered out from processing\n  * Rows can be filtered out from processing based on numeric constraint\n  * Rows can be filtered out from processing based on string constraint\n  * Rows can be sampled uniformly or randomly.\n  * Initial header rows can be skipped\n  * Fixed number of rows can be processed\n  * Duplicate delimiters can be ignored\n  * Rows can be reshaped into columns\n  * Strictly enforce that only rows of the same size are processed\n  * A row containing column titles can be used to title output statistics\n    \n### Statistics options\n\n  * Summary statistics (Count, Minimum, Mean, Maximum, Standard deviation)\n  * Covariance\n  * Correlation\n  * Least squares offset\n  * Least squares slope\n  * Histogram\n  * Raw data after filtering\n    \n### Warnings\n - Delimiters are not preserved if in quotes. Any delimiter character will\n   cause the row to be split.\n\n - All statistics are computed on a rolling basis and not by using the\n   entire dataset, so all statistics (except minimum, maximum, and count) are\n   approximations. When row sampling is used all statistics are approximations.\n\n - The histogram is also computed on a rolling basis with a\n   dynamic histogram merging algorithm. New data points that do not fall\n   within the current histogram's bounds are added to a cache. Once the cache\n   is full it is merged with the current histogram. Bins sizes are scaled by\n   the smallest integer needed to include the new cache data and maintain the\n   same number of bins. The initial histogram is empty so all data is initially\n   added to the cache. Therefore, the cache size should not be too small and\n   preferably be set to a size that captures the statistics of the underlying\n   sample. However, the larger the cache size the more memory required to\n   store the cache values. The default value of the cache size is 1000.\n   \n### Alternatives\nI trolled around online and searched for existing solutions to the same problem.\nThere are several very nice solutions which I've listed below. I still think\nclistats is the most flexible, robust, and easiest to use out-of-the-box but\nI've done no real testing so it's just coder's pride saying that. However,\nthere are some others that appear might be faster but are more limited in the\nscope of what kind of input they can process and what output statistics they\ngenerate.\n\n - [|STAT](http://hcibib.org/perlman/stat/)\n - [Average](http://sourceforge.net/projects/average/)\n - [datastat](http://sourceforge.net/projects/datastat/)\n - [qstat](https://github.com/tonyfischetti/qstats)\n - [st](https://github.com/nferraz/st)\n - [sta](https://github.com/simonccarter/sta)\n - [stats](http://web.cs.wpi.edu/~claypool/misc/stats/stats.html)\n - [stats-tools](https://github.com/jweslley/stats-tools)\n\nEXAMPLES\n================================================================================\n\nSince there's no better explanation to running a tool like some simple examples,\nbelow are some basic use cases to get you going on running clistats. The\nfollowing show how to provide input data, filter the data, and redirect the\ncomputed statistics to gnuplot.\n\n### Standard Input\n\nInput data is taken from the standard input so a user can input\nnumbers at the console:\n\n    ./clistats\n    1,2,3,4\n    5,6,7,8\n    9,0,1,2\n    3,4,5,6\n    \n    #============================================================================\n    #                            Statistics\n    #============================================================================\n    #     Dimension     Count      Minimum         Mean      Maximum        Stdev\n    #----------------------------------------------------------------------------\n                  1         4     1.000000     4.500000     9.000000     2.958040\n                  2         4     0.000000     3.000000     6.000000     2.236068\n                  3         4     1.000000     4.000000     7.000000     2.236068\n                  4         4     2.000000     5.000000     8.000000     2.236068\n\n### File Input\n\nAn input file can be redirected to process delimited data from a file or by\nspecifying the input file (see -i option):\n\n    ./clistats \u003c file.csv\n\n### Pipe Input\n\nInput data can also be provided using a pipe:\n\n    (echo \"1,2,3,4\"; echo \"5,6,7,8\"; echo \"9,0,1,2\"; echo \"3,4,5,6\") | ./clistats\n    #============================================================================\n    #                            Statistics\n    #============================================================================\n    #     Dimension     Count      Minimum         Mean      Maximum        Stdev\n    #----------------------------------------------------------------------------\n                  1         4     1.000000     4.500000     9.000000     2.958040\n                  2         4     0.000000     3.000000     6.000000     2.236068\n                  3         4     1.000000     4.000000     7.000000     2.236068\n                  4         4     2.000000     5.000000     8.000000     2.236068\n\n### Realistic Example\n\nA slightly more realistic example would be to download some actual data. The\nexample below downloads comma delimited raw data from the [Lahman Baseball\nArchive](http://www.seanlahman.com/baseball-archive/statistics/).\nThe results show various batting statistics over the years 1871 to 2013.\n\nColumns that have entirely non-numeric data are displayed with a\nNot-A-Number string \"nan\". The example makes use of displaying the data's\ncorrelation table which can highlight trends between variables. For\ninstance, note the very low correlation of 0.253640 between Home Runs (HR)\nand Stolen Bases (SB) as most power hitter don't tend to be faster runners.\nSame goes for Triples (3B) as it's hard for those big guys to make it all\nthe way to 3rd base so their correlation is 0.338364.\n\nNote, you may need to install wget and unzip to get this example to work.\nAlternatively, you can download and unzip the files manually. Also, I don't\nown any of this archive data so use it within the site's legal provisions.\n\n    wget http://seanlahman.com/files/database/lahman-csv_2014-02-14.zip\n    unzip lahman-csv_2014-02-14.zip\n    ./clistats --titles 1 --filterColumn \"1,8:10,12:13,15\" --correlation \u003c Batting.csv\n    #=============================================================================\n    #                           Correlation\n    #=============================================================================\n    #   playerID         AB          R          H         3B         HR         SB\n    #-----------------------------------------------------------------------------\n             nan        nan        nan        nan        nan        nan        nan\n             nan   1.000000   0.950196   0.987135   0.712319   0.684625   0.603282\n             nan   0.950196   1.000000   0.965945   0.742781   0.719900   0.657723\n             nan   0.987135   0.965945   1.000000   0.736148   0.693786   0.611282\n             nan   0.712319   0.742781   0.736148   1.000000   0.338364   0.609333\n             nan   0.684625   0.719900   0.693786   0.338364   1.000000   0.253640\n             nan   0.603282   0.657723   0.611282   0.609333   0.253640   1.000000\n\n### Plotting with gnuplot\n\nYou can pipe this output to gnuplot:\n\n    ./clistats --titles 1 --filterColumn \"1,8:10,12:13,15\" --correlation \u003c Batting.csv | gnuplot -p -e 'plot \"-\" matrix with image title \"Correlation\"'\n\nwhich will display an image representation of the correlation matrix. You'll\nsee any \"nan\" strings are interpreted as undefined values by gnuplot and\nrendered as black.\n\n![Correlation](/examples/correlation.png?raw=true \"Correlation\")\n\n### Column filtering\n\nColumn filters can be used to removed unwanted columns from the computed\nstatistics using the \"--filterColumn\" option. The example above can be\nmodified to additionally filter out the first column and remove the\nunwanted \"nan\" strings.\n\n    ./clistats --titles 1 --filterColumn \"8:10,12:13,15\" --correlation \u003c Batting.csv | gnuplot -p -e 'plot \"-\" matrix with image title \"Correlation\"'\n\n![Column Filtering](/examples/column-filtering.png?raw=true \"Column Filtering\")\n\n### Row filtering\n\nRow filters can also be used to remove unwanted rows from the computed statistics\nusing either numeric or string criteria. Multiple row filters can be used and will\nbe processed in the order provided. However, all string filters will be processed\nfirst then all numeric filters are processed. The following example keeps only\nthose batting statistics after the year 2000 by matching the entries in the 2nd\ncolumn to the interval [2000,infinity]:\n\n    ./clistats --titles 1 --filterColumn \"2,8:10,12:13,15\" --filterNumeric \"2,2000,inf\" \u003c Batting.csv\n    #===========================================================================\n    #                           Statistics\n    #===========================================================================\n    #   Dimension   Count       Minimum          Mean       Maximum        Stdev\n    #---------------------------------------------------------------------------\n           yearID   18641   2000.000000   2006.440051   2013.000000     4.028313\n               AB   17377      0.000000    134.062611    716.000000   186.083370\n                R   17377      0.000000     18.102664    152.000000    28.139381\n                H   17377      0.000000     35.189561    262.000000    52.407670\n               3B   17377      0.000000      0.731369     23.000000     1.657640\n               HR   17377      0.000000      4.080566     73.000000     7.865398\n               SB   17377      0.000000      2.308741     78.000000     6.119866\n\nThe following example adds an additional filter to keep only those batting\nstatistics for players from Boston by matching the string \"BOS\" to the 4th column:\n\n    ./clistats --titles 1 --filterColumn \"2,8:10,12:13,15\" --filterNumeric \"2,2000,inf\" --filterString \"4,BOS\" \u003c Batting.csv\n    #===========================================================================\n    #                           Statistics\n    #===========================================================================\n    #   Dimension   Count       Minimum          Mean       Maximum        Stdev\n    #---------------------------------------------------------------------------\n           yearID     644   2000.000000   2006.335404   2013.000000     4.014638\n               AB     543      0.000000    145.392265    660.000000   197.596551\n                R     543      0.000000     21.965009    123.000000    32.242187\n                H     543      0.000000     39.930018    213.000000    57.410212\n               3B     543      0.000000      0.720074     13.000000     1.586924\n               HR     543      0.000000      4.974217     54.000000     8.855879\n               SB     543      0.000000      2.123389     70.000000     6.290502\n\nUSAGE\n================================================================================\n\nTo display a full listing of the application options use\n\n    $ ./clistats --help\n\nINSTALL\n================================================================================\n\n### Build with Make\n\nA simple GNU make file is provided to build the code. Just run \"make\".\n\n### Build with CMake\n\nA CMake file is also provided to build out of source and provided for\nfuture development.\n\n  * Definitions\n    1. \\\u003csource\u003e    Directory where source code was installed\n    2. \\\u003cbuild\u003e     Directory where code will be built\n    3. \\\u003cinstall\u003e   Directory where executable will be installed\n  * Prerequisites:\n    1. CMake 2.8 (or higher)            To build from the Cmake files.\n    2. Visual Studio 2008 (or higher)   To build on Windows\n  * Instructions:\n    1. Create build directory: mkdir \\\u003cbuild\u003e\n    2. Change to build directory: cd \\\u003cbuild\u003e\n    3. Build\n      - Linux\n        * cmake \\\u003csource\u003e -DCMAKE_INSTALL_PREFIX=\\\u003cinstall\u003e -DCMAKE_BUILD_TYPE=Release\n        * make \u0026\u0026 make install\n      - Windows\n        * cmake \\\u003csource\u003e -DCMAKE_INSTALL_PREFIX=\\\u003cinstall\u003e\n        * Open Visual Studio solution \"clistats.sln\"\n        * Run project \"ALL_BUILD\"\n        * Run project \"INSTALL\"\n\nLICENSE\n================================================================================\n\nCopyright (c) 2014 Daniel Pulido \u003cdpmcmlxxvi@gmail.com\u003e\n\nclistats is released under the [MIT License](http://opensource.org/licenses/MIT)\n\nCHANGELOG\n================================================================================\n\n- Version 0.1\n    \n  * Initial release\n\nAUTHOR\n================================================================================\n\nCopyright 2014 by Daniel Pulido \u003cdpmcmlxxvi@gmail.com\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdpmcmlxxvi%2Fclistats","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdpmcmlxxvi%2Fclistats","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdpmcmlxxvi%2Fclistats/lists"}