Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/antononcube/raku-data-reshapers
Raku package with data reshaping functions for different data structures (full arrays, Red tables, Text::CSV tables.)
https://github.com/antononcube/raku-data-reshapers
data data-transformation data-wrangling rakulang
Last synced: about 1 month ago
JSON representation
Raku package with data reshaping functions for different data structures (full arrays, Red tables, Text::CSV tables.)
- Host: GitHub
- URL: https://github.com/antononcube/raku-data-reshapers
- Owner: antononcube
- License: artistic-2.0
- Created: 2021-08-30T06:39:35.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2024-06-11T23:13:59.000Z (8 months ago)
- Last Synced: 2024-11-07T03:42:15.591Z (3 months ago)
- Topics: data, data-transformation, data-wrangling, rakulang
- Language: Raku
- Homepage: https://raku.land/zef:antononcube/Data::Reshapers
- Size: 225 KB
- Stars: 4
- Watchers: 3
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README-work.md
- License: LICENSE
Awesome Lists containing this project
README
# Raku Data::Reshapers
[![MacOS](https://github.com/antononcube/Raku-Data-Reshapers/actions/workflows/macos.yml/badge.svg)](https://github.com/antononcube/Raku-Data-Reshapers/actions/workflows/macos.yml)
[![Linux](https://github.com/antononcube/Raku-Data-Reshapers/actions/workflows/linux.yml/badge.svg)](https://github.com/antononcube/Raku-Data-Reshapers/actions/workflows/linux.yml)
[![Win64](https://github.com/antononcube/Raku-Data-Reshapers/actions/workflows/windows.yml/badge.svg)](https://github.com/antononcube/Raku-Data-Reshapers/actions/workflows/windows.yml)
[![https://raku.land/zef:antononcube/Data::Reshapers](https://raku.land/zef:antononcube/Data::Reshapers/badges/version)](https://raku.land/zef:antononcube/Data::Reshapers)
[![License: Artistic-2.0](https://img.shields.io/badge/License-Artistic%202.0-0298c3.svg)](https://opensource.org/licenses/Artistic-2.0)This Raku package has data reshaping functions for different data structures that are
coercible to full arrays.The supported data structures are:
- Positional-of-hashes
- Positional-of-arrays
The most important data reshaping provided by the package over those data structures are:- Cross tabulation, `cross-tabulate`
- Long format conversion, `to-long-format`
- Wide format conversion, `to-wide-format`
- Join across (aka `SQL JOIN`), `join-across`
- Transpose, `transpose`The first four operations are fundamental in data wrangling and data analysis;
see [AA1, Wk1, Wk2, AAv1-AAv2].(Transposing of tabular data is, of course, also fundamental, but it also can be seen as a
basic functional programming operation.)There are other reshaping functions for:
- Flattening and tallying,
- Simple and stratified (dataset) splitting
- Taking, renaming, and deleting of table columns,
- Table column separationAn overview is given in (some part of) the presentation
["TRC 2022 Implementation of ML algorithms in Raku"](https://youtu.be/efRHfjYebs4?si=-KHucA8exZ8Cxx-w&t=1335),
[AAv4].More detailed explanations of the data wrangling methodology and workflows is given in the article
["Introduction to data wrangling with Raku"](https://rakuforprediction.wordpress.com/2021/12/31/introduction-to-data-wrangling-with-raku/), [AA2].
(And its Bulgarian version [AA3].)This package is one of the translation targets of the interpreter(s) provided by the package
["DSL::English::DataQueryWorkflows"](https://github.com/antononcube/Raku-DSL-English-DataQueryWorkflows), [AAp2].------
## Usage examples
### Cross tabulation
Making contingency tables -- or cross tabulation -- is a fundamental statistics and data analysis operation,
[Wk1, AA1].Here is an example using the
[Titanic](https://en.wikipedia.org/wiki/Titanic)
dataset (that is provided by this package through the function `get-titanic-dataset`):```perl6
use Data::Reshapers;my @tbl = get-titanic-dataset();
my $res = cross-tabulate( @tbl, 'passengerSex', 'passengerClass');
say $res;
``````perl6
to-pretty-table($res);
```### Long format
Conversion to long format allows column names to be treated as data.
(More precisely, when converting to long format specified column names of a tabular dataset become values
in a dedicated column, e.g. "Variable" in the long format.)```perl6
my @tbl1 = @tbl.roll(3);
.say for @tbl1;
``````perl6
.say for to-long-format( @tbl1 );
``````perl6
my @lfRes1 = to-long-format( @tbl1, 'id', [], variablesTo => "VAR", valuesTo => "VAL2" );
.say for @lfRes1;
```### Wide format
Here we transform the long format result `@lfRes1` above into wide format --
the result has the same records as the `@tbl1`:```perl6
to-pretty-table( to-wide-format( @lfRes1, 'id', 'VAR', 'VAL2' ) );
```### Transpose
Using cross tabulation result above:
```perl6
my $tres = transpose( $res );to-pretty-table($res, title => "Original");
``````perl6
to-pretty-table($tres, title => "Transposed");
```------
## Type system
Earlier versions of the package implemented a type "deduction" system.
Currently, the type system is provided by the package [
"Data::TypeSystem"](https://resources.wolframcloud.com/FunctionRepository), [AAp1].The type system conventions follow those of Mathematica's
[`Dataset`](https://reference.wolfram.com/language/ref/Dataset.html)
-- see the presentation
["Dataset improvements"](https://www.wolfram.com/broadcast/video.php?c=488&p=4&disp=list&v=3264).Here we get the Titanic dataset, change the "passengerAge" column values to be numeric,
and show dataset's dimensions:```perl6
my @dsTitanic = get-titanic-dataset(headers => 'auto');
@dsTitanic = @dsTitanic.map({$_ = $_.Numeric; $_}).Array;
dimensions(@dsTitanic)
```Here is a sample of dataset's records:
```perl6
to-pretty-table(@dsTitanic.pick(5).List, field-names => )
```Here is the type of a single record:
```perl6
use Data::TypeSystem;
deduce-type(@dsTitanic[12])
```Here is the type of single record's values:
```perl6
deduce-type(@dsTitanic[12].values.List)
```Here is the type of the whole dataset:
```perl6
deduce-type(@dsTitanic)
```Here is the type of "values only" records:
```perl6
my @valArr = @dsTitanic>>.values>>.Array;
deduce-type(@valArr)
```Here is the type of the string values only records:
```perl6
my @valArr = delete-columns(@dsTitanic, 'passengerAge')>>.values>>.Array;
deduce-type(@valArr)
```------
## TODO
1. [X] DONE Simpler more convenient interface.
- ~~Currently, a user have to specify four different namespaces
in order to be able to use all package functions.~~
2. [ ] TODO More extensive long format tests.3. [ ] TODO More extensive wide format tests.
4. [X] DONE Implement verifications for:
- See the type system implementation -- it has all of functionalities listed here.
- [X] DONE Positional-of-hashes
- [X] DONE Positional-of-arrays
- [X] DONE Positional-of-key-to-array-pairs
- [X] DONE Positional-of-hashes, each record of which has:
- [X] Same keys
- [X] Same type of values of corresponding keys
- [X] DONE Positional-of-arrays, each record of which has:
- [X] Same length
- [X] Same type of values of corresponding elements5. [X] DONE Implement "nice tabular visualization" using
[Pretty::Table](https://gitlab.com/uzluisf/raku-pretty-table)
and/or
[Text::Table::Simple](https://github.com/ugexe/Perl6-Text--Table--Simple).6. [X] DONE Document examples using pretty tables.
7. [X] DONE Implement transposing operation for:
- [X] hash of hashes
- [X] hash of arrays
- [X] array of hashes
- [X] array of arrays
- [X] array of key-to-array pairs8. [X] DONE Implement to-pretty-table for:
- [X] hash of hashes
- [X] hash of arrays
- [X] array of hashes
- [X] array of arrays
- [X] array of key-to-array pairs9. [ ] DONE Implement join-across:
- [X] DONE inner, left, right, outer
- [X] DONE single key-to-key pair
- [X] DONE multiple key-to-key pairs
- [X] DONE optional fill-in of missing values
- [ ] TODO handling collisions10. [X] DONE Implement semi- and anti-join
11. [ ] TODO Implement to long format conversion for:
- [ ] TODO hash of hashes
- [ ] TODO hash of arrays12. [ ] TODO Speed/performance profiling.
- [ ] TODO Come up with profiling tests
- [ ] TODO Comparison with R
- [ ] TODO Comparison with Python
13. [ ] TODO Type system.
- [X] DONE Base type (Int, Str, Numeric)
- [X] DONE Homogenous list detection
- [X] DONE Association detection
- [X] DONE Struct discovery
- [ ] TODO Enumeration detection
- [X] DONE Dataset detection
- [X] List of hashes
- [X] Hash of hashes
- [X] List of lists
-
14. [X] DONE Refactor the type system into a separate package.15. [X] DONE "Simple" or fundamental functions
- [X] `flatten`
- [X] `take-drop`
- [X] `tally`
- Currently in "Data::Summarizers".
- Can be easily, on the spot, "implemented" with `.BagHash.Hash`.
------## References
### Articles
[AA1] Anton Antonov,
["Contingency tables creation examples"](https://mathematicaforprediction.wordpress.com/2016/10/04/contingency-tables-creation-examples/),
(2016),
[MathematicaForPrediction at WordPress](https://mathematicaforprediction.wordpress.com).[AA2] Anton Antonov,
["Introduction to data wrangling with Raku"](https://rakuforprediction.wordpress.com/2021/12/31/introduction-to-data-wrangling-with-raku/),
(2021),
[RakuForPrediction at WordPress](https://rakuforprediction.wordpress.com).[AA3] Anton Antonov,
["Увод в обработката на данни с Raku"](https://rakuforprediction.wordpress.com/2022/05/24/увод-в-обработката-на-данни-с-raku/),
(2022),
[RakuForPrediction at WordPress](https://rakuforprediction.wordpress.com).[Wk1] Wikipedia entry, [Contingency table](https://en.wikipedia.org/wiki/Contingency_table).
[Wk2] Wikipedia entry, [Wide and narrow data](https://en.wikipedia.org/wiki/Wide_and_narrow_data).
### Functions, repositories
[AAf1] Anton Antonov,
[CrossTabulate](https://resources.wolframcloud.com/FunctionRepository/resources/CrossTabulate),
(2019),
[Wolfram Function Repository](https://resources.wolframcloud.com/FunctionRepository).[AAf2] Anton Antonov,
[LongFormDataset](https://resources.wolframcloud.com/FunctionRepository/resources/LongFormDataset),
(2020),
[Wolfram Function Repository](https://resources.wolframcloud.com/FunctionRepository).[AAf3] Anton Antonov,
[WideFormDataset](https://resources.wolframcloud.com/FunctionRepository/resources/WideFormDataset),
(2021),
[Wolfram Function Repository](https://resources.wolframcloud.com/FunctionRepository).[AAf4] Anton Antonov,
[RecordsSummary](https://resources.wolframcloud.com/FunctionRepository/resources/RecordsSummary),
(2019),
[Wolfram Function Repository](https://resources.wolframcloud.com/FunctionRepository).[AAp1] Anton Antonov,
[Data::TypeSystem Raku package](https://github.com/antononcube/Raku-Data-TypeSystem),
(2023),
[GitHub/antononcube](https://github.com/antononcube).[AAp2] Anton Antonov,
[DSL::English::DataQueryWorkflows Raku package](https://github.com/antononcube/Raku-DSL-English-DataQueryWorkflows),
(2022-2024),
[GitHub/antononcube](https://github.com/antononcube).### Videos
[AAv1] Anton Antonov,
["Multi-language Data-Wrangling Conversational Agent"](https://www.youtube.com/watch?v=pQk5jwoMSxs),
(2020),
[YouTube channel of Wolfram Research, Inc.](https://www.youtube.com/channel/UCJekgf6k62CQHdENWf2NgAQ).
(Wolfram Technology Conference 2020 presentation.)[AAv2] Anton Antonov,
["Data Transformation Workflows with Anton Antonov, Session #1"](https://www.youtube.com/watch?v=iXrXMQdXOsM),
(2020),
[YouTube channel of Wolfram Research, Inc.](https://www.youtube.com/channel/UCJekgf6k62CQHdENWf2NgAQ).[AAv3] Anton Antonov,
["Data Transformation Workflows with Anton Antonov, Session #2"](https://www.youtube.com/watch?v=DWGgFsaEOsU),
(2020),
[YouTube channel of Wolfram Research, Inc.](https://www.youtube.com/channel/UCJekgf6k62CQHdENWf2NgAQ).[AAv4] Anton Antonov,
["TRC 2022 Implementation of ML algorithms in Raku](https://youtu.be/efRHfjYebs4?si=-KHucA8exZ8Cxx-w),
(2022),
[YouTube/@AAA4Prediction](https://www.youtube.com/@AAA4prediction).