{"id":15713674,"url":"https://github.com/naqvis/crysda","last_synced_at":"2025-06-22T22:34:41.718Z","repository":{"id":42477690,"uuid":"307441672","full_name":"naqvis/CrysDA","owner":"naqvis","description":"Crystal library for Data Analysis, Wrangling, Munging","archived":false,"fork":false,"pushed_at":"2023-03-27T04:55:59.000Z","size":8499,"stargazers_count":22,"open_issues_count":0,"forks_count":2,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-05-12T21:46:06.158Z","etag":null,"topics":["crystal","crystal-lang","crystal-language","crystal-shard","data-a","data-science","data-wrangling"],"latest_commit_sha":null,"homepage":"","language":"Crystal","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/naqvis.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-10-26T16:49:20.000Z","updated_at":"2024-08-22T04:25:59.000Z","dependencies_parsed_at":"2024-10-24T10:51:55.929Z","dependency_job_id":"5dd61c89-79b6-4834-bdd0-8962dbaab1a6","html_url":"https://github.com/naqvis/CrysDA","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/naqvis/CrysDA","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/naqvis%2FCrysDA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/naqvis%2FCrysDA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/naqvis%2FCrysDA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/naqvis%2FCrysDA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/naqvis","download_url":"https://codeload.github.com/naqvis/CrysDA/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/naqvis%2FCrysDA/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261379838,"owners_count":23149927,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crystal","crystal-lang","crystal-language","crystal-shard","data-a","data-science","data-wrangling"],"created_at":"2024-10-03T21:32:51.938Z","updated_at":"2025-06-22T22:34:36.706Z","avatar_url":"https://github.com/naqvis.png","language":"Crystal","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CrysDA\n\n[![CI](https://github.com/naqvis/CrysDA/workflows/CrysDA%20CI/badge.svg)](https://github.com/naqvis/CrysDA/actions?query=workflow%3ACrysDA%20CI)\n[![Latest release](https://img.shields.io/github/release/naqvis/CrysDA.svg)](https://github.com/naqvis/CrysDA/releases)\n[![Docs](https://img.shields.io/badge/docs-available-brightgreen.svg)](https://naqvis.github.io/CrysDA/)\n\n`CrysDA` is a **Crys**tal shard for **D**ata **A**nalysis. Provides you modern functional-style API for data manipulation to filter, transform, aggregate and reshape tabular data. Core of the library is `Crysda::DataFrame` an immutable data structure interface.\n\n`CrysDA` is heavily inspired by the amazing [`dplyr`](https://github.com/hadley/dplyr) for [R](https://www.r-project.org/). `CrysDA` is written in pure Crystal and have no external dependencies. It is mimicking the API of `dplyr`, while carefully adding more typed constructs where possible.\n\n## Features\n\n\n- [X] Filter, transform, aggregate and reshape tabular data\n- [X] Modern, user-friendly and easy-to-learn data-science API\n- [X] Reads from plain and compressed tsv, csv, json, or any delimited format with or without header from local or remote with auto inferring the types of data.\n- [X] Supports reading data from DB\n- [X] Supports grouped operations\n- [X] Tables can contain atomic columns (Number, Float, Bool, String) as well as object columns\n- [X] Reshape tables from wide to long and back\n- [X] Table joins (left, right, semi, inner, outer)\n- [X] Cross tabulation\n- [X] Descriptive statistics (mean, min, max, median, ...)\n- [X] Functional API inspired by [dplyr](http://dplyr.tidyverse.org/), [pandas](http://pandas.pydata.org/)\n\n- [X] many more...\n\n\n## Quick glimpse and comparison with R/dplyr\n\n```crystal\nflights = Crysda.read_csv(\"./spec/data/nycflights.tsv.gz\", separator: '\\t')\n\nflights\n.group_by(\"year\", \"month\", \"day\")\n.select(\n  Crysda.selector { |e| e[\"year\"..\"day\"] }, # columns range\n  Crysda.selector { |e| e.list_of(\"arr_delay\", \"dep_delay\") })\n.summarize(\n  \"mean_arr_delay\".with {|s| s[\"arr_delay\"].mean(remove_na: true)},\n  \"mean_dep_delay\".with {|s| s[\"dep_delay\"].mean(true)})\n.filter {|f| (f[\"mean_arr_delay\"] \u003e 30) .or (f[\"mean_dep_delay\"] \u003e 30)}\n.print(\"Flights mean delay of arrival and departure\")\n```\n**output**\n```shell\nFlights mean delay of arrival and departure: 49 x 5\n     year   month   day   mean_arr_delay   mean_dep_delay\n 1   2013       1    16           34.247           24.613\n 2   2013       1    31           32.603           28.658\n 3   2013      10     7           39.017           39.147\n 4   2013      10    11           18.923           31.232\n 5   2013      12     5           51.666           52.328\n 6   2013      12     8           36.912           21.515\n 7   2013      12     9           42.576           34.800\n 8   2013      12    10           44.509           26.465\n 9   2013      12    14           46.398           28.362\n10   2013      12    17           55.872           40.706\nand 39 more rows\n```\n\n**And the same snippet written in `dplyr`**\n```r\nflights %\u003e%\n    group_by(year, month, day) %\u003e%\n    select(year:day, arr_delay, dep_delay) %\u003e%\n    summarise(\n        mean_arr_delay = mean(arr_delay, na.rm = TRUE),\n        mean_dep_delay = mean(dep_delay, na.rm = TRUE)\n    ) %\u003e%\n    filter(mean_arr_delay \u003e 30 | mean_dep_delay \u003e 30)\n```\n---\n## Tutorial - Short 1 minute Introduction\nFor this quick and short tutorial, we will be using [ramen-ratings](https://www.kaggle.com/residentmario/ramen-ratings) dataset from kaggle. You are free to use any of your choice.\n\n```crystal\n# load dataset\ndf = Crysda.read_csv(\"./spec/data/ramen-ratings.csv\")\n```\nShard provide support for loading data from CSV, TSV, JSON, DB, URL etc and auto infer the types of columns by peeking into data and make a best choice of data type. Once we’ve read the data into a DataFrame, we can start poking it to see what it looks like. A couple of things one typically look at first are the schema and a few rows.\n```crystal\ndf.print(max_rows: 5) # just show us first 5 rows of data\n```\n```shell\nA DataFrame: 2580 x 7\n    Review #            Brand                                                       Variety   Style   Country   Stars   Top Ten\n1       2580        New Touch                                     T's Restaurant Tantanmen      Cup     Japan    3.75\n2       2579         Just Way   Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles    Pack    Taiwan       1\n3       2578           Nissin                                 Cup Noodles Chicken Vegetable     Cup       USA    2.25\n4       2577          Wei Lih                                 GGE Ramen Snack Tomato Flavor    Pack    Taiwan    2.75\n5       2576   Ching's Secret                                               Singapore Curry    Pack     India    3.75\nand 2575 more rows\n```\nabove output shows that our dataset contains 2580 observations (rows) with 7 variables (or they are called columns here)\n\n```crystal\ndf.schema # show the structure of data.\n```\n\n```shell\nDataFrame with 2580 observations\nReview # [Int32]  2580, 2579, 2578, 2577, 2576, 2575, 2574, 2573, 2572, 2571, 2570, 2569, 2568, 2567, 2566, 2565, 2564...\nBrand    [String] New Touch, Just Way, Nissin, Wei Lih, Ching's Secret, Samyang Foods, Acecook, Ikeda Shoku, Ripe'n'Dr...\nVariety  [String] T's Restaurant Tantanmen , Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles, Cup Noodles ...\nStyle    [String] Cup, Pack, Cup, Pack, Pack, Pack, Cup, Tray, Pack, Pack, Pack, Pack, Pack, Bowl, Pack, Cup, Pack, Pa...\nCountry  [String] Japan, Taiwan, USA, Taiwan, India, South Korea, Japan, Japan, Japan, Singapore, Thailand, USA, South...\nStars    [String] 3.75, 1, 2.25, 2.75, 3.75, 4.75, 4, 3.75, 0.25, 2.5, 5, 5, 4.25, 4.5, 5, 3.5, 3.75, 5, 4, 4, 4.25, 5...\nTop Ten  [String] , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ...\n```\nAlready at this point we can notice that for some reason, the ratings (`Stars` column) themselves inferred to be of `String` type. That might be due to some weirdness in the data itself. Exploring various datasets you’ll encounter all sorts of strange things. Some are easy to fix, like in this case. Let's try to see what is causing the problem\n```crystal\ndf.count(\"Stars\").print(max_rows: 15)\n```\n\n```shell\nA DataFrame: 51 x 2\n       Stars     n\n 1      3.75   350\n 2         1    26\n 3      2.25    21\n 4      2.75    85\n 5      4.75    64\n 6         4   384\n 7      0.25    11\n 8       2.5    67\n 9         5   369\n10      4.25   143\n11       4.5   132\n12       3.5   326\n13   Unrated     3\n14       1.5    37\n15      3.25   170\nand 36 more rows\n```\n\nTurns out three records have a rating of “Unrated”, and since there are so few of them, it’s easier to just drop those records or we can reload the dataset and set the `na_value` argument to `\"Unrated\"`, entries with this value will then be treated as `nil`. use this approach if you want to treat some values as nil, or you don't want to lose other columns values.\n\n```crystal\ndf = Crysda.read_csv(\"./spec/data/ramen-ratings.csv\", na_value: \"Unrated\") # this will retain all rows, while column values with \"Unrated\" will be treated as `nil`\n```\n\nBut in this tutorial we are just going to drop those 3 rows and addd new column to dataframe loaded in above step.\n```crystal\nnew_df = df.filter { |f| f[\"Stars\"].matching { |s| !s.starts_with?(\"Un\") } }\n  .add_column(\"Stars_New\") { |c| c[\"Stars\"].map { |m| m.to_s.to_f } }.tap(\u0026.schema)\n```\n\n```shell\nDataFrame with 2577 observations\nReview #  [Int32]   2580, 2579, 2578, 2577, 2576, 2575, 2574, 2573, 2572, 2571, 2570, 2569, 2568, 2567, 2566, 2565, 2564...\nBrand     [String]  New Touch, Just Way, Nissin, Wei Lih, Ching's Secret, Samyang Foods, Acecook, Ikeda Shoku, Ripe'n'Dr...\nVariety   [String]  T's Restaurant Tantanmen , Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles, Cup Noodles ...\nStyle     [String]  Cup, Pack, Cup, Pack, Pack, Pack, Cup, Tray, Pack, Pack, Pack, Pack, Pack, Bowl, Pack, Cup, Pack, Pa...\nCountry   [String]  Japan, Taiwan, USA, Taiwan, India, South Korea, Japan, Japan, Japan, Singapore, Thailand, USA, South...\nStars     [String]  3.75, 1, 2.25, 2.75, 3.75, 4.75, 4, 3.75, 0.25, 2.5, 5, 5, 4.25, 4.5, 5, 3.5, 3.75, 5, 4, 4, 4.25, 5...\nTop Ten   [String]  , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ...\nStars_New [Float64] 3.750, 1.000, 2.250, 2.750, 3.750, 4.750, 4.000, 3.750, 0.250, 2.500, 5.000, 5.000, 4.250, 4.500, 5....\n```\nSo we added a new column `Stars_New` and we can see that its now of `Float64` type. We can perform simple statistics operations on this column now. Let's just calculate the average rating.\n```crystal\n# we can either create summary dataframe\nnew_df.summarize(\"Average Rating\".with {|c| c[\"Stars_New\"].mean}).tap(\u0026.print)\n# or we can store the value to some local variable\nputs new_df[\"Stars_New\"].mean # =\u003e 3.654675979821498\n```\nOf course, you can have questions to your data that can require some data manipulation, like grouping. For example, let’s find out how many unique Ramen brands are there per country.\n```crystal\nbrands_per_country = new_df\n.group_by(\"Country\")\n.distinct(\"Brand\")\n.group_by(\"Country\")\n.count.tap(\u0026.print)\n```\n\n```shell\nA DataFrame: 31 x 2\n         Country    n\n 1         Japan   58\n 2        Taiwan   47\n 3           USA   44\n 4         India    7\n 5   South Korea   32\n 6     Singapore    5\n 7      Thailand   22\n 8     Hong Kong    9\n 9       Vietnam   14\n10         Ghana    2\nand 21 more rows\n```\nLet's sort the dataframe on unique brand count in descending order (with highest count on top)\n```crystal\nbrands_per_country.sort_desc_by(\"n\").print\n```\n\n```shell\nA DataFrame: 31 x 2\n         Country    n\n 1         Japan   58\n 2        Taiwan   47\n 3           USA   44\n 4   South Korea   32\n 5      Malaysia   28\n 6         China   22\n 7      Thailand   22\n 8     Indonesia   18\n 9       Vietnam   14\n10            UK   11\nand 21 more rows\n```\nThese were just a very few and basic examples to give you a taste of what you can do with `Crysda`. As every data wrangler’s path is different, I would encourage you to grab a dataset that interests you and explore it.\n\n---\n## Tutorial 2 - Reshaping Data\nData analysis can be divided into three parts\n- Extraction : First, we need to collect the data from many sources and combine them.\n- Transform: This step involves the data manipulation. Once we have consolidated all the sources of data, we can begin to clean the data.\n- Visualize: The last move is to visualize our data to check irregularity.\n\nOne of the most significant challenges faced by data scientist is the data manipulation. Data is never available in the desired format. The data scientist needs to spend at least half of his time, cleaning and manipulating the data. That is one of the most critical assignments in the job. If the data manipulation process is not complete, precise and rigorous, the model will not perform correctly.\n\n### Merging(joining) Data\nCrysDA provides a nice and convenient way to combine datasets. We may have many sources of input data, and at some point, we need to combine them. A join with CrysDA adds variables to the right of the original dataset. The beauty is CrysDA is that it handles four types of joins similar to SQL\n\n- Left join\n- right join\n- inner join\n- outer join\n\nWe will study all the joins types via an easy example.\n\nFirst of all, we build two datasets. Table 1 contains two variables, ID, and y, whereas Table 2 gathers ID and z. In each situation, we need to have a key-pair variable. In our case, ID is our key variable. The function will look for identical values in both tables and bind the returning values to the right of table 1.\n![Table 1](images/table1.png)\n\n```crystal\ndf_primary = Crysda.dataframe_of(\"ID\",\"y\").values(\n  \"A\", 5,\n   \"B\", 5,\n   \"C\", 8,\n   \"D\", 0,\n  \"F\", 9\n)\n\ndf_secondary = Crysda.dataframe_of(\"ID\",\"z\").values(\n  \"A\", 30,\n   \"B\", 21,\n   \"C\", 22,\n   \"D\", 25,\n   \"E\", 29\n)\n```\n\n#### left_join()\n\nThe most common way to merge two datasets is to use the `left_join` function. We can see from the picture below that the key-pair matches perfectly the rows A, B, C and D from both datasets. However, E and F are left over. How do we treat these two observations? With the `left_join`, we will keep all the variables in the original table and don't consider the variables that do not have a key-paired in the destination table. In our example, the variable E does not exist in table 1. Therefore, the row will be dropped. The variable F comes from the origin table; it will be kept after the `left_join` and return NA in the column z. The figure below reproduces what will happen with a `left_join`.\n![Left Join](images/left_join.png)\n\n```crystal\ndf_primary.left_join(df_secondary, \"ID\").print(\"Left Join\")\n```\nOutput:\n\n```shell\nLeft Join: 5 x 3\n    ID   y      z\n1    A   5     30\n2    B   5     21\n3    C   8     22\n4    D   0     25\n5    F   9   \u003cNA\u003e\n```\n\n#### right_join()\n\nThe `right_join` function works exactly like `left_join`. The only difference is the row dropped. The value E, available in the destination data frame, exists in the new table and takes the value NA for the column y.\n![Right Join](images/right_join.png)\n\n```crystal\ndf_primary.right_join(df_secondary, \"ID\").print(\"Right Join\")\n```\n\nOutput:\n\n```shell\nRight Join: 5 x 3\n    ID      y    z\n1    A      5   30\n2    B      5   21\n3    C      8   22\n4    D      0   25\n5    E   \u003cNA\u003e   29\n```\n\n#### inner_join()\n\nWhen we are 100% sure that the two datasets won't match, we can consider to return only rows existing in both dataset. This is possible when we need a clean dataset or when we don't want to impute missing values with the mean or median.\n\nThe `inner_join` comes to help. This function excludes the unmatched rows.\n![Inner Join](images/inner_join.png)\n\n```crystal\ndf_primary.inner_join(df_secondary, \"ID\").print(\"Inner Join\")\n```\n\nOutput:\n\n```shell\nInner Join: 4 x 3\n    ID   y    z\n1    A   5   30\n2    B   5   21\n3    C   8   22\n4    D   0   25\n```\n\n#### outer_join()\nFinally, the `outer_join` function keeps all observations and replace missing values with `NA`.\n![Outer Join](images/outer_join.png)\n\n```crystal\ndf_primary.outer_join(df_secondary, \"ID\").print(\"Outer Join\")\n```\nOutput:\n\n```shell\nOuter Join: 6 x 3\n    ID      y      z\n1    A      5     30\n2    B      5     21\n3    C      8     22\n4    D      0     25\n5    E   \u003cNA\u003e     29\n6    F      9   \u003cNA\u003e\n```\n\n#### Multiple keys pairs\n\nWe can have multiple keys in our dataset. Consider the following dataset where we have a years or a list of products bought by the customer.\n![Duplicate keys](images/multikey_join.png)\n\n```crystal\ndf_primary = Crysda.dataframe_of(\"ID\",\"year\",\"items\").values(\n  \"A\", 2015,3,\n  \"A\", 2016,7,\n  \"A\", 2017,6,\n  \"B\", 2015,4,\n  \"B\", 2016,8,\n  \"B\", 2017,7,\n  \"C\", 2015,4,\n  \"C\", 2016,6,\n  \"C\", 2017,6\n)\n\ndf_secondary = Crysda.dataframe_of(\"ID\",\"year\",\"prices\").values(\n  \"A\", 2015,9,\n  \"A\", 2016,8,\n  \"A\", 2017,12,\n  \"B\", 2015,13,\n  \"B\", 2016,14,\n  \"B\", 2017,6,\n  \"C\", 2015,15,\n  \"C\", 2016,15,\n  \"C\", 2017,13\n)\n\ndf_primary.left_join(df_secondary, by: [\"ID\",\"year\"]).print(\"Multikey Join\")\n```\n\nOutput:\n\n```shell\nMultikey Join: 9 x 4\n    ID   year   items   prices\n1    A   2015       3        9\n2    A   2016       7        8\n3    A   2017       6       12\n4    B   2015       4       13\n5    B   2016       8       14\n6    B   2017       7        6\n7    C   2015       4       15\n8    C   2016       6       15\n9    C   2017       6       13\n```\n\n### Data Cleaning functions\nFollowing are four important functions to tidy the data:\n\n- gather: Transform the data from wide to long\n- spread: Transform the data from long to wide\n- separate: Split one variable into multiples\n- unite: Unite multiple variables into one\n\n#### gather()\nThe objectives of the `gather` function is to transform the data from wide to long.\n\nBelow we can visualize the concept of reshaping wide to long. We want to create a single column named growth, filled by the rows of the quarter variables.\n![gather](images/gather.png)\n\n```crystal\n# Create a dataset\ndf = Crysda.dataframe_of(\"country\", \"q1_2017\", \"q2_2017\", \"q3_2017\", \"q4_2017\").values(\n  \"A\", 0.03, 0.05, 0.04, 0.03,\n  \"B\", 0.05, 0.07, 0.05, 0.02,\n  \"C\", 0.01, 0.02, 0.01, 0.04)\ndf.print\n```\nOutput:\n\n```shell\nA DataFrame: 3 x 5\n    country   q1_2017   q2_2017   q3_2017   q4_2017\n1         A     0.030     0.050     0.040     0.030\n2         B     0.050     0.070     0.050     0.020\n3         C     0.010     0.020     0.010     0.040\n```\n\nReshape the data\n\n```crystal\nreshaped = df.gather(\"quarter\",\"growth\", Crysda.selector{|c| c[\"q1_2017\"..\"q4_2017\"]}).tap(\u0026.print(max_rows: 12))\n```\n\nOutput:\n\n```shell\nA DataFrame: 12 x 3\n     country   quarter   growth\n 1         A   q1_2017    0.030\n 2         B   q1_2017    0.050\n 3         C   q1_2017    0.010\n 4         A   q2_2017    0.050\n 5         B   q2_2017    0.070\n 6         C   q2_2017    0.020\n 7         A   q3_2017    0.040\n 8         B   q3_2017    0.050\n 9         C   q3_2017    0.010\n10         A   q4_2017    0.030\n11         B   q4_2017    0.020\n12         C   q4_2017    0.040\n```\n\nIn the `gather` function, we created two new variables **qurater** and **growth** because our original dataset has one group variable: **country** and the key-value pairs.\n\n#### spread()\nThe `spread` function does the opposite of `gather`. We can reshape data in above example back to its original form.\n\n```crystal\nreshaped.spread(\"quarter\",\"growth\").print\n```\n\nOutput:\n\n```shell\nA DataFrame: 3 x 5\n    country   q1_2017   q2_2017   q3_2017   q4_2017\n1         A     0.030     0.050     0.040     0.030\n2         B     0.050     0.070     0.050     0.020\n3         C     0.010     0.020     0.010     0.040\n```\n\n#### separate()\nThe `separate` function splits a column into multiples according to a separator. This function is helpful in some situations where the variable is a list of values seprated by a separator. For example, our analysis require focusing on month and year and we want to separate the column into two new variables.\n\n```crystal\nreshaped.separate(\"quarter\", into: [\"Qtr\",\"Year\"], sep: \"_\").print(\"Separated\", max_rows: 12)\n```\n\nOutput:\n\n```shell\nSeparated: 12 x 4\n     country   growth   Qtr   Year\n 1         A    0.030    q1   2017\n 2         B    0.050    q1   2017\n 3         C    0.010    q1   2017\n 4         A    0.050    q2   2017\n 5         B    0.070    q2   2017\n 6         C    0.020    q2   2017\n 7         A    0.040    q3   2017\n 8         B    0.050    q3   2017\n 9         C    0.010    q3   2017\n10         A    0.030    q4   2017\n11         B    0.020    q4   2017\n12         C    0.040    q4   2017\n```\n\n#### unite()\nThe `unite` function concatenate multiple columns into one.\n\n```crystal\nseparated.unite(\"Quarter\",[\"Qtr\",\"Year\"], sep: \"_\").print(\"United\")\n```\n\nOutput:\n\n```shell\n     country   growth   Quarter\n 1         A    0.030   q1_2017\n 2         B    0.050   q1_2017\n 3         C    0.010   q1_2017\n 4         A    0.050   q2_2017\n 5         B    0.070   q2_2017\n 6         C    0.020   q2_2017\n 7         A    0.040   q3_2017\n 8         B    0.050   q3_2017\n 9         C    0.010   q3_2017\n10         A    0.030   q4_2017\nand 2 more rows\n```\n---\n## Installation\n\n1. Add the dependency to your `shard.yml`:\n\n```yaml\n   dependencies:\n     crysda:\n       github: naqvis/CrysDA\n```\n\n2. Run `shards install`\n\n## Usage\n\n```crystal\nrequire \"crysda\"\n\n# Read tab-delimited data-frame from disk\ndf = Crysda.read_csv(\"data/iris.txt\", separator: '\\t')\n\n# Read data-frame from URL\ndf = Crysda.read_csv(\"http://url/file.csv\")\n\n# Create data-frame in memory\ndf = Crysda.dataframe_of(\"first_name\", \"last_name\", \"age\", \"weight\", \"adult\").values(\n  \"Max\", \"Doe\", 23, 55.8, true,\n  \"Franz\", \"Smith\", 23, 88.3, true,\n  \"Horst\", \"Keanes\", 12, 82.5, false,\n)\n\n# print rows\ndf.print\n\n# print structure of data-frame\ndf.schema\n\n# Subset columns with select/reject\ndf.select(\"last_name\", \"weight\")\ndf.reject(\"weight\", \"age\")\ndf.select(\u0026.ends_with?(\"name\"))\ndf.select? { |v| v.is_a?(Crysda::Int32Col) }\ndf.select? { |v| v.name.starts_with?(\"foo\") }\n\n# Subset rows with filter\ndf.filter { |e| e.[\"age\"] == 23 }\ndf.filter { |e| e.[\"weight\"] \u003e 50 }\ndf.filter { |e| e[\"first_name\"].matching { |e| e.starts_with?(\"Ho\") } }\n\n# Sort your data\ndf.sort_by(\"age\")\n# and add secondary sorting attribute as variadic param\ndf.sort_by(\"age\", \"weight\")\n# sort in descending order\ndf.sort_desc_by(\"age\")\ndf.sort_by { |e| e[\"weight\"] }\n\n# add columns with mutate\n# by adding constant values as new column\ndf.add_column(\"salary_category\") { 3 }\n\n# by doing basic column arithmetics\ndf.add_column(\"age_3y_later\") { |e| e[\"age\"] + 3 }\n\n# Note: dataframes are immutable so we need to (re)assign results to preserve changes.\nnew_df = df.add_column(\"full_name\") { |e| e[\"first_name\"] + \" \" + e[\"last_name\"] }\n\n# Also feel free to mix types\ndf.add_column(\"user_id\") { |e| e[\"last_name\"] + \"_id\" + e.row_num }\n\n# add multiple columns at once\ndf.add_columns(\n  \"age_plus3\".with { |e| e[\"age\"] + 3 },\n  \"initials\".with { |e| e[\"first_name\"].map(\u0026.to_s[0]).concatenate(e[\"last_name\"].map(\u0026.to_s[0])) })\n\n# Summarize\n\n# do simple cross tabulations\ndf.count(\"age\", \"last_name\")\n\n# or calculate single summary statistic\ndf.summarize(\"min_age\") { |e| e[\"age\"].min }\n# or\ndf.summarize(\n  \"min_age\".with { |e| e[\"age\"].min },\n  \"max_age\".with { |e| e[\"age\"].max },\n  \"mean_weight\".with { |e| e[\"weight\"].mean },\n)\n\n# Group operations\ngrouped_df = df.group_by(\"age\") # or provide multiple grouping attributes\ngrouped_df.summarize(\n  \"mean_weight\".with { |e| e[\"weight\"].mean(remove_na: true) },\n  \"num_persons\".with {|e| e.num_row}\n)\n\n# optionally ungroup the data\ngrouped_df.ungroup.print\n\n# Join operations\na = Crysda.dataframe_of(\"name\", \"project_id\").values(\n  \"Max\", \"P1\",\n  \"Max\", \"P2\",\n  \"Tom\", \"P3\"\n)\n\nb = Crysda.dataframe_of(\"title\", \"project_id\").values(\n  \"foo\", \"P1\",\n  \"some_title\", \"P2\",\n  \"alt_title\", \"P2\"\n)\n\na.left_join(b, by: \"project_id\").print\na.outer_join(b).print\n\ndf = Crysda.dataframe_of(\"foo\", \"bar\").values(\n  \"a\", 2,\n  \"b\", 3,\n  \"c\", 4\n)\n\n# join on foo\ndf.inner_join(df, by: \"foo\", suffices: {\"_1\", \"_2\"}).tap do |d|\n  d.print\nend\n\n# again but now join on bar. Join columns are expected to come first\ndf.inner_join(df, \"bar\", {\"_1\", \"_2\"})\n\n# again but now join on nothing\ndf.inner_join(df, [] of String, {\"_1\", \"_2\"})\n\n# Reshape data\ndf = Crysda.dataframe_of(\"person\", \"year\", \"weight\", \"sex\").values(\n  \"max\", 2014, 33.1, \"M\",\n  \"max\", 2015, 32.3, \"M\",\n  \"max\", 2016, nil, \"M\",\n  \"anna\", 2013, 33.5, \"F\",\n  \"anna\", 2014, 37.3, \"F\",\n  \"anna\", 2015, 39.2, \"F\",\n  \"anna\", 2016, 39.9, \"F\"\n)\ndf.schema\ndf.spread(\"year\", \"weight\").print\n\ndf = Crysda.dataframe_of(\"person\", \"property\", \"value\", \"sex\").values(\n  \"max\", \"salary\", \"33.1\", \"M\",\n  \"max\", \"city\", \"London\", \"M\",\n  \"anna\", \"salary\", \"33.5\", \"F\",\n  \"anna\", \"city\", \"Berlin\", \"F\"\n)\nwide_df = df.spread(\"property\", \"value\")\n\nwide_df.gather(\"property\", \"value\", Crysda::ColumnSelector.new { |x| (x.except(\"person\")).and x.starts_with?(\"person\") })\n\nwide_df.gather(\"property\", \"value\", Crysda::ColumnSelector.new { |x| x.except(\"person\") })\n\nwide_df.gather(\"property\", \"value\", Crysda::ColumnSelector.new { |x| x.except(\"person\") })\n  .tap do |wf|\n    wf.print\n    annual_salary = wf.filter { |x| (x[\"person\"] == \"anna\").and (x[\"property\"] == \"salary\") }\n    annual_salary.print\n  end\n```\n**.....**\n\nUnable to cover each and every functionality in this README. So refer to `specs` for more sample usages and API documentation for all available functionality.\n\n## Development\n\nTo run all tests:\n\n```\ncrystal spec\n```\n\n## Contributing\n\n1. Fork it (\u003chttps://github.com/naqvis/Crysda/fork\u003e)\n2. Create your feature branch (`git checkout -b my-new-feature`)\n3. Commit your changes (`git commit -am 'Add some feature'`)\n4. Push to the branch (`git push origin my-new-feature`)\n5. Create a new Pull Request\n\n## Contributors\n\n- [Ali Naqvi](https://github.com/naqvis) - creator and maintainer\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnaqvis%2Fcrysda","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnaqvis%2Fcrysda","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnaqvis%2Fcrysda/lists"}