Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ankane/rover
Simple, powerful data frames for Ruby
https://github.com/ankane/rover
Last synced: 13 days ago
JSON representation
Simple, powerful data frames for Ruby
- Host: GitHub
- URL: https://github.com/ankane/rover
- Owner: ankane
- License: mit
- Created: 2020-05-14T02:30:50.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2024-10-08T01:29:29.000Z (27 days ago)
- Last Synced: 2024-10-08T01:38:38.046Z (27 days ago)
- Language: Ruby
- Size: 221 KB
- Stars: 350
- Watchers: 13
- Forks: 17
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.txt
Awesome Lists containing this project
- awesome-dataframes - rover - Simple, powerful data frames for Ruby. (Libraries)
- data-science-with-ruby - Rover
README
# Rover
Simple, powerful data frames for Ruby
:mountain: Designed for data exploration and machine learning, and powered by [Numo](https://github.com/ruby-numo/numo-narray)
:evergreen_tree: Uses [Vega](https://github.com/ankane/vega) for visualization
[![Build Status](https://github.com/ankane/rover/actions/workflows/build.yml/badge.svg)](https://github.com/ankane/rover/actions)
## Installation
Add this line to your application’s Gemfile:
```ruby
gem "rover-df"
```## Intro
A data frame is an in-memory table. It’s a useful data structure for data analysis and machine learning. It uses columnar storage for fast operations on columns.
Try it out for forecasting by clicking the button below (it can take a few minutes to start):
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/ankane/ml-stack/master?filepath=Forecasting.ipynb)
Use the `Run` button (or `SHIFT` + `ENTER`) to run each line.
## Creating Data Frames
From an array
```ruby
Rover::DataFrame.new([
{a: 1, b: "one"},
{a: 2, b: "two"},
{a: 3, b: "three"}
])
```From a hash
```ruby
Rover::DataFrame.new({
a: [1, 2, 3],
b: ["one", "two", "three"]
})
```From Active Record
```ruby
Rover::DataFrame.new(User.all)
```From a CSV
```ruby
Rover.read_csv("file.csv")
# or
Rover.parse_csv("CSV,data,string")
```From Parquet (requires the [red-parquet](https://github.com/apache/arrow/tree/master/ruby/red-parquet) gem)
```ruby
Rover.read_parquet("file.parquet")
# or
Rover.parse_parquet("PAR1...")
```## Attributes
Get number of rows
```ruby
df.count
```Get column names
```ruby
df.keys
```Check if a column exists
```ruby
df.include?(name)
```## Selecting Data
Select a column
```ruby
df[:a]
```> Note that strings and symbols are different keys, just like hashes. Creating a data frame from Active Record, a CSV, or Parquet uses strings.
Select multiple columns
```ruby
df[[:a, :b]]
```Select first rows
```ruby
df.head
# or
df.first(5)
```Select last rows
```ruby
df.tail
# or
df.last(5)
```Select rows by index
```ruby
df[1]
# or
df[1..3]
# or
df[[1, 4, 5]]
```Iterate over rows
```ruby
df.each_row { |row| ... }
```Iterate over a column
```ruby
df[:a].each { |item| ... }
# or
df[:a].each_with_index { |item, index| ... }
```## Filtering
Filter on a condition
```ruby
df[df[:a] == 100]
df[df[:a] != 100]
df[df[:a] > 100]
df[df[:a] >= 100]
df[df[:a] < 100]
df[df[:a] <= 100]
```In
```ruby
df[df[:a].in?([1, 2, 3])]
df[df[:a].in?(1..3)]
df[df[:a].in?(["a", "b", "c"])]
```Not in
```ruby
df[!df[:a].in?([1, 2, 3])]
```And, or, and exclusive or
```ruby
df[(df[:a] > 100) & (df[:b] == "one")] # and
df[(df[:a] > 100) | (df[:b] == "one")] # or
df[(df[:a] > 100) ^ (df[:b] == "one")] # xor
```## Operations
Basic operations
```ruby
df[:a] + 5
df[:a] - 5
df[:a] * 5
df[:a] / 5
df[:a] % 5
df[:a] ** 2
df[:a].sqrt
df[:a].cbrt
df[:a].abs
```Rounding
```ruby
df[:a].round
df[:a].ceil
df[:a].floor
```Logarithm
```ruby
df[:a].ln # or log
df[:a].log(5)
df[:a].log10
df[:a].log2
```Exponentiation
```ruby
df[:a].exp
df[:a].exp2
```Trigonometric functions
```ruby
df[:a].sin
df[:a].cos
df[:a].tan
df[:a].asin
df[:a].acos
df[:a].atan
```Hyperbolic functions
```ruby
df[:a].sinh
df[:a].cosh
df[:a].tanh
df[:a].asinh
df[:a].acosh
df[:a].atanh
```Error function
```ruby
df[:a].erf
df[:a].erfc
```Summary statistics
```ruby
df[:a].count
df[:a].sum
df[:a].mean
df[:a].median
df[:a].percentile(90)
df[:a].min
df[:a].max
df[:a].std
df[:a].var
```Count occurrences
```ruby
df[:a].tally
```Cross tabulation
```ruby
df[:a].crosstab(df[:b])
```## Grouping
Group
```ruby
df.group(:a).count
```Works with all summary statistics
```ruby
df.group(:a).max(:b)
```Multiple groups
```ruby
df.group(:a, :b).count
```## Visualization
Add [Vega](https://github.com/ankane/vega) to your application’s Gemfile:
```ruby
gem "vega"
```And use:
```ruby
df.plot(:a, :b)
```Specify the chart type (`line`, `pie`, `column`, `bar`, `area`, or `scatter`)
```ruby
df.plot(:a, :b, type: "pie")
```Group data
```ruby
df.plot(:a, :b, group: :c)
```Stacked columns or bars
```ruby
df.plot(:a, :b, group: :c, stacked: true)
```## Updating Data
Add a new column
```ruby
df[:a] = 1
# or
df[:a] = [1, 2, 3]
```Update a single element
```ruby
df[:a][0] = 100
```Update multiple elements
```ruby
df[:a][0..2] = 1
# or
df[:a][0..2] = [1, 2, 3]
```Update all elements
```ruby
df[:a] = df[:a].map { |v| v.gsub("a", "b") }
# or
df[:a].map! { |v| v.gsub("a", "b") }
```Update elements matching a condition
```ruby
df[:a][df[:a] > 100] = 0
```Clamp
```ruby
df[:a].clamp!(0, 100)
```Delete columns
```ruby
df.delete(:a)
# or
df.except!(:a, :b)
```Rename columns
```ruby
df.rename(a: :new_a, b: :new_b)
# or
df[:new_a] = df.delete(:a)
```Sort rows
```ruby
df.sort_by! { |r| r[:a] }
```Clear all data
```ruby
df.clear
```## Combining Data Frames
Add rows
```ruby
df.concat(other_df)
```Add columns
```ruby
df.merge!(other_df)
```Inner join
```ruby
df.inner_join(other_df)
# or
df.inner_join(other_df, on: :a)
# or
df.inner_join(other_df, on: [:a, :b])
# or
df.inner_join(other_df, on: {df_col: :other_df_col})
```Left join
```ruby
df.left_join(other_df)
```## Encoding
One-hot encoding
```ruby
df.one_hot
```Drop a variable in each category to avoid the dummy variable trap
```ruby
df.one_hot(drop: true)
```## Conversion
Array of hashes
```ruby
df.to_a
```Hash of arrays
```ruby
df.to_h
```Numo array
```ruby
df.to_numo
```CSV
```ruby
df.to_csv
```Parquet (requires the [red-parquet](https://github.com/apache/arrow/tree/master/ruby/red-parquet) gem)
```ruby
df.to_parquet
```## Types
You can specify column types when creating a data frame
```ruby
Rover::DataFrame.new(data, types: {"a" => :int64, "b" => :float64})
```Or
```ruby
Rover.read_csv("data.csv", types: {"a" => :int64, "b" => :float64})
```Supported types are:
- boolean - `:bool`
- float - `:float64`, `:float32`
- integer - `:int64`, `:int32`, `:int16`, `:int8`
- unsigned integer - `:uint64`, `:uint32`, `:uint16`, `:uint8`
- object - `:object`Get column types
```ruby
df.types
```For a specific column
```ruby
df[:a].type
```Change the type of a column
```ruby
df[:a].to!(:int32)
```## History
View the [changelog](https://github.com/ankane/rover/blob/master/CHANGELOG.md)
## Contributing
Everyone is encouraged to help improve this project. Here are a few ways you can help:
- [Report bugs](https://github.com/ankane/rover/issues)
- Fix bugs and [submit pull requests](https://github.com/ankane/rover/pulls)
- Write, clarify, or fix documentation
- Suggest or add new featuresTo get started with development:
```sh
git clone https://github.com/ankane/rover.git
cd rover
bundle install
bundle exec rake test
```