https://github.com/ankane/rover

Simple, powerful data frames for Ruby
https://github.com/ankane/rover

Last synced: 17 days ago
JSON representation

Simple, powerful data frames for Ruby

Host: GitHub
URL: https://github.com/ankane/rover
Owner: ankane
License: mit
Created: 2020-05-14T02:30:50.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2024-12-29T20:24:22.000Z (5 months ago)
Last Synced: 2025-04-13T13:18:40.049Z (about 2 months ago)
Language: Ruby
Size: 218 KB
Stars: 363
Watchers: 12
Forks: 18
Open Issues: 2
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.txt

Awesome Lists containing this project

awesome-dataframes - rover - Simple, powerful data frames for Ruby. (Libraries)
data-science-with-ruby - Rover
stars - ankane/rover - Simple, powerful data frames for Ruby (Ruby)

README

        # Rover

Simple, powerful data frames for Ruby

:mountain: Designed for data exploration and machine learning, and powered by [Numo](https://github.com/ruby-numo/numo-narray)

:evergreen_tree: Uses [Vega](https://github.com/ankane/vega) for visualization

[![Build Status](https://github.com/ankane/rover/actions/workflows/build.yml/badge.svg)](https://github.com/ankane/rover/actions)

## Installation

Add this line to your application’s Gemfile:

```ruby

gem "rover-df"

```

## Intro

A data frame is an in-memory table. It’s a useful data structure for data analysis and machine learning. It uses columnar storage for fast operations on columns.

Try it out for forecasting by clicking the button below (it can take a few minutes to start):

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/ankane/ml-stack/master?filepath=Forecasting.ipynb)

Use the `Run` button (or `SHIFT` + `ENTER`) to run each line.

## Creating Data Frames

From an array

```ruby

Rover::DataFrame.new([

  {a: 1, b: "one"},

  {a: 2, b: "two"},

  {a: 3, b: "three"}

])

```

From a hash

```ruby

Rover::DataFrame.new({

  a: [1, 2, 3],

  b: ["one", "two", "three"]

})

```

From Active Record

```ruby

Rover::DataFrame.new(User.all)

```

From a CSV

```ruby

Rover.read_csv("file.csv")

# or

Rover.parse_csv("CSV,data,string")

```

From Parquet (requires the [red-parquet](https://github.com/apache/arrow/tree/master/ruby/red-parquet) gem)

```ruby

Rover.read_parquet("file.parquet")

# or

Rover.parse_parquet("PAR1...")

```

## Attributes

Get number of rows

```ruby

df.count

```

Get column names

```ruby

df.keys

```

Check if a column exists

```ruby

df.include?(name)

```

## Selecting Data

Select a column

```ruby

df[:a]

```

> Note that strings and symbols are different keys, just like hashes. Creating a data frame from Active Record, a CSV, or Parquet uses strings.

Select multiple columns

```ruby

df[[:a, :b]]

```

Select first rows

```ruby

df.head

# or

df.first(5)

```

Select last rows

```ruby

df.tail

# or

df.last(5)

```

Select rows by index

```ruby

df[1]

# or

df[1..3]

# or

df[[1, 4, 5]]

```

Iterate over rows

```ruby

df.each_row { |row| ... }

```

Iterate over a column

```ruby

df[:a].each { |item| ... }

# or

df[:a].each_with_index { |item, index| ... }

```

## Filtering

Filter on a condition

```ruby

df[df[:a] == 100]

df[df[:a] != 100]

df[df[:a] > 100]

df[df[:a] >= 100]

df[df[:a] < 100]

df[df[:a] <= 100]

```

In

```ruby

df[df[:a].in?([1, 2, 3])]

df[df[:a].in?(1..3)]

df[df[:a].in?(["a", "b", "c"])]

```

Not in

```ruby

df[!df[:a].in?([1, 2, 3])]

```

And, or, and exclusive or

```ruby

df[(df[:a] > 100) & (df[:b] == "one")] # and

df[(df[:a] > 100) | (df[:b] == "one")] # or

df[(df[:a] > 100) ^ (df[:b] == "one")] # xor

```

## Operations

Basic operations

```ruby

df[:a] + 5

df[:a] - 5

df[:a] * 5

df[:a] / 5

df[:a] % 5

df[:a] ** 2

df[:a].sqrt

df[:a].cbrt

df[:a].abs

```

Rounding

```ruby

df[:a].round

df[:a].ceil

df[:a].floor

```

Logarithm

```ruby

df[:a].ln # or log

df[:a].log(5)

df[:a].log10

df[:a].log2

```

Exponentiation

```ruby

df[:a].exp

df[:a].exp2

```

Trigonometric functions

```ruby

df[:a].sin

df[:a].cos

df[:a].tan

df[:a].asin

df[:a].acos

df[:a].atan

```

Hyperbolic functions

```ruby

df[:a].sinh

df[:a].cosh

df[:a].tanh

df[:a].asinh

df[:a].acosh

df[:a].atanh

```

Error function

```ruby

df[:a].erf

df[:a].erfc

```

Summary statistics

```ruby

df[:a].count

df[:a].sum

df[:a].mean

df[:a].median

df[:a].percentile(90)

df[:a].min

df[:a].max

df[:a].std

df[:a].var

```

Count occurrences

```ruby

df[:a].tally

```

Cross tabulation

```ruby

df[:a].crosstab(df[:b])

```

## Grouping

Group

```ruby

df.group(:a).count

```

Works with all summary statistics

```ruby

df.group(:a).max(:b)

```

Multiple groups

```ruby

df.group(:a, :b).count

```

## Visualization

Add [Vega](https://github.com/ankane/vega) to your application’s Gemfile:

```ruby

gem "vega"

```

And use:

```ruby

df.plot(:a, :b)

```

Specify the chart type (`line`, `pie`, `column`, `bar`, `area`, or `scatter`)

```ruby

df.plot(:a, :b, type: "pie")

```

Group data

```ruby

df.plot(:a, :b, group: :c)

```

Stacked columns or bars

```ruby

df.plot(:a, :b, group: :c, stacked: true)

```

## Updating Data

Add a new column

```ruby

df[:a] = 1

# or

df[:a] = [1, 2, 3]

```

Update a single element

```ruby

df[:a][0] = 100

```

Update multiple elements

```ruby

df[:a][0..2] = 1

# or

df[:a][0..2] = [1, 2, 3]

```

Update all elements

```ruby

df[:a] = df[:a].map { |v| v.gsub("a", "b") }

# or

df[:a].map! { |v| v.gsub("a", "b") }

```

Update elements matching a condition

```ruby

df[:a][df[:a] > 100] = 0

```

Clamp

```ruby

df[:a].clamp!(0, 100)

```

Delete columns

```ruby

df.delete(:a)

# or

df.except!(:a, :b)

```

Rename columns

```ruby

df.rename(a: :new_a, b: :new_b)

# or

df[:new_a] = df.delete(:a)

```

Sort rows

```ruby

df.sort_by! { |r| r[:a] }

```

Clear all data

```ruby

df.clear

```

## Combining Data Frames

Add rows

```ruby

df.concat(other_df)

```

Add columns

```ruby

df.merge!(other_df)

```

Inner join

```ruby

df.inner_join(other_df)

# or

df.inner_join(other_df, on: :a)

# or

df.inner_join(other_df, on: [:a, :b])

# or

df.inner_join(other_df, on: {df_col: :other_df_col})

```

Left join

```ruby

df.left_join(other_df)

```

## Encoding

One-hot encoding

```ruby

df.one_hot

```

Drop a variable in each category to avoid the dummy variable trap

```ruby

df.one_hot(drop: true)

```

## Conversion

Array of hashes

```ruby

df.to_a

```

Hash of arrays

```ruby

df.to_h

```

Numo array

```ruby

df.to_numo

```

CSV

```ruby

df.to_csv

```

Parquet (requires the [red-parquet](https://github.com/apache/arrow/tree/master/ruby/red-parquet) gem)

```ruby

df.to_parquet

```

## Types

You can specify column types when creating a data frame

```ruby

Rover::DataFrame.new(data, types: {"a" => :int64, "b" => :float64})

```

Or

```ruby

Rover.read_csv("data.csv", types: {"a" => :int64, "b" => :float64})

```

Supported types are:

- boolean - `:bool`

- float - `:float64`, `:float32`

- integer - `:int64`, `:int32`, `:int16`, `:int8`

- unsigned integer - `:uint64`, `:uint32`, `:uint16`, `:uint8`

- object - `:object`

Get column types

```ruby

df.types

```

For a specific column

```ruby

df[:a].type

```

Change the type of a column

```ruby

df[:a].to!(:int32)

```

## History

View the [changelog](https://github.com/ankane/rover/blob/master/CHANGELOG.md)

## Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

- [Report bugs](https://github.com/ankane/rover/issues)

- Fix bugs and [submit pull requests](https://github.com/ankane/rover/pulls)

- Write, clarify, or fix documentation

- Suggest or add new features

To get started with development:

```sh

git clone https://github.com/ankane/rover.git

cd rover

bundle install

bundle exec rake test

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ankane/rover

Awesome Lists containing this project

README