https://github.com/rom1mouret/ml-essentials

dataframe library for machine learning
https://github.com/rom1mouret/ml-essentials

dataframe dataframe-library machine-learning ml one-hot-encoding preprocessing

Last synced: 3 months ago
JSON representation

dataframe library for machine learning

Host: GitHub
URL: https://github.com/rom1mouret/ml-essentials
Owner: rom1mouret
License: apache-2.0
Created: 2021-01-31T04:29:21.000Z (over 5 years ago)
Default Branch: main
Last Pushed: 2021-02-02T06:30:46.000Z (over 5 years ago)
Last Synced: 2024-06-20T11:58:35.201Z (about 2 years ago)
Topics: dataframe, dataframe-library, machine-learning, ml, one-hot-encoding, preprocessing
Language: Go
Homepage:
Size: 110 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          ml-essentials is a data frame library for Go in the same vein as [qota](https://github.com/go-gota/gota) and [qframe](https://github.com/tobgu/qframe).

It draws inspiration from [pandas](https://pandas.pydata.org/) and [numpy](https://numpy.org/).

Unlike [qota](https://github.com/go-gota/gota) and [qframe](https://github.com/tobgu/qframe),

ml-essentials doesn't cater for data scientists, e.g. with functions to load Excel files, SQL databases or functions to help with [EDA](https://en.wikipedia.org/wiki/Exploratory_data_analysis).

It is best suited for machine learning engineers who want to serve their models in a safe and predictable manner.

It is also smaller, with a focus on simplicity, stability and clarity.

I hope that ml-essentials is transparent enough for users to glance at their code and get a sense of what ml-essentials does under the hood and how much it is going to cost in CPU and RAM usage.

To illustrate my point, I am enumerating below all the view-returning functions.

Those features are only available through views, so the user has no choice but to spell out what his/her code should do.

```

(df *DataFrame) IndexView(indices []int) *DataFrame

(df *DataFrame) SliceView(from int, to int) *DataFrame

(df *DataFrame) MaskView(mask []bool) *DataFrame

(df *DataFrame) ColumnView(columns ...string) *DataFrame

(df *DataFrame) ShuffleView() *DataFrame

(df *DataFrame) SampleView(n int, replacement bool) *DataFrame

(df *DataFrame) SplitNView(n int) []*DataFrame

(df *DataFrame) SplitView(batchSize int) []*DataFrame

(df *DataFrame) SplitTrainTestViews(testingRatio float64) (*DataFrame, *DataFrame)

(df *DataFrame) SortedView(byColumn string) *DataFrame

(df *DataFrame) TopView(byColumn string, n int, ascending bool, sorted bool) *DataFrame

(df *DataFrame) ReverseView() *DataFrame

(df *DataFrame) HashStringsView(columns ...string) *DataFrame

(df *DataFrame) DetachedView(columns ...string) *DataFrame

(df *DataFrame) ResetIndexView() *DataFrame

(df *DataFrame) ShallowCopy() *DataFrame

(df *DataFrame) ColumnConcatView(dfs ...*DataFrame) (*DataFrame, error)

```

View-returning functions are guaranteed not to copy any large chunk of data.

### Documentation and examples

- [dataframe package](dataframe/)

- [preprocessing package](preprocessing/)

- [algorithms package](algorithms/)

- [A-to-Z example](examples/linreg.go)

### Benchmarks

dataset: [kddcup98](https://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html)

task: linear regression

|                             | ml-essentials CPU=1 | ml-essentials CPU=16 | python (pandas + pytorch) |

|-----------------------------|---------------------|----------------------|---------------------------|

| reading CSV                 | 18.3                | 3.3                  | 4.3                       |

| shuffling and splitting     | 0.003               | 0.003                | 0.4                       |

| preprocessing fit_transform | 2.4                 | 0.8                  | 2.2                       |

| linreg training (1 epoch)   | 6.9                 | 6.9                  | 3.4                       |

| preprocessor on test data   | 1                   | 0.5                  | 0.77                      |

| writing predictions         | 33                  | 4.7                  | 426                       |

| reading written rows        | 170                 | 71                   | 410                       |

The reason it takes so long to read/write predictions is because one-hot encoding creates over 20,000 columns.

Reproduction

```bash

cd examples

go run linreg.go -momentum=0.2 -epochs=1 -testratio=0.33 -batchsize 256 cup98LRN.txt TARGET_B CONTROLN

python3 linreg.py -momentum=0.2 -epochs=1 -testratio=0.33 -batchsize 256 cup98LRN.txt TARGET_B CONTROLN

```

### Design choices

##### Native types

Here are the benchmarks that have motivated my decision to use 3 native types alongside `interface{}`.

Those benchmarks measure the time to copy a slice at specific indices (from a slice of indices).

| type           | speed      | storage choice |  missing value |

|----------------|------------|----------------|----------------|

|`[]interface{}` | 4.51 ns/op | `[]interface{}`| nil            |

|`[]string`      | 4.26 ns/op | `[]interface{}`| nil            |

|`[]float64`     | 1.97 ns/op | `[]float64`    | NaN            |

|`[]int`         | 1.80 ns/op | `[]int`        | -1             |

|`[]bool`        | 1.38 ns/op | `[]bool`       | not applicable |

Float64 were chosen over float32 for the sake of compatibility with [gonum](https://github.com/gonum).

##### `interface{}` type for all the columns

Storing all the data slices as `interface{}` is sound.

For one thing, this requires only one `map[string]interface{}`.

By contrast, ml-essentials allocates 5 `map[string]T`, even when empty.

Also, some functions get to be very succinct, for instance

`rename` can move the data from one column to another without ever knowing what

type the data is of.

Ultimately, it was decided not to use `interface{}` for everything. Most functions

do rely on knowing the precise type and casting the values anyway. The first version

used `interface{}` everywhere and lots of type assertion errors popped up. Although

they were easy to fix, the new implementation brings more peace of mind.

### Roadmap

- functions to store/retrieve gonum's blas vectors in the df.objects map

- functions to store/retrieve/sort datetime objects in the df.objects map

- functions to create masks, e.g. mask := df.Test("age").Lower(15).Mask()

- smarter ColumnSmartConcat function

- ordinal encoder as an alternative to Hash Encoder

- more methods to RawData, like some sort of concat

- optimization of TopView

- more options to CSV reader and writer, such as BOM parsing

- inverse transform for OneHot

- `RepeatView(n int, bool interleaved)`

- more evaluation metrics, such as cross entropy

- reading/writing data in JSON

- release as a Go module

### External Contributions

ml-essentials is not affiliated with any organization.

Contributions are welcome.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rom1mouret/ml-essentials

Awesome Lists containing this project

README