Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/alvinwan/mlml

memory-limited machine learning utility for Python
https://github.com/alvinwan/mlml

Last synced: about 2 months ago
JSON representation

memory-limited machine learning utility for Python

Awesome Lists containing this project

README

        

# Memory-Limited Machine Learning (MLML)

This Python3 package offers support for a variety of algorithms on
memory-limited infrastructure. Specifically, this package addresses
two memory-limited scenarios:

1. Dataset is too large to fit in memory.
2. Kernel is too large to fit in memory.

created by [Alvin Wan](http://alvinwan.com), with guidance of
[Vaishaal Shankar](http://vaishaal.com) under
[Professor Benjamin Recht](https://people.eecs.berkeley.edu/~brecht/)
at UC Berkeley

This package is split into two sub-packages:

1. `mlml.ssgd`: **Streaming Stochastic Gradient Descent** handles
datasets too large for memory. Only the necessary portions of
the dataset are loaded into memory, and to optimize time needed for
disk I/O, data is shuffled on disk and then read sequentially.

2. `mlml.kernel`: To handle kernels too large for memory, this package
generates the kernel matrix part-by-part, performs pre-computation
for common algorithms, and then runs **Kernelized Stochastic Gradient
Descent**, streaming pre-computed matrices into memory as needed.

> Note that this project is backwards-compatible, down to Python 2 but
static-typing was introduced to comply with PEP 484, Python 3.5.

# Usage

## Data Format

The import script can be found at [`mlml/utils/imports.py`](https://github.com/alvinwan/mlml/blob/master/mlml/utils/imports.py).
Here are its usage details.

Usage:
imports.py (mnist|spam|cifar-10) [options]

Options:
--dtype= Datatype of generated memmap [default: uint8]
--percentage= Percentage of data for training [default: 0.8]

To extend this script for other datasets, we recommend using `save_inputs_as_data`
or `save_inputs_lablels_as_data`.

## Scenario 1: Streaming Stochastic Gradient Descent

With `mlml.py`, this algorithm can be run on several popular datasets.
We recommend using the `--simulated` flag when testing with subsets of
your data, so that train accuracy is evaluated on the entire train
dataset.

python mlml.py ssgd (mnist|spam|cifar-10) [options]

For example, the following runs streaming sgd on MNIST with simulated
memory constraints. Note that the `--buffer` size is in MB.

python mlml.py ssgd mnist --buffer=1 --simulated

## Scenario 2: Kernelized Stochastic Gradient Descent

> MLML currently prepackages only Kernelized Ridge Regression. However,
there are generic utilities such as `MemMatrix` and extensible interfaces
such as `Loss` and `Model` that enable the addition of custom kernelized
losses.

With `mlml.py`, there are two steps to solving a kernelized problem; see
the derivation [here.](https://github.com/alvinwan/mlml/blob/master/files/ridgeregression.pdf)
First, generate the kernel matrix and pre-computed matrices. Use the
`--subset=` flag to perform computations on a subset of the data.

python mlml.py generate (mnist|spam|cifar-10) --kernel= [options]

Then, run streaming stochastic gradient to compute the inverse of
our kernel matrix or a function of our kernel matrix.

python mlml.py ssgd (mnist|spam|cifar-10) --memId= [options]

For example, the following runs kernelized sgd on all samples from
cifar-10, using the radial basis function (RBF). Note that
the first command will output the `` needed for the second
command.

python mlml.py generate cifar-10 --kernel=RBF
python mlml.py ssgd cifar-10 --memId=

To run on a subset of your data, use the `--subset` flag.

python mlml.py generate cifar=10 --kernel=RBF --subset=35000

## Command-Line Utility

To use the command-line utility, run `mlml.py` at the root of the
repository.

Usage:
mlml.py closed --n= --d= --train= --test= --nt= [options]
mlml.py gd --n= --d= --train= --test= --nt= [options]
mlml.py sgd --n= --d= --train= --test= --nt= [options]
mlml.py ssgd --n= --d= --buffer= --train= --test= --nt= [options]
mlml.py hsgd --n= --d= --buffer= --train= --test= --nt= [options]
mlml.py (closed|gd|sgd|ssgd) (mnist|spam|cifar-10) [options]
mlml.py generate (mnist|spam|cifar-10) --kernel= [options]

Options:
--algo= Shuffling algorithm to use [default: external_shuffle]
--buffer= Size of memory in megabytes (MB) [default: 10]
--d= Number of features
--damp= Amount to multiply learning rate by per epoch [default: 0.99]
--dtype= The numeric type of each sample [default: float64]
--epochs= Number of passes over the training data [default: 3]
--eta0= The initial learning rate [default: 1e-6]
--iters= The number of iterations, used for gd and sgd [default: 5000]
--k= Number of classes [default: 10]
--kernel= Kernel function to use [default: RBF]
--loss= Type of loss to use [default: ridge]
--logfreq= Number of iterations between log entries. 0 for no log. [default: 1000]
--memId= Id of memory-mapped matrices containing Kernel.
--momentum= Momentum to apply to changes in weight [default: 0.9]
--n= Number of training samples
--nt= Number of testing samples
--one-hot= Whether or not to use one hot encoding [default: False]
--nthreads= Number of threads [default: 1]
--reg= Regularization constant [default: 0.1]
--step= Number of iterations between each alpha decay [default: 10000]
--train= Path to training data binary [default: data/train]
--test= Path to test data [default: data/test]
--simulated Mark memory constraints as simulated. Allows full accuracy tests.
--subset= Specify subset of data to pick. Ignored if <= 0. [default: 0]

# Installation

To use the included Python utilities, install from PyPi.

pip install mlml

To use the command-line utility, clone the repository.

git clone https://github.com/alvinwan/mlml.git

# References

- [Learning Multiple Layers of Features from Tiny Images](https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf), Alex Krizhevsky, 2009.
- F. Niu, B. Recht, C. R ́e, S. J. Wright. [Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent](https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf), 2011.