Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/alvinwan/mlml

memory-limited machine learning utility for Python
https://github.com/alvinwan/mlml

Last synced: about 2 months ago
JSON representation

memory-limited machine learning utility for Python

Host: GitHub
URL: https://github.com/alvinwan/mlml
Owner: alvinwan
Created: 2016-10-04T05:34:17.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2017-01-09T08:08:53.000Z (almost 8 years ago)
Last Synced: 2024-10-12T13:44:37.594Z (3 months ago)
Language: Jupyter Notebook
Homepage:
Size: 2.17 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # Memory-Limited Machine Learning (MLML)

This Python3 package offers support for a variety of algorithms on

memory-limited infrastructure. Specifically, this package addresses

two memory-limited scenarios:

1. Dataset is too large to fit in memory.

2. Kernel is too large to fit in memory.

created by [Alvin Wan](http://alvinwan.com), with guidance of 

[Vaishaal Shankar](http://vaishaal.com) under 

[Professor Benjamin Recht](https://people.eecs.berkeley.edu/~brecht/) 

at UC Berkeley

This package is split into two sub-packages:

1. `mlml.ssgd`: **Streaming Stochastic Gradient Descent** handles 

datasets too large for memory. Only the necessary portions of 

the dataset are loaded into memory, and to optimize time needed for

disk I/O, data is shuffled on disk and then read sequentially.

2. `mlml.kernel`: To handle kernels too large for memory, this package

generates the kernel matrix part-by-part, performs pre-computation

for common algorithms, and then runs **Kernelized Stochastic Gradient

Descent**, streaming pre-computed matrices into memory as needed.

> Note that this project is backwards-compatible, down to Python 2 but

static-typing was introduced to comply with PEP 484, Python 3.5.

# Usage

## Data Format

The import script can be found at [`mlml/utils/imports.py`](https://github.com/alvinwan/mlml/blob/master/mlml/utils/imports.py). 

Here are its usage details.

    Usage:

        imports.py (mnist|spam|cifar-10) [options]

    Options:

        --dtype=             Datatype of generated memmap [default: uint8]

        --percentage=   Percentage of data for training [default: 0.8]

To extend this script for other datasets, we recommend using `save_inputs_as_data`

or `save_inputs_lablels_as_data`.

## Scenario 1: Streaming Stochastic Gradient Descent

With `mlml.py`, this algorithm can be run on several popular datasets.

We recommend using the `--simulated` flag when testing with subsets of

your data, so that train accuracy is evaluated on the entire train

dataset.

    python mlml.py ssgd (mnist|spam|cifar-10) [options]

    

For example, the following runs streaming sgd on MNIST with simulated 

memory constraints. Note that the `--buffer` size is in MB.

    python mlml.py ssgd mnist --buffer=1 --simulated

## Scenario 2: Kernelized Stochastic Gradient Descent

> MLML currently prepackages only Kernelized Ridge Regression. However,

there are generic utilities such as `MemMatrix` and extensible interfaces

such as `Loss` and `Model` that enable the addition of custom kernelized 

losses.

With `mlml.py`, there are two steps to solving a kernelized problem; see

the derivation [here.](https://github.com/alvinwan/mlml/blob/master/files/ridgeregression.pdf)

First, generate the kernel matrix and pre-computed matrices. Use the

`--subset=` flag to perform computations on a subset of the data.

    python mlml.py generate (mnist|spam|cifar-10) --kernel= [options]

Then, run streaming stochastic gradient to compute the inverse of

our kernel matrix or a function of our kernel matrix.

    python mlml.py ssgd (mnist|spam|cifar-10) --memId= [options] 

For example, the following runs kernelized sgd on all samples from

cifar-10, using the radial basis function (RBF). Note that

the first command will output the `` needed for the second

command.

    python mlml.py generate cifar-10 --kernel=RBF

    python mlml.py ssgd cifar-10 --memId=

To run on a subset of your data, use the `--subset` flag.

    python mlml.py generate cifar=10 --kernel=RBF --subset=35000

## Command-Line Utility

To use the command-line utility, run `mlml.py` at the root of the 

repository.

    Usage:

        mlml.py closed --n= --d= --train= --test= --nt= [options]

        mlml.py gd --n= --d= --train= --test= --nt= [options]

        mlml.py sgd --n= --d= --train= --test= --nt= [options]

        mlml.py ssgd --n= --d= --buffer= --train= --test= --nt= [options]

        mlml.py hsgd --n= --d= --buffer= --train= --test= --nt= [options]

        mlml.py (closed|gd|sgd|ssgd) (mnist|spam|cifar-10) [options]

        mlml.py generate (mnist|spam|cifar-10) --kernel= [options]

    Options:

        --algo=       Shuffling algorithm to use [default: external_shuffle]

        --buffer=      Size of memory in megabytes (MB) [default: 10]

        --d=             Number of features

        --damp=       Amount to multiply learning rate by per epoch [default: 0.99]

        --dtype=     The numeric type of each sample [default: float64]

        --epochs=   Number of passes over the training data [default: 3]

        --eta0=       The initial learning rate [default: 1e-6]

        --iters=     The number of iterations, used for gd and sgd [default: 5000]

        --k=             Number of classes [default: 10]

        --kernel=   Kernel function to use [default: RBF]

        --loss=       Type of loss to use [default: ridge]

        --logfreq=    Number of iterations between log entries. 0 for no log. [default: 1000]

        --memId=     Id of memory-mapped matrices containing Kernel.

        --momentum=    Momentum to apply to changes in weight [default: 0.9]

        --n=             Number of training samples

        --nt=           Number of testing samples

        --one-hot=  Whether or not to use one hot encoding [default: False]

        --nthreads=   Number of threads [default: 1]

        --reg=         Regularization constant [default: 0.1]

        --step=       Number of iterations between each alpha decay [default: 10000]

        --train=     Path to training data binary [default: data/train]

        --test=       Path to test data [default: data/test]

        --simulated         Mark memory constraints as simulated. Allows full accuracy tests.

        --subset=      Specify subset of data to pick. Ignored if <= 0. [default: 0]

# Installation

To use the included Python utilities, install from PyPi.

    pip install mlml

To use the command-line utility, clone the repository.

    git clone https://github.com/alvinwan/mlml.git

# References

- [Learning Multiple Layers of Features from Tiny Images](https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf), Alex Krizhevsky, 2009.

- F. Niu, B. Recht, C. R ́e, S. J. Wright. [Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent](https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf), 2011.