https://github.com/yupbank/on_the_fly

one the fly machine learning toolkit
https://github.com/yupbank/on_the_fly

Last synced: 3 months ago
JSON representation

one the fly machine learning toolkit

Host: GitHub
URL: https://github.com/yupbank/on_the_fly
Owner: yupbank
Created: 2016-11-01T14:12:01.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2023-07-06T21:06:09.000Z (almost 2 years ago)
Last Synced: 2025-03-17T12:50:45.053Z (3 months ago)
Language: Python
Homepage:
Size: 33.2 KB
Stars: 2
Watchers: 0
Forks: 1
Open Issues: 2
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # the on the fly sklearn dictionary vectorizer and SGD classifier and GridSearchCV and RandomSearchCV.

[![Build Status](https://travis-ci.org/yupbank/on_the_fly.svg?branch=master)](https://travis-ci.org/yupbank/on_the_fly)

[![Pypi](https://img.shields.io/pypi/v/on_the_fly.svg)](https://pypi.python.org/pypi/on_the_fly)

It is always painful to generate dictinaries for SGD algorithms. Why not use them on the fly.

Parameter in GridSearch, RandomSearch are added to make Rdd distributed again, (splits/duplicate raw data and get result on the fly)

Example

----

## For steaming dictionaries/jsons

```python

from on_the_fly import FlyVectorizer, FlyClassifier

vec = FlyVectorizer()

clf = FlyClassifier()

features = ['name', 'age', 'stuff..']

label = ['gender']

for batch_data_in_dict in iterator_of_data_in_dict:

	batch_data = vec.partial_fit_transform(batch_data_in_dict)

	feature_dimension = vec.subset_features(features)

	label_dimension =  vec.subset_features(label)

	batch_X = batch_data[:, feature_dimension]

	batch_y = batch_data[:, label_dimension]

	clf.partial_fit(batch_X, batch_y)

```

## For spark rdd of dictionaries

```python

from on_the_fly import FlyClassifier, RddVectorizer, RddClassifier

vec = RddVectorizer(features=['name', 'age', 'stuff'], label='gender')

base_clf = FlyClassifier(loss='log')

clf = RddClassifier(base_clf)

training_design_matrix = vec.fit_transform(trainning_rdd_of_dicts)

clf.fit(training_design_matrix)

testing_design_matrix = vec.transform(testing_rdd_of_dicts)

clf.score(testing_design_matrix)

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/yupbank/on_the_fly

Awesome Lists containing this project

README