https://github.com/yupbank/on_the_fly
one the fly machine learning toolkit
https://github.com/yupbank/on_the_fly
Last synced: 3 months ago
JSON representation
one the fly machine learning toolkit
- Host: GitHub
- URL: https://github.com/yupbank/on_the_fly
- Owner: yupbank
- Created: 2016-11-01T14:12:01.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2023-07-06T21:06:09.000Z (almost 2 years ago)
- Last Synced: 2025-03-17T12:50:45.053Z (3 months ago)
- Language: Python
- Homepage:
- Size: 33.2 KB
- Stars: 2
- Watchers: 0
- Forks: 1
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# the on the fly sklearn dictionary vectorizer and SGD classifier and GridSearchCV and RandomSearchCV.
[](https://travis-ci.org/yupbank/on_the_fly)
[](https://pypi.python.org/pypi/on_the_fly)It is always painful to generate dictinaries for SGD algorithms. Why not use them on the fly.
Parameter in GridSearch, RandomSearch are added to make Rdd distributed again, (splits/duplicate raw data and get result on the fly)
Example
----
## For steaming dictionaries/jsons```python
from on_the_fly import FlyVectorizer, FlyClassifier
vec = FlyVectorizer()
clf = FlyClassifier()
features = ['name', 'age', 'stuff..']
label = ['gender']
for batch_data_in_dict in iterator_of_data_in_dict:
batch_data = vec.partial_fit_transform(batch_data_in_dict)
feature_dimension = vec.subset_features(features)
label_dimension = vec.subset_features(label)
batch_X = batch_data[:, feature_dimension]
batch_y = batch_data[:, label_dimension]
clf.partial_fit(batch_X, batch_y)
```## For spark rdd of dictionaries
```python
from on_the_fly import FlyClassifier, RddVectorizer, RddClassifiervec = RddVectorizer(features=['name', 'age', 'stuff'], label='gender')
base_clf = FlyClassifier(loss='log')
clf = RddClassifier(base_clf)training_design_matrix = vec.fit_transform(trainning_rdd_of_dicts)
clf.fit(training_design_matrix)
testing_design_matrix = vec.transform(testing_rdd_of_dicts)
clf.score(testing_design_matrix)
```