Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/bartekpog/modelcreator
Simple python package for creating predictive models
https://github.com/bartekpog/modelcreator
automl estimator machine-learning model-selection package predictive-modeling python sklearn
Last synced: 3 months ago
JSON representation
Simple python package for creating predictive models
- Host: GitHub
- URL: https://github.com/bartekpog/modelcreator
- Owner: BartekPog
- License: mit
- Created: 2020-04-03T07:37:39.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2024-06-17T23:31:48.000Z (7 months ago)
- Last Synced: 2024-10-10T08:20:56.938Z (3 months ago)
- Topics: automl, estimator, machine-learning, model-selection, package, predictive-modeling, python, sklearn
- Language: Python
- Homepage: https://pypi.org/project/modelcreator/
- Size: 357 KB
- Stars: 6
- Watchers: 1
- Forks: 0
- Open Issues: 7
-
Metadata Files:
- Readme: readme.md
- License: LICENSE
Awesome Lists containing this project
README
# modelcreator - AutoML package
This package contains a **Machine** which is meant to do the **learning** for you. It can automaticly create a fitting predictive model for given data.
###### Sample output
```
Testing: Gradient Boosting Classifier
[########################################] | 100% Completed | 3.9s
Score: 0.9667Testing: Ada Boost Classifier
[########################################] | 100% Completed | 1.3s
Score: 0.9600Testing: Random Forest Classifier
[########################################] | 100% Completed | 5.0s
Score: 0.9600Testing: Balanced Random Forest Classifier
[########################################] | 100% Completed | 3.5s
Score: 0.9600Testing: SVC
[########################################] | 100% Completed | 1.2s
Score: 0.9667Chosen model: Gradient Boosting Classifier 0.9667
Params:
min_samples_split: 2
n_estimators: 100Results saved to output.csv
```# Table of Contents
1. [Installation](#installation)
1. [Usage](#usage)
- [CSV input](#csv-path-input)
- [Pandas input](#pandas-input)
1. [Saving model](#saving-the-model)
1. [Parameters](#parameters)
- [Machine](#machine)
- [learn](#learn)
- [learnFromDf](#learnfromdf)
- [predict](#predict)
- [predictFromDf](#predictfromdf)
- [saveMachine](#savemachine)
1. [Development](#development)## Installation
To use the package run:
```bash
pip install modelcreator
```## Usage
The input may be either a path to a **csv** file or a **pandas DataFrame** object.
#### CSV path input
The library assumes that the last column of the training dataset contains the expected results. The dataset (both training and predictive) must be provided as a **csv** file.
If the results column contains text the _Machine_ will do its best to learn to _classify_ the data correctly. In case of a number inside, _regression_ will be performed.
If the file contains headers you shall add `header_in_csv=True` parameter to the method.
###### Example 1 _Iris_
```python
from modelcreator import Machine# Create automl machine instance
machine = Machine()# Train machine learning model
machine.learn('example-data/iris.csv')# Predict the outcomes
machine.predict('example-data/iris-pred.csv', 'output.csv')
```This example is also available in the `example.py` file. Consider trying it on your own.
#### Pandas input
But what to do if a result column is not the last in the given csv? It may be inconvenient to rewrite the whole csv just to swap the columns. Because of this problem Machine has `learnFromDf` and `predictFromDf` methods. The _Df_ in method names stands for _DataFrame_ from _pandas_ module. This way you can handle reading the file by yourself.
###### Example 2 _Titanic_
```python
from modelcreator import Machine
import pandas as pd# Create DataFrame object from file
train = pd.read_csv("train.csv")# Get features columns from DataFrame
X_train = train.drop(['Survived'], axis=1)# And labels (results) column
y_train = train["Survived"].astype(str)# Create the instance of Machine
machine = Machine()# Train machine learning model
machine.learnFromDf(X_train, y_train, computation_level='advanced')# Show parameters of the model
machine.showParams()# Load test set from file
X_test = pd.read_csv("test.csv")# Predict the labels
results = machine.predictFromDf(X_test)# Save results to a new file
results.to_csv("results.csv")
```Simple? That's right! Just note that we used `astype(str)` in order to treat data as **classes**, not **numbers** because the [Titanic dataset](https://www.kaggle.com/c/titanic) used in the example above has values _0_ and _1_ in `"Survived"` column to indicate whether a person made it through the disaster.
#### Saving the model
If you want your model to avoid re-learning on the whole dataset just to make a simple prediction you can save the state of _Machine_ to a file.
```python
# Save Machine with a trained model to "machine.pkl"
machine.saveMachine('machine.pkl')# Create a new machine based on a schema file
machine2 = Machine('machine.pkl')
```#### Parameters
The **Machine** can be customized according to the use case. Check the parameters table:
###### Machine
| Param | Type | Default | Description |
| ------ | --------------- | ------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| schema | _None_ or _str_ | `None` | A Machine may be created based on a saved, pre-trained machine instance. You may specify the path to the saved instance in this param to recreate it. |###### learn
| Param | Type | Default | Description |
| ----------------- | --------------------------- | ----------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| dataset_file | _str_ | | Path to a csv file which contains training dataset. |
| header_in_csv | _bool_ | `False` | Whether the csv file contains _headers_ in the first row. |
| metrics | _None_, _str_ or _Callable_ | `'accuracy'` or `'neg_root_mean_squared_error'` | Metrics used for scoring estimators. Many popular scoring functions (such as _f1_, _roc_auc_, _neg_mean_gamma_deviance_). See [here](https://scikit-learn.org/stable/modules/model_evaluation.html) how to make custom scoring functions. |
| verbose | _bool_ | `True` | Whether to print learning logs. |
| cv | _int_ | `3` | a Number of cross-validation subsets. Higher values may increase computation time. |
| computation_level | _str_ | `'medium'` | Can be either `'basic'`, `'medium'` or `'advanced'`. With higher computation level more models and parameters are being tested. |###### learnFromDf
| Param | Type | Default | Description |
| ----------------- | --------------------------- | ----------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| X | _pandas.DataFrame_ | | DataFrame containing the feature columns. |
| y | _pandas.Series_ | | Label columns of the training data. |
| metrics | _None_, _str_ or _Callable_ | `'accuracy'` or `'neg_root_mean_squared_error'` | Metrics used for scoring estimators. Many popular scoring functions (such as _f1_, _roc_auc_, _neg_mean_gamma_deviance_). See [here](https://scikit-learn.org/stable/modules/model_evaluation.html) how to make custom scoring functions. |
| verbose | _bool_ | `True` | Whether to print learning logs. |
| cv | _int_ | `3` | A number of cross-validation subsets. Higher values may increase computation time. |
| computation_level | _str_ | `'medium'` | Can be either `'basic'`, `'medium'` or `'advanced'`. With higher computation level more models and parameters are being tested. |###### predict
| Param | Type | Default | Description |
| ------------- | ------ | -------------- | ----------------------------------------------------------------------------- |
| features_file | _str_ | | Path to the features **csv** of the data to generate predictions on. |
| header_in_csv | _bool_ | `False` | Whether the csv file contains _headers_ in the first row. |
| output_file | _str_ | `'output.csv'` | Path to the output **csv** file. In this file, the predictions will be saved. |
| verbose | _str_ | `True` | Whether to print logs. |###### predictFromDf
| Param | Type | Default | Description |
| ------------- | ------------------ | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| X_predictions | _pandas.DataFrame_ | | Features columns to generate predictions on. |
| output_file | _str_ | `None` | Predict method returns _pandas.Series_ of the results. Additionally, it can also save the results to a **csv** file. It can be specified here. If the path is other than `None` it will be interpreted as a path to the output file. |
| verbose | _str_ | `True` | Whether to print logs. |###### saveMachine
| Param | Type | Default | Description |
| ---------------- | ----- | --------------- | -------------------------------------------------- |
| output_file_name | _str_ | `'machine.pkl'` | Path to where shall the Machine instance be saved. |### Development
Have a feature idea or just want to help? Take a look at the [issues tab](https://github.com/BartekPog/modelcreator/issues)!