An open API service indexing awesome lists of open source software.

https://github.com/dataideaorg/dataidea-analysis-package

This package helps simplify data analysis work for DATAIDEA students
https://github.com/dataideaorg/dataidea-analysis-package

Last synced: 9 months ago
JSON representation

This package helps simplify data analysis work for DATAIDEA students

Awesome Lists containing this project

README

          

## What is the `dataidea` package?

This is a package we are currently developing to help new and old data analysists (especially DATAIDEA students) walk around some repetitive and sometimes disturbing tasks that a data analyst does day to day

This library currently extends and depends on majorly numpy, pandas as sklearn and these, among a few others will be installed once you install dataidea

## Installing `dataidea`

- To install dataidea, you must have python installed on your machine
- It's advised that you install it in a virtual environment
- You can install `dataidea` using the command below

```
pip install dataidea
```

## Learning `dataidea`

The best way to get started with dataidea (and data analysis) is to complete the free course.

To see what’s possible with dataidea, take a look at the Quick Start

Read through the Tutorials to learn how to load datasets, train your own models on your own datasets. Use the navigation to look through the dataidea documentation. Every class, function, and method is documented here.

## Quickstart

```python
from dataidea.tabular import *
```

`dataidea`'s applications all use the same basic steps and code:

- Create appropriate DataLoaders
- Create a Trainer
- Call a fit method
- Make predictions or view results.

In this quick start, we’ll show these steps for classification and regression. As you’ll see, the code in each case is extremely similar, despite the very different models and data being used.

## Loading datasets

`dataidea` library makes loading the most common used dataset in the course easy, but also allows for loading personal dataset with one or 2 tweeks.

In the line of code below, we load the simple music dataset which inbuilt into dataidea for learning purposes

```python
music_data = loadDataset(name='music')
```

We can see some values inside by using our usual `pandas` dataframe methods like `sample()`, `head()`, `tail()` etc

```python
music_data.sample(n=5)
```




age
gender
genre




3
26
1
Jazz


2
25
1
HipHop


14
30
0
Acoustic


1
23
1
HipHop


19
35
1
Classical

We can then create `TabularDataLoader` that allows for easy data manipulation like feature scaling, imputation and splitting for training etc, we use this to quickly prepare and load the data to a machine learning model

```python
music_data_loader = TabularDataLoader(data=music_data,
numeric_features=[' age '],
categorical_features=['gender'],
outcome='genre'
)
```

We can (optionally) process the data, however this step is gonna be done for you once you decide to train a machine learning model

```python
transformed_data, transformer = music_data_loader.transform()
```

```python
transformed_data[0].head()
```




num__ age
cat__gender_0
cat__gender_1




0
1.637262
0.0
1.0


1
0.204658
0.0
1.0


2
-0.818631
1.0
0.0


3
-0.613973
0.0
1.0


4
1.227947
0.0
1.0

Now we can fit a machine learning model, behind the scenes, the `Trainer` has some code to work with the `TabularDataLoader` to process your data quite thorouly, standardizing and imputing and the resulting model is actually a pipeline of these steps.

```python
trainer = Trainer(data_loader=music_data_loader, model=RandomForestClassifier())
```

To train our model, we just call the `train()` method on the trainer object

```python
model = trainer.train()
```

We can obtain the accuracy real fast, this is obtained from a test set which is automatically picked from you data by the `TabularDataLoader`

```python
accuracy = trainer.evaluate()
```

Some times accuracy isn't the best measure for model performance, we can also use a classification report for classification problems

```python
classification_report = trainer.report()
```

```python
classification_report
```




precision
recall
f1-score
support




Acoustic
0.000000
1.000000
0.000000
0.00


Classical
1.000000
0.500000
0.666667
2.00


HipHop
1.000000
1.000000
1.000000
2.00


accuracy
0.750000
0.750000
0.750000
0.75


macro avg
0.666667
0.833333
0.555556
4.00


weighted avg
1.000000
0.750000
0.833333
4.00

It's easy to save a model for future use, you can use the `save()` method on the traner

```python
trainer.save(path='music_model.di')
```

Now (in future) we can load our saved model for prediction

```python
loaded_model = loadModel(filename='music_model.di')
```

Now let's make some predictions on some data

```python
data_to_predict = pd.DataFrame(
data={
' age ': [20, 35],
'gender': [1, 0]
})

data_to_predict
```




age
gender




0
20
1


1
35
0

```python
predicted = loaded_model.predict(X=data_to_predict)
```

```python
data_to_predict['predicted'] = predicted
```

```python
data_to_predict
```




age
gender
predicted




0
20
1
HipHop


1
35
0
Classical

```python

```