Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/majobasgall/smote-mr

SMOTE-MR: A distributed Synthetic Minority Oversampling Technique (SMOTE) for Big Data which applies a MapReduce based-approach. SMOTE-MR is categorized as an `approximated/ non exact` solution. Also, there is an `exact` solution called SMOTE-BD written by the author (See: https://github.com/majobasgall/smote-bd)
https://github.com/majobasgall/smote-mr

big-data imbalanced-data machile-learning scala smote spark

Last synced: about 1 month ago
JSON representation

SMOTE-MR: A distributed Synthetic Minority Oversampling Technique (SMOTE) for Big Data which applies a MapReduce based-approach. SMOTE-MR is categorized as an `approximated/ non exact` solution. Also, there is an `exact` solution called SMOTE-BD written by the author (See: https://github.com/majobasgall/smote-bd)

Awesome Lists containing this project

README

        

# SMOTE-MR
SMOTE-MR: A distributed Synthetic Minority Oversampling Technique (SMOTE) [1] for Big Data which applies a MapReduce based-approach. SMOTE-MR is categorized as an `approximated/ non exact` solution. Also, there is an `exact` solution called SMOTE-BD written by the author (See: https://github.com/majobasgall/smote-bd)

## How to run it?

A generic example to run it could be:

```spark-submit --master "URL" --executor-memory "XG" "path-to-jar".jar --class "path-to-main" --datasetName="aName" --headerFile="path-to-header" --inputFile="path-to-input" --delimiter=", " --outputPah="path-to-output" --seed="aSeed" --K="number-of-neighbours" --numPartitions="number-of-parts" --nReducers="number-of-reducers" --numIterations="number-of-iterations" --minClassName="min-class-name" -overPercentage=100 ```

- Parameters of spark: ```--master "URL" | --executor-memory "XG" ```. They can be useful for launch with diferent settings and datasets.
- ```--class path.to.the.main aJarFile.jar``` Determine the jar file to be run.
- ```datasetName``` The name of the current dataset.
- ```headerFile``` Full path to header file.
- ```inputFile``` Full path to input file.
- ```delimiter``` Delimiter of each attribute value.
- ```outputPah``` Full path to output directory.
- ```seed``` A seed to generate random numbers.
- ```K``` Number of nearest neighbours.
- ```numPartitions``` Number of partitions to split data.
- ```nReducers``` Number of reducers (required by the K-NN stage).
- ```numIterations``` Number of iterations (required by the K-NN stage).
- ```minClassName``` Name of the minority class (according to the header file).
- ```overPercentage``` Percentage of balancing between classes.

## References
[1] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Int. Res., 16(1), 321–357.