https://github.com/majobasgall/smote-mr

SMOTE-MR: A distributed Synthetic Minority Oversampling Technique (SMOTE) for Big Data which applies a MapReduce based-approach. SMOTE-MR is categorized as an `approximated/ non exact` solution. Also, there is an `exact` solution called SMOTE-BD written by the author (See: https://github.com/majobasgall/smote-bd)
https://github.com/majobasgall/smote-mr

big-data imbalanced-data machile-learning scala smote spark

Last synced: 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/majobasgall/smote-mr
Owner: majobasgall
License: apache-2.0
Created: 2018-09-11T17:31:44.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2019-05-03T10:53:38.000Z (almost 6 years ago)
Last Synced: 2025-01-08T18:17:55.179Z (4 months ago)
Topics: big-data, imbalanced-data, machile-learning, scala, smote, spark
Language: Scala
Homepage:
Size: 18.6 KB
Stars: 3
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # SMOTE-MR

SMOTE-MR: A distributed Synthetic Minority Oversampling Technique (SMOTE) [1] for Big Data which applies a MapReduce based-approach. SMOTE-MR is categorized as an `approximated/ non exact` solution. Also, there is an `exact` solution called SMOTE-BD written by the author (See: https://github.com/majobasgall/smote-bd)

## How to run it?

A generic example to run it could be:

```spark-submit --master "URL" --executor-memory "XG" "path-to-jar".jar --class "path-to-main" --datasetName="aName" --headerFile="path-to-header" --inputFile="path-to-input" --delimiter=", " --outputPah="path-to-output" --seed="aSeed" --K="number-of-neighbours" --numPartitions="number-of-parts"  --nReducers="number-of-reducers" --numIterations="number-of-iterations" --minClassName="min-class-name" -overPercentage=100 ```

- Parameters of spark: ```--master "URL" | --executor-memory "XG" ```. They can be useful for launch with diferent settings and datasets.

- ```--class path.to.the.main aJarFile.jar``` Determine the jar file to be run.

- ```datasetName``` The name of the current dataset.

- ```headerFile``` Full path to header file.

- ```inputFile``` Full path to input file.

- ```delimiter``` Delimiter of each attribute value.

- ```outputPah``` Full path to output directory.

- ```seed``` A seed to generate random numbers.

- ```K``` Number of nearest neighbours.

- ```numPartitions``` Number of partitions to split data.

- ```nReducers``` Number of reducers (required by the K-NN stage).

- ```numIterations``` Number of iterations (required by the K-NN stage).

- ```minClassName``` Name of the minority class (according to the header file).

- ```overPercentage``` Percentage of balancing between classes.

## References

[1] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Int. Res., 16(1), 321–357.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/majobasgall/smote-mr

Awesome Lists containing this project

README