Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/nikoshet/spark-mlp

Multilayer Perceptron Implementation Using Spark
https://github.com/nikoshet/spark-mlp

hdfs machine-learning mapreduce multilayer-perceptron pyspark python spark

Last synced: about 1 month ago
JSON representation

Multilayer Perceptron Implementation Using Spark

Awesome Lists containing this project

README

        

# Multilayer Perceptron Implementation Using Spark And Python

## Dataset

The dataset is the [Customer Complaint Database](https://catalog.data.gov/dataset/consumer-complaint-database). It includes complains from customers on some financial products and services from 2011 untill today. The data file is a comma-delimited .csv file.

### File Format

```
0 <- date %Y-%m-%d
1 <- category
2 <- comment
```

## Main Goal

The goal of this project is to use [Spark](https://spark.apache.org/) and [HDFS](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html) to implement a [Perceptron Classifier](https://en.wikipedia.org/wiki/Multilayer_perceptron) using the [TFIDF metric](https://en.wikipedia.org/wiki/Tf%E2%80%93idf). Both RDDs and Spark Dataframe API were utilized.

## Requirements
- nltk
- pyspark

## Usage

Assuming that Spark and HDFS are properly installed and working on our system:

- Upload data file in HDFS
```
hadoop fs -put ./customer_complaints.csv hdfs://master:9000/customer_complaints.csv
```

- Install necessary requirements
```
pip install -r requirements.txt
```

- Submit job in a Spark environment
```
spark-submit mlp.py
```

## Algorithm

- Clean the data
- Keep k most common words in all comments
- Remove less often categories
- Compute TFIDF metric for each word in the comments
- Use SparseVector where key is word_index and value is TFIDF metric
- Transform string labels (categories) in integers
- Split dataset in train and test set (stratified split)
- Train a Multilayer Perceptron model
- Compute accuracy of model on test set