Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/nikoshet/spark-mlp

Multilayer Perceptron Implementation Using Spark
https://github.com/nikoshet/spark-mlp

hdfs machine-learning mapreduce multilayer-perceptron pyspark python spark

Last synced: about 1 month ago
JSON representation

Multilayer Perceptron Implementation Using Spark

Host: GitHub
URL: https://github.com/nikoshet/spark-mlp
Owner: nikoshet
License: mit
Created: 2020-09-12T09:06:53.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2020-11-09T10:07:35.000Z (over 4 years ago)
Last Synced: 2024-11-09T09:44:29.618Z (3 months ago)
Topics: hdfs, machine-learning, mapreduce, multilayer-perceptron, pyspark, python, spark
Language: Python
Homepage:
Size: 35.2 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Multilayer Perceptron Implementation Using Spark And Python

## Dataset

The dataset is the [Customer Complaint Database](https://catalog.data.gov/dataset/consumer-complaint-database). It includes complains from customers on some financial products and services from 2011 untill today. The data file is a comma-delimited .csv file.

### File Format

```
0 <- date %Y-%m-%d
1 <- category
2 <- comment
```

## Main Goal

The goal of this project is to use [Spark](https://spark.apache.org/) and [HDFS](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html) to implement a [Perceptron Classifier](https://en.wikipedia.org/wiki/Multilayer_perceptron) using the [TFIDF metric](https://en.wikipedia.org/wiki/Tf%E2%80%93idf). Both RDDs and Spark Dataframe API were utilized.

## Requirements
- nltk
- pyspark

## Usage

Assuming that Spark and HDFS are properly installed and working on our system:

- Upload data file in HDFS
```
hadoop fs -put ./customer_complaints.csv hdfs://master:9000/customer_complaints.csv
```

- Install necessary requirements
```
pip install -r requirements.txt
```

- Submit job in a Spark environment
```
spark-submit mlp.py
```

## Algorithm

- Clean the data
- Keep k most common words in all comments
- Remove less often categories
- Compute TFIDF metric for each word in the comments
- Use SparseVector where key is word_index and value is TFIDF metric
- Transform string labels (categories) in integers
- Split dataset in train and test set (stratified split)
- Train a Multilayer Perceptron model
- Compute accuracy of model on test set