Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/nikoshet/spark-mlp
Multilayer Perceptron Implementation Using Spark
https://github.com/nikoshet/spark-mlp
hdfs machine-learning mapreduce multilayer-perceptron pyspark python spark
Last synced: about 1 month ago
JSON representation
Multilayer Perceptron Implementation Using Spark
- Host: GitHub
- URL: https://github.com/nikoshet/spark-mlp
- Owner: nikoshet
- License: mit
- Created: 2020-09-12T09:06:53.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2020-11-09T10:07:35.000Z (over 4 years ago)
- Last Synced: 2024-11-09T09:44:29.618Z (3 months ago)
- Topics: hdfs, machine-learning, mapreduce, multilayer-perceptron, pyspark, python, spark
- Language: Python
- Homepage:
- Size: 35.2 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Multilayer Perceptron Implementation Using Spark And Python
## Dataset
The dataset is the [Customer Complaint Database](https://catalog.data.gov/dataset/consumer-complaint-database). It includes complains from customers on some financial products and services from 2011 untill today. The data file is a comma-delimited .csv file.
### File Format
```
0 <- date %Y-%m-%d
1 <- category
2 <- comment
```## Main Goal
The goal of this project is to use [Spark](https://spark.apache.org/) and [HDFS](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html) to implement a [Perceptron Classifier](https://en.wikipedia.org/wiki/Multilayer_perceptron) using the [TFIDF metric](https://en.wikipedia.org/wiki/Tf%E2%80%93idf). Both RDDs and Spark Dataframe API were utilized.
## Requirements
- nltk
- pyspark## Usage
Assuming that Spark and HDFS are properly installed and working on our system:
- Upload data file in HDFS
```
hadoop fs -put ./customer_complaints.csv hdfs://master:9000/customer_complaints.csv
```- Install necessary requirements
```
pip install -r requirements.txt
```- Submit job in a Spark environment
```
spark-submit mlp.py
```## Algorithm
- Clean the data
- Keep k most common words in all comments
- Remove less often categories
- Compute TFIDF metric for each word in the comments
- Use SparseVector where key is word_index and value is TFIDF metric
- Transform string labels (categories) in integers
- Split dataset in train and test set (stratified split)
- Train a Multilayer Perceptron model
- Compute accuracy of model on test set