Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ehsanmok/sparkling-titanic
Training models with Apache Spark, PySpark for Titanic Kaggle competition
https://github.com/ehsanmok/sparkling-titanic
kaggle-titanic pyspark spark
Last synced: 12 days ago
JSON representation
Training models with Apache Spark, PySpark for Titanic Kaggle competition
- Host: GitHub
- URL: https://github.com/ehsanmok/sparkling-titanic
- Owner: ehsanmok
- Created: 2015-05-27T02:42:41.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2016-09-23T01:20:27.000Z (over 8 years ago)
- Last Synced: 2025-01-06T17:14:31.691Z (16 days ago)
- Topics: kaggle-titanic, pyspark, spark
- Language: Python
- Homepage:
- Size: 40 KB
- Stars: 14
- Watchers: 4
- Forks: 16
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Sparkling Titanic
=================### Introduction
`titanic_logReg.py` trains a Logistic Regression and makes prediction for [Titanic dataset](http://kaggle.com/c/titanic/data) as part of Kaggle competition using Apache-Spark [spark-1.3.1-bin-hadoop2.4](http://spark.apache.org/downloads.html) with its Python API on a local machine. I used `pyspark_csv.py` to load data as Spark DataFrame, for more instructions see [this](http://github.com/seahboonsiew/pyspark-csv).
The following will be added later
* Imputing NAs in train and test sets
* Cross-validation
* Using more features and feature engineering
* RandomForest classifier, SVM, etc.### Running PySpark Script in Shell
Use `$SPARK_HOME/bin/spark-submit scriptDirectoryPath/titanic_logReg.py`. For multithreading, you can add the option `--master local[N]` where N is the number of threads.