https://github.com/ehsanmok/sparkling-titanic

Training models with Apache Spark, PySpark for Titanic Kaggle competition
https://github.com/ehsanmok/sparkling-titanic

kaggle-titanic pyspark spark

Last synced: 6 months ago
JSON representation

Training models with Apache Spark, PySpark for Titanic Kaggle competition

Host: GitHub
URL: https://github.com/ehsanmok/sparkling-titanic
Owner: ehsanmok
Created: 2015-05-27T02:42:41.000Z (about 10 years ago)
Default Branch: master
Last Pushed: 2016-09-23T01:20:27.000Z (almost 9 years ago)
Last Synced: 2025-01-06T17:14:31.691Z (6 months ago)
Topics: kaggle-titanic, pyspark, spark
Language: Python
Homepage:
Size: 40 KB
Stars: 14
Watchers: 4
Forks: 16
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        Sparkling Titanic

=================

### Introduction

`titanic_logReg.py` trains a Logistic Regression and makes prediction for [Titanic dataset](http://kaggle.com/c/titanic/data) as part of Kaggle competition using Apache-Spark [spark-1.3.1-bin-hadoop2.4](http://spark.apache.org/downloads.html) with its Python API on a local machine. I used `pyspark_csv.py` to load data as Spark DataFrame, for more instructions see [this](http://github.com/seahboonsiew/pyspark-csv). 

The following will be added later

*   Imputing NAs in train and test sets

*   Cross-validation

*   Using more features and feature engineering

*   RandomForest classifier, SVM, etc.

### Running PySpark Script in Shell

Use `$SPARK_HOME/bin/spark-submit scriptDirectoryPath/titanic_logReg.py`. For multithreading, you can add the option `--master local[N]` where N is the number of threads.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ehsanmok/sparkling-titanic

Awesome Lists containing this project

README