https://github.com/vida-nyu/aws_taxi

Sample scripts to analyze taxi data on Amazon AWS
https://github.com/vida-nyu/aws_taxi

Last synced: about 2 months ago
JSON representation

Sample scripts to analyze taxi data on Amazon AWS

Host: GitHub
URL: https://github.com/vida-nyu/aws_taxi
Owner: VIDA-NYU
License: mit
Created: 2014-11-22T19:54:25.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2015-04-13T19:40:11.000Z (about 10 years ago)
Last Synced: 2025-03-24T17:55:24.875Z (2 months ago)
Language: Python
Size: 385 KB
Stars: 10
Watchers: 28
Forks: 10
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

NYC Taxi Analysis
========

Sample scripts to analyze taxi data on Amazon AWS

Instruction
-----------

1. Create an Amazon EMR cluster with the following configuration (the bootstrap action is very important -- please pay attention to that):

* Termination protection: Yes
* Logging: Enabled (remember to input your S3 bucket to store log file)
* Hadoop distribution: Amazon AMI 3.3.1
* Bootstrap action: This is a very important step because the sample scripts
make use of python rtree library, but Amazon AMI 3.3.1 does not have rtree installed.
Click 'Add bootstrap action' -> Custom action -> Configure and add ->
Put the following in 'S3 location': s3://mda2014/rtree.sh
* Don't add any step at this point
* Cluster Auto-terminate: No

2. Clone this repository and upload the neighborhoods and yearplot scripts to your bucket on S3. For example:

* neighborhoods: s3://mda2014/neighborhoods
* yearplot: s3://mda2014/yearplot

3. To run neighborhoods script: Add the following streaming step to your cluster with the following information:

Replace mda2014 with your bucket name, except in Input
* Mapper: s3://mda2014/neighborhoods/mapper.py
* Reducer: s3://mda2014/neighborhoods/reducer.py
* Input: s3://mda2014/taxi/trip/
* Output: s3://mda2014/output1
* Arguments: -D mapred.reduce.tasks=1 -files s3://mda2014/neighborhoods/mapper.py,s3://mda2014/neighborhoods/reducer.py,s3://mda2014/neighborhoods/shapefile.py,s3://mda2014/neighborhoods/ZillowNeighborhoods-NY.shp,s3://mda2014/neighborhoods/ZillowNeighborhoods-NY.prj,s3://mda2014/neighborhoods/ZillowNeighborhoods-NY.shp.xml,s3://mda2014/neighborhoods/ZillowNeighborhoods-NY.shx,s3://mda2014/neighborhoods/ZillowNeighborhoods-NY.dbf

Wait for finish, then download and merge all output into one file called `output.txt`

To generate plot, execute:

python plot_results.py output.txt

4. To run yearplot script: Add the following streaming step to your cluster with the following information:

Replace mda2014 with your bucket name, except in Input
* Mapper: s3://mda2014/yearplot/mapper.py
* Reducer: s3://mda2014/yearplot/reducer.py
* Input: s3://mda2014/taxi/trip/
* Output: s3://mda2014/output2
* Arguments: -D mapred.reduce.tasks=1

Wait for finish, then download and merge all output into one file called `output.txt`

To generate plot, execute:

python plot_results.py output.txt

5. Remember to terminate cluster after use.

Author
======

[Huy T. Vo](http://serv.cusp.nyu.edu/~hvo/)

Contributors
============

[Tuan-Anh Hoang-Vu](http://bigdata.poly.edu/~tuananh/)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vida-nyu/aws_taxi

Awesome Lists containing this project

README