https://github.com/vida-nyu/aws_taxi
Sample scripts to analyze taxi data on Amazon AWS
https://github.com/vida-nyu/aws_taxi
Last synced: about 2 months ago
JSON representation
Sample scripts to analyze taxi data on Amazon AWS
- Host: GitHub
- URL: https://github.com/vida-nyu/aws_taxi
- Owner: VIDA-NYU
- License: mit
- Created: 2014-11-22T19:54:25.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2015-04-13T19:40:11.000Z (about 10 years ago)
- Last Synced: 2025-03-24T17:55:24.875Z (2 months ago)
- Language: Python
- Size: 385 KB
- Stars: 10
- Watchers: 28
- Forks: 10
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
NYC Taxi Analysis
========Sample scripts to analyze taxi data on Amazon AWS
Instruction
-----------1. Create an Amazon EMR cluster with the following configuration (the bootstrap action is very important -- please pay attention to that):
* Termination protection: Yes
* Logging: Enabled (remember to input your S3 bucket to store log file)
* Hadoop distribution: Amazon AMI 3.3.1
* Bootstrap action: This is a very important step because the sample scripts
make use of python rtree library, but Amazon AMI 3.3.1 does not have rtree installed.
Click 'Add bootstrap action' -> Custom action -> Configure and add ->
Put the following in 'S3 location': s3://mda2014/rtree.sh
* Don't add any step at this point
* Cluster Auto-terminate: No2. Clone this repository and upload the neighborhoods and yearplot scripts to your bucket on S3. For example:
* neighborhoods: s3://mda2014/neighborhoods
* yearplot: s3://mda2014/yearplot
3. To run neighborhoods script: Add the following streaming step to your cluster with the following information:Replace mda2014 with your bucket name, except in Input
* Mapper: s3://mda2014/neighborhoods/mapper.py
* Reducer: s3://mda2014/neighborhoods/reducer.py
* Input: s3://mda2014/taxi/trip/
* Output: s3://mda2014/output1
* Arguments: -D mapred.reduce.tasks=1 -files s3://mda2014/neighborhoods/mapper.py,s3://mda2014/neighborhoods/reducer.py,s3://mda2014/neighborhoods/shapefile.py,s3://mda2014/neighborhoods/ZillowNeighborhoods-NY.shp,s3://mda2014/neighborhoods/ZillowNeighborhoods-NY.prj,s3://mda2014/neighborhoods/ZillowNeighborhoods-NY.shp.xml,s3://mda2014/neighborhoods/ZillowNeighborhoods-NY.shx,s3://mda2014/neighborhoods/ZillowNeighborhoods-NY.dbf
Wait for finish, then download and merge all output into one file called `output.txt`To generate plot, execute:
python plot_results.py output.txt
4. To run yearplot script: Add the following streaming step to your cluster with the following information:
Replace mda2014 with your bucket name, except in Input
* Mapper: s3://mda2014/yearplot/mapper.py
* Reducer: s3://mda2014/yearplot/reducer.py
* Input: s3://mda2014/taxi/trip/
* Output: s3://mda2014/output2
* Arguments: -D mapred.reduce.tasks=1Wait for finish, then download and merge all output into one file called `output.txt`
To generate plot, execute:
python plot_results.py output.txt
5. Remember to terminate cluster after use.Author
======[Huy T. Vo](http://serv.cusp.nyu.edu/~hvo/)
Contributors
============[Tuan-Anh Hoang-Vu](http://bigdata.poly.edu/~tuananh/)