Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/sryza/aas

Code to accompany Advanced Analytics with Spark from O'Reilly Media
https://github.com/sryza/aas

Last synced: 30 days ago
JSON representation

Code to accompany Advanced Analytics with Spark from O'Reilly Media

Awesome Lists containing this project

README

        

Advanced Analytics with Spark Source Code
=========================================

Code to accompany [Advanced Analytics with Spark](http://shop.oreilly.com/product/0636920035091.do), by
[Sandy Ryza](https://github.com/sryza), [Uri Laserson](https://github.com/laserson),
[Sean Owen](https://github.com/srowen), and [Josh Wills](https://github.com/jwills).

[![Advanced Analytics with Spark](http://akamaicovers.oreilly.com/images/0636920056591/lrg.jpg)](http://shop.oreilly.com/product/0636920056591.do)

### 3rd edition (current)

The source to accompany the 3rd edition is found in this, the default
[`master` branch](https://github.com/sryza/aas).

### 2nd Edition (current)

The source to accompany the 2nd edition may be found in the
[`2nd-edition` branch](https://github.com/sryza/aas/tree/2nd-edition).

### 1st Edition

The source to accompany the 1st edition may be found in the
[`1st-edition` branch](https://github.com/sryza/aas/tree/1st-edition).

### Build

[Apache Maven](http://maven.apache.org/) 3.2.5+ and Java 8+ are required to build. From the root level of the project,
run `mvn package` to compile artifacts into `target/` subdirectories beneath each chapter's directory.

### Running the Examples
- Install [Apache Spark](https://spark.apache.org) for your platform, following the instructions for the [latest release](https://spark.apache.org/docs/latest/).
- Build the projects according the instructions above.
- Launch the driver program using `spark-submit`
```bash
# working directory should be your Apache Spark installation root
bin/spark-submit /path/to/code/aas/$CHAPTER/target/$CHAPTER-jar-with-dependencies-$VERSION.jar
```
- Some examples might require that URI paths to the data be updated to your own HDFS or local filesystem locations.

### Data Sets

- Chapter 2: https://archive.ics.uci.edu/ml/machine-learning-databases/00210/
- Chapter 3: https://storage.googleapis.com/aas-data-sets/profiledata_06-May-2005.tar.gz
- Chapter 4: https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/
- Chapter 5: https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html (do _not_ use http://www.sigkdd.org/kdd-cup-1999-computer-network-intrusion-detection as the copy has a corrupted line)
- Chapter 6: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream.xml.bz2
- Chapter 7: ftp://ftp.nlm.nih.gov/nlmdata/sample/medline/ (`*.gz`)
- Chapter 8: https://storage.googleapis.com/aas-data-sets/trip_data_1.csv.zip (from http://www.andresmh.com/nyctaxitrips/)
- Chapter 9: See https://github.com/sryza/aas/tree/master/ch09-risk/data ; included download scripts no longer work
- Chapter 10: ftp://ftp.ncbi.nih.gov/1000genomes/ftp/phase3/data/HG00103/alignment/HG00103.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam
- Chapter 11: https://github.com/thunder-project/thunder/tree/v0.4.1/python/thunder/utils/data/fish/tif-stack

[![Build Status](https://travis-ci.org/sryza/aas.png?branch=master)](https://travis-ci.org/sryza/aas)