Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sryza/aas
Code to accompany Advanced Analytics with Spark from O'Reilly Media
https://github.com/sryza/aas
Last synced: 2 days ago
JSON representation
Code to accompany Advanced Analytics with Spark from O'Reilly Media
- Host: GitHub
- URL: https://github.com/sryza/aas
- Owner: sryza
- License: other
- Created: 2014-11-08T22:18:11.000Z (about 10 years ago)
- Default Branch: master
- Last Pushed: 2024-09-25T14:40:05.000Z (4 months ago)
- Last Synced: 2025-01-02T06:07:18.769Z (9 days ago)
- Language: Scala
- Size: 69.3 MB
- Stars: 1,523
- Watchers: 146
- Forks: 1,030
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome - aas - Code to accompany Advanced Analytics with Spark from O'Reilly Media (Scala)
README
Advanced Analytics with Spark Source Code
=========================================Code to accompany [Advanced Analytics with Spark](http://shop.oreilly.com/product/0636920035091.do), by
[Sandy Ryza](https://github.com/sryza), [Uri Laserson](https://github.com/laserson),
[Sean Owen](https://github.com/srowen), and [Josh Wills](https://github.com/jwills).[![Advanced Analytics with Spark](http://akamaicovers.oreilly.com/images/0636920056591/lrg.jpg)](http://shop.oreilly.com/product/0636920056591.do)
### 3rd edition (current)
The source to accompany the 3rd edition is found in this, the default
[`master` branch](https://github.com/sryza/aas).### 2nd Edition (current)
The source to accompany the 2nd edition may be found in the
[`2nd-edition` branch](https://github.com/sryza/aas/tree/2nd-edition).### 1st Edition
The source to accompany the 1st edition may be found in the
[`1st-edition` branch](https://github.com/sryza/aas/tree/1st-edition).### Build
[Apache Maven](http://maven.apache.org/) 3.2.5+ and Java 8+ are required to build. From the root level of the project,
run `mvn package` to compile artifacts into `target/` subdirectories beneath each chapter's directory.### Running the Examples
- Install [Apache Spark](https://spark.apache.org) for your platform, following the instructions for the [latest release](https://spark.apache.org/docs/latest/).
- Build the projects according the instructions above.
- Launch the driver program using `spark-submit`
```bash
# working directory should be your Apache Spark installation root
bin/spark-submit /path/to/code/aas/$CHAPTER/target/$CHAPTER-jar-with-dependencies-$VERSION.jar
```
- Some examples might require that URI paths to the data be updated to your own HDFS or local filesystem locations.### Data Sets
- Chapter 2: https://archive.ics.uci.edu/ml/machine-learning-databases/00210/
- Chapter 3: https://storage.googleapis.com/aas-data-sets/profiledata_06-May-2005.tar.gz
- Chapter 4: https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/
- Chapter 5: https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html (do _not_ use http://www.sigkdd.org/kdd-cup-1999-computer-network-intrusion-detection as the copy has a corrupted line)
- Chapter 6: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream.xml.bz2
- Chapter 7: ftp://ftp.nlm.nih.gov/nlmdata/sample/medline/ (`*.gz`)
- Chapter 8: https://storage.googleapis.com/aas-data-sets/trip_data_1.csv.zip (from http://www.andresmh.com/nyctaxitrips/)
- Chapter 9: See https://github.com/sryza/aas/tree/master/ch09-risk/data ; included download scripts no longer work
- Chapter 10: ftp://ftp.ncbi.nih.gov/1000genomes/ftp/phase3/data/HG00103/alignment/HG00103.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam
- Chapter 11: https://github.com/thunder-project/thunder/tree/v0.4.1/python/thunder/utils/data/fish/tif-stack[![Build Status](https://travis-ci.org/sryza/aas.png?branch=master)](https://travis-ci.org/sryza/aas)