Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/rajeshthallam/edx-cs100.1-big-data-with-apache-spark

Introduction to Big Data with Apache Spark
https://github.com/rajeshthallam/edx-cs100.1-big-data-with-apache-spark

Last synced: about 1 month ago
JSON representation

Introduction to Big Data with Apache Spark

Awesome Lists containing this project

README

        

# edX-CS100.1-Big-Data-with-Apache-Spark
Introduction to Big Data with Apache Spark
BerkeleyX - CS100.1x Ended - Jul 07, 2015

# COURSE OVERVIEW

Organizations use their data for decision support and to build data-intensive products and services, such as recommendation, prediction, and diagnostic systems. The collection of skills required by organizations to support these functions has been grouped under the term Data Science. This course will attempt to articulate the expected output of Data Scientists and then teach students how to use PySpark (part of Apache Spark) to deliver against these expectations. The course assignments include Log Mining, Textual Entity Recognition, and Collaborative Filtering exercises that teach students how to manipulate datasets using parallel processing with PySpark.

COURSE CONTENT

Week 1: Big Data and Data Science

Introduction to Big Data and Data Science - learn about big data and see examples of how data science can leverage big data
Performing Data Science and Preparing Data - explore data science definitions and topics, and the process of preparing data
Setting up the Course Software Environment - download and install the course software, run your first Apache Spark notebook, and submit your first assignment

Week 2: Introduction to Apache Spark

Big Data, Hardware Trends, and the History of Apache Spark - discuss big data and hardware trends, and learn about the history of Apache Spark
Spark Essentials - learn about Spark's Resilient Distributed Datasets, transformations, and actions
Lab 1: Learning Apache Spark - perform your first course lab where you will learn about the Spark data model, transformations, and actions, and write a word counting program to count the words in all of Shakespeare's plays

Week 3: Data Management

Semi-Structured Data - explore the concept of semi-structured data and how tabular data is handled in Spark
Structured Data - learn about structured data, the relational data model, SQL, and joins in SQL and Spark
Lab 2: Web Server Log Analysis with Apache Spark - use Spark to explore a NASA Apache web server log in the second course lab

Week 4: Data Quality, Exploratory Data Analysis, and Machine Learning

Data Quality - learn about the challenges of data quality and cleaning
Exploratory Data Analysis - understand the statistics of Exploratory Data Analysis and data distributions
Machine Learning - learn about Spark's machine learning library, mllib
Lab 3: Text Analysis and Entity Resolution - perform text analysis and entity resolution on Google and Amazon product listings using Spark in the third course lab

Week 5: Data Management

Lab 4: Introduction to Machine Learning with Apache Spark - use Spark's mllib Machine Learning library to perform collaborative filtering on a movie dataset in the fourth course lab