Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/hrolive/big-data-analysis-with-hadoop-and-rhadoop
Foundations of “Big Data” processing by introducing the Hadoop distributed computing architecture and providing an introductory level tutorial for Big Data analysis using Hadoop, Rhadoop, and R libraries parallel, doParallel, foreach and Rmpi.
https://github.com/hrolive/big-data-analysis-with-hadoop-and-rhadoop
big-data big-data-analytics hadoop hdfs hpc hpc-clusters jupyter mapreduce mpi python r rstudio unix
Last synced: 23 days ago
JSON representation
Foundations of “Big Data” processing by introducing the Hadoop distributed computing architecture and providing an introductory level tutorial for Big Data analysis using Hadoop, Rhadoop, and R libraries parallel, doParallel, foreach and Rmpi.
- Host: GitHub
- URL: https://github.com/hrolive/big-data-analysis-with-hadoop-and-rhadoop
- Owner: HROlive
- Created: 2022-10-19T13:36:58.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-09-11T14:31:33.000Z (over 1 year ago)
- Last Synced: 2024-11-09T13:32:31.648Z (3 months ago)
- Topics: big-data, big-data-analytics, hadoop, hdfs, hpc, hpc-clusters, jupyter, mapreduce, mpi, python, r, rstudio, unix
- Language: Jupyter Notebook
- Homepage:
- Size: 27.6 MB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Table of Contents
1. [Description](#description)
2. [Information](#information)
3. [File descriptions](#files)
4. [Certificate](#certificate)This training course focused on the foundations of “Big Data” processing by introducing the Hadoop distributed computing architecture and providing an introductory level tutorial for Big Data analysis using Hadoop, Rhadoop, and R libraries parallel, doParallel, foreach and Rmpi. The course had a hands-on approach, allowing participants to work interactively on real data on the High Performance Computing environment of the University of Ljubljana and on the Vienna Scientific Cluster.
This two-day course was an EuroCC event, jointly organized by EuroCC Slovenia, EuroCC Slovakia and EuroCC Austria.
The overall goals of this course were the following:
> - move big data efficiently to a cluster and to Hadoop distributed file system;
> - perform simple big data analysis by Python scripts using MapReduce and Hadoop;
> - work with RStudio and write scripts within R using several state-of-the-art libraries for parallel computations, like parallel, doParallel, foreach and Rmpi;
> - work with libraries for Hadoop, like rmr, rhdfs and rhbase;
> - perform parallel slurm jobs with R scripts.All attendees were given access to real data on the High Performance Computing environment of the University of Ljubljana and on the Vienna Scientific Cluster.
More detailed information, links and lesson slides for the course can be found on the [course website](https://vsc.ac.at/training/2022/BigData/).
The notebooks and exercises can be found in this repository and are organized in their respective folders, one for each day of the course:
- [Day 1 - Hadoop, HDFS, MapReduce;](https://github.com/HROlive/Big-Data-analysis-with-Hadoop-and-RHadoop/tree/main/Day%201%20-%20Hadoop%2C%20HDFS%2C%20MapReduce)
- [Day 2 - Big Data management and analysis with Rmpi and RHadoop;](https://github.com/HROlive/Big-Data-analysis-with-Hadoop-and-RHadoop/tree/main/Day%202%20-%20Big%20Data%20management%20and%20analysis%20with%20Rmpi%20and%20RHadoop)The certificate for the workshop can be found below:
["Big Data analysis with Hadoop and RHadoop" - EuroCC Slovenia, EuroCC Slovakia, EuroCC Austria and VSC Research Center](https://github.com/HROlive/Big-Data-analysis-with-Hadoop-and-RHadoop/blob/main/images/certificate.pdf) (Issued On: October 2022)