{"id":18675384,"url":"https://github.com/manuparra/tallerh2s","last_synced_at":"2025-08-28T22:26:06.201Z","repository":{"id":79159436,"uuid":"185332378","full_name":"manuparra/TallerH2S","owner":"manuparra","description":"Taller HDFS, Hadoop y Spark para el Master Profesional de Ingeniería Informática - Universidad de Granada","archived":false,"fork":false,"pushed_at":"2019-05-20T18:26:45.000Z","size":14,"stargazers_count":0,"open_issues_count":0,"forks_count":5,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-25T21:51:12.963Z","etag":null,"topics":["hadoop","hdfs","java","map-reduce","python","spark","wordcount"],"latest_commit_sha":null,"homepage":null,"language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/manuparra.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-05-07T06:11:31.000Z","updated_at":"2019-05-20T18:26:47.000Z","dependencies_parsed_at":null,"dependency_job_id":"bb19f3c0-32d7-4573-9e32-4b26cb960f87","html_url":"https://github.com/manuparra/TallerH2S","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manuparra%2FTallerH2S","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manuparra%2FTallerH2S/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manuparra%2FTallerH2S/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manuparra%2FTallerH2S/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/manuparra","download_url":"https://codeload.github.com/manuparra/TallerH2S/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248505928,"owners_count":21115354,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["hadoop","hdfs","java","map-reduce","python","spark","wordcount"],"created_at":"2024-11-07T09:24:40.610Z","updated_at":"2025-04-12T02:11:23.319Z","avatar_url":"https://github.com/manuparra.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Workshop Material for HDFS, Hadoop and Spark\n\nManuel Parra: manuelparra@decsai.ugr.es\n\nContent:\n\n- [Workshop - HDFS, Hadoop](#workshop-material-for-hdfs--hadoop-and-spark)\n  * [How to connect](#how-to-connect)\n  * [What is hadoop.ugr.es](#what-is-hadoopugres)\n  * [Working with HDFS](#working-with-hdfs)\n    + [HDFS basics](#hdfs-basics)\n    + [HDFS storage space](#hdfs-storage-space)\n    + [Usage HDFS](#usage-hdfs)\n  * [Exercice](#exercice)\n  * [Working with Hadoop Map-Reduce](#working-with-hadoop-map-reduce)\n    + [Structure of M/R code](#structure-of-m-r-code)\n      - [Mapper](#mapper)\n      - [Reducer](#reducer)\n      - [Main](#main)\n    + [Word Count example](#word-count-example)\n    + [WordCount example file](#wordcount-example-file)\n    + [Running Hadoop applications](#running-hadoop-applications)\n    + [Results](#results)\n    + [Datasets](#datasets)\n    + [Calculate MIN of a row in Hadoop](#calculate-min-of-a-row-in-hadoop)\n    + [Compile MIN in Hadoop](#compile-min-in-hadoop)\n    + [References](#references-)\n- [Workshop - SparkR](#sparkr)\n  * [How to connect](#how-to-connect-1)\n  * [Start R shell for Spark](#start-r-shell-for-spark)\n  * [Create the Spark Environment](#create-the-spark-environment)\n  * [Close the Spark Session](#close-the-spark-session)\n  * [Spark Session parameters](#spark-session-parameters)\n  * [Creating SparkDataFrames](#creating-sparkdataframes)\n    + [From local data frames](#from-local-data-frames)\n    + [From Data Sources](#from-data-sources)\n    + [How to read/write from/to hdfs](#how-to-read-write-from-to-hdfs)\n  * [SparkDataFrame Operations](#sparkdataframe-operations)\n  * [Grouping and Aggregation](#grouping-and-aggregation)\n  * [Operating on Columns](#operating-on-columns)\n  * [SparkSQL](#sparksql)\n  * [Machine learning](#machine-learning)\n  * [Let see some examples](#let-see-some-examples)\n    + [First example](#first-example)\n\n## How to connect\n\nFrom linux/MacOs machines: \n\n```\nssh \u003cyour account\u003e@hadoop.ugr.es\n```\nFrom Windows machine:\n\n```Use Putty/SSH ``` \n\nDownload link: https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html\n\n- Host: ``hadoop.ugr.es``\n- Port: ``22``\n- Click \"Open\" -\u003e Write your login credentials and password.\n\n## What is hadoop.ugr.es\n\nHadoop.ugr.es is a computing infrastructure or cluster with 15 nodes and a header node containing the data processing platforms Hadoop and Spark and their libraries for Data Mining and Machine Learning (Mahout and MLLib). It also has HDFS installed for working with distributed data. \n\n## Working with HDFS\n\n### HDFS basics\n\nThe management of the files in HDFS works in a different way of the files of the local system. The file system is stored in a special space for HDFS. The directory structure of HDFS is as follows:\n\n```\n/tmp     Temp storage\n/user    User storage\n/usr     Application storage\n/var     Logs storage\n```\n\n### HDFS storage space\n\nEach user has in HDFS a folder in ``/user/mp2019/`` with the username, for example for the user with login mcc50600265 in HDFS have:\n\n```\n/user/mp2019/mpXXXXXX/\n```\n\n!! The HDFS storage space is different from the user's local storage space in hadoop.ugr.es\n\n```\n/user/mp2019/mpXXXXXX/  NOT EQUAL /home/mcc506000265/\n```\n\n### Usage HDFS\n\n```\nhdfs dfs \u003coptions\u003e\n```\n\nor \n\n\n```\nhadoop fs \u003coptions\u003e\n```\n\nCheck the following command to see all options:\n\n```\nhdfs dfs -help\n```\n\nOptions are (simplified):\n\n```\n-ls         List of files \n-cp         Copy files\n-rm         Delete files\n-rmdir      Remove folder\n-mv         Move files or rename\n-cat        Similar to Cat\n-mkdir      Create a folder\n-tail       Last lines of the file\n-get        Get a file from HDFS to local\n-put        Put a file from local to HDFS\n```\n\n\n\nList the content of your HDFS folder:\n\n```\nhdfs dfs -ls /user/mp2019/mpXXXXXX/\n\n```\n\nCreate a test file:\n\n```\necho “HOLA HDFS” \u003e fichero.txt\n```\n\nMove the local file ``fichero.txt`` to HDFS:\n\n```\nhdfs dfs -put fichero.txt /user/mp2019/mpXXXXXX/\n\n```\n\nor, the same:\n\n```\nhdfs dfs -put fichero.txt /user/mp2019/mpXXXXXX/\n```\n\nList again your folder:\n\n```\nhdfs dfs -ls /user/mp2019/mpXXXXXX/\n\n```\n\nCreate a folder:\n\n```\nhdfs dfs -mkdir /user/mp2019/mpXXXXXX/test\n```\n\nMove ``fichero.txt`` to test folder:\n\n```\nhdfs dfs -mv fichero.txt /user/mp2019/mpXXXXXX/test\n```\n\nShow the content:\n\n```\nhdfs dfs -cat /user/mp2019/mpXXXXXX/test/fichero.txt\n```\n\nor, the same:\n\n```\nhdfs dfs -cat /user/mp2019/mpXXXXXX/test/fichero.txt\n```\n\nDelete file and folder:\n\n```\nhdfs dfs -rm /user/mp2019/mpXXXXXX/test/fichero.txt\n```\n\nand \n\n```\nhdfs dfs -rmdir /user/mp2019/mpXXXXXX/test\n```\n\nCreate two files:\n\n```\necho “HOLA HDFS 1” \u003e f1.txt\n```\n\n```\necho “HOLA HDFS 2” \u003e f2.txt\n```\n\nStore in HDFS:\n\n```\nhdfs dfs -put /user/mp2019/mpXXXXXX/f1.txt\n```\n\n```\nhdfs dfs -put /user/mp2019/mpXXXXXX/f2.txt\n```\n\nCocatenate both files (this option is very usefull, because you will need merge the results of the Hadoop Algorithms  execution):\n\n```\nhdfs dfs -getmerge /user/mp2019/mpXXXXXX/ merged.txt\n```\n\nDelete folder recursively:\n\n```\nhdfs dfs -rmr\n```\n\n\n\n## Exercice\n\n- Create 5 files in yout local account with the following names:\n  - part1.dat ,part2.dat, part3.dat. part4.dat. part5.dat\u2028\n- Copy files to HDFS\n- Create the following HDFS folder structure:\n  - /test/p1/\n  - /train/p1/\n  - /train/p2/\u2028\n- Copy part1 in /test/p1/ and part2 in /train/p2/ \n- Move part3, and part4 to /train/p1/\n- Finally merge folder /train/p2 and store as data_merged.txt\n\n\n\n## Working with Hadoop Map-Reduce\n\nHadoop version: 2.9.3\n\n### Structure of M/R code\n\n#### Mapper\n\nMaps input key/value pairs to a set of intermediate key/value pairs.\nMaps are the individual tasks which transform input records into a intermediate records. The transformed intermediate records need not be of the same type as the input records. A given input pair may map to zero or many output pairs.\n\nThe Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. Mapper implementations can access the Configuration for the job via the JobContext.getConfiguration().\n\n```\npublic class TokenCounterMapper \n     extends Mapper\u003cObject, Text, Text, IntWritable\u003e{\n    \n   private final static IntWritable one = new IntWritable(1);\n   private Text word = new Text();\n   \n   public void map(Object key, Text value, Context context) throws IOException, InterruptedException {\n     StringTokenizer itr = new StringTokenizer(value.toString());\n     while (itr.hasMoreTokens()) {\n       word.set(itr.nextToken());\n       context.write(word, one);\n     }\n   }\n }\n```\n\n#### Reducer\n\nReduces a set of intermediate values which share a key to a smaller set of values.\n\nReducer has 3 primary phases:\n\n- Shuffle: The Reducer copies the sorted output from each Mapper using HTTP across the network.\n- Sort: The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key). The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged. A SecondarySort in ordet to achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce.The grouping comparator is specified via Job.setGroupingComparatorClass(Class). The sort order is controlled by Job.setSortComparatorClass(Class). \n- Reduce In this phase the reduce(Object, Iterable, org.apache.hadoop.mapreduce.Reducer.Context) method is called for each \u003ckey, (collection of values)\u003e in the sorted inputs.\n\nThe output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object).\n\n```\npublic class IntSumReducer\u003cKey\u003e extends Reducer\u003cKey,IntWritable,\n                                                 Key,IntWritable\u003e {\n   private IntWritable result = new IntWritable();\n \n   public void reduce(Key key, Iterable\u003cIntWritable\u003e values,\n                      Context context) throws IOException, InterruptedException {\n     int sum = 0;\n     for (IntWritable val : values) {\n       sum += val.get();\n     }\n     result.set(sum);\n     context.write(key, result);\n   }\n }\n \n```\n\n#### Main\n\nMain function considering Map and Reduce objects and additional data for the job.\n\n```\n...\n    Configuration conf = new Configuration();\n    Job job = Job.getInstance(conf, \"word count\");\n\tjob.setJarByClass(WordCount.class);\n    job.setMapperClass(TokenizerMapper.class);\n    job.setCombinerClass(IntSumReducer.class);\n    job.setReducerClass(IntSumReducer.class);\n    job.setOutputKeyClass(Text.class);\n    job.setOutputValueClass(IntWritable.class);\n    FileInputFormat.addInputPath(job, new Path(args[0]));\n    FileOutputFormat.setOutputPath(job, new Path(args[1]));\n    System.exit(job.waitForCompletion(true) ? 0 : 1);\n...\n```\n\n### Word Count example\n\nFull example of Word Count for Hadoop 2.9.3 :\n\n```\nimport java.io.IOException;\nimport java.util.StringTokenizer;\n\nimport org.apache.hadoop.conf.Configuration;\nimport org.apache.hadoop.fs.Path;\nimport org.apache.hadoop.io.IntWritable;\nimport org.apache.hadoop.io.Text;\nimport org.apache.hadoop.mapreduce.Job;\nimport org.apache.hadoop.mapreduce.Mapper;\nimport org.apache.hadoop.mapreduce.Reducer;\nimport org.apache.hadoop.mapreduce.lib.input.FileInputFormat;\nimport org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;\n\npublic class WordCount {\n\n  // Mapper\n  public static class TokenizerMapper\n       extends Mapper\u003cObject, Text, Text, IntWritable\u003e{\n\n    private final static IntWritable one = new IntWritable(1);\n    private Text word = new Text();\n\n    public void map(Object key, Text value, Context context\n                    ) throws IOException, InterruptedException {\n      StringTokenizer itr = new StringTokenizer(value.toString());\n      while (itr.hasMoreTokens()) {\n        word.set(itr.nextToken());\n        context.write(word, one);\n      }\n    }\n  }\n\n  // Reducer \n  public static class IntSumReducer\n       extends Reducer\u003cText,IntWritable,Text,IntWritable\u003e {\n    private IntWritable result = new IntWritable();\n\n    public void reduce(Text key, Iterable\u003cIntWritable\u003e values,\n                       Context context\n                       ) throws IOException, InterruptedException {\n      int sum = 0;\n      for (IntWritable val : values) {\n        sum += val.get();\n      }\n      result.set(sum);\n      context.write(key, result);\n    }\n  }\n\n  // Main\n  public static void main(String[] args) throws Exception {\n    Configuration conf = new Configuration();\n    Job job = Job.getInstance(conf, \"word count\");\n    job.setJarByClass(WordCount.class);\n    job.setMapperClass(TokenizerMapper.class);\n    job.setCombinerClass(IntSumReducer.class);\n    job.setReducerClass(IntSumReducer.class);\n    job.setOutputKeyClass(Text.class);\n    job.setOutputValueClass(IntWritable.class);\n    FileInputFormat.addInputPath(job, new Path(args[0]));\n    FileOutputFormat.setOutputPath(job, new Path(args[1]));\n    System.exit(job.waitForCompletion(true) ? 0 : 1);\n  }\n}\n```\n\n\n### WordCount example file\n\nCopy source code of WordCount.java from HDFS to your home:\n\n```\nhdfs dfs -get /user/mp2019/WordCount.java /home/\u003cyourID\u003e/WordCount.java\n```\n\n\n\n\n### Running Hadoop applications\n\n\nFirst, create classes folder:\n\n````\nmkdir wordcount_classes\n````\n\nCompile WordCount Application (from source code WordCount.java):\n\n```\njavac -classpath `yarn classpath` -d wordcount_classes WordCount.java\n```\n\nThen, (*pay attention in part / . is separated*) \n```\njar -cvf WordCount.jar -C wordcount_classes / .\n```\n\nFinally, the execution template is: \n\n```\nhadoop jar \u003cApplication\u003e \u003cMainClassName\u003e \u003cInput in HDFS\u003e \u003cOutput in HDFS\u003e\n```\n\n**Examples of execution**\n\n*Pay attention: Each run require different output folder*\n\nWith a file Oddyssey.txt in /tmp (HDFS):\n\n```\nhadoop jar WordCount.jar WordCount /tmp/odyssey.txt /user/mp2019/\u003cyourID\u003e/\u003cfolder\u003e/\n```\n\nWith a text file in your HDFS folder:\n\n```\nhadoop jar WordCount.jar WordCount /user/mp2019/lorem.txt  /user/mp2019/\u003cyourID\u003e/\u003cfolder\u003e/\n```\n\n\n\n\n### Results \n\nCheck output folder with:\n\n```\nhdfs dfs -ls /user/mp2019/\u003cyourID\u003e/\u003cfolder\u003e\n```\n\nReturn ...:\n\n```\nFound 2 items\n-rw-r--r--   2 root mapred          0 2019-05-13 17:23 /user/.../_SUCCESS\n-rw-r--r--   2 root mapred       6713 2019-05-13 17:23 /user/.../part-r-00000\n```\n\nShow the content of ``part-r-00000``:\n\n```\nhdfs dfs -cat /user/mp2019/\u003cyourID\u003e/\u003cfolder\u003e/part-r-00000\n```\n\n\n### Datasets\n\nFolder ``/user/mp2019/`` contains several samples of a BigData dataset named ECBDL.\n\n```\nRows    Bytes                         Folder and File\n------- ---------                     ----------------\n5000    177876    2019-05-13 18:14    /user/mp2019/5000_ECBDL14_10tst.data\n20000   711174    2019-05-13 18:13    /user/mp2019/20000_ECBDL14_10tst.data\n500000  17683919  2019-05-13 18:14    /user/mp2019/500000_ECBDL14_10tst.data\n2897918 102747181 2019-05-13 18:14    /user/mp2019/ECBDL14_10tst.data\n```\n\n\n### Calculate MIN of a row in Hadoop\n\nMapper (old version):\n\n\n```\npublic class MinMapper extends MapReduceBase implements Mapper\u003cLongWritable, Text, Text, DoubleWritable\u003e {\n\n\n        private static final int MISSING = 9999;\n        \n        // Numero de la Columna del Dataset de donde vamos a buscar el valor mínimo\n        public static int col=5;\n\n\t\tpublic void map(LongWritable key, Text value, OutputCollector\u003cText, DoubleWritable\u003e output, Reporter reporter) throws IOException {\n                \n                // Como el fichero de datos cada columna está separada por el caracter , (coma)\n\t\t\t\t// Usamos el caracter , (coma) para dividir cada línea del fichero en el map en las columnas\n                String line = value.toString();\n                String[] parts = line.split(\",\");\n\n                // Hacemos el collect de la key=1 y el valor de la columna (el valor corresponde con el número de columna\n                // indicado anteriormente)\n                output.collect(new Text(\"1\"), new DoubleWritable(Double.parseDouble(parts[col])));\n        }\n}\n```\n\n\nReducer (old version):\n\n````\npublic class MinReducer extends MapReduceBase implements Reducer\u003cText, DoubleWritable, Text, DoubleWritable\u003e {\n\t\n\t\t// Funcion Reduce:\t\t\n\t\tpublic void reduce(Text key, Iterator\u003cDoubleWritable\u003e values, OutputCollector\u003cText, DoubleWritable\u003e output, Reporter reporter) throws IOException {\n\n\t\t// Para extraer el Minimo usamos de valor incial el máximo de JAVA\n\t\tDouble minValue = Double.MAX_VALUE;\n\t\t\n\t\t// Leemos cada tupla  y nos quedamos con el menor valor\n\t\twhile (values.hasNext()) {\n\t\t\tminValue = Math.min(minValue, values.next().get());\n\t\t}\n\t\t\n\t\t// Hacemos el collect con la key el valor mínimo encontrado en esta fase de reducción\n\t\toutput.collect(key, new DoubleWritable(minValue));\n\t}\n}\n````\n\nMain (old version):\n\n```\n  public static void main(String[] args) throws Exception {\n    Configuration conf = new Configuration();\n    Job job = Job.getInstance(conf, \"Min\");\n    job.setJarByClass(Min.class);\n    job.setMapperClass(MinMapper.class);\n    job.setCombinerClass(MinReducer.class);\n    job.setReducerClass(MinReducer.class);\n    job.setOutputKeyClass(Text.class);\n    job.setOutputValueClass(IntWritable.class);\n    FileInputFormat.addInputPath(job, new Path(args[0]));\n    FileOutputFormat.setOutputPath(job, new Path(args[1]));\n    System.exit(job.waitForCompletion(true) ? 0 : 1);\n  }\n```\n\n### Compile MIN in Hadoop\n\nFirst, create classes folder:\n\n````\nmkdir min_classes\n````\n\nCompile Min Application (from source code Min.java):\n\n```\njavac -classpath `yarn classpath` -d min_classes Min.java\n```\n\nThen, (*pay attention in part / . is separated*) \n```\njar -cvf Min.jar -C min_classes / .\n```\n\nFinally, the execution template is: \n\n```\nhadoop jar \u003cApplication\u003e \u003cMainClassName\u003e \u003cInput in HDFS\u003e \u003cOutput in HDFS\u003e\n```\n\n**Examples of execution**\n\n*Pay attention: Each run require different output folder*\n\nWith a file Oddyssey.txt in /tmp (HDFS):\n\n```\nhadoop jar Min.jar Min /user/mp2019/5000_ECBDL14_10tst.data /user/mp2019/\u003cyourID\u003e/\u003cfolder\u003e/\n```\n\nCheck results:\n\n```\nhdfs dfs -ls /user/mp2019/\u003cyourID\u003e/\u003cfolder\u003e/\n```\n\nShow results:\n\n```\nhdfs dfs -cat /user/mp2019/\u003cyourID\u003e/\u003cfolder\u003e/part-....\n```\n\n\n\n\n\n\n\n# References:\n\n- http://www.glennklockwood.com/data-intensive/hadoop/overview.html\n\n\n\ncp /tmp/lorem.txt /home/mp\u003cDNI\u003e/lorem.txt\nhdfs dfs -put lorem.txt /user/mp2019/mp\u003cDNI\u003e/\nhdfs dfs -put /home/mp\u003cDNI\u003e/lorem.txt /user/mp2019/mp\u003cDNI\u003e/\n\nhdfs dfs -ls /user/mp2019/mp\u003cDNI\u003e/lorem.txt\ncat lorem.txt\nhdfs dfs -cat /user/mp2019/mp\u003cDNI\u003e/lorem.txt\n\n\n\n\n# SparkR\n\nAPI SparkR: https://spark.apache.org/docs/2.2.0/api/R/\n\n## How to connect\n\n\n```\nssh \u003cyourID\u003e@hadoop.ugr.es\n```\n\n## Start R shell for Spark\n\nRun the next command:\n\n```\nR\n```\n\n## Create the Spark Environment\n\n\n```\n\nif (nchar(Sys.getenv(\"SPARK_HOME\")) \u003c 1) {\n  Sys.setenv(SPARK_HOME = \"/opt/spark-2.2.0/\")\n}\n\nlibrary(SparkR, lib.loc = c(file.path(Sys.getenv(\"SPARK_HOME\"), \"R\", \"lib\")))\n\nsparkR.session(master = \"local[*]\", sparkConfig = list(spark.driver.memory = \"1g\"),enableHiveSupport=FALSE)\n\n```\n\n\n\n## Close the Spark Session\n\n\n```\nsparkR.session.stop()\n```\n\n## Spark Session parameters\n\n```\nProperty Name                   Property group          spark-submit equivalent\nspark.master                    Application Properties  --master\nspark.yarn.keytab               Application Properties  --keytab\nspark.yarn.principal            Application Properties  --principal\nspark.driver.memory             Application Properties  --driver-memory\nspark.driver.extraClassPath     Runtime Environment     --driver-class-path\nspark.driver.extraJavaOptions   Runtime Environment     --driver-java-options\nspark.driver.extraLibraryPath   Runtime Environment     --driver-library-path\n\n```\n\n## Creating SparkDataFrames\n\nWith a SparkSession, applications can create SparkDataFrames from a local R data frame, from a Hive table, or from other data sources.\n\n### From local data frames\n\nThe simplest way to create a data frame is to convert a local R data frame into a SparkDataFrame. Specifically we can use as.DataFrame or createDataFrame and pass in the local R data frame to create a SparkDataFrame. As an example, the following creates a SparkDataFrame based using the faithful dataset from R.\n\n```\ndf \u003c- as.DataFrame(faithful)\n```\n\nShow data in df:\n\n```\nhead(df)\n```\n\n### From Data Sources\n\nSparkR supports operating on a variety of data sources through the SparkDataFrame interface.\n\nSparkR supports reading JSON, CSV and Parquet files natively, and through packages available from sources like Third Party Projects, you can find data source connectors for popular file formats like Avro.\n\n```\ndata1 \u003c- read.df(\"my_file.json\", \"json\")\ndata2 \u003c- read.df(\"my_file.csv\", \"csv\")\n...\n```\n\n### How to read/write from/to hdfs\n\n\nRead as DataFrame ``/user/mp2019/5000_ECBDL14_10tst.data``:\n\n```\ndf5000 \u003c- read.df(\"hdfs://hadoop-master/user/mp2019/5000_ECBDL14_10tst.data\", source=\"csv\")\n\n```\n\nCheck data:\n\n```\nsummary(df5000)\n```\n\nExplain the data: From _c0 to _c9 (data), class variable: _c10.\n\n## SparkDataFrame Operations\n\nCreate the SparkDataFrame\n\n```\ndf \u003c- as.DataFrame(faithful)\n```\n\nGet basic information about the SparkDataFrame\n\n``\ndf\n``\n\nSelect only the \"eruptions\" column\n\n```\nselect(df, df$eruptions)\n``\n\nShow:\n\n```\nhead(select(df, df$eruptions))\n``\n\n\nFilter the SparkDataFrame to only retain rows with wait times shorter than 50 mins\n\n```\nfilter(df, df$waiting \u003c 50)\n```\n\nShow first results\n\n```\nhead(filter(df, df$waiting \u003c 50))\n```\n\n## Grouping and Aggregation\n\nSparkR data frames support a number of commonly used functions to aggregate data after grouping. \n\n\nWe use the `n` operator to count the number of times each waiting time appears\n\n```\nhead(summarize(groupBy(df, df$waiting), count = n(df$waiting)))\n```\n\n## Operating on Columns\n\nSparkR also provides a number of functions that can directly applied to columns for data processing and during aggregation.\n\n```\ndf$waiting_secs \u003c- df$waiting * 60\n```\n\n## SparkSQL\n\n```\ndf5000 \u003c- read.df(\"hdfs://hadoop-master/user/mp2019/5000_ECBDL14_10tst.data\", source=\"csv\")\n```\n\nCheck summary:\n\n```\nsummary(df5000)\n```\n\nConvert to SparkSQLObject:\n\n```\ncreateOrReplaceTempView(df5000, \"df5000sql\")\n\n```\n\nUse the next sentence:\n\n```\nresults \u003c- sql(\"SELECT _c0  FROM df5000sql\")\n```\n\nCheck results:\n\n```\nhead(results)\n```\n\n\n```\nresults \u003c- sql(\"SELECT max(_c0)  FROM df5000sql\")\n\n```\n\n**Question:**\n\nWhat is the pair of columns more correlated?\n\nCheck SQL functions: https://spark.apache.org/docs/2.3.0/api/sql/index.html\n\n**5 minutes**\n\n\n**How many records of each class are there?**\n\n```\nresults \u003c- sql(\"SELECT count(*),_c10  FROM df5000sql group by _c10\")\n```\n\n## Machine learning\n\nSparkR supports the following machine learning algorithms currently:\n\n- Classification\n  - spark.logit: Logistic Regression\n  - spark.mlp: Multilayer Perceptron (MLP)\n  - spark.naiveBayes: Naive Bayes\n  - spark.svmLinear: Linear Support Vector Machine\n- Regression\n  - spark.survreg: Accelerated Failure Time (AFT) Survival Model\n  - spark.glm or glm: Generalized Linear Model (GLM)\n  - spark.isoreg: Isotonic Regression\n- Tree\n  - spark.gbt: Gradient Boosted Trees for Regression and Classification\n  - spark.randomForest: Random Forest for Regression and Classification\n- Clustering\n  - spark.bisectingKmeans: Bisecting k-means\n  - spark.gaussianMixture: Gaussian Mixture Model (GMM)\n  - spark.kmeans: K-Means\n  - spark.lda: Latent Dirichlet Allocation (LDA)\n- Collaborative Filtering\n  - spark.als: Alternating Least Squares (ALS)\n- Frequent Pattern Mining\n  - spark.fpGrowth : FP-growth\n- Statistics\n  - spark.kstest: Kolmogorov-Smirnov Test\n\nUnder the hood, SparkR uses MLlib to train the model. Please refer to the corresponding section of MLlib user guide for example code. Users can call summary to print a summary of the fitted model, predict to make predictions on new data, and write.ml/read.ml to save/load fitted models. SparkR supports a subset of the available R formula operators for model fitting, including ‘~’, ‘.’, ‘:’, ‘+’, and ‘-‘.\n\n## Let see some examples\n\n+Info: https://spark.apache.org/docs/2.2.0/ml-classification-regression.html\n\nBefore start, check the columns types:\n\n```\nsummary(df5000)\n```\n\nResult:\n\n```\nSparkDataFrame[summary:string, _c0:string, _c1:string, _c2:string, _c3:string, _c4:string, _c5:string, _c6:string, _c7:string, _c8:string, _c9:string, _c10:string]\n```\n\nALL TYPES are STRING :(\n\nSolution: Infer Scheme !\n\n```\ndf5000 \u003c- read.df(\"hdfs://hadoop-master/user/mp2019/5000_ECBDL14_10tst.data\", source=\"csv\",  inferSchema = \"true\", header=\"true\")\n```\n\nCheck again:\n\n```\nsummary(df5000)\n```\n\n### First example\n\n```\ntraining \u003c- df5000\ntest \u003c- df5000\n```\n\n```\nmodel = spark.logit(training, f1 ~ class, maxIter = 10, regParam = 0.3, elasticNetParam = 0.8)\n\n```\n\nSee the model:\n\n```\nsummary(model)\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmanuparra%2Ftallerh2s","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmanuparra%2Ftallerh2s","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmanuparra%2Ftallerh2s/lists"}