{"id":21078659,"url":"https://github.com/bchoubert/spark-scala-word-processing","last_synced_at":"2025-10-30T23:24:42.756Z","repository":{"id":90191643,"uuid":"79159815","full_name":"bchoubert/spark-scala-word-processing","owner":"bchoubert","description":null,"archived":false,"fork":false,"pushed_at":"2017-01-16T22:13:14.000Z","size":56,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-10-09T11:12:46.438Z","etag":null,"topics":["polytech-lyon"],"latest_commit_sha":null,"homepage":null,"language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bchoubert.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-01-16T21:10:51.000Z","updated_at":"2017-08-28T21:43:06.000Z","dependencies_parsed_at":null,"dependency_job_id":"1583f378-a785-4487-8e39-a927ff8deb06","html_url":"https://github.com/bchoubert/spark-scala-word-processing","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/bchoubert/spark-scala-word-processing","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bchoubert%2Fspark-scala-word-processing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bchoubert%2Fspark-scala-word-processing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bchoubert%2Fspark-scala-word-processing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bchoubert%2Fspark-scala-word-processing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bchoubert","download_url":"https://codeload.github.com/bchoubert/spark-scala-word-processing/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bchoubert%2Fspark-scala-word-processing/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":281897511,"owners_count":26580345,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-30T02:00:06.501Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["polytech-lyon"],"created_at":"2024-11-19T19:41:15.699Z","updated_at":"2025-10-30T23:24:42.717Z","avatar_url":"https://github.com/bchoubert.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cimg src=\"http://spark.apache.org/images/spark-logo-trademark.png\" alt=\"Spark Logo\" height=\"200\"/\u003e\n\n# spark-scala-word-processing\n\nThis repo is an example of Spark Word processing with Scala.\n\n## Input files\n\nThe poeme.txt is a 2978 line-long file separated into sections. It represents a foreign poem translated into French.\n\nThe common_words_subset.txt are the most common French words (1,102 words).\n\n## Code and Results\n\n### Word count\n\nThis goal was to count all words from a file :\n\n```scala\n%spark\nval textFile = sc.textFile(\"hdfs_path_to_poeme.txt\")\nval counts = textFile.flatMap(line =\u003e line.split(\" \"))\n                 .map(word =\u003e (word, 1))\n                 .reduceByKey(_ + _)\ncounts.saveAsTextFile(\"ouputfile\")\n```\n\nHere are some lines of the ouput (`Pair \u003cString, Integer\u003e`)  : \n\n```\n(revêtus,1)\n(francs,1)\n(souvent,8)\n(épais,5)\n(derniers,3)\n(voile,6)\n(dois-je,1)\n(collines;,1)\n(Remplit,1)\n(l'aigle,1)\n(d'ailes,1)\n(Verse,1)\n(Verdissent,1)\n(frappait,1)\n(Viennent,2)\n(saisie,1)\n(guider,2)\n(tristesse!,1)\n(demeure;,1)\n(Dans,6)\n(distrait,1)\n(gentille:,1)\n```\n\n### Word Count Line\n\nThis code will print the line with the most words :\n\n```scala\n%spark\nval textFile = sc.textFile(\"hdfs_path_to_poeme.txt\")\n\nval reg = \"\\\\s+\".r\n\nval counts = textFile.flatMap(line =\u003e line.split(\"\\n\"))\n                .map(line =\u003e (reg.findAllIn(line).length+1, line))\n                .max()\n\nval rdd = sqlContext.sparkContext.parallelize(Seq(counts))\n\nrdd.saveAsTextFile(\"outputfile\")\n```\n\nHere is the result : \n\n```\n12,--«Je sais fort peu de chose et fais mieux de me taire)\n```\nThe line with the most words is 12 word-long.\n\n### Word Anagrams\n\nThis code will process every word to list its anagrams inside the file.\n\nTo do this, letters from words are ordered alphabetically and put in as the key.\n\nAfter, words with the same key are concatenated.\n\n```scala\n%spark \nval textFile = sc.textFile(\"hdfs_path_to_common_words_subset.txt\")\n\nval counts = textFile.flatMap(line =\u003e line.split(\"\\n\"))\n            .map(word =\u003e (\n                word.toLowerCase().toCharArray().sortWith(_ \u003c _).mkString,\n                word.toLowerCase()))\n            .reduceByKey(_ +\"|\"+ _)\n\ncounts.saveAsTextFile(\"outputfile\")\n```\n\nResults are styled as `Pair\u003cText, Text\u003e` with the second is \"word1|word2|word3\" :\n\n```\n(eeiqssuuv,visqueuse)\n(aeeimnstt,estaminet|tantiemes)\n(acceeirst,circaetes)\n(aaeeillst,allaitees)\n(deegirruv,degivreur)\n(egiilnrss,rieslings)\n(eeimrs,misere|remise|rimees)\n(adenort,tornade|erodant)\n(einrtux,nitreux)\n(eeegir,egerie|erigee)\n(aabeeilln,alienable)\n(ceeegnors,congreees)\n(adeeloprs,leopardes)\n(aeglnos,losange|solange)\n(aiiopptt,pipotait)\n(ademoorst,moderatos)\n( aabcilooppss,pablo picasso|pascal obispo)\n```\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbchoubert%2Fspark-scala-word-processing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbchoubert%2Fspark-scala-word-processing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbchoubert%2Fspark-scala-word-processing/lists"}