{"id":18003713,"url":"https://github.com/ishaansathaye/csc369-introdistributedcomputing","last_synced_at":"2026-05-02T18:41:10.609Z","repository":{"id":258184406,"uuid":"861858355","full_name":"ishaansathaye/CSC369-IntroDistributedComputing","owner":"ishaansathaye","description":"Cal Poly Fall 2024 CSC 369 Intro to Distributed Computing","archived":false,"fork":false,"pushed_at":"2024-12-07T07:23:19.000Z","size":747,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-07-02T02:03:15.031Z","etag":null,"topics":["distributed-computing","hadoop","java","map-reduce","scala","spark"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ishaansathaye.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-23T16:18:26.000Z","updated_at":"2024-12-07T07:23:22.000Z","dependencies_parsed_at":"2024-10-25T21:30:15.367Z","dependency_job_id":"56142ee1-b3fe-4e3b-bbe3-4ef81e519edb","html_url":"https://github.com/ishaansathaye/CSC369-IntroDistributedComputing","commit_stats":null,"previous_names":["ishaansathaye/csc369-introdistributedcomputing"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ishaansathaye/CSC369-IntroDistributedComputing","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ishaansathaye%2FCSC369-IntroDistributedComputing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ishaansathaye%2FCSC369-IntroDistributedComputing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ishaansathaye%2FCSC369-IntroDistributedComputing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ishaansathaye%2FCSC369-IntroDistributedComputing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ishaansathaye","download_url":"https://codeload.github.com/ishaansathaye/CSC369-IntroDistributedComputing/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ishaansathaye%2FCSC369-IntroDistributedComputing/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263061406,"owners_count":23407606,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["distributed-computing","hadoop","java","map-reduce","scala","spark"],"created_at":"2024-10-30T00:10:36.327Z","updated_at":"2026-05-02T18:41:10.536Z","avatar_url":"https://github.com/ishaansathaye.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CSC 369 Introduction to Distributed Computing\n\n## [Slides](https://drive.google.com/drive/folders/15f8oNQfrhNaNGEnIE2-QLQ7_O8go3B60)\n\n- [0 - Using Hadoop and Java](https://docs.google.com/presentation/d/1MJ10Xl_4CI0m0sRZV7fgejnmyCAH4qWe/edit#slide=id.p3)\n- [1 - Map/Reduce](https://docs.google.com/presentation/d/1CFfGHUuzZNVUfKejn_E3AVtR1p7046op/edit#slide=id.p4)\n- [2 - Combiner Functions and Custom Classes](https://docs.google.com/presentation/d/1AiSMVQQLVdIh6sGOEFGy96F0YZt6Uzax/edit#slide=id.p1)\n- [3 - Custom Partitioners, Sorters, and Grouping Comparators](https://docs.google.com/presentation/d/1r9gLifKq3PrpLUJ2gMmY7KM44d0iOs5d/edit#slide=id.p1)\n- [4 - Secondary Sort Example](https://docs.google.com/presentation/d/1CG13YuVfVTuzRJFCazdTRs2kr2yBp03Z/edit#slide=id.p3)\n- [5 - Top N Example](https://docs.google.com/presentation/d/1HfZVg7Nh81fa1gThNKIdQ_m1zmPcUFIg/edit#slide=id.p1)\n- [6 - Outer Joins and Multiple Jobs](https://docs.google.com/presentation/d/1yorq8VWz3FI8mJmigQXl9FDGNmwfr0vd/edit#slide=id.p1)\n- [7 - Intro to Scala](https://docs.google.com/presentation/d/1R4BPvFmCZU-IzKlDTGnucmuqC9Up9k0w/edit#slide=id.p1)\n- [8 - RDDs in Spark](https://docs.google.com/presentation/d/1sP2jW2tuYeUqczHkS8HJXlLjtBrCADf8/edit#slide=id.p1)\n- [9 - RDDs with Key-Value Pairs](https://docs.google.com/presentation/d/1ivR5WsgxJivx8ltVhxXFgWm3rkmPPnip/edit#slide=id.p1)\n\n## Notes\n\n- [Hadoop and Java](notes/0_HadoopJava.md)\n\n## Labs\n\n- [Lab 1 - Data Generation](labs/lab1/)\n  - [Lab 1 Instructions](https://docs.google.com/document/d/1IZJ3BmwIFJFoxMhJ-pdHfGyYU1rzox7w/edit)\n- [Lab 2 - Total Sales](labs/lab2/)\n  - [Lab 2 Instructions](https://docs.google.com/document/d/1K-T44teE8fGD3-PdRWcSMnv6ewJBGX5b/edit)\n- [Lab 3 - Sorting All Sales](labs/lab3/)\n  - [Lab 3 Instructions](https://docs.google.com/document/d/1ILEF63JqMABhDGTELM9VkUjAXiMnC9AN/edit)\n- [Lab 4 - Top 10 Expensive Products](labs/lab4/)\n  - [Lab 4 Instructions](https://docs.google.com/document/d/1F3ElibL21zv-aZDmF0MCAoTl-26RO-hI/edit#heading=h.gjdgxs)\n- [Lab 5 - Scala](labs/lab5/)\n  - [Lab 5 Instructions](https://docs.google.com/document/d/1tWk_RK40CvqoesQINOo84wVH3D_pPMXL/edit)\n- [Lab 6 - Spark and RDDs](labs/lab6/)\n  - [Lab 6 Instructions](https://docs.google.com/document/d/1FsnPrEl35rMZDPZBhhcHf7eJ8VzCTun7/edit)\n- [Lab 7 - Top 10 Spark and RDD](labs/lab7/)\n  - [Lab 7 Instructions](https://docs.google.com/document/d/17RWxoWmKOL16y-a_VxK53Ube9PoHk1AC/edit)\n\n## Assignments\n\n- [Assignment 1](assignments/assignment1/assignment1.pdf)\n- [Assignment 2](assignments/assignment2/assignment2.pdf)\n- [Assignment 3](assignments/assignment3/assignment3.pdf)\n- [Assignment 4](assignments/assignment4/assignment4.pdf)\n\n## Project\n\n- [Project](https://github.com/ishaansathaye/CSC369Project-LoanApproval)\n\n## Scala Cluster Commands\n- Create an example subdirectory, e.g. directory with name `example`.\n- Create your program, e.g., `App.scala` in this subdirectory. Make sure to start with `package example`.\n- Type `sbt package` in the main directory (that contain the src folder). This will compile your program.\n- `spark-submit --class example.App --master yarn ./target/scala-2.11/example_2.11-0.1.jar /user/isathaye/input /user/isathaye/output` to execute. Program parameters (HDFS directories) in blue.\n- In the above statement, example is the name of the project as set in build.sbt.\n\n## Map/Reduce Java Basic Commands\n\n- Write `hadoop fs -ls /user/lubo` to see what is there.\n  Common operations\n- `hadoop fs -ls [directory]` find content\n- `hadoop fs -copyFromLocal localDataFile /user/lubo/input` \u003c- copies a file from current local directory to input directory in the HDFS.\n- `hadoop fs -get /user/lubo/input/file` \u003c- copies a file from the input directory in the HDFS to the current local directory.\n- `hadoop fs -rm -r /user/lubo/output` \u003c- deletes output directory.\n- `hadoop fs -mkdir /user/lubo/input` \u003c- creates input directory\n- `hadoop fs -cat /user/lubo/output/part-r-00000` \u003c-prints the content of the output file\n\n## Compiling Program\n\n- Put all your source code in the same folder.\n- Run `hadoop com.sun.tools.javac.Main *.java` This will create the .class files.\n- Run `jar cvf WordCount.jar *.class`. This will create a single jar.\n  - Example job submission:\n    - `hadoop jar WordCount.jar WordCountDriver 5 /user/lubo/input /user/lubo/output`\n- `WordCount.jar` is the name of the jar file\n- `WordCountDriver` is the name of the Java driver file (the file that contains the main method).\n- The last three parameters are input to the program: e.g., min size of word to be selected and location of input and output directories.\n- Type `hadoop fs -cat /user/lubo/output/part-r-00000` to see result.\n- If multiple files, use 00001, 00002, and so on.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fishaansathaye%2Fcsc369-introdistributedcomputing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fishaansathaye%2Fcsc369-introdistributedcomputing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fishaansathaye%2Fcsc369-introdistributedcomputing/lists"}