{"id":16388204,"url":"https://github.com/COM6012/ScalableML","last_synced_at":"2025-09-08T15:32:01.644Z","repository":{"id":34213816,"uuid":"168848225","full_name":"haipinglu/ScalableML","owner":"haipinglu","description":"COM6012 Scalable Machine Learning - University of Sheffield","archived":false,"fork":false,"pushed_at":"2023-05-17T13:48:11.000Z","size":196844,"stargazers_count":75,"open_issues_count":0,"forks_count":80,"subscribers_count":8,"default_branch":"master","last_synced_at":"2024-11-09T05:38:32.002Z","etag":null,"topics":["machine-learning","scalable-data-analysis"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/haipinglu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-02-02T15:54:05.000Z","updated_at":"2024-08-09T15:13:10.000Z","dependencies_parsed_at":"2024-11-09T05:47:23.517Z","dependency_job_id":null,"html_url":"https://github.com/haipinglu/ScalableML","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/haipinglu%2FScalableML","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/haipinglu%2FScalableML/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/haipinglu%2FScalableML/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/haipinglu%2FScalableML/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/haipinglu","download_url":"https://codeload.github.com/haipinglu/ScalableML/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":232320250,"owners_count":18504974,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning","scalable-data-analysis"],"created_at":"2024-10-11T04:28:35.433Z","updated_at":"2025-09-08T15:32:01.634Z","avatar_url":"https://github.com/haipinglu.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# COM6012 Scalable Machine Learning - University of Sheffield\n\n## Spring 2025\n\n**by [Shuo Zhou](https://shuo-zhou.github.io/) and [Haiping Lu](https://haipinglu.github.io/), with [Tahsin Khan](https://www.sheffield.ac.uk/dcs/people/academic/tahsinur-khan) and [Xianyuan Liu](https://xianyuanliu.github.io/)**\n\nIn [this module](http://www.dcs.shef.ac.uk/intranet/teaching/public/modules/msc/com6012.html), we will learn how to do machine learning at large scale using [Apache Spark](https://spark.apache.org/).\nWe will use the [High Performance Computing (HPC) cluster systems](https://docs.hpc.shef.ac.uk/en/latest/hpc/index.html) of our university. To access the HPC clusters, log in using SSH with your university username and the associated password. When connecting while on campus using Eduroam or off campus, you **must** keep the [university's VPN](https://www.sheffield.ac.uk/it-services/vpn) connected all the time. Multifactor authentication (MFA) will be mandatory. The standard University [DUO MFA](https://www.sheffield.ac.uk/it-services/mfa/set-mfa#duo) is utilized.\n\nThis edition uses [**PySpark 3.5.4**](https://spark.apache.org/docs/3.5.4/api/python/index.html), the [latest stable release of Spark](https://spark.apache.org/releases/spark-release-3-5-4.html) (Dec 20, 2024), and has 10 sessions below. You can refer to the [overview slides](https://github.com/COM6012/ScalableML/blob/master/Slides/Overview-COM6012-2025.pdf) for more information, e.g. timetable and assessment information.\n\n* Session 1: Introduction to Spark and HPC [[Slides](Slides/Lecture%201-COM6012-2025.pdf)][[Lab notes](Lab%201%20-%20Introduction%20to%20Spark%20and%20HPC.md)] (Shuo Zhou)\n* Session 2: RDD, DataFrame, ML pipeline, \u0026 parallelization [[Slides](Slides/Lecture%202-COM6012-2025.pdf)][[Lab notes](Lab%202%20-%20RDD,%20DataFrame,%20ML%20pipeline,%20and%20parallelization.md)] (Shuo Zhou)\n* Session 3: Scalable logistic regression and Spark configuration [[Slides](Slides/Lecture%203-COM6012-2025.pdf)][[Lab notes](Lab%203%20-%20Spark%20configuration%20and%20scalable%20logistic%20regression.md)] (Shuo Zhou)\n* Session 4: Scalable generalized linear models and Spark data types [[Slides](Slides/Lecture%204-COM6012-2025.pdf)][[Lab notes](Lab%204%20-%20Scalable%20Generalized%20Linear%20Models.md)] (Shuo Zhou)\n* Session 5: Scalable decision trees and ensemble models [[Slides](Slides/Lecture%205-COM6012-2025.pdf)][[Lab notes](Lab%205-%20Scalable%20Decision%20trees.md)] (Tahsin Khan)\n* Session 6: Scalable neural networks [[Slides](Slides/Lecture%206-COM6012-2025.pdf)][[Lab notes](Lab%206%20-%20Scalable%20neural%20networks.md)] (Tahsin Khan)\n* Session 7: Scalable k-means clustering [[Slides](Slides/Lecture%207-COM6012-2025.pdf)][[Lab notes](Lab%207%20-%20Scalable%20k-means%20clustering.md)] (Tahsin Khan)\n* Session 8: Scalable matrix factorization for collaborative filtering in recommender systems and PCA for dimensionality reduction [[Slides](Slides/Lecture%208-COM6012-2025.pdf)][[Lab notes](Lab%208%20-%20Sclable%20matrix%20factorization%20and%20PCA.md)] (Haiping Lu)\n* Session 9: Apache Spark in the Cloud (Xianyuan Liu)\n* Session 10: Reproducible and reusable AI  (Xianyuan Liu)\n\nYou can also download the [Spring 2024 version](https://github.com/COM6012/ScalableML/releases/tag/v2024) for preview or reference.\n\nIf you do not have a [GitHub account](https://github.com/join) yet, we recommend signing up for one to learn how to use this popular open-source software development platform.\n\nWe use US spelling in the slides and lab notes for consistency with the naming conventions in Spark.\n\n## An Introduction to Transparent Machine Learning\n\nShuo Zhou and Haiping Lu developed a course on [An Introduction to Transparent Machine Learning](https://pykale.github.io/transparentML/), part of the [Alan Turing Institute’s online learning courses in responsible AI](https://www.turing.ac.uk/funding-call-online-learning-courses-responsible-ai). If interested, you can refer to this introductory course with emphasis on transparency in machine learning to assist you in your learning of scalable machine learning.\n\n## Acknowledgement\n\nThe materials are built with references to the following sources:\n\n* The official [Apache Spark documentations](https://spark.apache.org/). *Note: the **latest information** is here.*\n* The [PySpark tutorial](https://runawayhorse001.github.io/LearningApacheSpark/) by [Wenqiang Feng](https://www.linkedin.com/in/wenqiang-feng-ph-d-51a93742/) with [PDF - Learning Apache Spark with Python](https://runawayhorse001.github.io/LearningApacheSpark/pyspark.pdf). Also see [GitHub Project Page](https://github.com/runawayhorse001/LearningApacheSpark). *Note: last update in Dec 2022.*\n* The [**Introduction to Apache Spark** course by A. D. Joseph, University of California, Berkeley](https://www.mooc-list.com/course/introduction-apache-spark-edx). *Note: archived.*\n* The book [Learning Spark: Lightning-Fast Data Analytics](https://www.oreilly.com/library/view/learning-spark-2nd/9781492050032/), 2nd Edition, O'Reilly by Jules S. Damji, Brooke Wenig, Tathagata Das \u0026 Denny Lee, with a [github repository](https://github.com/databricks/LearningSparkV2).\n* The book [**Spark: The Definitive Guide**](https://books.google.co.uk/books/about/Spark.html?id=urjpAQAACAAJ\u0026redir_esc=y) by Bill Chambers and Matei Zaharia. There is also a Repository for [code](https://github.com/databricks/Spark-The-Definitive-Guide) from the book.\n\nMany thanks to\n\n* [Robert Loftin](https://www.sheffield.ac.uk/cs/people/academic/robert-loftin) and [Mauricio A Álvarez](https://maalvarezl.github.io/), who contributed to this module in 2024 and from 2016 to 2022, respectively. Their contributions remain reflected in the course materials.\n* Mike Croucher, Neil Lawrence, William Furnass, Twin Karmakharm, Mike Smith, Xianyuan Liu, Desmond Ryan, Steve Kirk, James Moore, and Vamsi Sai Turlapati for their inputs and inspirations since 2016.\n* Our teaching assistants and students who have contributed in many ways since 2017.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCOM6012%2FScalableML","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FCOM6012%2FScalableML","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCOM6012%2FScalableML/lists"}