{"id":20630512,"url":"https://github.com/pixelbyaj/apache-spark","last_synced_at":"2026-04-19T04:37:28.676Z","repository":{"id":140855277,"uuid":"130630501","full_name":"pixelbyaj/apache-spark","owner":"pixelbyaj","description":"Start Apache Spark with Python - pyspark","archived":false,"fork":false,"pushed_at":"2018-05-05T06:53:18.000Z","size":13555,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-10-21T16:57:08.479Z","etag":null,"topics":["apache-spark","pyspark-python","python","spark","winutils"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pixelbyaj.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-04-23T02:41:03.000Z","updated_at":"2018-05-08T09:23:51.000Z","dependencies_parsed_at":"2023-05-04T22:25:44.187Z","dependency_job_id":null,"html_url":"https://github.com/pixelbyaj/apache-spark","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/pixelbyaj/apache-spark","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pixelbyaj%2Fapache-spark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pixelbyaj%2Fapache-spark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pixelbyaj%2Fapache-spark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pixelbyaj%2Fapache-spark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pixelbyaj","download_url":"https://codeload.github.com/pixelbyaj/apache-spark/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pixelbyaj%2Fapache-spark/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31995167,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-18T20:23:30.271Z","status":"online","status_checked_at":"2026-04-19T02:00:07.110Z","response_time":55,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","pyspark-python","python","spark","winutils"],"created_at":"2024-11-16T14:08:04.317Z","updated_at":"2026-04-19T04:37:28.654Z","avatar_url":"https://github.com/pixelbyaj.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Start Apache Spark with Python\n\n## Windows ##\n1.\tInstall a JDK (Java Development Kit) from http://www.oracle.com/technetwork/java/javase/downloads/index.html . **You must install the \t\tJDK into a path with no spaces**, for example c:\\jdk. Be sure to change the default location for the installation! **DO NOT INSTALL JAVA 9-INSTALL JAVA 8. Spark is not compatible with Java 9.**\n\n2. \tDownload a **pre-built** version of Apache Spark from https://spark.apache.org/downloads.html\n\n3.\tIf necessary, download and install WinRAR so you can extract the .tgz file you downloaded. http://www.rarlab.com/download.htm\n\n4.\tExtract the Spark archive, and copy its **contents** into **C:\\spark** after creating that directory. You should end up with \t\t\t\t\t\tdirectories like c:\\spark\\bin, c:\\spark\\conf, etc.\n\n5.\tDownload winutils.exe from https://sundog–s3.amazonaws.com/winutils.exe and move it into a **C:\\winutils\\bin** folder that you’ve \t\t\tcreated. (note, this is a 64-bit application. If you are on a 32-bit version of Windows, you’ll need to search for a 32-bit build of \t\t **winutils.exe** for Hadoop.)\n\n6.\tOpen the the **c:\\spark\\conf** folder, and make sure “File Name Extensions” is checked in the “view” tab of Windows Explorer. Rename\n\t\tthe **log4j.properties.template** file to **log4j.properties**. Edit this file (using Wordpad or something similar) and change the \t\t\terror level from **INFO to ERROR** for log4j.rootCategory\n\n7.\tRight-click your Windows menu, select Control Panel, System and Security, and then System. Click on “Advanced System Settings” and \n\t\tthen the “Environment Variables” button.\n\n8.\tAdd the following new USER variables:\n\t\t\n\t\t1. **SPARK_HOME** c:\\spark\n\t\t2. **JAVA_HOME** (the path you installed the JDK to in step 1, for example C:\\JDK)\n\t\t3. **HADOOP HOME** c:\\winutils\n\n9.\tAdd the following paths to your PATH user variable:\n\n\t\t1.\t**%SPARK_HOME%\\bin**\n\t\t2.\t**%JAVA_HOME%\\bin**\n\n10.\tClose the environment variable screen and the control panels.\n\n11.\tInstall the latest Enthought Canopy for Python 3.5 from https://store.enthought.com/downloads/#default Don’t install a Python \t\t2.7 version! **If you already have Python 3.5 don't need to install Canopy. Even you can intall Python differntly and \t\t\tconfiguret the same**\n\n12. Test it out!\n\n\t\t1.\tOpen up Canopy and select “Canopy Command Prompt” from the Tools menu.\n\t\t2.\tEnter cd c:\\spark and then dir to get a directory listing.\n\t\t3.\tLook for a text file we can play with, like README.md or CHANGES.txt\n\t\t4.\tEnter pyspark\n\t\t5.\tAt this point you should have a \u003e\u003e\u003e prompt. If not, double check the steps above.\n\t\t6.\tEnter rdd = sc.textFile(“README.md”) (or whatever text file you’ve found) Enter rdd.count()\n\t\t7.\tYou should get a count of the number of lines in that file! Congratulations, you just ran your \n\t\t\t\tfirst Spark program!\n\t\t8.\tEnter quit() to exit the spark shell, and close the console window\n\t\t9.\tYou’ve got everything set up! Hooray!\n\t\t\n## NOTE : ##\nFor dataset please download data from https://grouplens.org/datasets/movielens/ \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpixelbyaj%2Fapache-spark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpixelbyaj%2Fapache-spark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpixelbyaj%2Fapache-spark/lists"}