{"id":18400671,"url":"https://github.com/databricks/simr","last_synced_at":"2025-04-07T06:33:41.563Z","repository":{"id":11159841,"uuid":"13531617","full_name":"databricks/simr","owner":"databricks","description":"Spark In MapReduce (SIMR) - launching Spark applications on existing Hadoop MapReduce infrastructure","archived":false,"fork":false,"pushed_at":"2022-03-09T16:37:32.000Z","size":6669,"stargazers_count":45,"open_issues_count":4,"forks_count":19,"subscribers_count":359,"default_branch":"master","last_synced_at":"2025-04-03T00:59:00.058Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://databricks.github.io/simr/","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/databricks.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-10-13T00:44:49.000Z","updated_at":"2024-06-06T08:44:36.000Z","dependencies_parsed_at":"2022-08-28T15:00:24.269Z","dependency_job_id":null,"html_url":"https://github.com/databricks/simr","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Fsimr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Fsimr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Fsimr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Fsimr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/databricks","download_url":"https://codeload.github.com/databricks/simr/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247607769,"owners_count":20965945,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-06T02:35:57.911Z","updated_at":"2025-04-07T06:33:36.547Z","avatar_url":"https://github.com/databricks.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Spark In MapReduce (SIMR) Documentation\n\n## Quick Guide\n\nDownload the `simr` runtime script, as well as the `simr-\u003chadoop-version\u003e.jar` and  `spark-assembly-\u003chadoop-version\u003e.jar` that match\nthe version of Hadoop your cluster is running. If it is not provided, you will have to build it\nyourself. [See below](#advanced-configuration).\n\n* SIMR runtime script\n  + [Download] ()\n* SIMR and Spark Jars are provided for the following Hadoop versions:\n  + 1.0.4 (HDP 1.0 - 1.2) [SIMR Hadoop 1.0.4] () / [Spark Hadoop 1.0.4] ()\n  + 1.2.x (HDP 1.3) [SIMR Hadoop 1.2.0] () / [Spark Hadoop 1.2.0] ()\n  + 0.20 (CDH3) [SIMR CDH3] () / [Spark CDH3] ()\n  + 2.0.0 (CDH4) [SIMR CDH4] () / [Spark CDH4] ()\n\nPlace `simr`, `simr-\u003chadoop-version\u003e.jar`, and `spark-assembly-\u003chadoop-version\u003e.jar` in a directory\nand call `simr` to get usage information. Try running the shell! If you get stuck, continue reading.\n```shell\n./simr --shell\n```\n\n## Requirements\n\n* Java v1.6 is required\n* SIMR will ship Scala 2.9.3 and Spark 0.8.1 to the Hadoop cluster and execute your program with them.\n* Spark jars are provided for Hadoop 1.0.4 (HDP 1.0 - 1.2), 1.2.x (HDP 1.3), 0.20 (CDH3), 2.0.0 (CDH4)\n\n## Guide\n\nEnsure the `hadoop` executable is in the PATH. If it is not, set $HADOOP to point to the binary, or\nthe hadoop/bin directory. Set `$SIMRJAR` and `$SPARKJAR` to specifiy which SIMR and Spark jars to\nuse, otherwise jars will be selected from the current directory.\n\nTo run a Spark application, package it up as a JAR file and execute:\n```shell\n./simr jar_file main_class parameters [--outdir=\u003chdfs_out_dir\u003e] [--slots=N] [--unique]\n```\n\n* `jar_file` is a JAR file containing all your programs, e.g. `spark-examples.jar`\n* `main_class` is the name of the class with a `main` method, e.g. `org.apache.spark.examples.SparkPi`\n* `parameters` is a list of parameters that will be passed to your `main_class`.\n + _Important_: the special parameter `%spark_url%` will be replaced with the Spark driver URL.\n* `outdir` is an optional parameter which sets the path (absolute or relative) in HDFS where your\n  job's output will be stored, e.g. `/user/alig/myjob11`.\n  + If this parameter is not set, a directory will be created using the current time stamp in the\n    form of `yyyy-MM-dd_kk_mm_ss`, e.g.  `2013-12-01_11_12_13`\n* `slots` is an optional parameter that specifies the number of Map slots SIMR should utilize.  By\n  default, SIMR sets the value to the number of nodes in the cluster.\n  + This value must be at least 2, otherwise no executors will be present and the task will never\n    complete.\n* `unique` is an optional parameter which ensures that each node in the cluster will run at most 1\n  SIMR executor.\n\nYour output will be placed in the `outdir` in HDFS, this includes output from stdout/stderr for the driver and all executors.\n\n**Important**: to ensure that your Spark jobs terminate without\n  errors, you must end your Spark programs by calling `stop()` on\n  `SparkContext`. In the case of the Spark examples, this usually\n  means adding `spark.stop()` at the end of `main()`.\n\n## Example\n\nAssuming `spark-examples.jar` exists and contains the Spark examples, the following will execute the example that computes pi in 100 partitions in parallel:\n```shell\n./simr spark-examples.jar org.apache.spark.examples.SparkPi %spark_url% 100\n```\n\nAlternatively, you can launch a Spark-shell like this:\n```shell\n./simr --shell\n```\n\n## Configuration\n\nThe `$HADOOP` environment variable should point at the `hadoop` binary or its directory. To specify\nthe SIMR or Spark jar the runtime script should use, set the `$SIMRJAR` and `$SPARKJAR` environment\nvariables respectively. If these variables are not set, the runtime script will default to a SIMR\nand Spark jar in the current directory.\n\nBy default SIMR figures out the number of task trackers in the cluster\nand launches a job that is the same size as the cluster. This can be\nadjusted by supplying the command line parameter ``--slots=\u003cinteger\u003e``\nto ``simr`` or setting the Hadoop configuration parameter\n`simr.cluster.slots`.\n\n## Network Configuration\n\nSIMR expects its different components to communicate over the network, which\nrequires opening ports for communication. SIMR does not have a set of static\nports, as this would prevent multiple SIMR jobs from executing simultaneously.\nInstead the ports are in the [Ephemeral Range](http://en.wikipedia.org/wiki/Ephemeral_port).\nFor SIMR to function properly ports in the ephemeral range should be opened.\n\n## Advanced Configuration\n\nThe following sections are targeted at users who aim to run SIMR on versions of Hadoop for which\njars have not been provided. It is necessary to build both the appropriate version of\n`simr-\u003chadoop-version\u003e.jar` and `spark-assembly-\u003chadoop-version\u003e.jar` and place them in the same\ndirectory as the `simr` runtime script.\n\n## Building Spark\n\nIn order to build SIMR, we must first compile a version of Spark that targets the version of Hadoop\nthat SIMR will be run on.\n\n1. Download Spark v0.8.1 or greater.\n\n2. Unpack and enter the Spark directory.\n\n3. Modify `project/SparkBuild.scala`\n  + Change the value of `DEFAULT_HADOOP_VERSION` to match the version of Hadoop you are targeting, e.g.\n  `val DEFAULT_HADOOP_VERSION = \"1.2.0\"`\n\n4. Run `sbt/sbt assembly` which creates a giant jumbo jar containing all of Spark in\n   `assembly/target/scala*/spark-assembly-\u003cspark-version\u003e-SNAPSHOT-\u003chadoop-version\u003e.jar`.\n\n5. Copy `assembly/target/scala*/spark-assembly-\u003cspark-version\u003e-SNAPSHOT-\u003chadoop-version\u003e.jar` to the\n   same directory as the runtime script `simr` and follow the instructions below to build\n   `simr-\u003chadoop-version\u003e.jar`.\n\n## Building SIMR\n\n1. Checkout the SIMR repository from https://github.com/databricks/simr.git\n\n2. Copy the Spark jumbo jar into the SIMR `lib/` directory.\n  + **Important**: Ensure the Spark jumbo jar is named `spark-assembly.jar` when placed in the `lib/` directory,\n    otherwise it will be included in the SIMR jumbo jar.\n\n3. Run `sbt/sbt assembly` in the root of the SIMR directory. This will build the SIMR jumbo jar\n   which will be output as `target/scala*/simr.jar`.\n\n4. Copy `target/scala*/simr.jar` to the same directory as the runtime script `simr` and follow the\n   instructions above to execute SIMR.\n\n## How it works (advanced)\n\nSIMR launches a Hadoop MapReduce job that only contains mappers. It\nensures that a jumbo jar (simr.jar), containing Scala and Spark, gets\nuploaded to the machines of the mappers. It also ensures that the job\njar you specified gets shipped to those nodes.\n\nOnce the mappers are all running with the right dependencies in place,\nSIMR uses HDFS to do leader election to elect one of the mappers as\nthe Spark driver. SIMR then executes your job driver, which uses a new\nSIMR scheduler backend that generates and accepts driver URLs of the\nform `simr://path`.  SIMR thereafter communicates the new driver URL\nto all the mappers, which then start Spark executors. The executors\nconnect back to the driver, which executes your program.\n\nAll output to stdout and stderr is redirected to the specified HDFS\ndirectory. Once your job is done, the SIMR backend scheduler has\nadditional functionality to shut down all the executors (hence the new\nrequired call to `stop()`).\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabricks%2Fsimr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatabricks%2Fsimr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabricks%2Fsimr/lists"}