{"id":15710285,"url":"https://github.com/aws/sagemaker-spark","last_synced_at":"2025-05-14T17:05:57.260Z","repository":{"id":45825829,"uuid":"111010291","full_name":"aws/sagemaker-spark","owner":"aws","description":"A Spark library for Amazon SageMaker.","archived":false,"fork":false,"pushed_at":"2025-03-08T00:27:49.000Z","size":1002,"stargazers_count":301,"open_issues_count":35,"forks_count":131,"subscribers_count":52,"default_branch":"master","last_synced_at":"2025-05-08T00:08:05.021Z","etag":null,"topics":["amazon-sagemaker","aws","machine-learning","python","sagemaker","scala","spark"],"latest_commit_sha":null,"homepage":"https://aws.github.io/sagemaker-spark/","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aws.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-11-16T18:58:56.000Z","updated_at":"2025-05-05T14:17:05.000Z","dependencies_parsed_at":"2025-01-16T18:06:59.370Z","dependency_job_id":"c81ed1aa-bf3b-4c0e-a51b-ce4ea670f103","html_url":"https://github.com/aws/sagemaker-spark","commit_stats":{"total_commits":129,"total_committers":25,"mean_commits":5.16,"dds":0.6589147286821706,"last_synced_commit":"495f4cec463b951273ea41a5bb3ec7893f9b610b"},"previous_names":[],"tags_count":35,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aws%2Fsagemaker-spark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aws%2Fsagemaker-spark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aws%2Fsagemaker-spark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aws%2Fsagemaker-spark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aws","download_url":"https://codeload.github.com/aws/sagemaker-spark/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254190396,"owners_count":22029632,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["amazon-sagemaker","aws","machine-learning","python","sagemaker","scala","spark"],"created_at":"2024-10-03T21:05:35.067Z","updated_at":"2025-05-14T17:05:57.236Z","avatar_url":"https://github.com/aws.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# \u003cimg alt=\"SageMaker\" src=\"branding/icon/sagemaker-banner.png\" height=\"100\"\u003e\n\n# SageMaker Spark\n[![codecov](https://codecov.io/gh/aws/sagemaker-spark/branch/master/graph/badge.svg)](https://codecov.io/gh/aws/sagemaker-spark)\n\nSageMaker Spark is an open source Spark library for [Amazon SageMaker](https://aws.amazon.com/sagemaker/). With SageMaker Spark you construct Spark ML `Pipeline`s using Amazon SageMaker stages. These pipelines interleave native Spark ML stages and stages that interact with SageMaker training and model hosting.\n\nWith SageMaker Spark, you can train on Amazon SageMaker from Spark `DataFrame`s using **Amazon-provided ML algorithms**\nlike K-Means clustering or XGBoost, and make predictions on `DataFrame`s against\nSageMaker endpoints hosting your trained models, and, if you have **your own ML algorithms** built\ninto SageMaker compatible Docker containers, you can use SageMaker Spark to train and infer on `DataFrame`s with your\nown algorithms -- **all at Spark scale.**\n\n## Table of Contents\n* [Getting SageMaker Spark](#getting-sagemaker-spark)\n  * [Scala](#scala)\n* [Running SageMaker Spark](#running-sagemaker-spark)\n  * [Running SageMaker Spark Applications with spark-shell or \u003ccode\u003espark-submit\u003c/code\u003e](#running-sagemaker-spark-applications-with-spark-shell-or-spark-submit)\n  * [Running SageMaker Spark Applications on EMR](#running-sagemaker-spark-applications-on-emr)\n  * [Python](#python)\n  * [S3 FileSystem Schemes](#s3-filesystem-schemes)\n  * [API Documentation](#api-documentation)\n* [Getting Started: K-Means Clustering on SageMaker with SageMaker Spark SDK](#getting-started-k-means-clustering-on-sagemaker-with-sagemaker-spark-sdk)\n* [Example: Using SageMaker Spark with Any SageMaker Algorithm](#example-using-sagemaker-spark-with-any-sagemaker-algorithm)\n* [Example: Using SageMakerEstimator and SageMakerModel in a Spark Pipeline](#example-using-sagemakerestimator-and-sagemakermodel-in-a-spark-pipeline)\n* [Example: Using Multiple SageMakerEstimators and SageMakerModels in a Spark Pipeline](#example-using-multiple-sagemakerestimators-and-sagemakermodels-in-a-spark-pipeline)\n* [Example: Creating a SageMakerModel](#example-creating-a-sagemakermodel)\n  * [SageMakerModel From an Endpoint](#sagemakermodel-from-an-endpoint)\n  * [SageMakerModel From Model Data in S3](#sagemakermodel-from-model-data-in-s3)\n  * [SageMakerModel From a Previously Completed Training Job](#sagemakermodel-from-a-previously-completed-training-job)\n* [Example: Tearing Down Amazon SageMaker Endpoints](#example-tearing-down-amazon-sagemaker-endpoints)\n* [Configuring an IAM Role](#configuring-an-iam-role)\n* [SageMaker Spark: In-Depth](#sagemaker-spark-in-depth)\n  * [The Amazon Record format](#the-amazon-record-format)\n  * [Serializing and Deserializing for Inference](#serializing-and-deserializing-for-inference)\n* [License](#license)\n\n## Getting SageMaker Spark\n\n### Scala\n\nSageMaker Spark for Scala is available in the Maven central repository:\n\n```\n\u003cdependency\u003e\n    \u003cgroupId\u003ecom.amazonaws\u003c/groupId\u003e\n    \u003cartifactId\u003esagemaker-spark_2.11\u003c/artifactId\u003e\n    \u003cversion\u003espark_2.2.0-1.0\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\nOr, if your project depends on Spark 2.1:\n\n```\n\u003cdependency\u003e\n    \u003cgroupId\u003ecom.amazonaws\u003c/groupId\u003e\n    \u003cartifactId\u003esagemaker-spark_2.11\u003c/artifactId\u003e\n    \u003cversion\u003espark_2.1.1-1.0\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\nYou can also build SageMaker Spark from source. See [sagemaker-spark-sdk](sagemaker-spark-sdk) for more on\nbuilding SageMaker Spark from source.\n\n### Python\n\nSee the [sagemaker-pyspark-sdk](sagemaker-pyspark-sdk) for more on installing and running SageMaker PySpark.\n\n## Running SageMaker Spark\n\nSageMaker Spark depends on hadoop-aws-2.8.1. To run Spark applications that depend on SageMaker Spark, you need to\nbuild Spark with Hadoop 2.8. However, if you are running Spark applications on EMR, you can use Spark built with Hadoop 2.7.\n\nApache Spark currently distributes binaries built against Hadoop-2.7, but not 2.8.\nSee the [Spark documentation](https://spark.apache.org/docs/2.2.0/hadoop-provided.html) for more on building Spark\nwith Hadoop 2.8.\n\nSageMaker Spark needs to be added to both the driver and executor classpaths.\n\n### Running SageMaker Spark Applications with `spark-shell` or `spark-submit`\n\nYou can submit SageMaker Spark and the AWS Java Client as dependencies with the \"--jars\" flag, or take a dependency\non SageMaker Spark in Maven using the \"--package\" flag.\n\n1. Install Hadoop-2.8. [https://hadoop.apache.org/docs/r2.8.0/](https://hadoop.apache.org/docs/r2.8.0/)\n2. Build Spark 2.2 with Hadoop-2.8. The [Spark documentation](https://spark.apache.org/docs/2.2.0/hadoop-provided.html)\nhas guidance on building Spark with your own Hadoop installation.\n3. Run ```spark-shell``` or ```spark-submit``` with the `--packages` flag:\n\n```\nspark-shell --packages com.amazonaws:sagemaker-spark_2.11:spark_2.2.0-1.0\n```\n\n### Running SageMaker Spark Applications on EMR\n\nYou can run SageMaker Spark applications on an EMR cluster just like any other Spark application by\nsubmitting your Spark application jar and the SageMaker Spark dependency jars with the --jars or --packages flags.\n\nSageMaker Spark is pre-installed on EMR releases since 5.11.0. You can run your SageMaker Spark application\non EMR by submitting your Spark application jar and any additional dependencies your Spark application uses.\n\nSageMaker Spark applications have also been verified to be compatible with EMR-5.6.0 (which runs Spark 2.1) and EMR-5-8.0\n(which runs Spark 2.2). When submitting your Spark application to an earlier EMR release, use the `--packages` flag to\ndepend on a recent version of the AWS Java SDK:  \n\n```\nspark-submit\n  --packages com.amazonaws:aws-java-sdk:1.11.613 \\\n  --deploy-mode cluster \\\n  --conf spark.driver.userClassPathFirst=true \\\n  --conf spark.executor.userClassPathFirst=true \\\n  --jars SageMakerSparkApplicationJar.jar,...\n  ...\n```\n\nThe `spark.driver.userClassPathFirst=true` and `spark.executor.userClassPathFirst=true` properties are required so that\nthe Spark cluster will use the AWS Java SDK dependencies with SageMaker, rather than the AWS Java SDK installed on these\nearlier EMR clusters.\n\nFor more on running Spark application on EMR, see the\n[EMR Documentation](http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-submit-step.html) on submitting a step.\n\n### Python\n\nSee the [sagemaker-pyspark-sdk](sagemaker-pyspark-sdk) for more on installing and running SageMaker PySpark.\n\n### S3 FileSystem Schemes\n\nEMR allows you to read and write data using the EMR FileSystem (EMRFS), accessed through Spark with \"s3://\":\n\n```scala\nspark.read.format(\"libsvm\").load(\"s3://my-bucket/my-prefix\")\n```\n\nIn other execution environments, you can use the S3A schema to use the S3A FileSystem \"s3a://\" to read and write data:\n\n```scala\nspark.read.format(\"libsvm\").load(\"s3a://my-bucket/my-prefix\")\n```\n\nIn the code examples in this README, we use \"s3://\" to use the [EMRFS](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html),\nor \"s3a://\" to use the [S3A system](https://wiki.apache.org/hadoop/AmazonS3), which is recommended over \"s3n://\".\n\n### API Documentation\n\nYou can view the [Scala API Documentation for SageMaker Spark here.](https://aws.github.io/sagemaker-spark/)\n\nYou can view the [PySpark API Documentation for SageMaker Spark here.](http://sagemaker-pyspark.readthedocs.io/en/latest/)\n\n## Getting Started: K-Means Clustering on SageMaker with SageMaker Spark SDK\n \nThis example walks through using SageMaker Spark to train on a Spark DataFrame using a SageMaker-provided algorithm,\nhost the resulting model on SageMaker Spark, and making predictions on a Spark DataFrame using that hosted model.\n\nWe'll cluster handwritten digits in the MNIST dataset, which we've made available in LibSVM format at \n`s3://sagemaker-sample-data-us-east-1/spark/mnist/train/mnist_train.libsvm`.\n\nYou can start a Spark shell with SageMaker Spark\n\n```\nspark-shell --packages com.amazonaws:sagemaker-spark_2.11:spark_2.1.1-1.0\n```\n\n1. Create your Spark Session and load your training and test data into DataFrames:\n```scala\nval spark = SparkSession.builder.getOrCreate\n\n// load mnist data as a dataframe from libsvm. replace this region with your own.\nval region = \"us-east-1\"\nval trainingData = spark.read.format(\"libsvm\")\n  .option(\"numFeatures\", \"784\")\n  .load(s\"s3://sagemaker-sample-data-$region/spark/mnist/train/\")\n\nval testData = spark.read.format(\"libsvm\")\n  .option(\"numFeatures\", \"784\")\n  .load(s\"s3://sagemaker-sample-data-$region/spark/mnist/test/\")\n```\n\nThe `DataFrame` consists of a column named \"label\" of Doubles, indicating the digit for each example,\nand a column named \"features\" of Vectors:\n\n```scala\ntrainingData.show\n\n+-----+--------------------+\n|label|            features|\n+-----+--------------------+\n|  5.0|(784,[152,153,154...|\n|  0.0|(784,[127,128,129...|\n|  4.0|(784,[160,161,162...|\n|  1.0|(784,[158,159,160...|\n|  9.0|(784,[208,209,210...|\n|  2.0|(784,[155,156,157...|\n|  1.0|(784,[124,125,126...|\n|  3.0|(784,[151,152,153...|\n|  1.0|(784,[152,153,154...|\n|  4.0|(784,[134,135,161...|\n|  3.0|(784,[123,124,125...|\n|  5.0|(784,[216,217,218...|\n|  3.0|(784,[143,144,145...|\n|  6.0|(784,[72,73,74,99...|\n|  1.0|(784,[151,152,153...|\n|  7.0|(784,[211,212,213...|\n|  2.0|(784,[151,152,153...|\n|  8.0|(784,[159,160,161...|\n|  6.0|(784,[100,101,102...|\n|  9.0|(784,[209,210,211...|\n+-----+--------------------+\n```\n\n2. Construct a `KMeansSageMakerEstimator`, which extends `SageMakerEstimator`, which is a Spark `Estimator`.\nYou need to pass in an Amazon SageMaker-compatible\nIAM Role that Amazon SageMaker will use to make AWS service calls on your behalf (or configure SageMaker Spark\nto [get this from Spark Config](#configuring-iam-role-and-s3-buckets)). Consult the API Documentation for a\ncomplete list of parameters.\n\nIn this example, we are setting the \"k\" and \"feature_dim\" hyperparameters, corresponding to the number\nof clusters we want and to the number of dimensions in our training dataset, respectively.\n\n```scala\n\n// Replace this IAM Role ARN with your own.\nval roleArn = \"arn:aws:iam::account-id:role/rolename\"\n\nval estimator = new KMeansSageMakerEstimator(\n  sagemakerRole = IAMRole(roleArn),\n  trainingInstanceType = \"ml.p2.xlarge\",\n  trainingInstanceCount = 1,\n  endpointInstanceType = \"ml.c4.xlarge\",\n  endpointInitialInstanceCount = 1)\n  .setK(10).setFeatureDim(784)\n```\n\n3. To train and host your model, call `fit()` on your training `DataFrame`:\n\n```scala\nval model = estimator.fit(trainingData)\n```\n\nWhat happens in this call to `fit()`?\n\n1. SageMaker Spark serializes your `DataFrame` and uploads the\nserialized training data to S3. For the K-Means algorithm, SageMaker Spark converts the `DataFrame` to the [Amazon Record\nformat](#the-amazon-record-format).\nSageMaker Spark will create an S3 bucket for you that your IAM role can access if you do not provide an S3 Bucket in\nthe constructor.\n2. SageMaker Spark sends a `CreateTrainingJobRequest` to Amazon SageMaker to run a Training Job with one `p2.xlarge` on the data in S3, configured with the\nvalues you pass in to the `SageMakerEstimator`, and polls for completion of the Training Job.\nIn this example, we are sending a CreateTrainingJob request to run a k-means clustering Training Job on Amazon SageMaker\non serialized data we uploaded from your `DataFrame`. When training completes, the Amazon SageMaker service puts\na serialized model in an S3 bucket you own (or the default bucket created by SageMaker Spark).\n3. After training completes, SageMaker Spark sends a `CreateModelRequest`, a `CreateEndpointConfigRequest`, and a\n`CreateEndpointRequest` and polls for completion, each configured with the values you pass in to the SageMakerEstimator.\nThis Endpoint will initially be backed by one `c4.xlarge`.\n\n4. To make inferences using the Endpoint hosting our model, call `transform()` on the `SageMakerModel` returned by `fit()`.\n\n```scala\nval transformedData = model.transform(testData)\ntransformedData.show\n+-----+--------------------+-------------------+---------------+\n|label|            features|distance_to_cluster|closest_cluster|\n+-----+--------------------+-------------------+---------------+\n|  5.0|(784,[152,153,154...|  1767.897705078125|            4.0|\n|  0.0|(784,[127,128,129...|  1392.157470703125|            5.0|\n|  4.0|(784,[160,161,162...| 1671.5711669921875|            9.0|\n|  1.0|(784,[158,159,160...| 1182.6082763671875|            6.0|\n|  9.0|(784,[208,209,210...| 1390.4002685546875|            0.0|\n|  2.0|(784,[155,156,157...|  1713.988037109375|            1.0|\n|  1.0|(784,[124,125,126...| 1246.3016357421875|            2.0|\n|  3.0|(784,[151,152,153...|  1753.229248046875|            4.0|\n|  1.0|(784,[152,153,154...|  978.8394165039062|            2.0|\n|  4.0|(784,[134,135,161...|  1623.176513671875|            3.0|\n|  3.0|(784,[123,124,125...|  1533.863525390625|            4.0|\n|  5.0|(784,[216,217,218...|  1469.357177734375|            6.0|\n|  3.0|(784,[143,144,145...|  1736.765869140625|            4.0|\n|  6.0|(784,[72,73,74,99...|   1473.69384765625|            8.0|\n|  1.0|(784,[151,152,153...|    944.88720703125|            2.0|\n|  7.0|(784,[211,212,213...| 1285.9071044921875|            3.0|\n|  2.0|(784,[151,152,153...| 1635.0125732421875|            1.0|\n|  8.0|(784,[159,160,161...| 1436.3162841796875|            6.0|\n|  6.0|(784,[100,101,102...| 1499.7366943359375|            7.0|\n|  9.0|(784,[209,210,211...| 1364.6319580078125|            6.0|\n+-----+--------------------+-------------------+---------------+\n\n```\n\nIn this call to `transform()`, the `SageMakerModel` serializes chunks of the input `DataFrame` and sends them to the\nEndpoint using the SageMakerRuntime `InvokeEndpoint` API. The `SageMakerModel` deserializes the Endpoint's responses,\nwhich contain predictions, and appends the prediction columns to the input `DataFrame`.\n\n## Example: Using SageMaker Spark with Any SageMaker Algorithm\n\nThe `SageMakerEstimator` is an `org.apache.spark.ml.Estimator` that trains a model on Amazon SageMaker.\n\nSageMaker Spark provides several classes that extend `SageMakerEstimator` to run particular algorithms, like `KMeansSageMakerEstimator`\nto run the SageMaker-provided k-means algorithm, or `XGBoostSageMakerEstimator` to run the SageMaker-provided XGBoost\nalgorithm. These classes are just `SageMakerEstimator`s with certain default values passed in. You can use SageMaker Spark with\nany algorithm that runs on Amazon SageMaker by creating a SageMakerEstimator.\n\nInstead of creating a KMeansSageMakerEstimator, you can create an equivalent SageMakerEstimator:\n\n```scala\nval estimator = new SageMakerEstimator(\n  trainingImage =\n    \"382416733822.dkr.ecr.us-east-1.amazonaws.com/kmeans:1\",\n  modelImage =\n    \"382416733822.dkr.ecr.us-east-1.amazonaws.com/kmeans:1\",\n  requestRowSerializer = new ProtobufRequestRowSerializer(),\n  responseRowDeserializer = new KMeansProtobufResponseRowDeserializer(),\n  hyperParameters = Map(\"k\" -\u003e \"10\", \"feature_dim\" -\u003e \"784\"),\n  sagemakerRole = IAMRole(roleArn),\n  trainingInstanceType = \"ml.p2.xlarge\",\n  trainingInstanceCount = 1,\n  endpointInstanceType = \"ml.c4.xlarge\",\n  endpointInitialInstanceCount = 1,\n  trainingSparkDataFormat = \"sagemaker\")\n```\n\n* `trainingImage` identifies the Docker registry path to the training image containing your custom code. In this case,\nthis points to the us-east-1 k-means image.\n* `modelImage` identifies the Docker registry path to the image containing inference code. Amazon SageMaker k-means \nuses the same image to train and to host trained models.\n* `requestRowSerializer` implements `com.amazonaws.services.sagemaker.sparksdk.transformation.RequestRowSerializer`.\nA `RequestRowSerializer` serializes `org.apache.spark.sql.Row`s in the input `DataFrame` to send them to the model hosted in Amazon SageMaker for inference.\nThis is passed to the SageMakerModel returned by `fit`. In this case, we pass in a `RequestRowSerializer` that serializes\n`Row`s to the Amazon Record protobuf format. See [Serializing and Deserializing for Inference](#serializing-and-deserializing-for-inference)\nfor more information on how SageMaker Spark makes inferences. \n* `responseRowDeserializer` Implements\n`com.amazonaws.services.sagemaker.sparksdk.transformation.ResponseRowDeserializer`. A `ResponseRowDeserializer` deserializes\nresponses containing predictions from the Endpoint back into columns in a `DataFrame`.\n* `hyperParameters` is a `Map[String, String]` that the `trainingImage` will use to set training hyperparameters.\n* `trainingSparkDataFormat` specifies the data format that Spark uses when uploading training data from a `DataFrame`\nto S3.\n\nSageMaker Spark needs the trainingSparkDataFormat to tell Spark how to write the DataFrame to S3 for the `trainingImage` to\ntrain on. In this example, \"sagemaker\" tells Spark to write the data as\nRecordIO-encoded [Amazon Records](#the-amazon-record-format), but your own algorithm may take another data format.\nYou can pass in any format that Spark supports as long as your `trainingImage` can train using that data format,\nsuch as \"csv\", \"parquet\", \"com.databricks.spark.csv\", or \"libsvm.\"\n\nSageMaker Spark also needs a `RequestRowSerializer` to serialize Spark `Row`s to a\ndata format the `modelImage` can deserialize, and a `ResponseRowDeserializer` to deserialize responses that contain\npredictions from the `modelImage` back into Spark `Row`s. See [Serializing and Deserializing for Inference](#serializing-and-deserializing-for-inference)\nfor more details.\n\n## Example: Using SageMakerEstimator and SageMakerModel in a Spark Pipeline\n\n`SageMakerEstimator`s and `SageMakerModel`s can be used in `Pipeline`s. In this\nexample, we run `org.apache.spark.ml.feature.PCA` on our Spark cluster, then train and infer using Amazon SageMaker's\nK-Means on the output column from `PCA`:\n\n```scala\nval pcaEstimator = new PCA()\n  .setInputCol(\"features\")\n  .setOutputCol(\"projectedFeatures\")\n  .setK(50)\n\nval kMeansSageMakerEstimator = new KMeansSageMakerEstimator(\n  sagemakerRole = IAMRole(roleArn),\n  requestRowSerializer =\n    new ProtobufRequestRowSerializer(featuresColumnName = \"projectedFeatures\"),\n  trainingSparkDataFormatOptions = Map(\"featuresColumnName\" -\u003e \"projectedFeatures\"),\n  trainingInstanceType = \"ml.p2.xlarge\",\n  trainingInstanceCount = 1,\n  endpointInstanceType = \"ml.c4.xlarge\",\n  endpointInitialInstanceCount = 1)\n  .setK(10).setFeatureDim(50)\n\nval pipeline = new Pipeline().setStages(Array(pcaEstimator, kMeansSageMakerEstimator))\n\n// train\nval pipelineModel = pipeline.fit(trainingData)\n\nval transformedData = pipelineModel.transform(testData)\ntransformedData.show()\n\n+-----+--------------------+--------------------+-------------------+---------------+\n|label|            features|   projectedFeatures|distance_to_cluster|closest_cluster|\n+-----+--------------------+--------------------+-------------------+---------------+\n|  5.0|(784,[152,153,154...|[880.731433034386...|     1500.470703125|            0.0|\n|  0.0|(784,[127,128,129...|[1768.51722024166...|      1142.18359375|            4.0|\n|  4.0|(784,[160,161,162...|[704.949236329314...|  1386.246826171875|            9.0|\n|  1.0|(784,[158,159,160...|[-42.328192193771...| 1277.0736083984375|            5.0|\n|  9.0|(784,[208,209,210...|[374.043902028333...|   1211.00927734375|            3.0|\n|  2.0|(784,[155,156,157...|[941.267714528850...|  1496.157958984375|            8.0|\n|  1.0|(784,[124,125,126...|[30.2848596410594...| 1327.6766357421875|            5.0|\n|  3.0|(784,[151,152,153...|[1270.14374062052...| 1570.7674560546875|            0.0|\n|  1.0|(784,[152,153,154...|[-112.10792566485...|     1037.568359375|            5.0|\n|  4.0|(784,[134,135,161...|[452.068280676606...| 1165.1236572265625|            3.0|\n|  3.0|(784,[123,124,125...|[610.596447285397...|  1325.953369140625|            7.0|\n|  5.0|(784,[216,217,218...|[142.959601818422...| 1353.4930419921875|            5.0|\n|  3.0|(784,[143,144,145...|[1036.71862533658...| 1460.4315185546875|            7.0|\n|  6.0|(784,[72,73,74,99...|[996.740157435754...| 1159.8631591796875|            2.0|\n|  1.0|(784,[151,152,153...|[-107.26076167417...|   960.963623046875|            5.0|\n|  7.0|(784,[211,212,213...|[619.771820430940...|   1245.13623046875|            6.0|\n|  2.0|(784,[151,152,153...|[850.152101817161...|  1304.437744140625|            8.0|\n|  8.0|(784,[159,160,161...|[370.041887230547...| 1192.4781494140625|            0.0|\n|  6.0|(784,[100,101,102...|[546.674328209335...|    1277.0908203125|            2.0|\n|  9.0|(784,[209,210,211...|[-29.259112927426...| 1245.8182373046875|            6.0|\n+-----+--------------------+--------------------+-------------------+---------------+\n```\n\n* `requestRowSerializer =\n      new ProtobufRequestRowSerializer(featuresColumnName = \"projectedFeatures\")` tells the `SageMakerModel` returned\n      by `fit()` to infer on the features in the \"projectedFeatures\" column\n* `trainingSparkDataFormatOptions = Map(\"featuresColumnName\" -\u003e \"projectedFeatures\")` tells the `SageMakerProtobufWriter`\n that Spark is using to write the `DataFrame` as format \"sagemaker\" to serialize the \"projectedFeatures\" column when\n writing Amazon Records for training.\n\n\n## Example: Using Multiple SageMakerEstimators and SageMakerModels in a Spark Pipeline\n\nWe can use multiple `SageMakerEstimator`s and `SageMakerModel`s in a pipeline. Here, we use\nSageMaker's PCA algorithm to reduce a dataset with 50 dimensions to a dataset with 20 dimensions, then\nuse SageMaker's K-Means algorithm to train on the 20-dimension data.\n\n```scala\nval pcaEstimator = new PCASageMakerEstimator(sagemakerRole = IAMRole(sagemakerRole),\n  trainingInstanceType = \"ml.p2.xlarge\",\n  trainingInstanceCount = 1,\n  endpointInstanceType = \"ml.c4.xlarge\",\n  endpointInitialInstanceCount = 1\n  responseRowDeserializer = new PCAProtobufResponseRowDeserializer(\n    projectionColumnName = \"projectionDim20\"),\n  trainingInputS3DataPath = S3DataPath(trainingBucket, inputPrefix),\n  trainingOutputS3DataPath = S3DataPath(trainingBucket, outputPrefix),\n  endpointCreationPolicy = EndpointCreationPolicy.CREATE_ON_TRANSFORM)\n  .setNumComponents(20).setFeatureDim(50)\n\nval kmeansEstimator = new KMeansSageMakerEstimator(sagemakerRole = IAMRole(sagemakerRole),\n  trainingInstanceType = \"ml.p2.xlarge\",\n  trainingInstanceCount = 1,\n  endpointInstanceType = \"ml.c4.xlarge\",\n  endpointInitialInstanceCount = 1\n  trainingSparkDataFormatOptions = Map(\"featuresColumnName\" -\u003e \"projectionDim20\"),\n  requestRowSerializer = new ProtobufRequestRowSerializer(\n    featuresColumnName = \"projectionDim20\"),\n  responseRowDeserializer = new KMeansProtobufResponseRowDeserializer(),\n  trainingInputS3DataPath = S3DataPath(trainingBucket, inputPrefix),\n  trainingOutputS3DataPath = S3DataPath(trainingBucket, outputPrefix),\n  endpointCreationPolicy = EndpointCreationPolicy.CREATE_ON_TRANSFORM)\n  .setK(10).setFeatureDim(20)\n\nval pipeline = new Pipeline().setStages(Array(pcaEstimator, kmeansEstimator))\n\nval model = pipeline.fit(dataset)\n\n// For expediency, transforming the training dataset:\nval transformedData = model.transform(dataset)\ntransformedData.show()\n\n+-----+--------------------+--------------------+-------------------+---------------+\n|label|            features|     projectionDim20|distance_to_cluster|closest_cluster|\n+-----+--------------------+--------------------+-------------------+---------------+\n|  1.0|[-0.7927307,-11.2...|[5.50362682342529...|  45.03189468383789|            1.0|\n|  1.0|[-3.762671,-5.853...|[-2.1558122634887...|  41.79889678955078|            1.0|\n|  1.0|[-2.0988898,-2.40...|[4.53881502151489...| 50.824703216552734|            1.0|\n|  1.0|[-2.81075,-3.6481...|[0.97894239425659...|  52.78211975097656|            1.0|\n|  1.0|[-2.14356,-4.0369...|[2.25758934020996...|  48.99141311645508|            1.0|\n|  1.0|[-5.3773708,-15.3...|[-3.2523036003112...|  21.99374771118164|            1.0|\n|  1.0|[-1.0369565,-16.5...|[-17.643878936767...| 29.127044677734375|            3.0|\n|  1.0|[-2.019725,-3.226...|[1.41068196296691...|   51.7830696105957|            1.0|\n|  1.0|[-4.3821997,-0.98...|[-0.8335087299346...| 53.921058654785156|            1.0|\n|  1.0|[-7.075208,-34.31...|[11.4329795837402...|  35.12031173706055|            3.0|\n|  1.0|[-3.90454,-4.8401...|[-1.4304646253585...|  50.00594711303711|            1.0|\n|  1.0|[0.9607103,-13.50...|[1.13785743713378...|  28.71956443786621|            1.0|\n|  1.0|[-4.5025017,-15.2...|[2.66747045516967...| 25.419822692871094|            1.0|\n|  1.0|[0.041773,-27.148...|[7.58121681213378...| 30.303693771362305|            3.0|\n|  1.0|[-10.1477266,-39....|[-12.086886405944...|   35.9030647277832|            2.0|\n|  1.0|[-3.09143,-6.4892...|[1.79180252552032...|  39.34271240234375|            1.0|\n|  1.0|[-13.5285917,-32....|[7.62783145904541...| 35.040035247802734|            2.0|\n|  1.0|[-4.189806,-16.04...|[1.41141772270202...| 25.123626708984375|            1.0|\n|  1.0|[-12.77831508,-62...|[0.11281073093414...|  63.91242599487305|            2.0|\n|  1.0|[-9.3934507,-12.5...|[-9.4945802688598...| 20.913305282592773|            1.0|\n+-----+--------------------+--------------------+-------------------+---------------+\n\n```\n* `responseRowDeserializer = new PCAProtobufResponseRowDeserializer(\nprojectionColumnName = \"projectionDim20\")` tells the `SageMakerModel` attached to the PCA endpoint to deserialize\nresponses (which contain the lower-dimensional projections of the features vectors) into the column named \"projectionDim20\"\n* `endpointCreationPolicy = EndpointCreationPolicy.CREATE_ON_TRANSFORM` tells the `SageMakerEstimator` to delay SageMaker\n Endpoint creation until it is needed to transform a `DataFrame`. \n* `trainingSparkDataFormatOptions = Map(\"featuresColumnName\" -\u003e \"projectionDim20\"),\n   requestRowSerializer = new ProtobufRequestRowSerializer(\n       featuresColumnName = \"projectionDim20\")` these lines tell the `KMeansSageMakerEstimator`\n       to respectively train and infer on the features in the \"projectionDim20\" column.\n\n## Example: Creating a SageMakerModel\n\nSageMaker Spark supports attaching `SageMakerModel`s to an existing SageMaker endpoint, or to an Endpoint created by\nreference to model data in S3, or to a previously completed Training Job.\n\nThis allows you to use SageMaker Spark just for model hosting and inference on Spark-scale `DataFrame`s without running\na new Training Job.\n\n### SageMakerModel From an Endpoint\n\nYou can attach a `SageMakerModel` to an endpoint that has already been created. Supposing an endpoint with name\n\"my-endpoint-name\" is already in service and hosting a SageMaker K-Means model:\n\n```scala\nval model = SageMakerModel\n  .fromEndpoint(endpointName = \"my-endpoint-name\",\n                requestRowSerializer = new ProtobufRequestRowSerializer(\n                  featuresColumnName = \"MyFeaturesColumn\"),\n                responseRowDeserializer = new KMeansProtobufResponseRowDeserializer(\n                  distanceToClusterColumnName = \"DistanceToCluster\",\n                  closestClusterColumnName = \"ClusterLabel\"\n                ))\n```\n\nThis `SageMakerModel` will, upon a call to `transform()`, serialize the column named\n\"MyFeaturesColumn\" for inference, and append the columns \"DistanceToCluster\" and \"ClusterLabel\" to the `DataFrame`.\n\n### SageMakerModel From Model Data in S3\n\nYou can create a SageMakerModel and an Endpoint by referring directly to your model data in S3:\n\n```scala\nval model = SageMakerModel\n  .fromModelS3Path(modelPath = \"s3://my-model-bucket/my-model-data/model.tar.gz\",\n                   modelExecutionRoleARN = \"arn:aws:iam::account-id:role/rolename\"\n                   modelImage = 382416733822.dkr.ecr.us-east-1.amazonaws.com/kmeans:1\",\n                   endpointInstanceType = \"ml.c4.xlarge\",\n                   endpointInitialInstanceCount = 1\n                   requestRowSerializer = new ProtobufRequestRowSerializer(),\n                   responseRowDeserializer = new KMeansProtobufResponseRowDeserializer()\n                  )\n```\n\n### SageMakerModel From a Previously Completed Training Job\n\nYou can create a SageMakerModel and an Endpoint by referring to a previously-completed training job:\n\n```scala\nval model = SageMakerModel\n  .fromTrainingJob(trainingJobName = \"my-training-job-name\",\n                   modelExecutionRoleARN = \"arn:aws:iam::account-id:role/rolename\"\n                   modelImage = 382416733822.dkr.ecr.us-east-1.amazonaws.com/kmeans:1\",\n                   endpointInstanceType = \"ml.c4.xlarge\",\n                   endpointInitialInstanceCount = 1\n                   requestRowSerializer = new ProtobufRequestRowSerializer(),\n                   responseRowDeserializer = new KMeansProtobufResponseRowDeserializer()\n                  )\n\n```\n\n## Example: Tearing Down Amazon SageMaker Endpoints\n\nSageMaker Spark provides a utility for deleting Endpoints created by a SageMakerModel:\n\n```scala\nval sagemakerClient = AmazonSageMakerClientBuilder.defaultClient\nval cleanup = new SageMakerResourceCleanup(sagemakerClient)\ncleanup.deleteResources(model.getCreatedResources)\n\n```\n\n## Configuring an IAM Role\n\nSageMaker Spark allows you to add your IAM Role ARN to your Spark Config so that you don't have to keep passing in\n`IAMRole(\"arn:aws:iam::account-id:role/rolename\")`.\n\nAdd an entry to your Spark Config with key `com.amazonaws.services.sagemaker.sparksdk.sagemakerrole` whose value is your\nAmazon SageMaker-compatible IAM Role. `SageMakerEstimator` will look for this role if it is not supplied in the constructor.\n\n## SageMaker Spark: In-Depth\n\n### The Amazon Record format\n\n`KMeansSageMakerEstimator`, `PCASageMakerEstimator`, and `LinearLearnerSageMakerEstimator` all serialize `DataFrame`s\nto the Amazon Record protobuf format with each Record encoded in\n[RecordIO](https://mxnet.incubator.apache.org/architecture/note_data_loading.html).\nThey do this by passing in \"sagemaker\" to the `trainingSparkDataFormat` constructor argument, which configures Spark\nto use the `SageMakerProtobufWriter` to serialize Spark `DataFrame`s.\n\nWriting a `DataFrame` using the \"sagemaker\"\nformat serializes a column named \"label\", expected to contain\n`Double`s, and a column named \"features\", expected to contain a Sparse or Dense `org.apache.mllib.linalg.Vector`.\nIf the features column contains a `SparseVector`, SageMaker Spark sparsely-encodes the `Vector` into the Amazon Record.\nIf the features column contains a `DenseVector`, SageMaker Spark densely-encodes the `Vector` into the Amazon Record.\n\nYou can choose which columns the `SageMakerEstimator` chooses as its \"label\" and \"features\" columns by passing in \na `trainingSparkDataFormatOptions` `Map[String, String]` with keys \"labelColumnName\" and \"featuresColumnName\" and with\nvalues corresponding to the names of your chosen label and features columns.\n\nYou can also write Amazon Records using SageMaker Spark by using the \"sagemaker\" format directly:\n\n```scala\nmyDataFrame.write\n    .format(\"sagemaker\")\n    .option(\"labelColumnName\", \"myLabelColumn\")\n    .option(\"featuresColumnName\", \"myFeaturesColumn\")\n    .save(\"s3://my-s3-bucket/my-s3-prefix\")\n```\n\nBy default, `SageMakerEstimator` deletes the RecordIO-encoded Amazon Records in S3 following training on Amazon \nSageMaker. You can choose to allow the data to persist in S3 by passing in `deleteStagingDataAfterTraining = true` to \n`SageMakerEstimator`.\n\nSee the [AWS Documentation on Amazon Records](https://aws.amazon.com/sagemaker/latest/dg/cdf-training.html) for\nmore information on Amazon Records.\n\n### Serializing and Deserializing for Inference\n\n`SageMakerEstimator.fit()` returns a `SageMakerModel`, which transforms a `DataFrame` by calling `InvokeEndpoint` on\nan Amazon SageMaker Endpoint. `InvokeEndpointRequest`s carry serialized `Row`s as their payload.`Row`s in the `DataFrame`\nare serialized for predictions against an Endpoint using a `RequestRowSerializer`. Responses from an Endpoint containing\npredictions are deserialized into Spark `Row`s and appended as columns in a `DataFrame` using a `ResponseRowDeserializer.`\n\nInternally, `SageMakerModel.transform` calls `mapPartitions` to distribute the work\nof serializing Spark `Row`s, constructing and sending `InvokeEndpointRequest`s to an Endpoint, and deserializing\n`InvokeEndpointResponse`s across a Spark cluster. Because each `InvokeEndpointRequest` can carry only 5MB, each \nSpark partition creates a\n`com.amazonaws.services.sagemaker.sparksdk.transformation.util.RequestBatchIterator` to iterate over its partition,\nsending prediction requests to the Endpoint in 5MB increments.\n\n`RequestRowSerializer.serializeRow()` converts a `Row` to an `Array[Byte]`.\nThe `RequestBatchIterator` appends these byte arrays to\nform the request body of an `InvokeEndpointRequest`.\n\nFor example, the\n`com.amazonaws.services.sagemaker.sparksdk.transformation.ProtobufRequestRowSerializer` creates one\nRecordIO-encoded Amazon Record per input row by serializing the \"features\" column in each row, and wrapping each\nAmazon Record in the RecordIO header.\n\n`ResponseRowDeserializer.deserializeResponse()` converts an `Array[Byte]` containing predictions from an Endpoint to \nan `Iterator[Row]`to appends columns containing these predictions to the `DataFrame` being transformed by the\n`SageMakerModel`.\n\nFor comparison, SageMaker's XGBoost uses LibSVM-formatted data for inference (as well as training), and responds with a comma-delimited list of predictions.\nAccordingly, SageMaker Spark uses `com.amazonaws.services.sagemaker.sparksdk.transformation.LibSVMRequestRowSerializer`\nto serialize rows into LibSVM-formatted data, and uses `com.amazonaws.services.sagemaker.sparksdk.transformation.XGBoostCSVResponseRowDeserializer`\nto deserialize the response into a column of predictions.\n\nTo support your own model image's data formats for inference, you can implement your own `RequestRowSerializer` and `ResponseRowDeserializer`.\n\n## License\n\nSageMaker Spark is licensed under [Apache-2.0](https://github.com/aws/sagemaker-spark/LICENSE.txt).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faws%2Fsagemaker-spark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faws%2Fsagemaker-spark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faws%2Fsagemaker-spark/lists"}