{"id":18400676,"url":"https://github.com/databricks/spark-redshift","last_synced_at":"2025-05-15T23:04:31.847Z","repository":{"id":23205625,"uuid":"26562411","full_name":"databricks/spark-redshift","owner":"databricks","description":"Redshift data source for Apache Spark","archived":false,"fork":false,"pushed_at":"2023-08-10T16:12:49.000Z","size":796,"stargazers_count":606,"open_issues_count":150,"forks_count":349,"subscribers_count":171,"default_branch":"master","last_synced_at":"2025-05-08T05:19:08.332Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/databricks.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2014-11-13T00:08:13.000Z","updated_at":"2025-01-14T22:01:51.000Z","dependencies_parsed_at":"2024-11-06T03:04:33.280Z","dependency_job_id":"2651f278-52b9-49ff-8721-286708cc464f","html_url":"https://github.com/databricks/spark-redshift","commit_stats":null,"previous_names":[],"tags_count":12,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Fspark-redshift","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Fspark-redshift/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Fspark-redshift/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Fspark-redshift/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/databricks","download_url":"https://codeload.github.com/databricks/spark-redshift/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254436944,"owners_count":22070946,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-06T02:35:58.799Z","updated_at":"2025-05-15T23:04:31.818Z","avatar_url":"https://github.com/databricks.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Redshift Data Source for Apache Spark\n\n[![Build Status](https://travis-ci.org/databricks/spark-redshift.svg?branch=master)](https://travis-ci.org/databricks/spark-redshift)\n[![codecov.io](http://codecov.io/github/databricks/spark-redshift/coverage.svg?branch=master)](http://codecov.io/github/databricks/spark-redshift?branch=master)\n\n## Note\n\nTo ensure the best experience for our customers, we have decided to inline this connector directly in Databricks Runtime. The latest version of Databricks Runtime (3.0+)  includes an advanced version of the RedShift connector for Spark that features both performance improvements (full query pushdown) as well as security improvements (automatic encryption). For more information, refer to the \u003ca href=\"https://docs.databricks.com/spark/latest/data-sources/aws/amazon-redshift.html\"\u003eDatabricks documentation\u003c/a\u003e. As a result, we will no longer be making releases separately from Databricks Runtime.\n\n\n## Original Readme\n\nA library to load data into Spark SQL DataFrames from Amazon Redshift, and write them back to\nRedshift tables. Amazon S3 is used to efficiently transfer data in and out of Redshift, and\nJDBC is used to automatically trigger the appropriate `COPY` and `UNLOAD` commands on Redshift.\n\nThis library is more suited to ETL than interactive queries, since large amounts of data could be extracted to S3 for each query execution. If you plan to perform many queries against the same Redshift tables then we recommend saving the extracted data in a format such as Parquet.\n\n- [Installation](#installation)\n  - [Snapshot builds](#snapshot-builds)\n- Usage:\n  - Data sources API: [Scala](#scala), [Python](#python), [SQL](#sql), [R](#r)\n  - [Hadoop InputFormat](#hadoop-inputformat)\n- [Configuration](#configuration)\n  - [Authenticating to S3 and Redshift](#authenticating-to-s3-and-redshift)\n  - [Encryption](#encryption)\n  - [Parameters](#parameters)\n- [Additional configuration options](#additional-configuration-options)\n    - [Configuring the maximum size of string columns](#configuring-the-maximum-size-of-string-columns)\n    - [Setting a custom column type](#setting-a-custom-column-type)\n    - [Configuring column encoding](#configuring-column-encoding)\n    - [Setting descriptions on columns](#setting-descriptions-on-columns)\n- [Transactional Guarantees](#transactional-guarantees)\n- [Common problems and solutions](#common-problems-and-solutions)\n - [S3 bucket and Redshift cluster are in different AWS regions](#s3-bucket-and-redshift-cluster-are-in-different-aws-regions)\n- [Migration Guide](#migration-guide)\n\n## Installation\n\nThis library requires Apache Spark 2.0+ and Amazon Redshift 1.0.963+.\n\nFor version that works with Spark 1.x, please check for the [1.x branch](https://github.com/databricks/spark-redshift/tree/branch-1.x).\n\nYou may use this library in your applications with the following dependency information:\n\n**Scala 2.10**\n\n```\ngroupId: com.databricks\nartifactId: spark-redshift_2.10\nversion: 3.0.0-preview1\n```\n\n**Scala 2.11**\n```\ngroupId: com.databricks\nartifactId: spark-redshift_2.11\nversion: 3.0.0-preview1\n```\n\nYou will also need to provide a JDBC driver that is compatible with Redshift. Amazon recommend that you use [their driver](http://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-connection.html), which is distributed as a JAR that is hosted on Amazon's website. This library has also been successfully tested using the Postgres JDBC driver.\n\n**Note on Hadoop versions**: This library depends on [`spark-avro`](https://github.com/databricks/spark-avro), which should automatically be downloaded because it is declared as a dependency. However, you may need to provide the corresponding `avro-mapred` dependency which matches your Hadoop distribution. In most deployments, however, this dependency will be automatically provided by your cluster's Spark assemblies and no additional action will be required.\n\n**Note on Amazon SDK dependency**: This library declares a `provided` dependency on components of the AWS Java SDK. In most cases, these libraries will be provided by your deployment environment. However, if you get ClassNotFoundExceptions for Amazon SDK classes then you will need to add explicit dependencies on `com.amazonaws.aws-java-sdk-core` and `com.amazonaws.aws-java-sdk-s3` as part of your build / runtime configuration. See the comments in `project/SparkRedshiftBuild.scala` for more details.\n\n### Snapshot builds\n\nMaster snapshot builds of this library are built using [jitpack.io](https://jitpack.io/). In order\nto use these snapshots in your build, you'll need to add the JitPack repository to your build file.\n\n- **In Maven**:\n   ```\n   \u003crepositories\u003e\n      \u003crepository\u003e\n        \u003cid\u003ejitpack.io\u003c/id\u003e\n        \u003curl\u003ehttps://jitpack.io\u003c/url\u003e\n      \u003c/repository\u003e\n   \u003c/repositories\u003e\n   ```\n\n   then\n\n   ```\n   \u003cdependency\u003e\n     \u003cgroupId\u003ecom.github.databricks\u003c/groupId\u003e\n     \u003cartifactId\u003espark-redshift_2.10\u003c/artifactId\u003e  \u003c!-- For Scala 2.11, use spark-redshift_2.11 instead --\u003e\n     \u003cversion\u003emaster-SNAPSHOT\u003c/version\u003e\n   \u003c/dependency\u003e\n   ```\n\n- **In SBT**:\n   ```\n   resolvers += \"jitpack\" at \"https://jitpack.io\"\n   ```\n\n   then\n\n   ```\n   libraryDependencies += \"com.github.databricks\" %% \"spark-redshift\" % \"master-SNAPSHOT\"\n   ```\n\n- In Databricks: use the \"Advanced Options\" toggle in the \"Create Library\" screen to specify\n  a custom Maven repository:\n\n  ![](https://cloud.githubusercontent.com/assets/50748/20371277/6c34a8d2-ac18-11e6-879f-d07320d56fa4.png)\n\n  Use `https://jitpack.io` as the repository.\n\n  - For Scala 2.10: use the coordinate `com.github.databricks:spark-redshift_2.10:master-SNAPSHOT`\n  - For Scala 2.11: use the coordinate `com.github.databricks:spark-redshift_2.11:master-SNAPSHOT`\n\n\n## Usage\n\n### Data Sources API\n\nOnce you have [configured your AWS credentials](#aws-credentials), you can use this library via the Data Sources API in Scala, Python or SQL, as follows:\n\n#### Scala\n\n```scala\nimport org.apache.spark.sql._\n\nval sc = // existing SparkContext\nval sqlContext = new SQLContext(sc)\n\n// Get some data from a Redshift table\nval df: DataFrame = sqlContext.read\n    .format(\"com.databricks.spark.redshift\")\n    .option(\"url\", \"jdbc:redshift://redshifthost:5439/database?user=username\u0026password=pass\")\n    .option(\"dbtable\", \"my_table\")\n    .option(\"tempdir\", \"s3n://path/for/temp/data\")\n    .load()\n\n// Can also load data from a Redshift query\nval df: DataFrame = sqlContext.read\n    .format(\"com.databricks.spark.redshift\")\n    .option(\"url\", \"jdbc:redshift://redshifthost:5439/database?user=username\u0026password=pass\")\n    .option(\"query\", \"select x, count(*) my_table group by x\")\n    .option(\"tempdir\", \"s3n://path/for/temp/data\")\n    .load()\n\n// Apply some transformations to the data as per normal, then you can use the\n// Data Source API to write the data back to another table\n\ndf.write\n  .format(\"com.databricks.spark.redshift\")\n  .option(\"url\", \"jdbc:redshift://redshifthost:5439/database?user=username\u0026password=pass\")\n  .option(\"dbtable\", \"my_table_copy\")\n  .option(\"tempdir\", \"s3n://path/for/temp/data\")\n  .mode(\"error\")\n  .save()\n\n// Using IAM Role based authentication\ndf.write\n  .format(\"com.databricks.spark.redshift\")\n  .option(\"url\", \"jdbc:redshift://redshifthost:5439/database?user=username\u0026password=pass\")\n  .option(\"dbtable\", \"my_table_copy\")\n  .option(\"aws_iam_role\", \"arn:aws:iam::123456789000:role/redshift_iam_role\")\n  .option(\"tempdir\", \"s3n://path/for/temp/data\")\n  .mode(\"error\")\n  .save()\n```\n\n#### Python\n\n```python\nfrom pyspark.sql import SQLContext\n\nsc = # existing SparkContext\nsql_context = SQLContext(sc)\n\n# Read data from a table\ndf = sql_context.read \\\n    .format(\"com.databricks.spark.redshift\") \\\n    .option(\"url\", \"jdbc:redshift://redshifthost:5439/database?user=username\u0026password=pass\") \\\n    .option(\"dbtable\", \"my_table\") \\\n    .option(\"tempdir\", \"s3n://path/for/temp/data\") \\\n    .load()\n\n# Read data from a query\ndf = sql_context.read \\\n    .format(\"com.databricks.spark.redshift\") \\\n    .option(\"url\", \"jdbc:redshift://redshifthost:5439/database?user=username\u0026password=pass\") \\\n    .option(\"query\", \"select x, count(*) my_table group by x\") \\\n    .option(\"tempdir\", \"s3n://path/for/temp/data\") \\\n    .load()\n\n# Write back to a table\ndf.write \\\n  .format(\"com.databricks.spark.redshift\") \\\n  .option(\"url\", \"jdbc:redshift://redshifthost:5439/database?user=username\u0026password=pass\") \\\n  .option(\"dbtable\", \"my_table_copy\") \\\n  .option(\"tempdir\", \"s3n://path/for/temp/data\") \\\n  .mode(\"error\") \\\n  .save()\n\n# Using IAM Role based authentication\ndf.write \\\n  .format(\"com.databricks.spark.redshift\") \\\n  .option(\"url\", \"jdbc:redshift://redshifthost:5439/database?user=username\u0026password=pass\") \\\n  .option(\"dbtable\", \"my_table_copy\") \\\n  .option(\"tempdir\", \"s3n://path/for/temp/data\") \\\n  .option(\"aws_iam_role\", \"arn:aws:iam::123456789000:role/redshift_iam_role\") \\\n  .mode(\"error\") \\\n  .save()\n```\n\n#### SQL\n\nReading data using SQL:\n\n```sql\nCREATE TABLE my_table\nUSING com.databricks.spark.redshift\nOPTIONS (\n  dbtable 'my_table',\n  tempdir 's3n://path/for/temp/data',\n  url 'jdbc:redshift://redshifthost:5439/database?user=username\u0026password=pass'\n);\n```\n\nWriting data using SQL:\n\n```sql\n-- Create a new table, throwing an error if a table with the same name already exists:\nCREATE TABLE my_table\nUSING com.databricks.spark.redshift\nOPTIONS (\n  dbtable 'my_table',\n  tempdir 's3n://path/for/temp/data'\n  url 'jdbc:redshift://redshifthost:5439/database?user=username\u0026password=pass'\n)\nAS SELECT * FROM tabletosave;\n```\n\nNote that the SQL API only supports the creation of new tables and not overwriting or appending; this corresponds to the default save mode of the other language APIs.\n\n#### R\n\nReading data using R:\n\n```R\ndf \u003c- read.df(\n   NULL,\n   \"com.databricks.spark.redshift\",\n   tempdir = \"s3n://path/for/temp/data\",\n   dbtable = \"my_table\",\n   url = \"jdbc:redshift://redshifthost:5439/database?user=username\u0026password=pass\")\n```\n\n### Hadoop InputFormat\n\nThe library contains a Hadoop input format for Redshift tables unloaded with the ESCAPE option,\nwhich you may make direct use of as follows:\n\n```scala\nimport com.databricks.spark.redshift.RedshiftInputFormat\n\nval records = sc.newAPIHadoopFile(\n  path,\n  classOf[RedshiftInputFormat],\n  classOf[java.lang.Long],\n  classOf[Array[String]])\n```\n\n## Configuration\n\n### Authenticating to S3 and Redshift\n\nThe use of this library involves several connections which must be authenticated / secured, all of\nwhich are illustrated in the following diagram:\n\n```\n                            ┌───────┐\n       ┌───────────────────▶│  S3   │◀─────────────────┐\n       │    IAM or keys     └───────┘    IAM or keys   │\n       │                        ▲                      │\n       │                        │ IAM or keys          │\n       ▼                        ▼               ┌──────▼────┐\n┌────────────┐            ┌───────────┐         │┌──────────┴┐\n│  Redshift  │            │   Spark   │         ││   Spark   │\n│            │◀──────────▶│  Driver   │◀────────▶┤ Executors │\n└────────────┘            └───────────┘          └───────────┘\n               JDBC with                  Configured\n               username /                     in\n                password                    Spark\n            (can enable SSL)\n```\n\nThis library reads and writes data to S3 when transferring data to/from Redshift. As a result, it\nrequires AWS credentials with read and write access to a S3 bucket (specified using the `tempdir`\nconfiguration parameter).\n\n\u003e **:warning: Note**: This library does not clean up the temporary files that it creates in S3.\n\u003e As a result, we recommend that you use a dedicated temporary S3 bucket with an\n\u003e [object lifecycle configuration](http://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecycle-mgmt.html)\n\u003e to ensure that temporary files are automatically deleted after a specified expiration period.\n\u003e See the [_Encryption_](#encryption) section of this document for a discussion of how these files\n\u003e may be encrypted.\n\nThe following describes how each connection can be authenticated:\n\n- **Spark driver to Redshift**: The Spark driver connects to Redshift via JDBC using a username and password.\n    Redshift does not support the use of IAM roles to authenticate this connection.\n    This connection can be secured using SSL; for more details, see the Encryption section below.\n\n- **Spark to S3**: S3 acts as a middleman to store bulk data when reading from or writing to Redshift.\n    Spark connects to S3 using both the Hadoop FileSystem interfaces and directly using the Amazon\n    Java SDK's S3 client.\n\n    This connection can be authenticated using either AWS keys or IAM roles (DBFS mountpoints are\n    not currently supported, so Databricks users who do not want to rely on AWS keys should use\n    cluster IAM roles instead).\n\n    There are multiple ways of providing these credentials:\n\n    1. **Default Credential Provider Chain (best option for most users):**\n        AWS credentials will automatically be retrieved through the [DefaultAWSCredentialsProviderChain](http://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/credentials.html#id6).\n\n        If you use IAM instance roles to authenticate to S3 (e.g. on Databricks, EMR, or EC2), then\n        you should probably use this method.\n\n        If another method of providing credentials is used (methods 2 or 3), then that will take\n        precedence over this default.\n\n    2. **Set keys in Hadoop conf:** You can specify AWS keys via\n        [Hadoop configuration properties](https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md).\n        For example, if your `tempdir` configuration points to a `s3n://` filesystem then you can\n        set the `fs.s3n.awsAccessKeyId` and `fs.s3n.awsSecretAccessKey` properties in a Hadoop XML\n        configuration file or call `sc.hadoopConfiguration.set()` to mutate Spark's global Hadoop\n        configuration.\n\n        For example, if you are using the `s3n` filesystem then add\n\n        ```scala\n        sc.hadoopConfiguration.set(\"fs.s3n.awsAccessKeyId\", \"YOUR_KEY_ID\")\n        sc.hadoopConfiguration.set(\"fs.s3n.awsSecretAccessKey\", \"YOUR_SECRET_ACCESS_KEY\")\n        ```\n\n        and for the `s3a` filesystem add\n\n        ```scala\n        sc.hadoopConfiguration.set(\"fs.s3a.access.key\", \"YOUR_KEY_ID\")\n        sc.hadoopConfiguration.set(\"fs.s3a.secret.key\", \"YOUR_SECRET_ACCESS_KEY\")\n        ```\n\n        Python users will have to use a slightly different method to modify the `hadoopConfiguration`,\n        since this field is not exposed in all versions of PySpark. Although the following command\n        relies on some Spark internals, it should work with all PySpark versions and is unlikely to\n        break or change in the future:\n\n        ```python\n        sc._jsc.hadoopConfiguration().set(\"fs.s3n.awsAccessKeyId\", \"YOUR_KEY_ID\")\n        sc._jsc.hadoopConfiguration().set(\"fs.s3n.awsSecretAccessKey\", \"YOUR_SECRET_ACCESS_KEY\")\n        ```\n\n    3. **Encode keys in `tempdir` URI**:\n     For example, the URI `s3n://ACCESSKEY:SECRETKEY@bucket/path/to/temp/dir` encodes the key pair\n      (`ACCESSKEY`, `SECRETKEY`).\n\n      Due to [Hadoop limitations](https://issues.apache.org/jira/browse/HADOOP-3733), this\n      approach will not work for secret keys which contain forward slash (`/`) characters, even if\n      those characters are urlencoded.\n\n- **Redshift to S3**: Redshift also connects to S3 during `COPY` and `UNLOAD` queries. There are\n    three methods of authenticating this connection:\n\n    1. **Have Redshift assume an IAM role (most secure)**: You can grant Redshift permission to assume\n        an IAM role during `COPY` or `UNLOAD` operations and then configure this library to instruct\n        Redshift to use that role:\n\n        1. Create an IAM role granting appropriate S3 permissions to your bucket.\n        2. Follow the guide\n        [_Authorizing Amazon Redshift to Access Other AWS Services On Your Behalf_](http://docs.aws.amazon.com/redshift/latest/mgmt/authorizing-redshift-service.html)\n        to configure this role's trust policy in order to allow Redshift to assume this role.\n        3. Follow the steps in the\n        [_Authorizing COPY and UNLOAD Operations Using IAM Roles_](http://docs.aws.amazon.com/redshift/latest/mgmt/copy-unload-iam-role.html)\n        guide to associate that IAM role with your Redshift cluster.\n        4. Set this library's `aws_iam_role` option to the role's ARN.\n    2. **Forward Spark's S3 credentials to Redshift**: if the `forward_spark_s3_credentials` option is\n        set to `true` then this library will automatically discover the credentials that Spark is\n        using to connect to S3 and will forward those credentials to Redshift over JDBC. If Spark\n        is authenticating to S3 using an IAM instance role then a set of temporary STS credentials\n        will be passed to Redshift; otherwise, AWS keys will be passed. These credentials are\n        sent as part of the JDBC query, so therefore it is **strongly recommended** to enable SSL\n        encryption of the JDBC connection when using this authentication method.\n    3. **Use Security Token Service (STS) credentials**: You may configure the\n        `temporary_aws_access_key_id`, `temporary_aws_secret_access_key`, and\n        `temporary_aws_session_token` configuration properties to point to temporary keys created\n        via the AWS\n        [Security Token Service](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp.html).\n        These credentials are sent as part of the JDBC query, so therefore it is\n        **strongly recommended** to enable SSL encryption of the JDBC connection when using this\n        authentication method.\n        If you choose this option then please be aware of the risk that the credentials expire before\n        the read / write operation succeeds.\n\n    These three options are mutually-exclusive and you must explicitly choose which one to use.\n\n\n### Encryption\n\n- **Securing JDBC**: The Redshift and Postgres JDBC drivers both support SSL. To enable SSL support,\n    first configure Java to add the required certificates by following the\n    [_Using SSL and Server Certificates in Java_](http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-ssl-support.html#connecting-ssl-support-java)\n    instructions in the Redshift documentation. Then, follow the instructions in\n    [_JDBC Driver Configuration Options_](http://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-options.html) to add the appropriate SSL options\n    to the JDBC `url` used with this library.\n\n- **Encrypting `UNLOAD` data stored in S3 (data stored when reading from Redshift)**: According to the Redshift documentation\n    on [_Unloading Data to S3_](http://docs.aws.amazon.com/redshift/latest/dg/t_Unloading_tables.html),\n    \"UNLOAD automatically encrypts data files using Amazon S3 server-side encryption (SSE-S3).\"\n\n    Redshift also supports client-side encryption with a custom key\n    (see: [_Unloading Encrypted Data Files_](http://docs.aws.amazon.com/redshift/latest/dg/t_unloading_encrypted_files.html))\n    but this library currently lacks the capability to specify the required symmetric key.\n\n- **Encrypting `COPY` data stored in S3 (data stored when writing to Redshift)**:\n    According to the Redshift documentation on\n    [_Loading Encrypted Data Files from Amazon S3_](http://docs.aws.amazon.com/redshift/latest/dg/c_loading-encrypted-files.html):\n\n    \u003e You can use the COPY command to load data files that were uploaded to Amazon S3 using\n    \u003e server-side encryption with AWS-managed encryption keys (SSE-S3 or SSE-KMS), client-side\n    \u003e encryption, or both. COPY does not support Amazon S3 server-side encryption with a customer-supplied key (SSE-C)\n\n    To use this capability, you should configure your Hadoop S3 FileSystem to use encryption by\n    setting the appropriate configuration properties (which will vary depending on whether you\n    are using `s3a`, `s3n`, EMRFS, etc.).\n    Note that the `MANIFEST` file (a list of all files written) will not be encrypted.\n\n\n### Parameters\n\nThe parameter map or \u003ctt\u003eOPTIONS\u003c/tt\u003e provided in Spark SQL supports the following settings.\n\n\u003ctable\u003e\n \u003ctr\u003e\n    \u003cth\u003eParameter\u003c/th\u003e\n    \u003cth\u003eRequired\u003c/th\u003e\n    \u003cth\u003eDefault\u003c/th\u003e\n    \u003cth\u003eNotes\u003c/th\u003e\n \u003c/tr\u003e\n\n \u003ctr\u003e\n    \u003ctd\u003e\u003ctt\u003edbtable\u003c/tt\u003e\u003c/td\u003e\n    \u003ctd\u003eYes, unless \u003ctt\u003equery\u003c/tt\u003e is specified\u003c/td\u003e\n    \u003ctd\u003eNo default\u003c/td\u003e\n    \u003ctd\u003eThe table to create or read from in Redshift. This parameter is required when saving data back to Redshift.\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n    \u003ctd\u003e\u003ctt\u003equery\u003c/tt\u003e\u003c/td\u003e\n    \u003ctd\u003eYes, unless \u003ctt\u003edbtable\u003c/tt\u003e is specified\u003c/td\u003e\n    \u003ctd\u003eNo default\u003c/td\u003e\n    \u003ctd\u003eThe query to read from in Redshift\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n    \u003ctd\u003e\u003ctt\u003euser\u003c/tt\u003e\u003c/td\u003e\n    \u003ctd\u003eNo\u003c/td\u003e\n    \u003ctd\u003eNo default\u003c/td\u003e\n    \u003ctd\u003eThe Redshift username.  Must be used in tandem with \u003ctt\u003epassword\u003c/tt\u003e option.  May only be used if the user and password are not passed in the URL, passing both will result in an error.\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n    \u003ctd\u003e\u003ctt\u003epassword\u003c/tt\u003e\u003c/td\u003e\n    \u003ctd\u003eNo\u003c/td\u003e\n    \u003ctd\u003eNo default\u003c/td\u003e\n    \u003ctd\u003eThe Redshift password.  Must be used in tandem with \u003ctt\u003euser\u003c/tt\u003e option.  May only be used if the user and password are not passed in the URL; passing both will result in an error.\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n    \u003ctd\u003e\u003ctt\u003eurl\u003c/tt\u003e\u003c/td\u003e\n    \u003ctd\u003eYes\u003c/td\u003e\n    \u003ctd\u003eNo default\u003c/td\u003e\n    \u003ctd\u003e\n\u003cp\u003eA JDBC URL, of the format, \u003ctt\u003ejdbc:subprotocol://host:port/database?user=username\u0026password=password\u003c/tt\u003e\u003c/p\u003e\n\n\u003cul\u003e\n \u003cli\u003e\u003ctt\u003esubprotocol\u003c/tt\u003e can be \u003ctt\u003epostgresql\u003c/tt\u003e or \u003ctt\u003eredshift\u003c/tt\u003e, depending on which JDBC driver\n    you have loaded. Note however that one Redshift-compatible driver must be on the classpath and match\n    this URL.\u003c/li\u003e\n \u003cli\u003e\u003ctt\u003ehost\u003c/tt\u003e and \u003ctt\u003eport\u003c/tt\u003e should point to the Redshift master node, so security groups and/or VPC will\nneed to be configured to allow access from your driver application.\n \u003cli\u003e\u003ctt\u003edatabase\u003c/tt\u003e identifies a Redshift database name\u003c/li\u003e\n \u003cli\u003e\u003ctt\u003euser\u003c/tt\u003e and \u003ctt\u003epassword\u003c/tt\u003e are credentials to access the database, which must be embedded\n    in this URL for JDBC, and your user account should have necessary privileges for the table being referenced. \u003c/li\u003e\n    \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n   \u003ctd\u003e\u003ctt\u003eaws_iam_role\u003c/tt\u003e\u003c/td\u003e\n   \u003ctd\u003eOnly if using IAM roles to authorize Redshift COPY/UNLOAD operations\u003c/td\u003e\n   \u003ctd\u003eNo default\u003c/td\u003e\n   \u003ctd\u003eFully specified ARN of the \u003ca href=\"http://docs.aws.amazon.com/redshift/latest/mgmt/copy-unload-iam-role.html\"\u003eIAM Role\u003c/a\u003e attached to the Redshift cluster, ex: arn:aws:iam::123456789000:role/redshift_iam_role\u003c/td\u003e\n \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003ctt\u003eforward_spark_s3_credentials\u003c/tt\u003e\u003c/td\u003e\n    \u003ctd\u003eNo\u003c/td\u003e\n    \u003ctd\u003efalse\u003c/td\u003e\n    \u003ctd\u003e\n        If \u003ctt\u003etrue\u003c/tt\u003e then this library will automatically discover the credentials that Spark is\n        using to connect to S3 and will forward those credentials to Redshift over JDBC.\n        These credentials are sent as part of the JDBC query, so therefore it is strongly\n        recommended to enable SSL encryption of the JDBC connection when using this option.\n    \u003c/td\u003e\n  \u003c/tr\u003e\n \u003ctr\u003e\n    \u003ctd\u003e\u003ctt\u003etemporary_aws_access_key_id\u003c/tt\u003e\u003c/td\u003e\n    \u003ctd\u003eNo\u003c/td\u003e\n    \u003ctd\u003eNo default\u003c/td\u003e\n    \u003ctd\u003eAWS access key, must have write permissions to the S3 bucket.\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n    \u003ctd\u003e\u003ctt\u003etemporary_aws_secret_access_key\u003c/tt\u003e\u003c/td\u003e\n    \u003ctd\u003eNo\u003c/td\u003e\n    \u003ctd\u003eNo default\u003c/td\u003e\n    \u003ctd\u003eAWS secret access key corresponding to provided access key.\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n    \u003ctd\u003e\u003ctt\u003etemporary_aws_session_token\u003c/tt\u003e\u003c/td\u003e\n    \u003ctd\u003eNo\u003c/td\u003e\n    \u003ctd\u003eNo default\u003c/td\u003e\n    \u003ctd\u003eAWS session token corresponding to provided access key.\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n    \u003ctd\u003e\u003ctt\u003etempdir\u003c/tt\u003e\u003c/td\u003e\n    \u003ctd\u003eYes\u003c/td\u003e\n    \u003ctd\u003eNo default\u003c/td\u003e\n    \u003ctd\u003eA writeable location in Amazon S3, to be used for unloaded data when reading and Avro data to be loaded into\nRedshift when writing. If you're using Redshift data source for Spark as part of a regular ETL pipeline, it can be useful to\nset a \u003ca href=\"http://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecycle-mgmt.html\"\u003eLifecycle Policy\u003c/a\u003e on a bucket\nand use that as a temp location for this data.\n    \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n    \u003ctd\u003e\u003ctt\u003ejdbcdriver\u003c/tt\u003e\u003c/td\u003e\n    \u003ctd\u003eNo\u003c/td\u003e\n    \u003ctd\u003eDetermined by the JDBC URL's subprotocol\u003c/td\u003e\n    \u003ctd\u003eThe class name of the JDBC driver to use. This class must be on the classpath. In most cases, it should not be necessary to specify this option, as the appropriate driver classname should automatically be determined by the JDBC URL's subprotocol.\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n    \u003ctd\u003e\u003ctt\u003ediststyle\u003c/tt\u003e\u003c/td\u003e\n    \u003ctd\u003eNo\u003c/td\u003e\n    \u003ctd\u003e\u003ctt\u003eEVEN\u003c/tt\u003e\u003c/td\u003e\n    \u003ctd\u003eThe Redshift \u003ca href=\"http://docs.aws.amazon.com/redshift/latest/dg/c_choosing_dist_sort.html\"\u003eDistribution Style\u003c/a\u003e to\nbe used when creating a table. Can be one of \u003ctt\u003eEVEN\u003c/tt\u003e, \u003ctt\u003eKEY\u003c/tt\u003e or \u003ctt\u003eALL\u003c/tt\u003e (see Redshift docs). When using \u003ctt\u003eKEY\u003c/tt\u003e, you\nmust also set a distribution key with the \u003ctt\u003edistkey\u003c/tt\u003e option.\n    \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n    \u003ctd\u003e\u003ctt\u003edistkey\u003c/tt\u003e\u003c/td\u003e\n    \u003ctd\u003eNo, unless using \u003ctt\u003eDISTSTYLE KEY\u003c/tt\u003e\u003c/td\u003e\n    \u003ctd\u003eNo default\u003c/td\u003e\n    \u003ctd\u003eThe name of a column in the table to use as the distribution key when creating a table.\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n    \u003ctd\u003e\u003ctt\u003esortkeyspec\u003c/tt\u003e\u003c/td\u003e\n    \u003ctd\u003eNo\u003c/td\u003e\n    \u003ctd\u003eNo default\u003c/td\u003e\n    \u003ctd\u003e\n\u003cp\u003eA full Redshift \u003ca href=\"http://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data.html\"\u003eSort Key\u003c/a\u003e definition.\u003c/p\u003e\n\n\u003cp\u003eExamples include:\u003c/p\u003e\n\u003cul\u003e\n    \u003cli\u003e\u003ctt\u003eSORTKEY(my_sort_column)\u003c/tt\u003e\u003c/li\u003e\n    \u003cli\u003e\u003ctt\u003eCOMPOUND SORTKEY(sort_col_1, sort_col_2)\u003c/tt\u003e\u003c/li\u003e\n    \u003cli\u003e\u003ctt\u003eINTERLEAVED SORTKEY(sort_col_1, sort_col_2)\u003c/tt\u003e\u003c/li\u003e\n\u003c/ul\u003e\n    \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n    \u003ctd\u003e\u003cdel\u003e\u003ctt\u003eusestagingtable\u003c/tt\u003e\u003c/del\u003e (Deprecated)\u003c/td\u003e\n    \u003ctd\u003eNo\u003c/td\u003e\n    \u003ctd\u003e\u003ctt\u003etrue\u003c/tt\u003e\u003c/td\u003e\n    \u003ctd\u003e\n    \u003cp\u003e\n    Setting this deprecated option to \u003ctt\u003efalse\u003c/tt\u003e will cause an overwrite operation's destination table to be dropped immediately at the beginning of the write, making the overwrite operation non-atomic and reducing the availability of the destination table. This may reduce the temporary disk space requirements for overwrites.\n    \u003c/p\u003e\n\n    \u003cp\u003eSince setting \u003ctt\u003eusestagingtable=false\u003c/tt\u003e operation risks data loss / unavailability, we have chosen to deprecate it in favor of requiring users to manually drop the destination table themselves.\u003c/p\u003e\n    \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n    \u003ctd\u003e\u003ctt\u003edescription\u003c/tt\u003e\u003c/td\u003e\n    \u003ctd\u003eNo\u003c/td\u003e\n    \u003ctd\u003eNo default\u003c/td\u003e\n    \u003ctd\u003e\n\u003cp\u003eA description for the table. Will be set using the SQL COMMENT command, and should show up in most query tools.\nSee also the \u003ctt\u003edescription\u003c/tt\u003e metadata to set descriptions on individual columns.\n \u003c/tr\u003e\n \u003ctr\u003e\n    \u003ctd\u003e\u003ctt\u003epreactions\u003c/tt\u003e\u003c/td\u003e\n    \u003ctd\u003eNo\u003c/td\u003e\n    \u003ctd\u003eNo default\u003c/td\u003e\n    \u003ctd\u003e\n\u003cp\u003eThis can be a \u003ctt\u003e;\u003c/tt\u003e separated list of SQL commands to be executed before loading \u003ctt\u003eCOPY\u003c/tt\u003e command.\nIt may be useful to have some \u003ctt\u003eDELETE\u003c/tt\u003e commands or similar run here before loading new data. If the command contains\n\u003ctt\u003e%s\u003c/tt\u003e, the table name will be formatted in before execution (in case you're using a staging table).\u003c/p\u003e\n\n\u003cp\u003eBe warned that if this commands fail, it is treated as an error and you'll get an exception. If using a staging\ntable, the changes will be reverted and the backup table restored if pre actions fail.\u003c/p\u003e\n    \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n    \u003ctd\u003e\u003ctt\u003epostactions\u003c/tt\u003e\u003c/td\u003e\n    \u003ctd\u003eNo\u003c/td\u003e\n    \u003ctd\u003eNo default\u003c/td\u003e\n    \u003ctd\u003e\n\u003cp\u003eThis can be a \u003ctt\u003e;\u003c/tt\u003e separated list of SQL commands to be executed after a successful \u003ctt\u003eCOPY\u003c/tt\u003e when loading data.\nIt may be useful to have some \u003ctt\u003eGRANT\u003c/tt\u003e commands or similar run here when loading new data. If the command contains\n\u003ctt\u003e%s\u003c/tt\u003e, the table name will be formatted in before execution (in case you're using a staging table).\u003c/p\u003e\n\n\u003cp\u003eBe warned that if this commands fail, it is treated as an error and you'll get an exception. If using a staging\ntable, the changes will be reverted and the backup table restored if post actions fail.\u003c/p\u003e\n    \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n    \u003ctd\u003e\u003ctt\u003eextracopyoptions\u003c/tt\u003e\u003c/td\u003e\n    \u003ctd\u003eNo\u003c/td\u003e\n    \u003ctd\u003eNo default\u003c/td\u003e\n    \u003ctd\u003e\n\u003cp\u003eA list extra options to append to the Redshift \u003ctt\u003eCOPY\u003c/tt\u003e command when loading data, e.g. \u003ctt\u003eTRUNCATECOLUMNS\u003c/tt\u003e\nor \u003cTT\u003eMAXERROR n\u003c/tt\u003e (see the \u003ca href=\"http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html#r_COPY-syntax-overview-optional-parameters\"\u003eRedshift docs\u003c/a\u003e\nfor other options).\u003c/p\u003e\n\n\u003cp\u003eNote that since these options are appended to the end of the \u003ctt\u003eCOPY\u003c/tt\u003e command, only options that make sense\nat the end of the command can be used, but that should cover most possible use cases.\u003c/p\u003e\n    \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n    \u003ctd\u003e\u003ctt\u003etempformat\u003c/tt\u003e  (Experimental)\u003c/td\u003e\n    \u003ctd\u003eNo\u003c/td\u003e\n    \u003ctd\u003e\u003ctt\u003eAVRO\u003c/tt\u003e\u003c/td\u003e\n    \u003ctd\u003e\n    \u003cp\u003e\n        The format in which to save temporary files in S3 when writing to Redshift.\n        Defaults to \"AVRO\"; the other allowed values are \"CSV\" and \"CSV GZIP\" for CSV\n        and gzipped CSV, respectively.\n    \u003c/p\u003e\n    \u003cp\u003e\n        Redshift is significantly faster when loading CSV than when loading Avro files, so\n        using that \u003ctt\u003etempformat\u003c/tt\u003e may provide a large performance boost when writing\n        to Redshift.\n    \u003c/p\u003e\n    \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n    \u003ctd\u003e\u003ctt\u003ecsvnullstring\u003c/tt\u003e  (Experimental)\u003c/td\u003e\n    \u003ctd\u003eNo\u003c/td\u003e\n    \u003ctd\u003e\u003ctt\u003e@NULL@\u003c/tt\u003e\u003c/td\u003e\n    \u003ctd\u003e\n    \u003cp\u003e\n        The String value to write for nulls when using the CSV \u003ctt\u003etempformat\u003c/tt\u003e.\n        This should be a value which does not appear in your actual data.\n    \u003c/p\u003e\n    \u003c/td\u003e\n \u003c/tr\u003e\n\u003c/table\u003e\n\n## Additional configuration options\n\n### Configuring the maximum size of string columns\n\nWhen creating Redshift tables, this library's default behavior is to create `TEXT` columns for string columns. Redshift stores `TEXT` columns as `VARCHAR(256)`, so these columns have a maximum size of 256 characters ([source](http://docs.aws.amazon.com/redshift/latest/dg/r_Character_types.html)).\n\nTo support larger columns, you can use the `maxlength` column metadata field to specify the maximum length of individual string columns. This can also be done as a space-savings performance optimization in order to declare columns with a smaller maximum length than the default.\n\n\u003e **:warning: Note**: Due to limitations in Spark, metadata modification is unsupported in the Python, SQL, and R language APIs.\n\nHere is an example of updating multiple columns' metadata fields using Spark's Scala API:\n\n```scala\nimport org.apache.spark.sql.types.MetadataBuilder\n\n// Specify the custom width of each column\nval columnLengthMap = Map(\n  \"language_code\" -\u003e 2,\n  \"country_code\" -\u003e 2,\n  \"url\" -\u003e 2083\n)\n\nvar df = ... // the dataframe you'll want to write to Redshift\n\n// Apply each column metadata customization\ncolumnLengthMap.foreach { case (colName, length) =\u003e\n  val metadata = new MetadataBuilder().putLong(\"maxlength\", length).build()\n  df = df.withColumn(colName, df(colName).as(colName, metadata))\n}\n\ndf.write\n  .format(\"com.databricks.spark.redshift\")\n  .option(\"url\", jdbcURL)\n  .option(\"tempdir\", s3TempDirectory)\n  .option(\"dbtable\", sessionTable)\n  .save()\n```\n\n### Setting a custom column type\n\nIf you need to manually set a column type, you can use the `redshift_type` column metadata. For example, if you desire to override\nthe `Spark SQL Schema -\u003e Redshift SQL` type matcher to assign a user-defined column type, you can do the following:\n\n```scala\nimport org.apache.spark.sql.types.MetadataBuilder\n\n// Specify the custom width of each column\nval columnTypeMap = Map(\n  \"language_code\" -\u003e \"CHAR(2)\",\n  \"country_code\" -\u003e \"CHAR(2)\",\n  \"url\" -\u003e \"BPCHAR(111)\"\n)\n\nvar df = ... // the dataframe you'll want to write to Redshift\n\n// Apply each column metadata customization\ncolumnTypeMap.foreach { case (colName, colType) =\u003e\n  val metadata = new MetadataBuilder().putString(\"redshift_type\", colType).build()\n  df = df.withColumn(colName, df(colName).as(colName, metadata))\n}\n```\n\n### Configuring column encoding\n\nWhen creating a table, this library can be configured to use a specific compression encoding on individual columns. You can use the `encoding` column metadata field to specify a compression encoding for each column (see [Amazon docs](http://docs.aws.amazon.com/redshift/latest/dg/c_Compression_encodings.html) for available encodings).\n\n### Setting descriptions on columns\n\nRedshift allows columns to have descriptions attached that should show up in most query tools (using the `COMMENT` command). You can set the `description` column metadata field to specify a description for individual columns.\n\n## Transactional Guarantees\n\nThis section describes the transactional guarantees of the Redshift data source for Spark\n\n### General background on Redshift and S3's properties\n\nFor general information on Redshift's transactional guarantees, see the [Managing Concurrent Write Operations](https://docs.aws.amazon.com/redshift/latest/dg/c_Concurrent_writes.html) chapter in the Redshift documentation. In a nutshell, Redshift provides [serializable isolation](https://docs.aws.amazon.com/redshift/latest/dg/c_serial_isolation.html) (according to the documentation for Redshift's [`BEGIN`](https://docs.aws.amazon.com/redshift/latest/dg/r_BEGIN.html) command, \"[although] you can use any of the four transaction isolation levels, Amazon Redshift processes all isolation levels as serializable\"). According to its [documentation](https://docs.aws.amazon.com/redshift/latest/dg/c_serial_isolation.html), \"Amazon Redshift supports a default _automatic commit_ behavior in which each separately-executed SQL command commits individually.\" Thus, individual commands like `COPY` and `UNLOAD` are atomic and transactional, while explicit `BEGIN` and `END` should only be necessary to enforce the atomicity of multiple commands / queries.\n\nWhen reading from / writing to Redshift, this library reads and writes data in S3. Both Spark and Redshift produce partitioned output which is stored in multiple files in S3. According to the [Amazon S3 Data Consistency Model](https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel) documentation, S3 bucket listing operations are eventually-consistent, so the files must to go to special lengths to avoid missing / incomplete data due to this source of eventual-consistency.\n\n### Guarantees of the Redshift data source for Spark\n\n\n**Appending to an existing table**: In the [`COPY`](https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html) command, this library uses [manifests](https://docs.aws.amazon.com/redshift/latest/dg/loading-data-files-using-manifest.html) to guard against certain eventually-consistent S3 operations. As a result, it appends to existing tables have the same atomic and transactional properties as regular Redshift `COPY` commands.\n\n**Appending to an existing table**: When inserting rows into Redshift, this library uses the [`COPY`](https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html) command and specifies [manifests](https://docs.aws.amazon.com/redshift/latest/dg/loading-data-files-using-manifest.html) to guard against certain eventually-consistent S3 operations. As a result, `spark-redshift` appends to existing tables have the same atomic and transactional properties as regular Redshift `COPY` commands.\n\n\n**Creating a new table (`SaveMode.CreateIfNotExists`)**: Creating a new table is a two-step process, consisting of a `CREATE TABLE` command followed by a [`COPY`](https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html) command to append the initial set of rows. Both of these operations are performed in a single transaction.\n\n**Overwriting an existing table**: By default, this library uses transactions to perform overwrites, which are implemented by deleting the destination table, creating a new empty table, and appending rows to it.\n\nIf the deprecated `usestagingtable` setting is set to `false` then this library will commit the `DELETE TABLE` command before appending rows to the new table, sacrificing the atomicity of the overwrite operation but reducing the amount of staging space that Redshift needs during the overwrite.\n\n**Querying Redshift tables**: Queries use Redshift's [`UNLOAD`](https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html) command to execute a query and save its results to S3 and use [manifests](https://docs.aws.amazon.com/redshift/latest/dg/loading-data-files-using-manifest.html) to guard against certain eventually-consistent S3 operations. As a result, queries from Redshift data source for Spark should have the same consistency properties as regular Redshift queries.\n\n## Common problems and solutions\n\n### S3 bucket and Redshift cluster are in different AWS regions\n\nBy default, S3 \u003c-\u003e Redshift copies will not work if the S3 bucket and Redshift cluster are in different AWS regions.\n\nIf you attempt to perform a read of a Redshift table and the regions are mismatched then you may see a confusing error, such as\n\n```\njava.sql.SQLException: [Amazon](500310) Invalid operation: S3ServiceException:The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.\n```\n\nSimilarly, attempting to write to Redshift using a S3 bucket in a different region may cause the following error:\n\n```\nerror:  Problem reading manifest file - S3ServiceException:The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.,Status 301,Error PermanentRedirect\n```\n\n**For writes:** Redshift's `COPY` command allows the S3 bucket's region to be explicitly specified, so you can make writes to Redshift work properly in these cases by adding\n\n```\nregion 'the-region-name'\n```\n\nto the `extracopyoptions` setting. For example, with a bucket in the US East (Virginia) region and the Scala API, use\n\n```\n.option(\"extracopyoptions\", \"region 'us-east-1'\")\n```\n\n**For reads:** According to [its documentation](http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html), the Redshift `UNLOAD` command does not support writing to a bucket in a different region:\n\n\u003e **Important**\n\u003e\n\u003e The Amazon S3 bucket where Amazon Redshift will write the output files must reside in the same region as your cluster.\n\nAs a result, this use-case is not supported by this library. The only workaround is to use a new bucket in the same region as your Redshift cluster.\n\n## Migration Guide\n\n- Version 3.0 now requires `forward_spark_s3_credentials` to be explicitly set before Spark S3\n  credentials will be forwarded to Redshift. Users who use the `aws_iam_role` or `temporary_aws_*`\n  authentication mechanisms will be unaffected by this change. Users who relied on the old default\n  behavior will now need to explicitly set `forward_spark_s3_credentials` to `true` to continue\n  using their previous Redshift to S3 authentication mechanism. For a discussion of the three\n  authentication mechanisms and their security trade-offs, see the [_Authenticating to S3 and\n  Redshift_](#authenticating-to-s3-and-redshift) section of this README.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabricks%2Fspark-redshift","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatabricks%2Fspark-redshift","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabricks%2Fspark-redshift/lists"}