{"id":16573317,"url":"https://github.com/nightscape/spark-excel","last_synced_at":"2025-05-15T09:06:10.927Z","repository":{"id":10853943,"uuid":"67264828","full_name":"nightscape/spark-excel","owner":"nightscape","description":"A Spark plugin for reading and writing Excel files","archived":false,"fork":false,"pushed_at":"2025-05-07T15:14:49.000Z","size":1438,"stargazers_count":493,"open_issues_count":95,"forks_count":153,"subscribers_count":39,"default_branch":"main","last_synced_at":"2025-05-09T18:59:58.138Z","etag":null,"topics":["data-frame","etl","excel","scala","spark"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nightscape.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2016-09-03T01:36:09.000Z","updated_at":"2025-05-09T02:43:06.000Z","dependencies_parsed_at":"2023-10-13T02:59:12.924Z","dependency_job_id":"3c5e981c-4f64-4cac-a468-287344d8461e","html_url":"https://github.com/nightscape/spark-excel","commit_stats":{"total_commits":755,"total_committers":39,"mean_commits":"19.358974358974358","dds":0.671523178807947,"last_synced_commit":"3696e11d1b04134ab93fd01d3dd0870e7fc44aca"},"previous_names":["nightscape/spark-excel"],"tags_count":94,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nightscape%2Fspark-excel","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nightscape%2Fspark-excel/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nightscape%2Fspark-excel/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nightscape%2Fspark-excel/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nightscape","download_url":"https://codeload.github.com/nightscape/spark-excel/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254310515,"owners_count":22049469,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-frame","etl","excel","scala","spark"],"created_at":"2024-10-11T21:40:55.312Z","updated_at":"2025-05-15T09:06:10.903Z","avatar_url":"https://github.com/nightscape.png","language":"Scala","funding_links":[],"categories":["Scala"],"sub_categories":[],"readme":"# Spark Excel Library\n\nA library for querying Excel files with Apache Spark, for Spark SQL and DataFrames.\n\n[![Build Status](https://github.com/nightscape/spark-excel/actions/workflows/ci.yml/badge.svg)](https://github.com/nightscape/spark-excel/actions/workflows/ci.yml)\n[![Maven Central](https://img.shields.io/maven-central/v/dev.mauch/spark-excel_2.13.svg)](https://search.maven.org/artifact/dev.mauch/spark-excel_2.13)\n\n## Co-maintainers wanted\nDue to personal and professional constraints, the development of this library has been rather slow.\nIf you find value in this library, please consider stepping up as a co-maintainer by leaving a comment [here](https://github.dev/mauch/spark-excel/issues/191).\nHelp is very welcome e.g. in the following areas:\n\n* Additional features\n* Code improvements and reviews\n* Bug analysis and fixing\n* Documentation improvements\n* Build / test infrastructure\n\n## Requirements\n\nThis library requires Spark 2.0+.\n\nList of spark versions, those are automatically tested:\n```\nspark: [\"2.4.1\", \"2.4.7\", \"2.4.8\", \"3.0.1\", \"3.0.3\", \"3.1.1\", \"3.1.2\", \"3.2.4\", \"3.3.2\", \"3.4.1\"]\n```\nFor more detail, please refer to project CI: [ci.yml](https://github.dev/mauch/spark-excel/blob/main/.github/workflows/ci.yml#L10)\n\n## Linking\nYou can link against this library in your program at the following coordinates:\n\n### Scala 2.12\n```\ngroupId: dev.mauch\nartifactId: spark-excel_2.12\nversion: \u003cspark-version\u003e_0.18.0\n```\n\n### Scala 2.11\n```\ngroupId: dev.mauch\nartifactId: spark-excel_2.11\nversion: \u003cspark-version\u003e_0.13.7\n```\n\n## Using with Spark shell\nThis package can be added to  Spark using the `--packages` command line option.  For example, to include it when starting the spark shell:\n\n### Spark compiled with Scala 2.12\n```\n$SPARK_HOME/bin/spark-shell --packages dev.mauch:spark-excel_2.12:\u003cspark-version\u003e_0.18.0\n```\n\n### Spark compiled with Scala 2.11\n```\n$SPARK_HOME/bin/spark-shell --packages dev.mauch:spark-excel_2.11:\u003cspark-version\u003e_0.13.7\n```\n\n## Features\n* This package allows querying Excel spreadsheets as [Spark DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html).\n* From spark-excel [0.14.0](https://github.dev/mauch/spark-excel/releases/tag/v0.14.0) (August 24, 2021), there are two implementation of spark-excel\n    * Original Spark-Excel with Spark data source API 1.0\n    * Spark-Excel V2 with data source API V2.0+, which supports loading from multiple files, corrupted record handling and some improvement on handling data types.\n      See below for further details\n\nTo use V2 implementation, just change your .format from `.format(\"dev.mauch.spark.excel\")` to `.format(\"excel\")`.\nSee [below](#excel-api-based-on-datasourcev2) for some details\n\nSee the [changelog](CHANGELOG.md) for latest features, fixes etc.\n\n### Scala API\n__Spark 2.0+:__\n\n\n#### Create a DataFrame from an Excel file\n\n```scala\nimport org.apache.spark.sql._\n\nval spark: SparkSession = ???\nval df = spark.read\n    .format(\"dev.mauch.spark.excel\") // Or .format(\"excel\") for V2 implementation\n    .option(\"dataAddress\", \"'My Sheet'!B3:C35\") // Optional, default: \"A1\"\n    .option(\"header\", \"true\") // Required\n    .option(\"treatEmptyValuesAsNulls\", \"false\") // Optional, default: true\n    .option(\"setErrorCellsToFallbackValues\", \"true\") // Optional, default: false, where errors will be converted to null. If true, any ERROR cell values (e.g. #N/A) will be converted to the zero values of the column's data type.\n    .option(\"usePlainNumberFormat\", \"false\") // Optional, default: false, If true, format the cells without rounding and scientific notations\n    .option(\"inferSchema\", \"false\") // Optional, default: false\n    .option(\"addColorColumns\", \"true\") // Optional, default: false\n    .option(\"timestampFormat\", \"MM-dd-yyyy HH:mm:ss\") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]\n    .option(\"dateFormat\", \"yyyyMMdd\") // Optional, default: yyyy-MM-dd\n    .option(\"maxRowsInMemory\", 20) // Optional, default None. If set, uses a streaming reader which can help with big files (will fail if used with xls format files)\n    .option(\"maxByteArraySize\", 2147483647) // Optional, default None. See https://poi.apache.org/apidocs/5.0/org/apache/poi/util/IOUtils.html#setByteArrayMaxOverride-int-\n    .option(\"tempFileThreshold\", 10000000) // Optional, default None. Number of bytes at which a zip entry is regarded as too large for holding in memory and the data is put in a temp file instead\n    .option(\"excerptSize\", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from\n    .option(\"workbookPassword\", \"pass\") // Optional, default None. Requires unlimited strength JCE for older JVMs\n    .schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings\n    .load(\"Worktime.xlsx\")\n```\n\nFor convenience, there is an implicit that wraps the `DataFrameReader` returned by `spark.read`\nand provides a `.excel` method which accepts all possible options and provides default values:\n\n```scala\nimport org.apache.spark.sql._\nimport dev.mauch.spark.excel._\n\nval spark: SparkSession = ???\nval df = spark.read.excel(\n    header = true,  // Required\n    dataAddress = \"'My Sheet'!B3:C35\", // Optional, default: \"A1\"\n    treatEmptyValuesAsNulls = false,  // Optional, default: true\n    setErrorCellsToFallbackValues = false, // Optional, default: false, where errors will be converted to null. If true, any ERROR cell values (e.g. #N/A) will be converted to the zero values of the column's data type.\n    usePlainNumberFormat = false,  // Optional, default: false. If true, format the cells without rounding and scientific notations\n    inferSchema = false,  // Optional, default: false\n    addColorColumns = true,  // Optional, default: false\n    timestampFormat = \"MM-dd-yyyy HH:mm:ss\",  // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]\n    maxRowsInMemory = 20,  // Optional, default None. If set, uses a streaming reader which can help with big files (will fail if used with xls format files)\n    maxByteArraySize = 2147483647,  // Optional, default None. See https://poi.apache.org/apidocs/5.0/org/apache/poi/util/IOUtils.html#setByteArrayMaxOverride-int-\n    tempFileThreshold = 10000000, // Optional, default None. Number of bytes at which a zip entry is regarded as too large for holding in memory and the data is put in a temp file instead\n    excerptSize = 10,  // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from\n    workbookPassword = \"pass\"  // Optional, default None. Requires unlimited strength JCE for older JVMs\n).schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings\n .load(\"Worktime.xlsx\")\n```\n\nIf the sheet name is unavailable, it is possible to pass in an index:\n\n```scala\nval df = spark.read.excel(\n  header = true,\n  dataAddress = \"0!B3:C35\"\n).load(\"Worktime.xlsx\")\n```\n\nor to read in the names dynamically:\n\n```scala\nimport dev.mauch.spark.excel.WorkbookReader\nval sheetNames = WorkbookReader( Map(\"path\" -\u003e \"Worktime.xlsx\")\n                               , spark.sparkContext.hadoopConfiguration\n                               ).sheetNames\nval df = spark.read.excel(\n  header = true,\n  dataAddress = sheetNames(0)\n)\n```\n\n#### Create a DataFrame from an Excel file using custom schema\n```scala\nimport org.apache.spark.sql._\nimport org.apache.spark.sql.types._\n\nval peopleSchema = StructType(Array(\n    StructField(\"Name\", StringType, nullable = false),\n    StructField(\"Age\", DoubleType, nullable = false),\n    StructField(\"Occupation\", StringType, nullable = false),\n    StructField(\"Date of birth\", StringType, nullable = false)))\n\nval spark: SparkSession = ???\nval df = spark.read\n    .format(\"dev.mauch.spark.excel\") // Or .format(\"excel\") for V2 implementation\n    .option(\"dataAddress\", \"'Info'!A1\")\n    .option(\"header\", \"true\")\n    .schema(peopleSchema)\n    .load(\"People.xlsx\")\n```\n\n#### Write a DataFrame to an Excel file\n```scala\nimport org.apache.spark.sql._\n\nval df: DataFrame = ???\ndf.write\n  .format(\"dev.mauch.spark.excel\") // Or .format(\"excel\") for V2 implementation\n  .option(\"dataAddress\", \"'My Sheet'!B3:C35\")\n  .option(\"header\", \"true\")\n  .option(\"dateFormat\", \"yy-mmm-d\") // Optional, default: yy-m-d h:mm\n  .option(\"timestampFormat\", \"mm-dd-yyyy hh:mm:ss\") // Optional, default: yyyy-mm-dd hh:mm:ss.000\n  .mode(\"append\") // Optional, default: overwrite.\n  .save(\"Worktime2.xlsx\")\n```\n\n#### Data Addresses\nAs you can see in the examples above,\nthe location of data to read or write can be specified with the `dataAddress` option.\n\nThe data address consists of two portions:\n* The sheet name (optional) \n* The cell range \n\nFor example `'My Sheet'!B3:F35` will read from the sheet `My Sheet` and the cell range `B3:F35`.  \n\nFollowing rules apply for the sheet name:\n* The sheet name is optional and can be omitted. In that case data is read from the first sheet (the leftmost sheet).\n* If the sheet name consists of digits only (e.g. `001`), spark excel will try to find/read from sheet named `001`. In case no sheet with this name exists, it will read the sheet with index 1 (zero-based, i.e. the second sheet from the left side).\n* If you set the spark option `sheetNameIsRegex` to `true`, the sheet name will be interpreted as a regex pattern. In this case, data of all sheets matching the regex will be read. The data schema for all such sheets must be the same.\n\nConcerning the cell range following formats are supported:\n* `B3`: Start cell of the data.\n  Reading will return all rows below and all columns to the right.\n  Writing will start here and use as many columns and rows as required.\n* `B3:F35`: Cell range of data.\n  Reading will return only rows and columns in the specified range.\n  Writing will start in the first cell (`B3` in this example) and use only the specified columns and rows.\n  If there are more rows or columns in the DataFrame to write, they will be truncated.\n  Make sure this is what you want.\n* `'My Sheet'!B3:F35`: Same as above, but with a specific sheet.\n* `MyTable[#All]`: Table of data.\n  Reading will return all rows and columns in this table.\n  Writing will only write within the current range of the table.\n  No growing of the table will be performed. PRs to change this are welcome.\n\n### Excel API based on DataSourceV2\nThe V2 API offers you several improvements when it comes to file and folder handling.\nand works in a very similar way than data sources like csv and parquet.\n\nTo use V2 implementation, just change your .format from `.format(\"dev.mauch.spark.excel\")` to `.format(\"excel\")`\n\nThe big difference is the fact that you provide a path to read / write data from/to and not\nan individual single file only:\n\n```scala\ndataFrame.write\n        .format(\"excel\")\n        .save(\"some/path\")\n```\n\n```scala\nspark.read\n        .format(\"excel\")\n        // ... insert excel read specific options you need\n        .load(\"some/path\")\n```\n\n\nBecause folders are supported you can read/write from/to a \"partitioned\" folder structure, just\nthe same way as csv or parquet. Note that writing partitioned structures is only\navailable for spark \u003e=3.0.1\n\n````scala\ndataFrame.write\n        .partitionBy(\"col1\")\n        .format(\"excel\")\n        .save(\"some/path\")\n````\n\nNeed some more examples? Check out the [test cases](src/test/scala/dev/mauch/spark/excel/v2/DataFrameWriterApiComplianceSuite.scala)\nor have a look at our wiki\n\n## Building From Source\nThis library is built with [Mill](https://github.com/com-lihaoyi/mill).\nTo build a JAR file simply run e.g. `mill spark-excel[2.13.10,3.3.1].assembly` from the project root, where `2.13.10` is the Scala version and `3.3.1` the Spark version.\nTo list all available combinations of Scala and Spark, run `mill resolve spark-excel[__]`.\n\n## Acknowledgements\n\nThis project was originally developed at [crealytics](https://crealytics.com), an award-winning full-funnel digital marketing agency with over 15 years of experience crafting omnichannel media strategies for leading B2C and B2B businesses.\nWe are grateful for their support in the initial development and open-sourcing of this library.\n\n## Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=nightscape/spark-excel\u0026type=Date)](https://star-history.com/#nightscape/spark-excel\u0026Date)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnightscape%2Fspark-excel","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnightscape%2Fspark-excel","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnightscape%2Fspark-excel/lists"}