https://github.com/rejeb/netcdf-spark-parser
Scala/Spark Netcdf for reading Netcdf files
https://github.com/rejeb/netcdf-spark-parser
netcdf-java scala spark spark-datasource
Last synced: 20 days ago
JSON representation
Scala/Spark Netcdf for reading Netcdf files
- Host: GitHub
- URL: https://github.com/rejeb/netcdf-spark-parser
- Owner: rejeb
- Created: 2025-03-30T17:01:01.000Z (6 months ago)
- Default Branch: master
- Last Pushed: 2025-03-30T17:42:30.000Z (6 months ago)
- Last Synced: 2025-03-30T18:20:39.981Z (6 months ago)
- Topics: netcdf-java, scala, spark, spark-datasource
- Language: Scala
- Homepage:
- Size: 88.9 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# NetCDF Spark Parser
[](https://github.com/rejeb/netcdf-spark-parser/stargazers)
[](https://github.com/rejeb/netcdf-spark-parser/blob/main/LICENSE)
[](https://www.java.com/fr/)
[](https://www.scala-lang.org/)
[](https://spark.apache.org/)A Spark connector for efficiently parsing and reading **NetCDF** files at scale using **Apache Spark**.
This project leverages the **DataSource V2** API to integrate NetCDF file reading in a distributed and performant way.
This parser uses [NetCDF Java](https://www.unidata.ucar.edu/software/netcdf-java/) to read data from netcdf files.---
## 🚀 Features- **Custom Schema Support**: Define the schema for NetCDF variables.
- **Partition Handling**: Automatically manages partitions for large netcdf files.
- **Scalable Performance**: Optimized for distributed computing with Spark.
- **Storage Compatibility**: This connector supports reading NetCDF files from:
- Local file systems (tested).
- Amazon S3, see [Dataset URLs](https://docs.unidata.ucar.edu/netcdf-java/5.6/userguide/dataset_urls.html) for configuration (tested).
---## 📋 Requirements
- **Java**: Version 11+
- **Apache Spark**: Version 3.5.x
- **Scala**: Version 2.12,2.13
- **Dependency Management**: SBT, Maven, or similar
- **Unidata repository**: Add Unidata repository, see [Using netCDF-Java Maven Artifacts](https://docs.unidata.ucar.edu/netcdf-java/current/userguide/using_netcdf_java_artifacts.html)---
## 🧰 Use Cases
- Transform multi dimensional data to tabular form.
- Processing climate and oceanographic data.
- Analyzing multi-dimensional scientific datasets.
- Batch processing of NetCDF files.## 📖 Usage
Loading data from a NetCDF file into a DataFrame requires that the variables to extract share at least one common dimension.
### Add Dependency to Your Project
To integrate the **NetCDF Spark** connector into your project, add the following dependency to your preferred build tool configuration.
#### Using SBT
Add the following line to your file: `build.sbt`
``` scala
libraryDependencies += "io.github.rejeb" %% "netcdf-spark-parser" % "1.0.0"
```
#### Using Maven
Include the following dependency in the section of your file: ```pom.xml`
``` xmlio.github.rejeb
netcdf-spark-parser_2.13
1.0.0```
> **Note**: Change `_2.13` to `_2.12` if your project uses Scala 2.12 instead of 2.13.
>#### Using Gradle
For Gradle, add this dependency to the `dependencies` block of your file: `build.gradle`
``` groovy
dependencies {
implementation 'io.github.rejeb:netcdf-spark-parser_2.13:1.0.0'
}
```
> **Hint**: Ensure that the Scala version in the artifact matches your project setup (e.g., `_2.12` or `_2.13`).
>---
### Define Your NetCDF Schema
NetCDF requires an explicitly defined schema for variable mapping. Here is an example schema definition:
```scala
val schema = StructType(Seq(
StructField("temperature", FloatType),
StructField("humidity", FloatType),
StructField("timestamp", StringType),
StructField("metadata", ArrayType(StringType))
))
```---
### Load NetCDF Files
Here is how to load a NetCDF file into a DataFrame:
```scala
val spark = SparkSession.builder().appName("NetCDF File Reader").master("local[*]").getOrCreate()
val df = spark.read.format("netcdf")
.schema(schema)
.option("path", "/path/to/your/netcdf-file.nc")
.load()
df.show()
```---
### Configuration Options
| Option | Description | Required | Default |
|---------------------|-------------------------------------------------------|----------|---------------|
| `path` | Path to the NetCDF file | Yes | None |
| `partition.size` | Rows per partition to optimize parallelism | No | 20,000 rows |
| `dimensions.to.ignore` | Comma-separated list of dimensions to ignore | No | None |Example with options:
```scala
val df = spark
.read
.format("netcdf")
.schema(schema)
.option("path", "/path/to/file.nc")
.option("partition.size", 50000)
.option("dimensions.to.ignore", "dim1,dim2")
.load()
```---
### Full Sample Pipeline Example
Here is a complete example:
```scala
val schema = val schema = StructType(Seq(
StructField("temperature", FloatType),
StructField("humidity", FloatType),
StructField("timestamp", StringType),
StructField("metadata", ArrayType(StringType))
))val df = spark
.read
.format("netcdf")
.schema(schema)
.option("path", "/data/example.nc")
.load()df.printSchema() df.show()
```---
## ⚠️ Limitations
- **Schema inference**: Schema inference is not supported; you must explicitly define the schema.
- **Write Operations**: Currently, writing to NetCDF files is not supported.
- **Common Dimensions**: Too many shared dimensions, or a large Cartesian product between them,
can cause the parser to fail during partitioning and data reading.---
## 🤝 Contributing
Contributions are welcome! To contribute:
1. Fork the project
2. Create a feature branch (`git checkout -b feature/my-feature`)
3. Commit your changes (`git commit -am 'Add my feature'`)
4. Push to your branch (`git push origin feature/my-feature`)
5. Create a Pull Request---
## 📄 License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.