{"id":26938741,"url":"https://github.com/rejeb/netcdf-spark-parser","last_synced_at":"2026-04-16T11:02:37.525Z","repository":{"id":285266906,"uuid":"957558797","full_name":"rejeb/netcdf-spark-parser","owner":"rejeb","description":"Scala/Spark Netcdf for reading Netcdf files","archived":false,"fork":false,"pushed_at":"2025-07-14T01:57:18.000Z","size":190,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-09-13T19:43:01.954Z","etag":null,"topics":["netcdf","netcdf-java","parser","scala","spark","spark-connector","spark-datasource"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rejeb.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-03-30T17:01:01.000Z","updated_at":"2025-07-14T01:57:22.000Z","dependencies_parsed_at":"2025-07-14T01:11:16.458Z","dependency_job_id":"3d5eb427-8ad5-4eb9-aff4-5bedada91d57","html_url":"https://github.com/rejeb/netcdf-spark-parser","commit_stats":null,"previous_names":["rejeb/netcdf-spark-parser"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/rejeb/netcdf-spark-parser","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rejeb%2Fnetcdf-spark-parser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rejeb%2Fnetcdf-spark-parser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rejeb%2Fnetcdf-spark-parser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rejeb%2Fnetcdf-spark-parser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rejeb","download_url":"https://codeload.github.com/rejeb/netcdf-spark-parser/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rejeb%2Fnetcdf-spark-parser/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31882886,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-16T09:23:21.276Z","status":"ssl_error","status_checked_at":"2026-04-16T09:23:15.028Z","response_time":69,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["netcdf","netcdf-java","parser","scala","spark","spark-connector","spark-datasource"],"created_at":"2025-04-02T14:13:41.368Z","updated_at":"2026-04-16T11:02:37.504Z","avatar_url":"https://github.com/rejeb.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# NetCDF Spark Parser\n\n[![GitHub stars](https://img.shields.io/github/stars/rejeb/netcdf-spark-parser)](https://github.com/rejeb/netcdf-spark-parser/stargazers)\n[![License](https://img.shields.io/github/license/rejeb/netcdf-spark-parser)](https://github.com/rejeb/netcdf-spark-parser/blob/main/LICENSE)\n[![Scala](https://img.shields.io/badge/Java-11-blue)](https://www.java.com/fr/)\n[![Scala](https://img.shields.io/badge/Scala-2.12%2F2.13-red)](https://www.scala-lang.org/)\n[![Spark](https://img.shields.io/badge/Spark-3.5.x-orange)](https://spark.apache.org/)\n\nA Spark connector for efficiently parsing and reading **NetCDF** files at scale using **Apache Spark**. \nThis project leverages the **DataSource V2** API to integrate NetCDF file reading in a distributed and performant way.\nThis parser uses [NetCDF Java](https://www.unidata.ucar.edu/software/netcdf-java/) to read data from netcdf files.\n\n---\n## 🚀 Features\n\n- **Custom Schema Support**: Define the schema for NetCDF variables.\n- **Partition Handling**: Automatically manages partitions for large netcdf files.\n- **Scalable Performance**: Optimized for distributed computing with Spark.\n- **Storage Compatibility**: This connector supports reading NetCDF files from:\n    - Local file systems (tested).\n    - Amazon S3, see [Dataset URLs](https://docs.unidata.ucar.edu/netcdf-java/5.6/userguide/dataset_urls.html) for configuration (tested).\n  \n---\n\n## 📋 Requirements\n\n- **Java**: Version 11+\n- **Apache Spark**: Version 3.5.x\n- **Scala**: Version 2.12,2.13\n- **Dependency Management**: SBT, Maven, or similar\n- **Unidata repository**: Add Unidata repository, see [Using netCDF-Java Maven Artifacts](https://docs.unidata.ucar.edu/netcdf-java/current/userguide/using_netcdf_java_artifacts.html)\n\n---\n\n## 🧰 Use Cases\n\n- Transform multi dimensional data to tabular form.\n- Processing climate and oceanographic data.\n- Analyzing multi-dimensional scientific datasets.\n- Batch processing of NetCDF files.\n\n## 📖 Usage\n\nLoading data from a NetCDF file into a DataFrame requires that the variables to extract share at least one common dimension.\n\n### Add Dependency to Your Project\n\nTo integrate the **NetCDF Spark** connector into your project, add the following dependency to your preferred build tool configuration.\n#### Using SBT\nAdd the following line to your file: `build.sbt`\n``` scala\nlibraryDependencies += \"io.github.rejeb\" %% \"netcdf-spark-parser\" % \"1.0.0\"\n```\n#### Using Maven\nInclude the following dependency in the section of your file: `\u003cdependencies\u003e``pom.xml`\n``` xml\n\u003cdependency\u003e\n    \u003cgroupId\u003eio.github.rejeb\u003c/groupId\u003e\n    \u003cartifactId\u003enetcdf-spark-parser_2.13\u003c/artifactId\u003e\n    \u003cversion\u003e1.0.0\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\u003e **Note**: Change `_2.13` to `_2.12` if your project uses Scala 2.12 instead of 2.13.\n\u003e\n\n#### Using Gradle\nFor Gradle, add this dependency to the `dependencies` block of your file: `build.gradle`\n``` groovy\ndependencies {\n    implementation 'io.github.rejeb:netcdf-spark-parser_2.13:1.0.0'\n}\n```\n\u003e **Hint**: Ensure that the Scala version in the artifact matches your project setup (e.g., `_2.12` or `_2.13`).\n\u003e\n\n---\n\n### Define Your NetCDF Schema\n\nNetCDF requires an explicitly defined schema for variable mapping. Here is an example schema definition:\n```scala\nval schema = StructType(Seq(\nStructField(\"temperature\", FloatType),\nStructField(\"humidity\", FloatType),\nStructField(\"timestamp\", StringType),\nStructField(\"metadata\", ArrayType(StringType))\n))\n``` \n\n---\n\n### Load NetCDF Files\n\nHere is how to load a NetCDF file into a DataFrame:\n\n```scala\nval spark = SparkSession.builder().appName(\"NetCDF File Reader\").master(\"local[*]\").getOrCreate()\nval df = spark.read.format(\"netcdf\")\n  .schema(schema)\n  .option(\"path\", \"/path/to/your/netcdf-file.nc\")\n  .load()\ndf.show()\n``` \n\n---\n\n### Configuration Options\n\n| Option              | Description                                           | Required | Default       |\n|---------------------|-------------------------------------------------------|----------|---------------|\n| `path`              | Path to the NetCDF file                               | Yes      | None          |\n| `partition.size`     | Rows per partition to optimize parallelism            | No       | 20,000 rows   |\n| `dimensions.to.ignore` | Comma-separated list of dimensions to ignore          | No       | None          |\n\nExample with options:\n\n```scala\nval df = spark\n        .read\n        .format(\"netcdf\")\n        .schema(schema)\n        .option(\"path\", \"/path/to/file.nc\")\n        .option(\"partition.size\", 50000)\n        .option(\"dimensions.to.ignore\", \"dim1,dim2\")\n        .load()\n``` \n\n---\n\n### Full Sample Pipeline Example\n\nHere is a complete example:\n```scala\nval schema = val schema = StructType(Seq(\nStructField(\"temperature\", FloatType),\nStructField(\"humidity\", FloatType),\nStructField(\"timestamp\", StringType),\nStructField(\"metadata\", ArrayType(StringType))\n))\n\nval df = spark\n        .read\n        .format(\"netcdf\")\n        .schema(schema)\n        .option(\"path\", \"/data/example.nc\")\n        .load()\n\ndf.printSchema() df.show()\n``` \n\n---\n\n## ⚠️ Limitations\n\n- **Schema inference**: Schema inference is not supported; you must explicitly define the schema.\n- **Write Operations**: Currently, writing to NetCDF files is not supported.\n- **Common Dimensions**: Too many shared dimensions, or a large Cartesian product between them, \ncan cause the parser to fail during partitioning and data reading.\n\n---\n\n## 🤝 Contributing\n\nContributions are welcome! To contribute:\n\n1. Fork the project\n2. Create a feature branch (`git checkout -b feature/my-feature`)\n3. Commit your changes (`git commit -am 'Add my feature'`)\n4. Push to your branch (`git push origin feature/my-feature`)\n5. Create a Pull Request\n\n---\n\n## 📄 License\n\nThis project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frejeb%2Fnetcdf-spark-parser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frejeb%2Fnetcdf-spark-parser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frejeb%2Fnetcdf-spark-parser/lists"}