{"id":19309926,"url":"https://github.com/cedadev/s3-netcdf-python","last_synced_at":"2025-04-22T13:33:46.825Z","repository":{"id":48685759,"uuid":"97473106","full_name":"cedadev/S3-netcdf-python","owner":"cedadev","description":"Read / write netCDF files from / to object stores with S3 interface","archived":false,"fork":false,"pushed_at":"2022-10-04T13:54:27.000Z","size":15485,"stargazers_count":45,"open_issues_count":2,"forks_count":10,"subscribers_count":10,"default_branch":"main","last_synced_at":"2025-04-02T00:23:30.773Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cedadev.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-07-17T12:21:19.000Z","updated_at":"2024-06-14T03:14:47.000Z","dependencies_parsed_at":"2023-01-19T04:01:00.643Z","dependency_job_id":null,"html_url":"https://github.com/cedadev/S3-netcdf-python","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cedadev%2FS3-netcdf-python","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cedadev%2FS3-netcdf-python/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cedadev%2FS3-netcdf-python/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cedadev%2FS3-netcdf-python/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cedadev","download_url":"https://codeload.github.com/cedadev/S3-netcdf-python/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248591664,"owners_count":21130087,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-10T00:21:49.054Z","updated_at":"2025-04-22T13:33:46.519Z","avatar_url":"https://github.com/cedadev.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# S3netCDF4\n\nAn extension package to netCDF4-python to enable reading and writing **netCDF**\nfiles and **CFA-netcdf** files from / to object stores and public cloud with a\nS3 HTTP interface, to disk or to OPeNDAP.\n\n# Contents\n* [Requirements](#requirements)\n* [Installation](#installation)\n* [Configuration](#configuration)\n* [Aliases](#aliases)\n* [Caching](#caching)\n* [Backends](#backends)\n* [Resource Usage](#resource)\n* [Writing files](#writing-files)\n  * [CFA-netCDF files](#cfa-netcdf-files)\n  * [Creating dimensions and variables](#creating-dimensions-and-variables)\n  * [Filenames and file hierarchy of CFA files](#filenames-and-file-hierarchy-of-cfa-files)\n  * [Writing field data](#writing-field-data)\n  * [File splitting algorithm](#file-splitting-algorithm)\n* [Reading files](#reading-files)\n  * [Reading variables](#reading-variables)\n  * [Reading metadata](#reading-metadata)\n  * [Reading field data](#reading-field-data)\n* [List of examples](#list-of-examples)\n\n# Requirements\nS3-netCDF4 requires Python 3.7 or later.\n\nIt also requires the following packages:\n* numpy==1.19.4\n* Cython==0.29.21\n* netCDF4==1.5.5.1\n* botocore==1.19.20\n* aiobotocore==1.1.2\n* psutil==5.7.3\n\n(These are fulfilled by a pip installation, so it is not necessary to install\nthem if you are installing the package via pip, as below.)\n\n[[Top]](#contents)\n\n# Installation\n\nS3netCDF4 is designed to be installed in user space, without the user having\n`root` or `sudo` privileges.  System wide installation is also supported.  It\nis recommended to install S3netCDF4 into a virtual environment, rather than\nusing the system Python.  S3netCDF4 does not rely on any external servers,\nbesides the storage systems, it is run entirely on the host machine.\n\ns3netCDF4 can be installed either from PyPi or directly from the GitHub repository.\n\n### From PyPi\n\n1. Create a Python 3 virtual environment:\n\n    `python3 -m venv /path/to/venv`\n\n2. Activate the virtual environment:\n\n    `source /path/to/venv/bin/activate`\n\n3. Installing S3netCDF4 requires a version of `pip` \u003e 10.0.  To install the\nlatest version of pip into the virtual environment use the command:\n\n    `pip install --upgrade pip`\n\n4. Install from PyPi:\n\n    `pip install S3netCDF4`\n\n5. Copy the configuration template file from `config/.s3nc.json.template` to\n    `~/.s3nc.json` and fill in the values for the variables.  See the section\n    [Configuration](#configuration).\n\n6. Run a test to ensure the package has installed correctly:\n\n    `python test/test_s3Dataset.py`\n\n### From GitHub    \n\n0. Users on the STFC/NERC JASMIN system will have to activate Python 3.7 by\nusing the command:\n\n    `module load jaspy`\n\n1. Create a Python 3 virtual environment:\n\n    `python3 -m venv /path/to/venv`\n\n2. Activate the virtual environment:\n\n    `source /path/to/venv/bin/activate`\n\n3. Installing S3netCDF4 requires a version of `pip` \u003e 10.0.  To install the\nlatest version of pip into the virtual environment use the command:\n\n    `pip install --upgrade pip`\n\n4. Install the S3netCDF4 library, directly from the github repository:\n\n    `pip install -e git+https://github.com/cedadev/S3-netcdf-python.git#egg=S3netCDF4`\n\n5. Copy the configuration template file from `config/.s3nc.json.template` to\n`~/.s3nc.json` and fill in the values for the variables.  See the section\n[Configuration](#configuration).\n\n6. Run a test to ensure the package has installed correctly:\n\n    `python test/test_s3Dataset.py`\n\n7. Users on the STFC/NERC JASMIN system will have to repeat step 0 every time\nthey wish to use S3netCDF4 via the virtual environment.\n\n[[Top]](#contents)\n\n# Configuration\nS3netCDF4 relies on a configuration file to resolve endpoints for the S3\nservices, and to control various aspects of the way the package operates.  This\nconfig file is a JSON file and is located in the user's home directory:\n\n`~/.s3nc.json`\n\nIn the git repository a templatised example of this configuration file is\nprovided:\n\n`config/.s3nc.json.template`\n\nThis can be copied to the user's home directory, and the template renamed to\n`~/.s3nc.json`.  \n\nAlternatively, an environment variable `S3_NC_CONFIG` can be set to define the\nlocation and name of the configuration file.  This can also be set in code,\nbefore the import of the S3netCDF4 module:\n\n    import os\n    os.environ[\"S3_NC_CONFIG\"] = \"/Users/neil/.s3nc_different_config.json\"\n    from S3netCDF4._s3netCDF4 import s3Dataset\n\nOnce the config file has been copied, the variables in the template should then\nbe filled in.  This file is a [jinja2](http://jinja.pocoo.org/docs/2.10/)\ntemplate of a JSON file, and so can be used within an\n[ansible](https://www.ansible.com/) deployment.  \nEach entry in the file has a key:value pair.  An example of the file is given\nbelow:\n\n    {\n        \"version\": \"9\",\n        \"hosts\": {\n            \"s3://tenancy-0\": {\n                \"alias\": \"tenancy-0\",\n                    \"url\": \"http://tenancy-0.jc.rl.ac.uk\",\n                    \"credentials\": {\n                        \"accessKey\": \"blank\",\n                        \"secretKey\": \"blank\"\n                    },\n                    \"backend\": \"s3aioFileObject\",\n                    \"api\": \"S3v4\"\n            }\n        },\n        \"backends\": {\n            \"s3aioFileObject\" : {\n                \"maximum_part_size\": \"50MB\",\n                \"maximum_parts\": 8,\n                \"enable_multipart_download\": true,\n                \"enable_multipart_upload\": true,\n                \"connect_timeout\": 30.0,\n                \"read_timeout\": 30.0\n            },\n            \"s3FileObject\" : {\n                \"maximum_part_size\": \"50MB\",\n                \"maximum_parts\": 4,\n                \"enable_multipart_download\": false,\n                \"enable_multipart_upload\": false,\n                \"connect_timeout\": 30.0,\n                \"read_timeout\": 30.0\n            }\n        },\n        \"cache_location\": \"/cache_location/.cache\",\n        \"resource_allocation\" : {\n            \"memory\": \"1GB\",\n            \"filehandles\": 20\n        }\n    }\n\n* `version` indicates which version of the configuration file this is.\n* `hosts` contains a list of named hosts and their respective configuration\ndetails.\n  * `s3://tenancy-0` contains the definition of a single host called\n  `tenancy-0`.  For each host a number of configuration details need to be\n  supplied:\n    * `alias` the alias for the S3 server.  See the [Aliases](#aliases)\n    section.\n    * `url` the DNS resolvable URL for the S3 server, with optional port\n    number.\n    * `credentials` contains two keys:\n        * `accessKey` the user's access key for the S3 endpoint.\n        * `secretKey` the user's secret key / password for the S3 endpoint.\n    * `backend` which backend to use to write the files to the S3 server.  See\n    the [Backends](#backends) section.\n    * `api` the api version used to access the S3 endpoint.\n* `backends` contains localised configuration information for each of the\nbackends which may be used (if included in a `host` definition) to write the\nfiles to the S3 server.  See the [Backends](#backends) section for more\ndetails on backends.\n    * `enable_multipart_download` allow the backend to split files fetched\n    from S3 into multiple parts when downloading.\n    * `enable_multipart_upload` allow the backend to split files when\n    uploading.\n    The advantage of splitting the files into parts is that they can be\n    uploaded or downloaded asynchronously, when the backend supports\n    asynchronous transfers.\n    * `maximum_part_size` the maximum size for each part of the file can reach\n    before it is uploaded or the size of each part when downloading a file.\n    * `maximum_parts` the maximum number of file parts that are held in memory\n    before they are uploaded or the number of file parts that are downloaded\n    at once, for asynchronous backends.\n    * `connect_timeout` the number of seconds that a connection attempt will\n    be made for before timing out.\n    * `read_timeout` the number of seconds that a read attempt will be made\n    before timing out.\n* `cache_location`  S3netCDF4 can read and write very large arrays that are\nsplit into **sub-arrays**. To enable very large arrays to be read, S3netCDF4\nuses Numpy memory mapped arrays.  `cache_location` contains the location of\nthese memory mapped array files.  See [Caching](#caching) section below.\n* `resource_allocation` contains localised information about how much\nresources each instance of S3netCDF4 should use on the host machine.  See the\nthe [Resource Usage](#resource) section below.\nIt contains two keys:\n    * `memory` the amount of RAM to dedicate to this instance of S3netCDF4.\n    * `file_handles` the number of file handles to dedicate to this instance\n    of S3netCDF4\n\n*Note that sizes can be expressed in units other than bytes by suffixing the\nsize with a magnitude identifier:, kilobytes (`kB`), megabytes (`MB`),\ngigabytes (`GB`), terabytes (`TB`), exabytes (`EB`), zettabytes (`ZB`) or\nyottabytes (`YB`).*\n\n[[Top]](#contents)\n\n## Aliases\nTo enable S3netCDF4 to write to disk, OPeNDAP and S3 object store, aliases are\nused to identify S3 servers.  They provide an easy to remember (and type)\nshorthand for the user so that they don't have to use the DNS resolved URL and\nport number for each S3 object access.  When creating a netCDF4 `s3Dataset`\nobject, either to read or write, the user supplies a filename.  To indicate\nthat the file should be written to or read from a S3 server, the string\nmust start with `s3://`.  After this must follow the aliased server name, as\ndefined in the config file above.  After this aliased server name a bucket\nname will follow, for example to read a netCDF file called `test2.nc` from the\n`test` bucket on the `s3://tenancy-0` server, the user would use this code:\n\n*Example 1: open a netCDF file from a S3 storage using the alias \"tenancy-0\"*\u003ca\nname=example-1\u003e\u003c/a\u003e\n\n```\nfrom S3netCDF4._s3netCDF4 import s3Dataset as Dataset\ntest_dataset = Dataset(\"s3://tenancy-0/test/test2.nc\", \"r\")\n```\n\nOn creation of the `s3Dataset` object, the S3netCDF4 package reads the\nfilename, determines that the filename starts with `s3://`, reads the next\npart of the string up to the next `/` (which equates to `tenancy-0` in this\ncases) and searches through the aliases defined in the `~/.s3nc.json` file to\nfind a matching alias.  If one is not found it will return an error message,\nif it is found then it will establish a connection to that S3 server, using the\n`url`, `accessKey` and `secretKey` defined for that server.  It is over this\nconnection that all the data transfers for this `s3Dataset` take place.\n\n[[Top]](#contents)\n\n## Caching\nIf the user requests to read a variable, or a slice of a variable, that is\nlarger than either the host machines physical memory or the\n`resource_allocation: memory` setting in `~/.s3nc.json`, then S3netCDF4 will\nuse two strategies to enable reading very large arrays:\n* a Numpy memory mapped array is used as the \"target array\", which will\ncontain the data requested by the user.  This is stored in a locally cached\nfile, in the `cache_location` root directory.  These files are deleted in the\ndestructor of S3netCDF4 - i.e. when the program exits, or the S3netCDF4 object\ngoes out of scope.  However, during processing, this directory has the\npotential to grow quite large so adequate provision should be made on disk for\nit.\n* If the file being read is a **CFA-netCDF** file, referencing **sub-array**\nfiles, then the **sub-array** files are streamed into memory (for files on S3\nstorage) or read from disk.  If the amount of memory used exceeds the\n`resource_allocation: memory` config setting, or the number of open files\nexceeds the `resource_allocation: filehandles` config setting, then the last\naccessed **sub-array** file is closed.  This means it will be removed from\nmemory, or the file handle will be freed, allowing another **sub-array** file\nto be read.\n\nSee the [Resource Usage](#resource) section below for more information on\nthis \"memory and file shuffling\" behaviour.\n\n[[Top]](#contents)\n\n## Backends\n\nIn S3-netCDF4, a backend refers to a set of routines that handles the\ninterface to a storage system.  The interface includes read and write, but\nalso gathering file information and file listings.  S3-netCDF4 has a pluggable\nbackend architecture, and so can interact with new storage systems by writing\na new backend plugin.  The backend plugins are extensions of the\n`io.BufferedIOBase` Python class and implement Python file object methods, such\nas `tell`, `seek`, `read` and `write`.  This enables interaction with the\nbackend as though they are POSIX disks.\nThese backends have to be configured on a host by host basis by setting the\n`host: backend` value in the `~/.s3nc.json` config file.  Currently there are\ntwo backends:\n\n* `_s3aioFileObject`: This backend enables asynchronous transfers to a S3\ncompatible storage system.  It is the fastest backend for S3 and should be used\nin preference to `_s3FileObject`.\n* `_s3FileObject`: This is a simpler, synchronous inferface to S3 storage\nsystems.  It can be used if there is a problem using `_s3aioFileObject`\n\n[[Top]](#contents)\n\n## Resource Usage\n\nS3netCDF4 has the ability to read and write very large files, much larger than\nthe available, or allocated, memory on a machine.  It also has the ability to\nread and write many files to and from disk, which means the number of open\nfiles may exceed the limit set by the file system, or the settings in `ulimit`.\n\nFiles are accessed when a Dataset is opened, and when a slice operator\n(`[x,y,z]`) is used on a **CFA-netCDF** file.\n\nTo enable very large and very many files to be read and written to, S3netCDF4\nemploys a strategy where files are \"shuffled\" out of memory (to free up memory)\nor closed (to free up disk handles).  The triggers for this shuffling are\nconfigured in the `\"resource_allocation\"` section of the `.s3nc.json` config\nfile:\n\n* `resource_allocation: memory`: the amount of memory that S3netCDF4 is allowed\nto use before a shuffle is triggered.  This applies when reading or writing\nfiles from / to remote storage, such as a S3 object store.  S3netCDF4 will\nstream the entire netCDF file, or an entire **sub-array** file into memory when\nreading.  When writing, it will create an entire netCDF file or **sub-array**\nfile in memory, writing the file to the remote storage upon closing the file.\n\n* `resource_allocation: disk_handles`: the number of files on disk that\nS3netCDF4 is allowed to have open at any one time.  This applies when reading\nor writing files to disk.  S3netCDF4 uses the underlying netCDF4 library to\nread and write files to disk, but it keeps a track of the number of open files.\n\n*Note that S3netCDF4 allows full flexibility over the location of the\nmaster-array and sub-array files of CFA-netCDF files.  It allows both to be\nstored on disk or S3 storage.  For example, the master-array file could be\nstored on disk for performance reasons, and the sub-array files stored on S3.  \nOr the first timestep of the sub-array files could also be stored on disk to\nenable users to quickly perform test analyses*\n\nThe file shuffling procedure is carried out by an internal FileManager, which\nkeeps notes about the files that are open at any time, or have been opened in\nthe past and the last time they were accessed.  The user does not see any of\nthis interaction, they merely interact with the S3Dataset, S3Group, S3Variable\nand S3Dimension objects.\n\n1. When a file is initially opened, a note is made of the mode and whether the\nfile is on disk or remote storage.  They are marked as \"OPEN_NEW\" and then,\n\"OPEN_EXISTS\" when they have been opened successfully.\n    - For reading from remote storage, the file is streamed into memory and\n    then a netCDF Dataset is created from the read in data.\n    - For writing to remote storage, the netCDF Dataset is created in memory.\n    - For reading from disk, the file is opened using the underlying netCDF4\n    library, and the netCDF Dataset is returned.\n    - For writing to disk, the file is created using the netCDF4 library and\n    the Dataset is returned.\n2. If the file is accessed again (e.g. via the slicing operator), then the\nnetCDF Dataset is returned.  The FileManager knows these files are already open\nor present in memory as they are marked as \"OPEN_EXISTS\".\n3. Steps 1 and 2 continue until either the amount of memory used exceeds\n`resource_allocation: memory` or the number of open files exceeds\n`resource_allocation: disk_handles`.\n4. If the amount of memory used exceeds `resource_allocation: memory`:\n  - The size of the next file is determined (read) or calculated (write).  \n  Files are closed, and the memory they occupy is freed using the Python\n  garbage collector, until there is enough memory free to read in or create the\n  next file.\n  - Files that were opened in \"write\" mode are closed, marked as \"KNOWN_EXISTS\"\n  and written to either the remote storage (S3) or disk.\n  - Files that were open in \"read\" mode are simply closed and their entry is\n  removed from the FileManager.\n  - The priority for closing files is that the last accessed file is closed\n  first.  The FileManager keeps a note when each file was accessed last.\n  - If a file is accessed again in \"write\" mode, and it is marked as\n  \"KNOWN_EXISTS\" in the FileManager, then it is opened in \"append\" mode.  In\n  this way, a file can be created, be shuffled in and out of memory, and still\n  be written to so that the end result is the same as if it had been in memory\n  throughout the operation.\n5. If the number of open files exceeds `resource_allocation: disk_handles`:\n  - The procedure for point 4 is followed, except rather than closing files\n  until there is enough memory available, files are closed until there are free\n  file handles.\n  - Files are marked as \"KNOWN_EXISTS\" as in point 4.\n\nThis file shuffling procedure is fundamental to the performance of S3netCDF4,\nas it minimises the number of times a file has to be streamed from remote\nstorage, or opened from disk.  There are also optimisations in the File\nManager, for example, if a file has been written to and then read, it will use\nthe copy in memory for all operations, rather than holding two copies, or\nstreaming to and from remote storage repeatably.\n\n[[Top]](#contents)\n\n## Writing files\nS3netCDF4 has the ability to write **netCDF3**, **netCDF4**, **CFA-netCDF3**\nand **CFA-netCDF4** files to a POSIX filesystem, Amazon S3 object storage (or\npublic cloud) or OPeNDAP.  Files are created in the same way as the standard\nnetCDF4-python package, by creating a `s3Dataset` object.  However, the\nparameters to the `s3Dataset` constructor can vary in two ways:\n\n1. The `filename` can be an S3 endpoint, i.e. it starts with `s3://`\n2. The `format` keyword can also, in addition to the formats permitted by\nnetCDF4-python, be `CFA3`, to create a CFA-netCDF3 dataset, or `CFA4`, to\ncreate a CFA-netCDF4 dataset.\n3. If creating a `CFA3` or `CFA4` dataset, then an optional keyword parameter\ncan be set: `cfa_version`.  This can be either `\"0.4\"` or `\"0.5\"`.  See the\n[CFA-netCDF](#cfa-netcdf) files section below.\n\n*Example 2: Create a netCDF4 file in the filesystem*\u003ca name=example-2\u003e\u003c/a\u003e\n```\nfrom S3netCDF4._s3netCDF4 import s3Dataset as Dataset\ntest_dataset = Dataset(\"/Users/neil/test_dataset_nc4.nc\", 'w',\nformat='NETCDF4')\n```\n\n*Example 3: Create a CFA-netCDF4 file in the filesystem with CFA version\n0.5 (the default)*\u003ca name=example-3\u003e\u003c/a\u003e\n```\nfrom S3netCDF4._s3netCDF4 import s3Dataset as Dataset\ncfa_dataset = Dataset(\"/Users/neil/test_dataset_cfa4.nc\", 'w',format='CFA4')\n```\n*Example 4: Create a CFA-netCDF3 file on S3 storage with CFA version 0.4*\u003ca\nname=example-4\u003e\u003c/a\u003e\n```\nfrom S3netCDF4._s3netCDF4 import s3Dataset as Dataset\ncfa_dataset = Dataset(\"s3://tenancy-0/test_bucket/test_dataset_s3_cfa3.nc\",\n'w', format='CFA3', cfa_version=\"0.4\")\n```\n\n[[Top]](#contents)\n\n### CFA-netCDF files\nChoosing `format=\"CFA3\"` or `format=\"CFA4\"` when creating a file creates a\nCFA-compliant netCDF file.  This consists of a **master-array** file and a\nnumber of **sub-array** files.\nThe version of CFA to use can also be specified, either `cfa_version=\"0.4\"` or\n`cfa_version=\"0.5\"`.  `\"0.4\"` follows the\n[CFA conventions](http://www.met.reading.ac.uk/~david/cfa/0.4/), where the\n**sub-array** metadata is written into the attributes of the netCDF variables.\n`\"0.5\"` refactors the **sub-array** metadata into extra groups and variables\nin the **master-array** file.  `\"0.5\"` is the preferred format as it is more\nmemory efficient, relying on netCDF slicing and partial reading of files, and\nis faster as it does not require parsing when the **master-array** file is\nfirst read.  As it uses features of netCDF4, `cfa_version=\"0.5\"` is only\ncompatible with `format=\"CFA4\"`\n\n*Note that `cfa_version=\"0.5\"` and `format=\"CFA3\"` are incompatible, as NETCDF3\ndoes not enable groups to be used*\n\nThe **master-array** file contains:\n\n* the dimension definitions\n* dimension variables\n* scalar variable definitions: variable definitions without reference to the\ndomain it spans\n* variable metadata\n* global metadata\n* It does not contain any field data, but it *does* contain data for the\ndimension variables, and therefore the domain of each variable.  \n* The **master-array** file may contain a single field variable or multiple\nfield variables.  \n\nThe **sub-array** files contain a subdomain of a single variable in the\n**master-array**.  They contain:\n\n* the dimension definitions for the subdomain\n* the dimension variables for the subdomain\n* a single variable definition, complete with reference to the dimensions\n* metadata for the variable\n\nTherefore, each **sub-array** file is a self-describing netCDF file.  If the\n**master-array** file is lost, it can be reconstructed from the **sub-array**\nfiles.\n\nIn CFA v0.4, the *variable metadata* (netCDF attributes) in each variable in\nthe **master-array** file contains a **partition matrix**.  The **partition\nmatrix** contains information on how to reconstruct the **master-array**\nvariables from the associated **sub-arrays** and, therefore, also contains the\nnecessary information to read or write slices of the **master-array**\nvariables.\n\nIn CFA v0.5, the **partition matrix** is stored in a group.  This group has the\nsame name as the variable, but prefixed with `cfa_`.  The group contains\ndimensions and variables to store the information for the **partition matrix**\nand the **partitions**.\nFull documentation for CFA v0.5 will be forthcoming.\n\nThe **partition matrix** contains:\n\n* The dimensions in the netCDF file that the partition matrix acts over (e.g.\n`[\"time\", \"latitude\", \"longitude\"`)\n* The shape of the partition matrix (e.g. `[4,2,2]`)\n* A list of partitions\n\nEach **partition** in the **partition matrix** contains:\n\n* An index for the partition into the partition matrix - a list the length of\nthe number of dimensions for the variable (e.g `[3, 1, 0]`)\n* The location of the partition in the **master-array** - a list (the length\nof the number of dimensions) of pairs, each pair giving the range of\nindices in the **master-array** for that dimension (e.g. `[[0, 10], [20,\n40], [0, 45]]`)\n* A definition of the **sub-array** which contains:\n    * The path or URI of the file containing the **sub-array**.  This may be\n    on the filesystem, an OPeNDAP file or an S3 URI.\n    * The name of the netCDF variable in the **sub-array** file\n    * The format of the file (always `netCDF` for S3netCDF4)\n    * The shape of the variable - i.e. the length of the subdomain in each\n    dimension\n\nFor more information see the [CFA conventions 0.4 website](http://\nwww.met.reading.ac.uk/~david/cfa/0.4/).\nThere is also a useful synopsis in the header of the `_CFAClasses.pyx` file in\nthe S3netCDF4 source code.  Documentation for the `\"0.5\"` version of CFA will\nfollow.\n\n*Note that indices in the partition matrix are indexed from zero, but the\nindices are inclusive for the location of the partition in the master-array.  \nThis is different from Python where the indices are non-inclusive.  The\nconversion between the two indexing methods is handled in the implementation\nof _CFAnetCDFParser, so the user does not have to worrying about converting\nindices*\n\n[[Top]](#contents)\n\n### Creating dimensions and variables\n\nCreating dimensions and variables in the netCDF or CFA-netCDF4 dataset follows\nthe same method as creating variables in the standard netCDF4-python library:\n\n*Example 5: creating dimensions and variables*\u003ca name=example-5\u003e\u003c/a\u003e\n```\nfrom S3netCDF4._s3netCDF4 import s3Dataset as Dataset\ncfa_dataset = Dataset(\"s3://minio/test_bucket/test_dataset_s3_cfa3.nc\", 'w',\nformat='CFA3')\n\ntimed = cfa_dataset.createDimension(\"time\", None)\ntimes = cfa_dataset.createVariable(\"time\", \"f4\", (\"time\",))\n```\n\nWhen creating variables, a number of different workflows for writing the files\noccur.  Which workflow is taken depends on the combination of the filename\npath (`S3`, filesystem or OPeNDAP) and format (`CFA3` and `CFA4` or `NETCDF4`\nand `NETCDF3_CLASSIC`).  These workflows can be summarised by:\n\n* `format=NETCDF4` or `format=NETCDF3_CLASSIC`.  These two options will create\na standard netCDF file.\n  * If the filename is on a remote system, (i.e. it contains `s3://`) then the\n  netCDF file will be created in memory and uploaded (PUT) to the S3\n  filesystem when `s3Dataset.close()` is called or the file is \"shuffled\" out\n  of memory. (see [Resource Usage](#resource) for more details).\n  * If the filename does not contain `s3://` then the netCDF file will be\n  written out to the filesystem or OPeNDAP, with the behaviour following the\n  standard netCDF4-python library.\n\n* `format=CFA3` or `format=CFA4`.  These two options will create a\n  **CFA-netCDF file**.\n  * At first only the **master-array** file is created and written to.  The\n  **sub-array** files are created and written to when data is written to the\n  **master-array** variable.\n  * When the variable is created, the dimensions are supplied and this enables\n  the **partition matrix** metadata to be generated:\n    * The [file splitting algorithm](#file-splitting-algorithm) determines how\n    to split the variable into the **sub-arrays**, or the user can supply the\n    shape of the **sub-arrays**\n    * From this information the **partition matrix** shape and **partition\n    matrix** list of dimensions are created.  The **partition matrix** is\n    represented internally by a netCDF dataset, and this is also created.\n  * Only when a variable is written to, via a slice operation on a variable,\n  is each individual **partition** written into the **partition matrix**.\n    * The **sub-array** file is created, either in memory for remote\n    filesystems (S3), or to disk for local filesystems (POSIX).\n    * The filename for the **sub-array** is determined programmatically.\n    * The location in the **master-array** for each **sub-array** (and its\n     shape) is determined by the slice and the **sub-array** shape determined\n     by either the file splitting algorithm, or supplied by the user.\n    * This single **partition** information is written into the **partition-\n    matrix**\n    * The field data is written into the **sub-array** file.\n    * On subsequent slices into the same **sub-array**, the **partition**\n    information is used, rather than rewritten.\n  * When the **master-array** file is closed (by the user calling\n      `s3Dataset.close()`):\n    * The **partition matrix** metadata is written to the **master-array**\n    * If the files are located on a remote filesystem (S3), then they only\n    currently exist in memory (unless they have been \"shuffled\" to storage).\n    They are now closed (in memory) and then uploaded to the remote storage.  \n    Any appended files are also uploaded to remote storage.\n    * If the files are not on a remote filesystem, then they are closed, the\n    **sub-array** files in turn, and then the **master-array** file last.\n\n### Filenames and file hierarchy of CFA files\n\nAs noted above, CFA files actually consist of a single **master-array** file\nand many **sub-array** files.  These **subarray-files** are referred to by\ntheir filepath or URI in the partition matrix.  To easily associate the **sub-\narray** files with the **master-array** file, a naming convention and file\nstructure is used:\n\n* The [CFA conventions](http://www.met.reading.ac.uk/~david/cfa/0.4/) dictate\nthat the file extension for a CFA-netCDF file should be `.nca`\n* A directory is created in the same directory / same root URI as the **master-\narray** file.  This directory has the same name **master-array** file without\nthe `.nca` extension\n* In this directory all of the **sub-array** files are contained.  These\nsubarray files follow the naming convention:\n\n    `\u003cmaster-array-file-name\u003e.\u003cvariable-name\u003e.[\u003clocation in the partition\n    matrix\u003e].nc`\n\nExample for the **master-array** file `a7tzga.pdl4feb.nca`:\n\n    ├── a7tzga.pdl4feb.nca\n    ├── a7tzga.pdl4feb\n    │   ├── a7tzga.pdl4feb.field16.0.nc\n    │   ├── a7tzga.pdl4feb.field16.1.nc\n    │   ├── a7tzga.pdl4feb.field186.0.nc\n    │   ├── a7tzga.pdl4feb.field186.1.nc\n    │   ├── a7tzga.pdl4feb.field1.0.0.nc\n    │   ├── a7tzga.pdl4feb.field1.0.1.nc\n    │   ├── a7tzga.pdl4feb.field1.1.0.nc\n    │   ├── a7tzga.pdl4feb.field1.1.1.nc\n\nOn an S3 storage system, the **master-array** directory will form part of the\n*prefix* for the **sub-array** objects, as directories do not exist, in a\nliteral sense, on S3 storage systems, only prefixes.\n\n*Note that the metadata in the master-array file informs S3netCDF4 where the\nsub-array files are located.  The above file structure defines the default\nbehaviour, but the specification of S3netCDF4 allows sub-array files to be\nlocated anywhere, be that on S3, POSIX disk or OpenDAP.*\n\n### Writing metadata\n\nMetadata can be written to the variables and the Dataset (global metadata) in\nthe same way as the standard netCDF4 library, by creating a member variable on\nthe Variable or Dataset object:\n\n*Example 6: creating variables with metadata*\u003ca name=example-6\u003e\u003c/a\u003e\n```\nfrom S3netCDF4._s3netCDF4 import s3Dataset as Dataset\nwith Dataset(\"/Users/neil/test_dataset_cfa3.nca\", mode='w', diskless=True,\nformat=\"CFA3\") as s3_data:\n    # create the dimensions\n    latd = s3_data.createDimension(\"lat\", 196)\n    lond = s3_data.createDimension(\"lon\", 256)\n    # create the dimension variables\n    latitudes = s3_data.createVariable(\"lat\", \"f4\", (\"lat\",))\n    longitudes = s3_data.createVariable(\"lon\", \"f4\", (\"lon\",))\n    # create the field variable\n    temp = s3_data.createVariable(\"tmp\", \"f4\", (\"lat\", \"lon\"))\n\n    # add some attributes - variable metadata\n    s3_data.source = \"s3netCDF4 python module tutorial\"\n    s3_data.units = \"degrees C\"\n    latitudes.units = \"degrees north\"\n    longitudes.units = \"degrees east\"\n\n    # add some global metadata\n    temp.author = \"Neil Massey\"\n```\n\n### Writing field data\n\nFor netCDF files with `format=NETCDF3_CLASSIC` or `format=NETCDF4`, the\nvariable is created and field data is written to the file (as missing values)\nwhen `createVariable` is called on the `s3Dataset` object.  Calls to the `[]`\noperator (i.e. slicing the array) will write data to the variable and to the\nfile when the operator is called.  This is the same behaviour as netCDF4-\npython. If a S3 URI is specified (filepath starts with `s3://`) then the file\nis first created in memory and then streamed to S3 on closing the file.\n\nFor netCDF files with `format=CFA3` or `format=CFA4` specified in the\n`s3Dataset` constructor, only the **master-array** file is written to when\n`createDimension`, `createVariable` etc. are called on the `s3Dataset`\nobject.  When `createVariable` is called, a scalar field variable (i.e. with\nno dimensions) is created, the **partition-matrix** is calculated (see [File\nsplitting algorithm](#file-splitting-algorithm)) and written to the scalar\nfield variable.  The **sub-array** files are only created when the `[]`\noperator is called on the `Variable` object return from the\n`s3Dataset.createVariable` method.  This operator is implemented in S3netCDF as\nthe `__setitem__` member function of the `s3Variable` class, and corresponds\nto slicing the array.\n\nWriting a slice of field data to the **master-array** file, via `__setitem__`\nconsists of five operations:\n\n1. Determining which of the **sub-arrays** overlap with the slice.  This is\ncurrently done via a hypercube overlapping method, i.e. the location of the\n**sub-array** can be determined by dividing the dimension index by the length\nof the dimension in the **partition matrix**.  This assumes that the **sub-arrays** are uniform (per dimension) in size.\n\n2. If the size of the **sub-array** file will cause the currently used amount\nof memory to exceed the `resource_allocation: memory` setting in `~/.s3nc.json`\nthen some files may be shuffled out of memory.  See the [Resource Usage](#resource) section above.  This may result in some files being written\nto the remote storage, meaning they will be opened in append mode the next time\nthey are written to.\nIf, even after the file shuffling has occurred, the size of the **sub-array**\ncannot be contained in memory then a memory error will occur.\n\n3. Open or create the file for the **sub-array** according to the filepath or\nURI in the **partition** information.  If a S3 URI is specified (filepath\nstarts with `s3://`) then the file is opened or created in memory, and\nwill be uploaded when `.close()` is called on the `s3Dataset`.  The file will be\nwill be opened in create mode (`w`).\n\n4. The dimensions and variable are created for the **sub-array** file, and the\nmetadata is also written.\n\n5. Calculate the source and target slices.  This calculates the mapping\nbetween the indices in the **master-array** and each **sub-array**.  This is\ncomplicated by allowing the user to choose any slice for the **master-array**\nand so this must be correctly translated to the **sub-array** indices.\n\n6. Copy the data from the source slice to the target slice.\n\nFor those files that have an S3 URI, uploading to S3 object storage is\nperformed when `.close()` is called on the `s3Dataset`.\n\n### Partial writing of field data\n\nThe **partition** information is only written into the **partition-matrix**\nwhen the s3Dataset is in \"write\" mode and the user slices into the part of the\n**master-array** that is covered by that **partition**.  Consequently, the\n**sub-array** file is only created when the **partition** is written into the\n**partition-matrix**.\n\nThis leads to the situation that a large part of the **partition-matrix** may\nhave undefined data, and a large number of **sub-array** files may not exist.  \nThis makes s3netCDF4 excellent for sparse data, as the **sub-array** size can\nbe optimised so that the sparse data occupies minimal space.\n\nIf, in \"read\" mode, the user specifies a slice that contains a **sub-array**\nthat is not defined, then the **missing value** (`_FillValue`) is returned for\nthe sub-domain of the **master-array** which the **sub-array** occupies.\n\n[[Top]](#contents)\n\n### File splitting algorithm\n\nTo split the **master-array** into it's constituent **sub-arrays** a method\nfor splitting a large netCDF file into smaller netCDF files is used.  The\nhigh-level algorithm is:\n\n1. Split the field variables so that there is one field variable per file.  \nnetCDF allows multiple field variables in a single file, so this is an obvious\nand easy way of partitioning the file.  Note that this only splits the field\nvariables up, the dimension variables all remain in the **master-array** file.\n\n2. For each field variable file, split along the `time`, `level`, `latitude` or\n`longitude` dimensions.  Note that, in netCDF files, the order of the\ndimensions is arbitrary, e.g. the order could be `[time, level, latitide,\nlongitude]` or `[longitude, latitude, level, time]` or even `[latitude, time,\nlongitude, level]`.  \nS3netCDF4 uses the metadata and name for each dimension variable to determine\nthe order of the dimensions so that it can split them correctly.  Note that\nany other dimension (`ensemble` or `experiment`) will always have length of 1,\ni.e. the dimension will be split into a number of fields equal to its length.\n\nThe maximum size of an object (a **sub-array** file) can be given as a keyword\nargument to `s3Dataset.createVariable` or `s3Group.createVariable`:\n`max_subarray_size=`.  If no `max_subarray_size` keyword is supplied, then it\ndefaults to 50MB.\nTo determine the most optimal number of splits for the `time`, `latitude` or\n`longitude` dimensions, while still staying under this maximum size\nconstraint, two use cases are considered:\n\n1. The user wishes to read all the timesteps for a single latitude-longitude\npoint of data.\n2. The user wishes to read all latitude-longitude points of the data for a\nsingle timestep.\n\nFor case 1, the optimal solution would be to split the **master-array** into\n**sub-arrays** that have length 1 for the `longitude` and `latitude` dimension\nand a length equal to the number of timesteps for the `time` dimension.  For\ncase 2, the optimal solution would be to not split the `longitude` and\n`latitude` dimensions but split each timestep so that the length of the `time`\ndimension is 1.  However, both of these cases have the worst case scenario for\nthe other use case.  \n\nBalancing the number of operations needed to perform both of these use cases,\nwhile still staying under the `max_subarray_size` leads to an optimisation\nproblem where the following two equalities must be balanced:\n\n1. use case 1 = n\u003csub\u003eT\u003c/sub\u003e / d\u003csub\u003eT\u003c/sub\u003e\n2. use case 2 = n\u003csub\u003elat\u003c/sub\u003e / d\u003csub\u003elat\u003c/sub\u003e **X** n\u003csub\u003elon\u003c/sub\u003e /\nd\u003csub\u003elon\u003c/sub\u003e\n\nwhere n\u003csub\u003eT\u003c/sub\u003e is the length of the `time` dimension and d\u003csub\u003eT\u003c/sub\u003e is\nthe number of splits along the `time` dimension.  n\u003csub\u003elat\u003c/sub\u003e is the\nlength of the `latitude` dimension and d\u003csub\u003elat\u003c/sub\u003e the number of splits\nalong the `latitude` dimension.  n\u003csub\u003elon\u003c/sub\u003e is the length of the\n`longitude` dimension and d\u003csub\u003elon\u003c/sub\u003e the number of splits along the\n`longitude dimension`.\n\nThe following algorithm is used:\n* Calculate the current object size O\u003csub\u003es\u003c/sub\u003e = n\u003csub\u003eT\u003c/sub\u003e / d\u003csub\u003eT\u003c/\nsub\u003e **X** n\u003csub\u003elat\u003c/sub\u003e / d\u003csub\u003elat\u003c/sub\u003e **X** n\u003csub\u003elon\u003c/sub\u003e /\nd\u003csub\u003elon\u003c/sub\u003e\n* **while** O\u003csub\u003es\u003c/sub\u003e \u003e `max_subarray_size`, split a dimension:\n  * **if** d\u003csub\u003elat\u003c/sub\u003e **X** d\u003csub\u003elon\u003c/sub\u003e \u003c= d\u003csub\u003eT\u003c/sub\u003e:\n    * **if** d\u003csub\u003elat\u003c/sub\u003e \u003c= d\u003csub\u003elon\u003c/sub\u003e:\n        split latitude dimension again: d\u003csub\u003elat\u003c/sub\u003e += 1\n    * **else:**\n        split longitude dimension again: d\u003csub\u003elon\u003c/sub\u003e += 1\n  * **else:**\n    split the time dimension again: d\u003csub\u003eT\u003c/sub\u003e += 1\n\nUsing this simple divide and conquer algorithm ensures the `max_subarray_size`\nconstraint is met and the use cases require an equal number of operations.\n\n*Note that in v2.0 of S3netCDF4, the user can specify the sub-array shape in\nthe s3Dataset.createVariable method.  This circumvents the file-splitting\nalgorithm and uses just the sub-array shape specified by the user.*\n\n[[Top]](#contents)\n\n## Reading files\n\nS3netCDF4 has the ability to read normal netCDF4 and netCDF3 files, **CFA-\nnetCDF4** and **CFA-netCDF3** files from a POSIX filesystem, Amazon S3 object\nstore and OPeNDAP.  \nFor files on remote storage, before reading the file, S3netCDF4 will query\nthe file size and determine whether it is greater than the\n`resource_allocation: memory` setting in the `~/.s3nc.json` configuration or\ngreater than the current available memory.  If it is, then some files will be\n\"shuffled\" out of memory until there is enough allocated memory available.  See\n[Resource Usage](#resource) for more details.  If it is less than the\n`resource_allocation: memory` setting then it will stream the file\ndirectly into memory.  Files on local disk (POSIX) are opened in the same way\nas the standard netCDF4 library, i.e. the header, variable and dimension\ninformation and metadata are read in, but no field data is read.\n\nFrom a user perspective, files are read in the same way as the standard\nnetCDF4-python package, by creating a `s3Dataset` object.  As with writing\nfiles, the parameters to the `s3Dataset` constructor can vary in a number of\nways:\n\n1.  The `filename` can be an S3 endpoint, i.e. it starts with `s3://`, or a\nfile on the disk, or an OpenDAP URL.\n2.  The `format` can be `CFA3` or `CFA4` to read in a **CFA-netCDF3** or **CFA-\nnetCDF4** dataset.  **However**, it is not necessary to specify this keyword if\nthe user wishes to read in a CFA file, as S3netCDF4 will determine, from the\nmetadata, whether a netCDF file is a regular netCDF file or a CFA-netCDF file.  \nS3netCDF4 will also determine, from the file header, whether a netCDF file is a\nnetCDF4 or netCDF3 file.  If the file resides on an S3 storage system, then the\nfirst 6 bytes only of the file will be first read to determine whether the file\nis a netCDF4 or netCDF3 file or an invalid file.  As a CFA-netCDF file is just\na netCDF file, determining whether the netCDF file is a CFA-netCDF file is left\nuntil the file is read in, i.e. after the interpretation of the header.\n3. Files that are on remote storage are streamed into memory.  As files are\nread in, other files may be \"shuffled\" out of memory if the currently used\nmemory exceeds the `resource_allocation: memory` setting in the `~/.s3nc.json`\nconfig file.  See [Resource Usage](#resource).\n\n*Example 7: Read a netCDF file from disk*\u003ca name=example-7\u003e\u003c/a\u003e\n```\nfrom S3netCDF4._s3netCDF4 import s3Dataset as Dataset\nwith Dataset(\"/Users/neil/test_dataset_nc4.nc\", 'r') as nc_data:\n    print(nc_data.variables)\n```\n\n*Example 8: Read a CFA-netCDF file from S3 storage*\u003ca name=example-8\u003e\u003c/a\u003e\n```\nfrom S3netCDF4._s3netCDF4 import s3Dataset as Dataset\nfrom S3netCDF4 import s3Dataset as Dataset\nwith Dataset(\"s3://tenancy-0/test_bucket/test_dataset_s3_cfa3.nc\", 'r') as nc_data:\n    print(nc_data.variables)\n```\n\nUpon reading a CFA-netCDF file, the **master-array** file is interpreted to\ntransform the metadata in the file (for CFA `\"v0.4\"`), or the information in\nthe CFA group for the variable (for CFA `\"v0.5\"`) into the **partition\nmatrix**.  See [CFA-\nnetCDF files](#cfa-netcdf-files) for more information.  \nPart of this transformation involves creating an instance of the `s3Variable`\nclass for each variable in the CFA-netCDF file.  The `s3Variable` class\ncontains `_nc_var`: the instance of the standard `netCDF4.Variable` object;\n`_cfa_var`: an instance of `CFAVariable`, containing information about the CFA\n**sub-array** associated with this variable; and `_cfa`: an instance of\n`CFADataset`, containing information about the CFA **master-array** file that\ncontains this variable.\nThe metadata, or CFA group, in the **master-array** file is parsed to generate\nthese two objects.  These two objects will be used when a user calls a slice\noperation on a s3Variable object.\n\n[[Top]](#contents)\n\n### Reading variables\n\nIn v2.0.x,the s3netCDF4 API now matches the standard netCDF4 python API in\nreading variable names and variables.  Previously, two extra functions were used\n(`variables()`, and `getVariable()`).  During the rework, a way was found to\nprovide 100% compatibility with the netCDF4 python API.  This is reflected in\nthe method of handling variables:\n\n1. `s3Dataset.variables`, or `s3Group.variables` : returns a list of variables\nin the Dataset.\n2. `s3Dataset.variables[\u003cvariable_name\u003e]`, or\n`s3Group.variables[\u003cvariable_name\u003e]` : return the `s3netCDF4.s3Variable`\ninstance for `\u003cvariable_name\u003e` if the variable is a **master-array** in a CFA-\nnetCDF file, or a `netCDF4.Variable` instance if it is a dimension variable, or\na variable in a standard netCDF file.\n\n*Example 9: Read a netCDF file from disk and get the \"field8\" variable*\u003ca name=example-9\u003e\u003c/a\u003e\n```\nfrom S3netCDF4 import s3Dataset as Dataset\nwith Dataset(\"/Users/neil/test_dataset_nc4.nc\") as src_file:\n    print(src_file.variables)\n    src_var = src_file.variables[\"field8\"]\n    print(type(src_var))\n```\n[[Top]](#contents)\n\n### Reading metadata\n\nReading metadata from the Variables or Dataset (global metadata) is done in\nexactly the same way as in the standard netCDF4 python package, by querying the\nmember variable of either a Variable or Dataset.  The `ncattrs` and `getncattr`\nmember functions of the `Dataset` and `Variable` classes are also supported.\n\n*Example 10: Read a netCDF file, a variable and its metadata*\u003ca name=example-10\u003e\u003c/a\u003e\n```\nfrom S3netCDF4 import s3Dataset as Dataset\nwith Dataset(\"/Users/neil/test_dataset_nc4.nc\") as src_file:\n    print(src_file.ncattrs())\n    src_var = src_file.getVariable[\"field8\"]\n    print(src_var.ncattrs())\n    print(src_var.units)\n    print(src_var.getncattr(\"units\"))\n    print(src_file.author)\n    print(src_file.getncattr(\"author\"))\n```\n\n[[Top]](#contents)\n\n### Reading field data\n\nReading field data in S3netCDF follows the same principles as writing the\ndata:\n1. If the file is determined to have `format=NETCDF3_CLASSIC` or\n`format=NETCDF4` then it is read in and the field data is made available in\nthe same manner as the standard netCDF4-python package.  If the file is\nresiding on S3 storage, then the entire file will be streamed to memory, if\nit is larger than the `resource_allocation: memory` setting in `~/\n.s3nc.json`, or larger than the available memory, then a memory error will\nbe returned.\n2. If the file is determined to have `format=CFA3` or `format=CFA4` then just\nthe **master-array** file is read in and any field data will only be read\nwhen the `[]` operator (`__getitem__`) is called on a `s3Variable` instance.  \nUpon opening the **master-array** file:\n3. if the file is `\"v0.4\"` of the CFA conventions, the CFA metadata is taken\nfrom the variable metadata. The **partition-matrix** is constructed (see [File\nsplitting algorithm](#file-splitting-algorithm)) internally as a netCDF group\nwith dimensions and variables containing the partition information.\n4. if the files is CFA `\"v0.5\"`, then the **partition-matrix** is read in\ndirectly from the Groups, Dimensions and Variables in the file, without any\nparsing having to take place.\n5. the `_cfa`, `_cfa_grp`, `_cfa_dim` and `_cfa_var` objects are created as\nmember variables of the `s3Dataset`, `s3Group`, `s3Dimension` and `s3Variable`\nobjects respectively.  These are instances of `CFADataset`, `CFAGroup`,\n`CFADimension` and `CFAVariable` respectively.  The **partition-matrix** is\ncontained within a `netCDF4.Group` within the `_cfa_var` instance of\n`CFAVariable`\n\nInternally, the **partition-matrix** consists of a netCDF group, which itself\ncontains the dimensions of the partition-matrix, and variables containing the\npartition information.\nWithin the s3Dataset, s3Variable and s3Group objects, there are objects that\ncontain higher level CFA data, and the methods to operate on that data.\nThis information is used when a user slices the field data to determine which\n**sub-array** files are read and which portion of the **sub-array** files are\nincluded in the slice:\n\n1. A `CFADataset` as the top level container:\n    1. A number of `CFAGroup`s: information about groups in the file.  There\n    is always at least one group: the `root` group is explicit in its\n    representation in the `CFADataset`.  Within the `CFAGroup` there are:\n        1. A number of `CFADim`s : information about the dimensions in the\n        Dataset\n        2. A number of `CFAVariable`s : information about the variables in\n        the Dataset, which contains:\n            1. The **partition-matrix** which consists of a netCDF group\n            containing:\n                1. the scalar dimensions, with no units or associated\n            dimension variable\n                2. the variables containing the partition information:\n                    1. `pmshape` : the shape of the **partition-matrix**\n                    2. `pmdimensions` : the dimensions in the **master-array**\n                    file which the partition matrix acts over.\n                    3. `index` : the index in the **partition-matrix**.  This\n                    is implied by the location in the **partition-matrix** but\n                    it is retained to detect erroneous lookups by the slicing\n                    algorithm.\n                    4. `location` : the location in the **master-array file**\n                    5. `ncvar` : the name of the variable in the **sub-array**\n                    file.\n                    6. `file` : the URL or path of the **sub-array** file.\n                    7. `format` : the format of the **sub-array** file.\n                    8. `shape` : the shape of the **sub-array** file.\n\n            2. Methods to act upon the variable and its **partition-matrix**,\n            including:\n                1. `__getitem__` : returns the necessary information to\n                read and write **sub-array** files.\n                2. `getPartition` : return a user-readable version of a\n                partition (a single element in the **partition-matrix**) as a\n                Python named tuple, rather than a netCDF Group or Variable.\n\nReading a slice of field data from a variable in the **master-array** file,\nvia __getitem__ consists of five operations:\n\n1. If the total size of the requested slice is greater than\n`resource_allocation: memory` (or the available memory) then a Numpy memory\nmapped array is created in the location indicated by the `cache_location:`\nsetting in the `~/.s3nc.json` config file.\n\n2. Determine which of the **sub-arrays** overlaps with the slice, by querying\nthe **partition-matrix**.  This is currently done by a simple arithmetic\noperation that relies on the **partitions** all being the same size.\n\n3. Calculate the source and target slices, the source being the **sub-array**\nand the target (the **master-array**) a memory-mapped Numpy array with a shape\nequal to the user supplied slice.  The location of this **sub-array** in the\n**master-array** is given by the **partition** containing the **sub-array**,\nwhich gives the slice into the **master-array**.  However, both the slice of\nthe **sub-array** and **master-array** may need to be altered if the user\nsupplied slice does not encapsulate the whole **sub-array**, for example if a\nrange of timesteps are taken.\n\n4. For each of the **sub-arrays** the file specified by the `file` variable in\nthe **partition** information is opened.  If the file is on disk, it is simply\nopened in the same way as a standard netCDF4 python file.  If it is on a remote\nfile system, such as S3, then it is streamed into memory.  If the size of the\n**sub-array** file will cause the currently used amount\nof memory to exceed the `resource_allocation: memory` setting in `~/.s3nc.json`\nthen some files may be shuffled out of memory.  See the [Resource\nUsage](#resource) section above.\nIf, even after the file shuffling has occurred, the size of the **sub-array**\ncannot be contained in memory then a memory error will occur.\n\n5. A netCDF4-python `Dataset` object is opened from the downloaded file or\nstreamed memory.\n\n6. The values in the **sub-array** are copied to the **master-array** (the\nmemory-mapped Numpy array) using the source (**sub-array**) slice and the\ntarget (**master-array**) slice.\n\n*Currently the reading of data is performed asynchronously, using\naiobotocore.  S3netCDF4 allows parallel workflows using multi-processing or\nDask, by using the CFA information stored in the CFADataset, CFAGroup,\nCFADimension and CFAVariable classes.  Examples of this will follow*\n\n## List of examples\n\n* [Example 1: open a netCDF file from a S3 storage using the alias \"tenancy-0\"](#example-1)\n* [Example 2: Create a netCDF4 file in the filesystem](#example-2)\n* [Example 3: Create a CFA-netCDF4 file in the filesystem with CFA version 0.5](#example-3)\n* [Example 4: Create a CFA-netCDF4 file in the filesystem](#example-4)\n* [Example 5: Create a CFA-netCDF3 file on S3 storage](#example-5)\n* [Example 6: creating dimensions and variables](#example-6)\n* [Example 7: creating variables with metadata](#example-7)\n* [Example 8: Read a netCDF file from disk](#example-8)\n* [Example 9: Read a netCDF file from disk and get the \"field8\" variable](#example-9)\n* [Example 10: Read a netCDF file, a variable and its metadata](#example-10)\n\n[[Top]](#contents)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcedadev%2Fs3-netcdf-python","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcedadev%2Fs3-netcdf-python","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcedadev%2Fs3-netcdf-python/lists"}