{"id":29639172,"url":"https://github.com/neotomadb/databus_ostracode","last_synced_at":"2025-07-21T20:08:23.608Z","repository":{"id":268025747,"uuid":"846206388","full_name":"NeotomaDB/DataBUS_Ostracode","owner":"NeotomaDB","description":null,"archived":false,"fork":false,"pushed_at":"2025-04-22T00:23:03.000Z","size":893,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-04-22T01:25:02.683Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NeotomaDB.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":"code_of_conduct.md","threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-08-22T18:25:09.000Z","updated_at":"2025-04-22T00:22:51.000Z","dependencies_parsed_at":null,"dependency_job_id":"fe20c1fa-3e2d-4c4c-966e-9a45150c60e4","html_url":"https://github.com/NeotomaDB/DataBUS_Ostracode","commit_stats":null,"previous_names":["neotomadb/databus_ostracode"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/NeotomaDB/DataBUS_Ostracode","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NeotomaDB%2FDataBUS_Ostracode","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NeotomaDB%2FDataBUS_Ostracode/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NeotomaDB%2FDataBUS_Ostracode/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NeotomaDB%2FDataBUS_Ostracode/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NeotomaDB","download_url":"https://codeload.github.com/NeotomaDB/DataBUS_Ostracode/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NeotomaDB%2FDataBUS_Ostracode/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266371486,"owners_count":23918862,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-21T11:47:31.412Z","response_time":64,"last_error":null,"robots_txt_status":null,"robots_txt_updated_at":null,"robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-07-21T20:08:18.970Z","updated_at":"2025-07-21T20:08:23.596Z","avatar_url":"https://github.com/NeotomaDB.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003c!-- badges: start --\u003e\n\n![lifecycle](https://img.shields.io/badge/lifecycle-active-green.svg)\n[![NSF-1948926](https://img.shields.io/badge/NSF-1948926-blue.svg)](https://www.nsf.gov/awardsearch/showAward?AWD_ID=1948926)\n[![NSF-2410961](https://img.shields.io/badge/NSF-2410961-blue.svg)](https://www.nsf.gov/awardsearch/showAward?AWD_ID=2410961)\n\n\u003c!-- badges: end --\u003e\n\n## Working with the Python Data Upload Template\n\nThis set of python scripts is intended to support the bulk upload of a set of records to Neotoma. It consists of three key steps:\n\n1. Development of a data template (YAML and CSV)\n2. Template validation\n3. Data upload\n\nOnce these three steps are completed the uploader will push the template files to the `neotomaholding` database. This is a temporary database that is intended to hold data within the Neotoma Paleoecology Database system for access by Tilia. Tilia is then used to provide a final data check and upload of data to Neotoma proper.\n\n![The process of uploading records using the bulk uploader. Individuals follow the steps outlined above and described further in this README file.](img/BulkUploaderSchema.svg)\n\n## Template Development\n\nThe template uses a `yaml` format file, with the following general structure for each data element:\n\n```yaml\napiVersion: neotoma v2.0\nkind: Development\nmetadata:\n  - column:  Site.name\n    neotoma: ndb.sites.sitename  \n    vocab: False\n    repeat: True\n    type: string\n    ordered: False\n```\n\nThe template is used to link the template CSV file (the file that will be generated by the upload team) to the Neotoma database. It is a form of cross-walk between the upload team and the existing database structure.\n\nAll YAML files should begin with an `apiVersion` header that indicates we are using `neotoma v2.0`. This is the current API version for Neotoma (accessible through [api.neotomadb.org](https://api.neotomadb.org)). This field is intended to support future development of the Neotoma API.\n\nThe `kind` field indicates whether we are prepared to work with the production version of the database. Options are `development` and `production`. For testing purposes all YAML files should set `kind` to `development`.\n\n## `metadata`\n\nEach entry in the `metadata` tab can have the following entries:\n\n* `column`:  The column of the spreadsheet that is being described.\n* `neotoma`: A database table and column combination from the database schema.\n* `vocab`: If there is a fixed vocabulary for the column, include the possible terms here.\n* `repeat`: [`true`, `false`] Is each entry unique and tied to the row (`false`, this isn't a set of repeated values), or is this a set of entries associated with the site (`true`, there is only a single value that repeats throughout)?\n* `type`: [`integer`, `numeric`, `date`] The variable type for the field.\n* `ordered`: [`true`, `false`] Does the order of the column matter?\n\n```yaml\nmetadata:\n  - column: Coordinate.precision\n    neotoma: ndb.collectionunits.location\n    vocab: ['core-site','GPS','core-site approximate','lake center']\n    repeat: True\n    type: character\n    ordered: False\n```\n\nIn this case we see that the team has chosen to create a column in their spreadsheet called `Coordinate.precision`, it is linked to the Neotoma table/column `ndb.collectionunits.location`. We state that it requires one term from a fixed vocabulary, the value repeats within the column, it is expected to be a `character` (as opposed to an `integer` or `numeric` value) and the order of the values does not matter.\n\nA complete list of Neotoma tables and columns is included in [`tablecolumns.csv`](docs/tablecolumns.csv), and additional support for table concepts and content can be found either in the [Neotoma Paleoecology Database Manual](https://open.neotomadb.org/manual) or in the [online database schema](https://open.neotomadb.org/dbschema).\n\nUsing the YAML template we can create complex relationships between existing data models for particular sets of records coming from individual researcher labs or data consortiums and the Neotoma database.\n\nOn completion of the YAML file, each column of the CSV will have an entry that fully describes the content of the data within that column. At that point we can validate the CSV files intended for upload.\n\n## Validation\n\nWe execute the validation process by running:\n\n```bash\n\u003e python3 template_validate.py FILEFOLDER\n```\n\nThis will then search the folder provided in `FILEFOLDER` for csv files and parse them for validity.\n\nThe set of tests for validity depends on the data content within the YAML file, but must at least include:\n\n* Site Validation\n* Collection Unit Validation\n* Analysis Unit Validation\n* Dataset Validation\n* Dataset PI Validation\n* Sample Validation\n* Data Validation\n\nTemplates with more elements will be tested depending on the data content provided.\n\nEach file will recieve a `log` file associated with it that contains a report of potential issues:\n\n```txt\n53f0a3feb956a4fa590a9d45b657f76e\nValidating data/FILENAME.csv\nReport for data/FILENAME.csv\n=== Checking Template Unit Definitions ===\n✔ All units validate.\n. . .\n. . .\n=== Checking the Dating Horizon is Valid ===\n✔  The dating horizon is in the reported depths.\n```\n\nThe log files begin with an [md5 hash](https://en.wikipedia.org/wiki/MD5) of the csv template file. This appears as a string of numbers and letters that record a point in time of the file. The hash is used to identify whether or not files have changed since validation.\n\nThe validation step identifies each element of the template being validated, provides a visual reference as to whether or not the element passes validation (**✔**, **?** or **✗**) and provides guidance as to whether changes need to be made.\n\n## Upload\n\nThe upload process is initiated using the command:\n\n```bash\n\u003e python3 template_upload.py\n```\n\nThe upload process will return the distince siteids, and related data identifiers for the uploads.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fneotomadb%2Fdatabus_ostracode","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fneotomadb%2Fdatabus_ostracode","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fneotomadb%2Fdatabus_ostracode/lists"}