{"id":29639207,"url":"https://github.com/neotomadb/interoperability_deepdive","last_synced_at":"2025-07-21T20:08:29.116Z","repository":{"id":266147783,"uuid":"897536521","full_name":"NeotomaDB/Interoperability_DeepDive","owner":"NeotomaDB","description":"A repository for managing the FAIROS Quaternary Interoperability project.","archived":false,"fork":false,"pushed_at":"2025-04-18T03:52:34.000Z","size":17914,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-04-18T16:50:34.181Z","etag":null,"topics":["neotoma","neotoma-database","network-analysis","networkx","paleoecology","xdeepdive"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NeotomaDB.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-02T19:54:56.000Z","updated_at":"2025-04-18T03:52:28.000Z","dependencies_parsed_at":"2024-12-03T02:45:34.094Z","dependency_job_id":"074ff8da-4b8d-4166-af8e-f31e1b6fceba","html_url":"https://github.com/NeotomaDB/Interoperability_DeepDive","commit_stats":null,"previous_names":["neotomadb/interoperability_deepdive"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/NeotomaDB/Interoperability_DeepDive","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NeotomaDB%2FInteroperability_DeepDive","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NeotomaDB%2FInteroperability_DeepDive/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NeotomaDB%2FInteroperability_DeepDive/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NeotomaDB%2FInteroperability_DeepDive/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NeotomaDB","download_url":"https://codeload.github.com/NeotomaDB/Interoperability_DeepDive/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NeotomaDB%2FInteroperability_DeepDive/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266371486,"owners_count":23918862,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-21T11:47:31.412Z","response_time":64,"last_error":null,"robots_txt_status":null,"robots_txt_updated_at":null,"robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["neotoma","neotoma-database","network-analysis","networkx","paleoecology","xdeepdive"],"created_at":"2025-07-21T20:08:28.942Z","updated_at":"2025-07-21T20:08:29.103Z","avatar_url":"https://github.com/NeotomaDB.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003c!-- badges: start --\u003e\n\n![lifecycle](https://img.shields.io/badge/lifecycle-active-green.svg)\n\u003c!-- badges: end --\u003e\n\n\n# Examining Interoperability Among Databases\n\nThis project leverage xDeepDive to understand the intersection between existing paleo-data community resources such as the Neotoma Paleoecology Database, Global Paleofire Database or WorldClim. We use a list of terms compiled from researcher interviews that indicate likely sources of data used by researchers working in Holocene/Quaternary studies across a range of disciplines. Included in this set of terms are various tools associated with these different data resources.\n\nThe full list of terms and links used in the xDeepDive snippets search is in the [data folder](./data/merged_records.csv). We have 51 different resources identified from the interviews and 148 unique terms associated with these resources. Terms include URLs (e.g., `neotomadb.org` for the Neotoma Paleoecology Database), programming libraries (e.g., `rgbif` for the Global Biodiversity Information Facility) and alternate names, including initializations (e.g., APD for the African Pollen Database).\n\nUsing the xDeepDive snippets API we [search for these terms](./src/interop_dd.py) to build a large table of DOIs, text snippets and database terms. Initial testing shows this table to be quite large (\u003e100k rows), in part because some resources have low specificity in their naming.\n\n## **Contributors**\n\nThis project is an open project, and contributions are welcome from any individual. All contributors to this project are bound by a [code of conduct](./CODE_OF_CONDUCT.md). Please review and follow this code of conduct as part of your contribution.\n\n- [![ORCID](https://img.shields.io/badge/orcid-0000--0002--2700--4605-brightgreen.svg)](https://orcid.org/0000-0002-2700-4605) Simon Goring\n\n### Tips for Contributing\n\nIssues and bug reports are always welcome. Code clean-up, and feature additions can be done either through pull requests to [project forks](https://github.com/NeotomaDB/Interoperability_DeepDive/network/members) or [project branches](https://github.com/NeotomaDB/Interoperability_DeepDive/branches).\n\nAll products of the Neotoma Paleoecology Database are licensed under an [MIT License](LICENSE) unless otherwise noted.\n\n## Using the Repository\n\nThe repository is coded in Python and managed using the [uv Package Manager](https://docs.astral.sh/uv/). To run a script, first clone the repository and then, at the command line enter:\n\n```bash\nuv install\n```\n\nto install all neccessary dependencies.\n\nTo run one of the two main scripts run either:\n\n```bash\nuv run src/interop_dd.py\n```\n\nor\n\n```bash\nuv run src/networkgraph.py\n```\n\n### `interop_dd`: Harvesting Snippets\n\nThe script to obtain text snippets is within the [src/interoperability_deepdive](./src/interoperability_deepdive) folder. These scripts:\n\n1. Build the API URL -- [gdd_snippets](./src/interoperability_deepdive/gdd_snippets.py)\n2. Page through the results list -- [gddURLcall](./src/interoperability_deepdive/gddURLcall.py)\n3. Process the JSON response from the API -- [process_hits](./src/interoperability_deepdive/process_hits.py)\n\nThe resulting object is a `list` of `dict` items, structured to be submitted to a CSV file with the following structure:\n\n| DOI | highlight | title | resource |\n| --- | --------- | ----- | -------- |\n| 10.1016/j.epsl.2018.10.016 | \"identiﬁed using several publications (see supplementary information), the African Pollen Database, and\" | The roles of climate and human land-use in the late Holocene rainforest crisis of Central Africa | African Pollen Database |\n| 10.1016/j.crte.2008.12.009 | cm2 pe r year. The determination of 116 pollen taxa was made using the African Pollen Database reference | Climate and environmental change at the end of the Holocene Humid Period: A pollen record off Pakistan | African Pollen Database |\n\nFrom this table we can manually examine records to assess match quality.\n\n## Processing Results\n\nTo effectively build the network model for these resources we look to _co-occurrence_ of resources in publications. For example:\n\n| DOI | highlight | title | resource |\n| --- | --------- | ----- | -------- |\n| 10.1016/j.tree.2010.10.007 | databases, most notably those of North American Pollen Database (NAPD), European Pollen Database (EPD) | Exploring vegetation in the fourth dimension | European Pollen Database |\n| 10.1016/j.tree.2010.10.007 | reviewed data from 36 beetle assemblages from Britain that are held in the BugsCEP database (http://www.bugscep.com) and exploited the specific | Exploring vegetation in the fourth dimension | BugsCEP |\n\nWould indicate co-citation of the European Pollen Database and BugsCEP. A significant challenge in this analysis is knowing whether or not co-citation explicitly includes co-analysis of data (which may involve cross-walking and data translation). This work is challenging because of the structure of the xDeepDive API (which only returns individual \"snippets\", or sentences) and because of citation and data use patterns in publication.\n\nUltimately, the scripts first search for text strings, and then process the returned results (in the `data/xdd_results` folder) into a single csv file containing only DOIs that report more than one data resource.\n\nThe output data -- [`data/doi_centric.json`](./data/doi_centric.json) -- is a JSON object with the following structure:\n\n```json\n[\n  {\n    \"doi\": {\n        \"type\": \"string\",\n        \"pattern\": \"/^10.\\d{4,9}/[-._;()/:A-Z0-9]+$/i\"\n    },\n    \"resources\": {\n        \"type\": \"array\",\n        \"minLength\": 2,\n        \"items\": {\n            \"type\": \"string\"\n        }\n    }\n  }\n]\n```\n\n## Statistical Analysis\n\nWe care about several key measures:","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fneotomadb%2Finteroperability_deepdive","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fneotomadb%2Finteroperability_deepdive","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fneotomadb%2Finteroperability_deepdive/lists"}