{"id":29639205,"url":"https://github.com/neotomadb/metaextractor","last_synced_at":"2025-07-21T20:08:27.174Z","repository":{"id":166943054,"uuid":"638558780","full_name":"NeotomaDB/MetaExtractor","owner":"NeotomaDB","description":"A repository for the UBC MDS Capstone team to develop a metadata extractor for Neotoma","archived":false,"fork":false,"pushed_at":"2025-03-26T01:37:21.000Z","size":54299,"stargazers_count":8,"open_issues_count":6,"forks_count":3,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-04-13T04:49:19.574Z","etag":null,"topics":["crossref","doi","machine-learning","metadata-extraction","neotoma","nlp","relevance","xdd"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NeotomaDB.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-05-09T15:53:30.000Z","updated_at":"2024-10-14T15:23:25.000Z","dependencies_parsed_at":null,"dependency_job_id":"4223ecdc-a8f4-46c0-9fb3-3470103447bb","html_url":"https://github.com/NeotomaDB/MetaExtractor","commit_stats":null,"previous_names":["neotomadb/metaextractor"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/NeotomaDB/MetaExtractor","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NeotomaDB%2FMetaExtractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NeotomaDB%2FMetaExtractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NeotomaDB%2FMetaExtractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NeotomaDB%2FMetaExtractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NeotomaDB","download_url":"https://codeload.github.com/NeotomaDB/MetaExtractor/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NeotomaDB%2FMetaExtractor/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266371486,"owners_count":23918862,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-21T11:47:31.412Z","response_time":64,"last_error":null,"robots_txt_status":null,"robots_txt_updated_at":null,"robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crossref","doi","machine-learning","metadata-extraction","neotoma","nlp","relevance","xdd"],"created_at":"2025-07-21T20:08:26.985Z","updated_at":"2025-07-21T20:08:27.166Z","avatar_url":"https://github.com/NeotomaDB.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Contributors][contributors-shield]][contributors-url]\n[![Forks][forks-shield]][forks-url]\n[![Stargazers][stars-shield]][stars-url]\n[![Issues][issues-shield]][issues-url]\n[![MIT License][license-shield]][license-url]\n[![codecov][codecov-shield]][codecov-url]\n\n![Banner](assets/ffossils-logo-text.png)\n# **MetaExtractor: Finding Fossils in the Literature**\n\nThis project aims to identify research articles which are relevant to the [_Neotoma Paleoecological Database_](http://neotomadb.org) (Neotoma), extract data relevant to Neotoma from the article, and provide a mechanism for the data to be reviewed by Neotoma data stewards then submitted to Neotoma. It is being completed as part of the _University of British Columbia (UBC)_ [_Masters of Data Science (MDS)_](https://masterdatascience.ubc.ca/) program in partnership with the [_Neotoma Paleoecological Database_](http://neotomadb.org).\n\n**Table of Contents**\n\n- [**MetaExtractor: Finding Fossils in the Literature**](#metaextractor-finding-fossils-in-the-literature)\n  - [About](#about)\n    - [Article Relevance Prediction](#article-relevance-prediction)\n    - [Data Extraction Pipeline](#data-extraction-pipeline)\n    - [Data Review Tool](#data-review-tool)\n  - [How to use this repository](#how-to-use-this-repository)\n    - [Data Review Tool](#data-review-tool-1)\n    - [Article Relevance \\\u0026 Entity Extraction Model](#article-relevance--entity-extraction-model)\n    - [Data Requirements](#data-requirements)\n      - [Article Relevance Prediction](#article-relevance-prediction-1)\n      - [Data Extraction Pipeline](#data-extraction-pipeline-1)\n    - [System Requirements](#system-requirements)\n  - [Directory Structure and Description](#directory-structure-and-description)\n  - [Contributors](#contributors)\n    - [Tips for Contributing](#tips-for-contributing)\n\nThere are 3 primary components to this project:\n\n1. **Article Relevance Prediction** - get the latest articles published, predict which ones are relevant to Neotoma and submit for processing.\n2. **Data Extraction Pipeline** - extract relevant entities from the article including geographic locations, taxa, etc.\n3. **Data Review Tool** - this takes the extracted data and allows the user to review and correct it for submission to Neotoma.\n\n\u003cp align=\"center\"\u003e\n   \u003cimg src=\"assets/project-flow-diagram.png\"  width=\"800\"\u003e  \n\u003c/p\u003e\n\n## **About**\n\nInformation on each component is outlined below.\n\n### **Article Relevance Prediction**\n\nThe goal of this component is to monitor and identify new articles that are relevant to Neotoma. This is done by using the public [xDD API](https://geodeepdive.org/) to regularly get recently published articles. Article metadata is queried from the [CrossRef API](https://www.crossref.org/documentation/retrieve-metadata/rest-api/) to obtain data such as journal name, title, abstract and more. The article metadata is then used to predict whether the article is relevant to Neotoma or not.\n\nThe model was trained on ~900 positive examples (a sample of articles currently contributing to Neotoma) and ~3500 negative examples (a sample of articles unrrelated or closely related to Neotoma). Logistic regression model was chosen for its outstanding performance and interpretability.\n\nArticles predicted to be relevant will then be submitted to the Data Extraction Pipeline for processing.\n\n\u003cp align=\"center\"\u003e\n   \u003cimg src=\"assets/article_prediction_flow.png\"  width=\"800\"\u003e  \n\u003c/p\u003e\n\nTo run the Docker image for article relevance prediction pipeline, please refer to the instructions [here](docker/article-relevance/README.md)\n\nThe model could be retrained using reviewed article data. Please refer to [here](docker/article-relevance-retrain/README.md) for the instructions.\n\n### **Data Extraction Pipeline**\n\nThe full text is provided by the xDD team for the articles that are deemed to be relevant and a custom trained **Named Entity Recognition (NER)** model is used to extract entities of interest from the article.\n\nThe entities extracted by this model are:\n\n- **SITE**: name of the excavation site\n- **REGION**: more general regions names to provide context for where sites are located\n- **TAXA**: plant or animal fossil names\n- **AGE**: historical age of the fossils, eg. 1234 AD, 4567 BP\n- **GEOG**: geographic coordinates indicating the location of the site, eg. 12'34\"N 34'23\"W\n- **EMAIL**: researcher emails referenced in the articles\n- **ALTI**: altitudes of sites, eg. 123 m a.s.l (above sea level)\n\nThe model was trained on ~40 existing Paleoecology articles manually annotated by the team consisting of **~60,000 tokens** with **~4,500 tagged entities**.\n\nThe trained model is available for inference and further development on huggingface.co [here](https://huggingface.co/finding-fossils/metaextractor).\n\n\u003cp align=\"center\"\u003e\n   \u003cimg src=\"assets/hugging-face-metaextractor.png\"  width=\"1000\"\u003e  \n\u003c/p\u003e\n\n### **Data Review Tool**\n\nFinally, the extracted data is loaded into the Data Review Tool where members of the Neotoma community can review the data and make any corrections necessary before submitting to Neotoma. The Data Review Tool is a web application built using the [Plotly Dash](https://dash.plotly.com/) framework. The tool allows users to view the extracted data, make corrections, and submit the data to be entered into Neotoma.\n\n\u003cp align=\"center\"\u003e\n   \u003cimg src=\"assets/data-review-tool.png\"  width=\"1000\"\u003e  \n\u003c/p\u003e\n\n## How to use this repository\n\nFirst, begin by installing the requirements.\n\nFor pip:\n\n```bash\npip install -r requirements.txt\n```\n\nFor conda:\n```bash\nconda env create -f environment.yml\n```\n\nIf you plan to use the pre-built Docker images, install Docker following these [instructions](https://docs.docker.com/get-docker/)\n\nTo launch the app, run the following command from the root directory of this repository:\n\n```bash\ndocker-compose up --build data-review-tool\n```\n\nOnce the image is built and the container is running, the Data Review Tool can be accessed at \u003chttp://0.0.0.0:8050/\u003e. There is a sample `article-relevance-output.parquet` and `entity-extraction-output.zip` provided for demo purposes.\n\n### **Article Relevance \u0026 Entity Extraction Model**\n\nPlease refer to the project wiki for the development and analysis workflow details: [MetaExtractor Wiki](https://github.com/NeotomaDB/MetaExtractor/wiki)\n\n### **Data Requirements**\n\nEach of the components of this project have different data requirements. The data requirements for each component are outlined below.\n\n#### **Article Relevance Prediction**\n\nThe article relevance prediction component requires a list of journals that are relevant to Neotoma. This dataset used to train and develop the model is available for download [HERE](https://drive.google.com/drive/folders/1NpOO7vSnVY0Wi0rvkuwNiSo3sqq-5AkY?usp=sharing). Download all files and extract the contents into `MetaExtractor/data/article-relevance/raw/`.\n\nThe prediction pipeline requires the trained model object. The model is available [HERE](https://drive.google.com/drive/folders/1NpOO7vSnVY0Wi0rvkuwNiSo3sqq-5AkY?usp=sharing). Download the model file and put the .joblib file in `MetaExtractor/models/article-relevance/`.\n\n#### **Data Extraction Pipeline**\n\nAs the full text articles provided by the xDD team are not publicly available we cannot create a public link to download the labelled training data. For access requests please contact Simon Goring at \u003cgoring@wisc.edu\u003e or Ty Andrews at \u003cty.elgin.andrews@gmail.com\u003e.\n\n#### **Data Review Tool**\n\nOnce the article relevance prediction and data extraction pipeline have been run, the output files can be used as input for the Data Review Tool. The Data Review Tool requires the following files:\n\n- `article-relevance-output.parquet` - output file from the article relevance prediction pipeline\n- `entity-extraction-output.zip` - output file from the data extraction pipeline\n\nThese files should be present under a single folder and the path to the folder can be updated in the `docker-compose.yml` file, the default location is `data/data-review-tool` directory.\n\n### **System Requirements**\n\nThe project has been developed and tested on the following system:\n\n- macOS Monterey 12.5.1\n- Windows 11 Pro Version: 22H2\n- Ubuntu 22.04.2 LTS\n\n\nThe pre-built Docker images were built using Docker version 4.20.0 but should work with any version of Docker since 4.\n\n## **Directory Structure and Description**\n\n```\n├── .github/                            \u003c- Directory for GitHub files\n│   ├── workflows/                      \u003c- Directory for workflows\n├── assets/                             \u003c- Directory for assets\n├── docker/                             \u003c- Directory for docker files\n│   ├── article-relevance/              \u003c- Directory for docker files related to article relevance prediction\n│   ├── article-relevance-retrain/      \u003c- Directory for docker files related to article relevance retraining\n│   ├── data-review-tool/               \u003c- Directory for docker files related to data review tool\n│   ├── entity-extraction/              \u003c- Directory for docker files related to named entity recognition\n├── data/                               \u003c- Directory for data\n│   ├── entity-extraction/              \u003c- Directory for named entity extraction data\n│   │   ├── raw/                        \u003c- Raw unprocessed data\n│   │   ├── processed/                  \u003c- Processed data\n│   │   └── interim/                    \u003c- Temporary data location\n│   ├── article-relevance/              \u003c- Directory for data related to article relevance prediction\n│   │   ├── raw/                        \u003c- Raw unprocessed data\n│   │   ├── processed/                  \u003c- Processed data\n│   │   └── interim/                    \u003c- Temporary data location\n│   ├── data-review-tool/               \u003c- Directory for data related to data review tool\n├── results/                            \u003c- Directory for results\n│   ├── article-relevance/              \u003c- Directory for results related to article relevance prediction\n│   ├── ner/                            \u003c- Directory for results related to named entity recognition\n│   └── data-review-tool/               \u003c- Directory for results related to data review tool\n├── models/                             \u003c- Directory for models\n│   ├── entity-extraction/              \u003c- Directory for named entity recognition models\n│   ├── article-relevance/              \u003c- Directory for article relevance prediction models\n├── notebooks/                          \u003c- Directory for notebooks\n├── src/                                \u003c- Directory for source code\n│   ├── entity_extraction/              \u003c- Directory for named entity recognition code\n│   ├── article_relevance/              \u003c- Directory for article relevance prediction code\n│   └── data_review_tool/               \u003c- Directory for data review tool code\n├── reports/                            \u003c- Directory for reports\n├── tests/                              \u003c- Directory for tests\n├── Makefile                            \u003c- Makefile with commands to perform analysis\n└── README.md                           \u003c- The top-level README for developers using this project.\n```\n\n## **Contributors**\n\nThis project is an open project, and contributions are welcome from any individual. All contributors to this project are bound by a [code of conduct](https://github.com/NeotomaDB/MetaExtractor/blob/main/CODE_OF_CONDUCT.md). Please review and follow this code of conduct as part of your contribution.\n\nThe UBC MDS project team consists of:\n\n- [![ORCID](https://img.shields.io/badge/orcid-0009--0003--0699--5838-brightgreen.svg)](https://orcid.org/0009-0003-0699-5838) [Ty Andrews](http://www.ty-andrews.com)\n- [![ORCID](https://img.shields.io/badge/orcid-0009--0004--2508--4746-brightgreen.svg)](https://orcid.org/0009-0004-2508-4746) Kelly Wu\n- [![ORCID](https://img.shields.io/badge/orcid-0009--0007--1998--3392-brightgreen.svg)](https://orcid.org/0009-0007-1998-3392) Shaun Hutchinson\n- [![ORCID](https://img.shields.io/badge/orcid-0009--0007--8913--2403-brightgreen.svg)](https://orcid.org/0009-0007-8913-2403) [Jenit Jain](https://www.linkedin.com/in/jenit-jain-0b31b0160/)\n\nSponsors from Neotoma supporting the project are:\n\n- [![ORCID](https://img.shields.io/badge/orcid-0000--0002--7926--4935-brightgreen.svg)](https://orcid.org/0000-0002-7926-4935) [Socorro Dominguez Vidana](https://ht-data.com/)\n- [![ORCID](https://img.shields.io/badge/orcid-0000--0002--2700--4605-brightgreen.svg)](https://orcid.org/0000-0002-2700-4605) [Simon Goring](http://www.goring.org)\n\n### Tips for Contributing\n\nIssues and bug reports are always welcome. Code clean-up, and feature additions can be done either through pull requests to [project forks](https://github.com/NeotomaDB/MetaExtractor/network/members) or [project branches](https://github.com/NeotomaDB/MetaExtractor/branches).\n\nAll products of the Neotoma Paleoecology Database are licensed under an [MIT License](LICENSE) unless otherwise noted.\n\n[contributors-shield]: https://img.shields.io/github/contributors/NeotomaDB/MetaExtractor.svg?style=for-the-badge\n[contributors-url]: https://github.com/NeotomaDB/MetaExtractor/graphs/contributors\n[forks-shield]: https://img.shields.io/github/forks/NeotomaDB/MetaExtractor.svg?style=for-the-badge\n[forks-url]: https://github.com/NeotomaDB/MetaExtractor/network/members\n[stars-shield]: https://img.shields.io/github/stars/NeotomaDB/MetaExtractor.svg?style=for-the-badge\n[stars-url]: https://github.com/NeotomaDB/MetaExtractor/stargazers\n[issues-shield]: https://img.shields.io/github/issues/NeotomaDB/MetaExtractor.svg?style=for-the-badge\n[issues-url]: https://github.com/NeotomaDB/MetaExtractor/issues\n[license-shield]: https://img.shields.io/github/license/NeotomaDB/MetaExtractor.svg?style=for-the-badge\n[license-url]: https://github.com/NeotomaDB/MetaExtractor/blob/master/LICENSE.txt\n[codecov-shield]: https://img.shields.io/codecov/c/github/NeotomaDB/MetaExtractor?style=for-the-badge\n[codecov-url]: https://codecov.io/gh/NeotomaDB/MetaExtractor\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fneotomadb%2Fmetaextractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fneotomadb%2Fmetaextractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fneotomadb%2Fmetaextractor/lists"}