{"id":15014100,"url":"https://github.com/jonasrenault/cprex","last_synced_at":"2026-02-23T01:32:42.125Z","repository":{"id":245307401,"uuid":"768101109","full_name":"jonasrenault/cprex","owner":"jonasrenault","description":"Chemical Properties Relation Extraction","archived":false,"fork":false,"pushed_at":"2024-07-03T12:06:24.000Z","size":1928,"stargazers_count":1,"open_issues_count":3,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-01-07T02:23:19.813Z","etag":null,"topics":["chemistry","crawler","deep-learning","information-extraction","machine-learning","named-entity-recognition","nlp","pubchem","relation-extraction","scientific-articles","spacy","transformers"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jonasrenault.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-06T13:24:11.000Z","updated_at":"2024-07-03T12:06:28.000Z","dependencies_parsed_at":"2024-09-30T05:40:45.604Z","dependency_job_id":"3e507774-b153-4de1-a833-205e3f8a859a","html_url":"https://github.com/jonasrenault/cprex","commit_stats":{"total_commits":69,"total_committers":1,"mean_commits":69.0,"dds":0.0,"last_synced_commit":"35923196aa5fd438ae6c89f5450c90d590bddbe1"},"previous_names":["jonasrenault/cprex"],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/jonasrenault/cprex","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonasrenault%2Fcprex","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonasrenault%2Fcprex/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonasrenault%2Fcprex/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonasrenault%2Fcprex/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jonasrenault","download_url":"https://codeload.github.com/jonasrenault/cprex/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonasrenault%2Fcprex/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29734468,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-22T20:09:16.275Z","status":"ssl_error","status_checked_at":"2026-02-22T20:09:13.750Z","response_time":110,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chemistry","crawler","deep-learning","information-extraction","machine-learning","named-entity-recognition","nlp","pubchem","relation-extraction","scientific-articles","spacy","transformers"],"created_at":"2024-09-24T19:45:11.364Z","updated_at":"2026-02-23T01:32:42.095Z","avatar_url":"https://github.com/jonasrenault.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CPREx - Chemical Properties Relation Extraction\n\n[![License](https://img.shields.io/badge/License-MIT-yellow)](LICENSE)\n![python_version](https://img.shields.io/badge/Python-%3E=3.11-blue)\n\nCPREx is an end to end tool for Named Entity Recognition (NER) and Relation Extraction (RE) specifically designed for chemical compounds and their properties. The goal of the tool is to identify, extract and link chemical compounds and their properties from scientific literature. For ease of use, CPREx provides a custom [spaCy](https://spacy.io/) pipeline to perform NER and RE.\n\nThe pipeline performs the following steps\n\n```mermaid\nflowchart LR\n    crawler(\"`**crawler**\n    fetch PDF articles\n    from online archives`\")\n    parser(\"`**parser**\n    Extract text\n    from PDF`\")\n    crawler --\u003e parser\n    parser --\u003e ner\n    ner(\"`**NER**\n    extract named\n    entities`\")\n    ner --\u003e chem[\"`**Chem**\n    *1,3,5-Triazine*\n    *Zinc bromide*\n    *C₃H₄N₂*`\"] --\u003e rel\n    ner --\u003e prop[\"`**Property**\n    *fusion enthalpy*\n    *Tc*`\"] --\u003e rel\n    ner --\u003e quantity[\"`**Value**\n    *169°C*\n    *21.49 kJ/mol*`\"] --\u003e rel\n    rel(\"`**Relation Extraction**\n    link entities`\")\n    rel --\u003e res\n    res(\"`**(Chem, Property, Value)**\n    *2,2'-Binaphthalene, ΔHfus, 38.9 kJ/mol*`\")\n```\n\n## Installation\n\nCPREx works with a recent version of python (**\u003e=python 3.11**). Make sure to install CPREx in a virtual environment of your choice.\n\nCPREx depends on [GROBID](https://github.com/kermitt2/grobid) and its extension [grobid-quantities](https://github.com/lfoppiano/grobid-quantities) for parsing PDF documents and extracting quantities from their text. In order to install and run GROBID, a JDK must also be installed on your machine. [GROBID currently supports](https://grobid.readthedocs.io/en/latest/Install-Grobid/) JDK versions **11 to 17**.\n\n### Install via PyPI\n\nYou can install CPREx directly with pip:\n\n```console\npip install cprex\n```\n\n### Install from github\n\nThis installation is recommended for users who want to customize the pipeline or train some models on their own dataset.\n\nClone the repository and install the project in your python environment, either using `pip`\n\n```console\ngit clone git@github.com:jonasrenault/cprex.git\ncd cprex\npip install --editable .\n```\n\nor [poetry](https://python-poetry.org/)\n\n```console\ngit clone git@github.com:jonasrenault/cprex.git\ncd cprex\npoetry install\n```\n\n### Install grobid and models\n\n#### Installing and running grobid\n\nCPREx depends on [GROBID](https://github.com/kermitt2/grobid) and its extension [grobid-quantities](https://github.com/lfoppiano/grobid-quantities) for parsing PDF documents and extracting quantities from their text. For convenience, CPREx provides a command line interface (CLI) to install grobid and start a grobid server.\n\nRun\n\n```console\ncprex install-grobid\n```\n\nto install a grobid server and the grobid-quantities extension (by default, grobid and models required by CPREx are installed in a `.cprex` directory in your home directory).\n\nRun\n\n```console\ncprex start-grobid\n```\n\nto start a grobid server and enable parsing of PDF documents from CPREx.\n\n#### Installing NER et REL models\n\nTo perform Named Entity Recognition (NER) of chemical compounds and Relation Extraction (RE), CPREx requires some pretrained models. These models can be installed by running\n\n```console\ncprex install-models\n```\n\nThis will install a [PubmedBert model](https://ftp.ncbi.nlm.nih.gov/pub/lu/BC7-NLM-Chem-track/) finetuned on the NLM-CHEM corpus for extraction of chemical named entities. This model was finetuned by the [BioCreative VII track](https://biocreative.bioinformatics.udel.edu/tasks/biocreative-vii/track-2/).\n\nIt will also install a [RE model](https://github.com/jonasrenault/cprex/releases/tag/v0.4.0) pre-trained on our own annotated dataset.\n\n#### Installing a base spacy model\n\nA base [spaCy model](https://github.com/explosion/spacy-models/releases), such as `en_core_web_sm`, is required for tokenization, lemmatization, and all other features offered by spaCy. To install a spaCy model, you can run\n\n```console\npython -m spacy download en_core_web_sm\n```\n\nor directly install it with pip by specifying the model's url\n\n```console\npip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz\n```\n\n## Run CPREx\n\n### Using Docker\n\nThe easiest way to run CPREx is to use [Docker](https://docs.docker.com/) to start a container running CPREx. Refer to the [official documentation](https://docs.docker.com/get-docker/) for instructions on how to install Docker on your system.\n\nYou can then pull the CPREx image from github's container registry with\n\n```console\ndocker pull ghcr.io/jonasrenault/cprex:latest\n```\n\nYou can start a container running this image with\n\n```console\ndocker run -t --rm -p 80:8501 ghcr.io/jonasrenault/cprex:latest\n```\n\nNote that the image is only compiled for amd64 architectures. Add `--platform=linux/amd64` if running on an ARM architecture.\n\nOnce the container is started, you can access CPREx's streamlit UI by opening a browser at the [http://localhost](http://localhost) URL.\n\n### Run streamlit locally\n\nCPREx provides a User Interface built with [streamlit](https://streamlit.io/). The UI lets you upload a PDF file and see the results of running CPREx on it. If you've cloned the CPREx repository locally and installed the projet in a python environment, you can run the UI by executing the following command, in the python environment and from CPREx's root directory\n\n```console\nstreamlit run cprex/ui/streamlit.py\n```\n\nThis will start a web server exposing the UI at [http://localhost:8501](http://localhost:8501). Note that for CPREx's pipeline to work, you must have installed the models with the `cprex install-models` command, and started the grobid services with `cprex start-grobid`.\n\n### Notebook examples\n\nNotebook examples showing how to use CPREx directly in a Python script are available in the [notebooks](./notebooks/) directory. To run the notebooks, install [jupyterlab](https://jupyter.org/install) in your Python environment, start it with `jupyter lab`, and open one of the example notebooks. Note that for CPREx's pipeline to work, you must have installed the models with the `cprex install-models` command, and started the grobid services with `cprex start-grobid`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjonasrenault%2Fcprex","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjonasrenault%2Fcprex","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjonasrenault%2Fcprex/lists"}