{"id":17202863,"url":"https://github.com/kermitt2/softdata_mentions_client","last_synced_at":"2026-02-16T16:38:47.316Z","repository":{"id":148636385,"uuid":"505563403","full_name":"kermitt2/softdata_mentions_client","owner":"kermitt2","description":"Python client for software and dataset mention recognizer in scholarly publications, using the Softcite and Datastet services","archived":false,"fork":false,"pushed_at":"2022-11-13T13:43:05.000Z","size":66,"stargazers_count":2,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-07-27T21:45:04.115Z","etag":null,"topics":["dataset","pdf","python-client","scholarly-articles","software","text"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kermitt2.png","metadata":{"files":{"readme":"Readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-06-20T18:56:59.000Z","updated_at":"2023-02-11T06:02:28.000Z","dependencies_parsed_at":null,"dependency_job_id":"39239da9-07fb-49e3-a586-9fd7074f5fe9","html_url":"https://github.com/kermitt2/softdata_mentions_client","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/kermitt2/softdata_mentions_client","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kermitt2%2Fsoftdata_mentions_client","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kermitt2%2Fsoftdata_mentions_client/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kermitt2%2Fsoftdata_mentions_client/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kermitt2%2Fsoftdata_mentions_client/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kermitt2","download_url":"https://codeload.github.com/kermitt2/softdata_mentions_client/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kermitt2%2Fsoftdata_mentions_client/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":276235477,"owners_count":25608050,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-21T02:00:07.055Z","response_time":72,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset","pdf","python-client","scholarly-articles","software","text"],"created_at":"2024-10-15T02:16:14.788Z","updated_at":"2025-09-21T11:47:55.810Z","avatar_url":"https://github.com/kermitt2.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Software and Dataset mention recognizer client\n\nSimple Python client for using the Softcite software mention recognition service and the DataStet dataset mention recognition service. It can be applied to: \n\n* individual PDF files\n\n* recursively to a local directory, processing all the encountered PDF \n\n* to a collection of documents harvested by [biblio-glutton-harvester](https://github.com/kermitt2/biblio-glutton-harvester) and [article-dataset-builder](https://github.com/kermitt2/article-dataset-builder), with the benefit of re-using the collection manifest for injectng metadata and keeping track of progress. The collection can be stored locally or on a S3 storage. \n\nThe client can call either one of the two services or both, parallelizing queries efficiently for individual or combined services. \n\n## Requirements\n\nThe client has been tested with Python 3.6-3.8. \n\nThe client requires a working [Softcite software mention recognition service](https://github.com/ourresearch/software-mentions) and/or a working [Datastet dataStet mention recognition service](https://github.com/kermitt2/datastet). Service host and port can be changed in the `config.json` file of the client. \n\nThe easiest is to use docker images for running these services: \n\n* Softcite software mention recognition service: \n\n* DataStet dataset mention recognition service: \n\nFor acceptable performance, these two services must typically be deployed on two different servers. For good performance, GPU are required to boost the usage of the different involved Deep Learning models. \n\n## Install\n\n```console\n\u003e git clone https://github.com/softcite/softdata_mentions_client.git\n\u003e cd softdata_mentions_client/\n```\n\nIt is advised to setup first a virtual environment to avoid falling into one of these gloomy python dependency marshlands:\n\n```console\n\u003e virtualenv --system-site-packages -p python3 env\n```\n\n```console\n\u003e source env/bin/activate\n```\n\nInstall the dependencies, use:\n\n```console\n\u003e pip3 install -r requirements.txt\n```\n\nFinally install the project in editable state\n\n```console\n\u003e pip3 install -e .\n```\n\n\n## Usage and options\n\n```\nusage: client.py [-h] [--repo-in REPO_IN] [--file-in FILE_IN] [--file-out FILE_OUT]\n                 [--data-path DATA_PATH] [--config CONFIG] [--reprocess] [--reset] [--load]\n                 [--diagnostic] [--scorched-earth]\n                 target\n\nSoftware and Dataset mention recognizer client for Softcite and Datastet services\n\npositional arguments:\n  target                one of [software, dataset, all], mandatory\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --repo-in REPO_IN     path to a directory of PDF files to be processed by the Softcite\n                        software mention recognizer\n  --file-in FILE_IN     a single PDF input file to be processed by the Softcite software\n                        mention recognizer\n  --file-out FILE_OUT   path to a single output the software mentions in JSON format, extracted\n                        from the PDF file-in\n  --data-path DATA_PATH\n                        path to the resource files created/harvested by biblio-glutton-\n                        harvester\n  --config CONFIG       path to the config file, default is ./config.json\n  --reprocess           reprocessed failed PDF\n  --reset               ignore previous processing states and re-init the annotation process\n                        from the beginning\n  --load                load json files into the MongoDB instance, the --repo-in or --data-path\n                        parameter must indicate the path to the directory of resulting json\n                        files to be loaded, --dump must indicate the path to the json dump file\n                        of document metadata\n  --diagnostic          perform a full count of annotations and diagnostic using MongoDB\n                        regarding the harvesting and transformation process\n  --scorched-earth      remove a PDF file after its sucessful processing in order to save\n                        storage space, careful with this!\n```\n\nThe logs are written by default in a file `./client.log`, but the location of the logs can be changed in the configuration file (default `./config.json`).\n\n### Processing local PDF files\n\nFor processing a single file for both software and dataset mentions, the resulting json being written as file at the indicated output path:\n\n\u003e python3 softdata_mentions_client/client.py all --file-in toto.pdf --file-out toto.json\n\nFor processing recursively a directory of PDF files, the results will be:\n\n* written to a mongodb server and database indicated in the config file\n\n* *and* in the directory of PDF files, as json files, together with each processed PDF\n\n\u003e python3 softdata_mentions_client/client.py all --repo-in /mnt/data/biblio/pmc_oa_dir/\n\nThe default config file is `./config.json`, but could also be specified via the parameter `--config`: \n\n\u003e python3 softdata_mentions_client/client.py all --repo-in /mnt/data/biblio/pmc_oa_dir/ --config ./my_config.json\n\nTo process document for only software mentions:\n\n\u003e python3 softdata_mentions_client/client.py software --file-in toto.pdf --file-out toto.json\n\nand for only dataset mentions:\n\n\u003e python3 softdata_mentions_client/client.py dataset --file-in toto.pdf --file-out toto.json\n\n\n### Processing a collection of PDF harvested by biblio-glutton-harvester\n\n[biblio-glutton-harvester](https://github.com/kermitt2/biblio-glutton-harvester) and [article-dataset-builder](https://github.com/kermitt2/article-dataset-builder) creates a collection manifest as a LMDB database to keep track of the harvesting of large collection of files. Storage of the resource can be located on a local file system or on a AWS S3 storage. The `software-mention` client will use the collection manifest to process these harvested documents. \n\n* locally:\n\n\u003e python3 softdata_mentions_client/client.py all --data-path /mnt/data/biblio-glutton-harvester/data/\n\n`--data-path` indicates the path to the repository of data harvested by [biblio-glutton-harvester](https://github.com/kermitt2/biblio-glutton-harvester).\n\nThe resulting JSON files will be enriched by the metadata records of the processed PDF and will be stored together with each processed PDF in the data repository. \n\nIf the harvested collection is located on a S3 storage, the access information must be indicated in the configuration file of the client `config.json`. The extracted software mention will be written in a file with extension `.software.json` and the extracted dataset mentions in a file with extension `.dataset.json` , for example:\n\n```\n-rw-rw-r-- 1 lopez lopez 1.1M Aug  8 03:26 0100a44b-6f3f-4cf7-86f9-8ef5e8401567.pdf\n-rw-rw-r-- 1 lopez lopez  485 Aug  8 03:41 0100a44b-6f3f-4cf7-86f9-8ef5e8401567.software.json\n-rw-rw-r-- 1 lopez lopez  485 Aug  8 03:41 0100a44b-6f3f-4cf7-86f9-8ef5e8401567.dataset.json\n```\n\nIf a MongoDB server access information is indicated in the configuration file `config.json`, the extracted information will additionally be written in MongoDB. \n\n## License and contact\n\nDistributed under [Apache 2.0 license](http://www.apache.org/licenses/LICENSE-2.0). The dependencies used in the project are either themselves also distributed under Apache 2.0 license or distributed under a compatible license. \n\nIf you contribute to this project, you agree to share your contribution following these licenses. \n\nMain author and contact: Patrice Lopez (\u003cpatrice.lopez@science-miner.com\u003e)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkermitt2%2Fsoftdata_mentions_client","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkermitt2%2Fsoftdata_mentions_client","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkermitt2%2Fsoftdata_mentions_client/lists"}