{"id":15460522,"url":"https://github.com/izuna385/pubtator-multiprocess-parser","last_synced_at":"2026-05-04T02:33:05.468Z","repository":{"id":113374965,"uuid":"266479710","full_name":"izuna385/PubTator-Multiprocess-Parser","owner":"izuna385","description":"Specifically for Entity Linking. Quick demo with MedMentions and NCBI datasets is also included. ","archived":false,"fork":false,"pushed_at":"2021-03-19T16:26:20.000Z","size":1372,"stargazers_count":1,"open_issues_count":1,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-28T10:50:34.338Z","etag":null,"topics":["allennlp","bioinformatics","entity-disambiguation","entity-linking","natural-language-processing","pubtator","spacy"],"latest_commit_sha":null,"homepage":"https://qiita.com/izuna385/items/d673694d25b2cf4efb89","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/izuna385.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-05-24T05:58:00.000Z","updated_at":"2021-12-05T13:14:11.000Z","dependencies_parsed_at":"2023-06-15T18:00:29.053Z","dependency_job_id":null,"html_url":"https://github.com/izuna385/PubTator-Multiprocess-Parser","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/izuna385/PubTator-Multiprocess-Parser","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/izuna385%2FPubTator-Multiprocess-Parser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/izuna385%2FPubTator-Multiprocess-Parser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/izuna385%2FPubTator-Multiprocess-Parser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/izuna385%2FPubTator-Multiprocess-Parser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/izuna385","download_url":"https://codeload.github.com/izuna385/PubTator-Multiprocess-Parser/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/izuna385%2FPubTator-Multiprocess-Parser/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32592518,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-03T22:12:39.696Z","status":"online","status_checked_at":"2026-05-04T02:00:06.625Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["allennlp","bioinformatics","entity-disambiguation","entity-linking","natural-language-processing","pubtator","spacy"],"created_at":"2024-10-01T23:22:20.810Z","updated_at":"2026-05-04T02:33:05.459Z","avatar_url":"https://github.com/izuna385.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Multiprocessing PubTator Parsing for Entity Linking\n## Quick Starts with MedMentions, BC5CDR and NCBI-dataset\n```\n$ git clone https://github.com/izuna385/PubTator-Multiprocess-Parser.git\n$ cd PubTator-Multiprocess-Parser\n$ docker build -t multiprocess_pubtator .\n$ docker run -itd multiprocess_pubtator /bin/bash\n\n# In container\n$ sh ./scripts/quick_start_Med_full.sh # for MedMentions\n```\n* You can run `quick_start_NCBI_full.sh`, too. If so, before running, make `pickled_doc_dir` empty.\n\n* Note: If you use Mac, do `brew install wget` before running above script.\n\n## Description\n* Preprocessing PubTator-format documents to each mentions.\n\n* If you are japanese, this might be useful for you.\n  \n  https://qiita.com/izuna385/items/d673694d25b2cf4efb89\n\n# How to run\n* Note: The following steps are entirely automated. \n\n  After building container, run `sh ./scripts/quick_start_[dataset_name]_full.sh`\n\n## 1. Place PubTator format files to the `./dataset/`\n\n* `corpus_pubtator.txt`, `corpus_pubtator_pmids_trng.txt`, `corpus_pubtator_pmids_dev.txt`, \n  and `corpus_pubtator_pmids_test.txt` must be placed there.\n  \n## 2. run\n\n`python3 main.py`\n\n## 3. Check\n\n* Each Pubtator documents is preprocessed and dumped to  `./dataset/**pmid**.pkl`\n\n  The format is as the below.\n  \n  ```\n  {'title':title,  \n   'abst':abst,\n   'title_plus_abst': title_plus_abst,\n   'pubmed_id': pubmed_id,\n   'entities': entities,\n   'split_sentence': splitted_sentence,\n   'if_txt_length_is_changed_flag':if_txt_lenght_is_changed_flag,\n   'lines':lines,\n   'lines_lemma':lines_lemma\n  }\n  ```\n  \n  * The Key component is 'lines', in which all information for entity linking is included.\n\n* Each document takes about 100sec for preprocessing, under `en_core_sci_md` model.\n\n* Under 24 core cpus and `en_core_sci_md` model, ~10GB RAM is needed.\n\n# LISENCE\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fizuna385%2Fpubtator-multiprocess-parser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fizuna385%2Fpubtator-multiprocess-parser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fizuna385%2Fpubtator-multiprocess-parser/lists"}