https://github.com/izuna385/pubtator-multiprocess-parser
Specifically for Entity Linking. Quick demo with MedMentions and NCBI datasets is also included.
https://github.com/izuna385/pubtator-multiprocess-parser
allennlp bioinformatics entity-disambiguation entity-linking natural-language-processing pubtator spacy
Last synced: about 2 months ago
JSON representation
Specifically for Entity Linking. Quick demo with MedMentions and NCBI datasets is also included.
- Host: GitHub
- URL: https://github.com/izuna385/pubtator-multiprocess-parser
- Owner: izuna385
- License: mit
- Created: 2020-05-24T05:58:00.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2021-03-19T16:26:20.000Z (over 5 years ago)
- Last Synced: 2025-03-28T10:50:34.338Z (over 1 year ago)
- Topics: allennlp, bioinformatics, entity-disambiguation, entity-linking, natural-language-processing, pubtator, spacy
- Language: Python
- Homepage: https://qiita.com/izuna385/items/d673694d25b2cf4efb89
- Size: 1.31 MB
- Stars: 1
- Watchers: 1
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Multiprocessing PubTator Parsing for Entity Linking
## Quick Starts with MedMentions, BC5CDR and NCBI-dataset
```
$ git clone https://github.com/izuna385/PubTator-Multiprocess-Parser.git
$ cd PubTator-Multiprocess-Parser
$ docker build -t multiprocess_pubtator .
$ docker run -itd multiprocess_pubtator /bin/bash
# In container
$ sh ./scripts/quick_start_Med_full.sh # for MedMentions
```
* You can run `quick_start_NCBI_full.sh`, too. If so, before running, make `pickled_doc_dir` empty.
* Note: If you use Mac, do `brew install wget` before running above script.
## Description
* Preprocessing PubTator-format documents to each mentions.
* If you are japanese, this might be useful for you.
https://qiita.com/izuna385/items/d673694d25b2cf4efb89
# How to run
* Note: The following steps are entirely automated.
After building container, run `sh ./scripts/quick_start_[dataset_name]_full.sh`
## 1. Place PubTator format files to the `./dataset/`
* `corpus_pubtator.txt`, `corpus_pubtator_pmids_trng.txt`, `corpus_pubtator_pmids_dev.txt`,
and `corpus_pubtator_pmids_test.txt` must be placed there.
## 2. run
`python3 main.py`
## 3. Check
* Each Pubtator documents is preprocessed and dumped to `./dataset/**pmid**.pkl`
The format is as the below.
```
{'title':title,
'abst':abst,
'title_plus_abst': title_plus_abst,
'pubmed_id': pubmed_id,
'entities': entities,
'split_sentence': splitted_sentence,
'if_txt_length_is_changed_flag':if_txt_lenght_is_changed_flag,
'lines':lines,
'lines_lemma':lines_lemma
}
```
* The Key component is 'lines', in which all information for entity linking is included.
* Each document takes about 100sec for preprocessing, under `en_core_sci_md` model.
* Under 24 core cpus and `en_core_sci_md` model, ~10GB RAM is needed.
# LISENCE
MIT