https://github.com/titipata/detecting-scientific-claim

Extracting scientific claims from biomedical abstracts (powered by AllenNLP)
https://github.com/titipata/detecting-scientific-claim

allennlp deep-learning natural-language-processing sentence-classification

Last synced: about 1 year ago
JSON representation

Extracting scientific claims from biomedical abstracts (powered by AllenNLP)

Host: GitHub
URL: https://github.com/titipata/detecting-scientific-claim
Owner: titipata
Created: 2018-06-07T18:45:50.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2021-06-02T13:37:07.000Z (about 5 years ago)
Last Synced: 2025-04-07T20:06:22.589Z (over 1 year ago)
Topics: allennlp, deep-learning, natural-language-processing, sentence-classification
Language: Python
Homepage:
Size: 1.66 MB
Stars: 142
Watchers: 9
Forks: 18
Open Issues: 8
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Claim Extraction for Scientific Publications

Detecting claim from scientific publication using [discourse model](https://github.com/Franck-Dernoncourt/pubmed-rct) and transfer learning. 

Models are trained using [AllenNLP](https://github.com/allenai/allennlp) library.

## Installing as a package

You can install the package using PIP, which will help you use the `discourse` classes inside a module

```bash

pip install git+https://github.com/titipata/detecting-scientific-claim.git

```

you will be able to use them as

```python

import discourse

predictor = discourse.DiscourseCRFClassifierPredictor()

```

## Training discourse model

Running AllenNLP to train a discourse model using [PubmMedRCT dataset](https://github.com/Franck-Dernoncourt/pubmed-rct) as follows

```bash

allennlp train experiments/pubmed_rct.json -s output --include-package discourse

```

We point data location to Amazon S3 directly in `pubmed_rct.json`

so you do not need to download the data locally. Change `cuda_device` to `-1` in `pubmed_rct.json`

if you want to run on CPU. There are more experiments available in `experiments` folder.

**Note** that you have to remove `output` folder first before running.

## Predicting discourse

We trained the Bidirectional LSTM model on structured abstracts from Pubmed to predict

discourse probability (`RESULTS`, `METHODS`, `CONCLUSIONS`, `BACKGROUND`, `OBJECTIVE`)

of a given sentence. You can download trained model from Amazon S3

```bash

wget https://s3-us-west-2.amazonaws.com/pubmed-rct/model.tar.gz # or model_crf.tar.gz for pretrained model with CRF layer

```

and run web service for discourse prediction task as follow

```bash

bash web_service.sh

```

To test the train model with provided examples [`fixtures.json`](pubmed-rct/PubMed_200k_RCT/fixtures.json),

simply run the following to predict labels.

```bash

allennlp predict model.tar.gz \

    pubmed-rct/PubMed_200k_RCT/fixtures.json \

    --include-package discourse \

    --predictor discourse_predictor

```

or run the following for 

```

allennlp predict model_crf.tar.gz \

    pubmed-rct/PubMed_200k_RCT/fixtures_crf.json \

    --include-package discourse \

    --predictor discourse_crf_predictor

```

To evaluate discourse model, you can run the following command

```bash

allennlp evaluate model.tar.gz \

  https://s3-us-west-2.amazonaws.com/pubmed-rct/test.json \

  --include-package discourse

```

## Predicting claim (web service)

We use transfer learning with fine tuning to train claim extraction model 

from pre-trained discourse model. The schematic of the training can be seen below.



  



You can run the demo web application to detect claims as follows

```bash

export FLASK_APP=main.py

flask run --host=0.0.0.0 # this will serve at port 5000

```

The interface will look something like this



  



And output will look something like the following (highlight means claim,

  tag behind the sentence is discourse prediction)



  



**Expertly annotated dataset** We release the dataset of annotated 1,500 abstracts containing 11,702 sentences (2,276 annotated as claim sentences) 

sampled from 110 biomedical journals. The final dataset are the majority vote from three experts. The annotations are hosted on Amazon S3 and 

can be found from these given [URLs](https://github.com/titipata/detecting-scientific-claim/blob/master/scripts/transfer_learning_crf.py#L48-L50).

## Requirements

- [Python 3.6](https://www.python.org/downloads/release/python-360/)

- [AllenNLP](https://github.com/allenai/allennlp) >= 0.6.1

- [spacy](https://github.com/explosion/spaCy)

- [fastText](https://github.com/facebookresearch/fastText)

- [Pubmed RCT](https://github.com/Franck-Dernoncourt/pubmed-rct) - dataset

## Citing the repository

You can cite our paper available on arXiv as

Achakulvisut, Titipat, Chandra Bhagavatula, Daniel Acuna, and Konrad Kording. _"Claim Extraction in Biomedical Publications using Deep Discourse Model and Transfer Learning."_ arXiv preprint arXiv:1907.00962 (2019).

or using BibTeX

```

@article{achakulvisut2019claim,

  title={Claim Extraction in Biomedical Publications using Deep Discourse Model and Transfer Learning},

  author={Achakulvisut, Titipat and Bhagavatula, Chandra and Acuna, Daniel and Kording, Konrad},

  journal={arXiv preprint arXiv:1907.00962},

  year={2019}

}

```

## Acknowledgement

This project is done at the [Allen Institute for Artificial Intelligence](https://allenai.org/)

and [Konrad Kording lab, University of Pennsylvania](http://kordinglab.com/)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/titipata/detecting-scientific-claim

Awesome Lists containing this project

README