https://github.com/saattrupdan/scholarly

Classification of scientific papers
https://github.com/saattrupdan/scholarly

arxiv arxiv-api classification multilabel neural-network scientific-papers

Last synced: 2 months ago
JSON representation

Classification of scientific papers

Host: GitHub
URL: https://github.com/saattrupdan/scholarly
Owner: saattrupdan
Archived: true
Created: 2019-04-25T08:32:56.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2023-02-15T23:07:51.000Z (over 2 years ago)
Last Synced: 2025-07-11T15:30:29.235Z (3 months ago)
Topics: arxiv, arxiv-api, classification, multilabel, neural-network, scientific-papers
Language: Python
Homepage:
Size: 9.37 MB
Stars: 24
Watchers: 3
Forks: 3
Open Issues: 6
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Scholarly
Category classification of scientific papers.

Given a title and an abstract of a paper, the model will predict a list of categories to which the paper belongs. These categories are the 148 categories used on [arXiv](https://arxiv.org).

## Usage
A demonstration of the model can be found at [ttilt.dk/scholarly](https://www.ttilt.dk/scholarly). Note that you're also free to write LaTeX equations like $\frac{1}{5}$.

A REST API is also available at the same endpoint, with arguments `title` and `abstract`. You will then receive a JSON response containing a list of lists, with each inner list containing the category id, category description and the probability. The list will only include results with probabilities at least 50%, and the list is sorted descending by probability. [Here](https://saattrupdan.pythonanywhere.com/scholarly?title="test"&abstract="test") is an example of a query.

## Performance
The score that I was using was the *sample-average F1 score*, which means that for every sample I'm computing the F1 score of the predictions of the sample (note that we are in a [multilabel](https://en.wikipedia.org/wiki/Multi-label_classification) setup), and averaging that over all the samples. If this was a [multiclass](https://en.wikipedia.org/wiki/Multiclass_classification) setup (in particular binary classification) then this would simply correspond to accuracy. The difference is that in a multilabel setup the model can be *partially* correct, if it correctly predicts some of the categories.

The model ended up achieving a ~93% and ~65% validation sample-average F1 score on the master categories and all the categories, respectively. Training the model requires ~17GB memory and it takes roughly a day to train on an Nvidia P100 GPU. This was trained on the [BlueCrystal Phase 4 compute cluster](https://www.acrc.bris.ac.uk/acrc/phase4.htm) at University of Bristol, UK.

## Documentation and data
This model was trained on all titles and abstracts from all of [arXiv](https://arxiv.org) up to and including year 2019, which were all scraped from their API. The scraping script can be found in `arxiv_scraper.py`. All the data can be found at

https://filedn.com/lRBwPhPxgV74tO0rDoe8SpH/scholarly_data

The main data file is the SQLite file `arxiv_data.db`, which contains 6 tables:
- `cats`, containing the 148 arXiv categories
- `master_cats`, containing the arXiv "master categories", which are 6 aggregates of the categories into things like Mathematics and Physics
- `papers`, containing the id, date, title and abstract of papers
- `papers_cats`, which links papers to their categories
- `authors`, containing author names
- `papers_authors`, which links authors to their papers

From this database I extracted the dataset `arxiv_data.tsv`, which contains the title and abstract for every paper in the database, along with a binary column for every category, denoting whether a paper belongs to that category. LaTeX equations in titles and abstracts have been replaced by "-EQN-" at this stage. Everything related to the database can be found in the `db.py` script.

A preprocessed dataset is also available, `arxiv_data_pp.tsv`, in which the titles and abstracts have been merged as "-TITLE_START- {title} -TITLE_END- -ABSTRACT_START- {abstract} -ABSTRACT_END-", and the texts have been tokenised using the SpaCy en_core_web_sm tokeniser. The resulting texts have all tokens separated by spaces, so that simply splitting the texts by whitespace will yield the tokenised versions. The preprocessing is done in the `data.py` script.

Two JSON files, `cats.json` and `mcat_dict.json` are also available, which are basically the `cats` table from the database and a dictionary to convert from a category to its master category, respectively.

I trained FastText vectors on the entire corpus, and the resulting model can be found as `fasttext_model.bin`, with the vectors themselves belonging to the text file `fasttext`. The script I used to train these can be found in `train_fasttext.py`.

In case you're like me and are having trouble working with the entire dataset, there's also the `arxiv_data_mini_pp.tsv` dataset, which simply consists of 100,000 randomly samples papers from `arxiv_data_pp.tsv`. You can make your own versions of these using the `make_mini.py` script, which constructs the smaller datasets without loading the larger one into memory.

The model itself is a simplified version of the new [SHA-RNN model](https://arxiv.org/abs/1911.11423), trained from scratch on the [BlueCrystal Phase 4 compute cluster](https://www.acrc.bris.ac.uk/acrc/phase4.htm) at the University of Bristol, UK. All scripts pertaining to the model are `modules.py`, `training.py` and `inference.py`.

## Contact
If you have any questions regarding the data or model, please contact me at saattrupdan at gmail dot com.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/saattrupdan/scholarly

Awesome Lists containing this project

README