https://github.com/saattrupdan/scholarly
Classification of scientific papers
https://github.com/saattrupdan/scholarly
arxiv arxiv-api classification multilabel neural-network scientific-papers
Last synced: 2 months ago
JSON representation
Classification of scientific papers
- Host: GitHub
- URL: https://github.com/saattrupdan/scholarly
- Owner: saattrupdan
- Archived: true
- Created: 2019-04-25T08:32:56.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2023-02-15T23:07:51.000Z (over 2 years ago)
- Last Synced: 2025-07-11T15:30:29.235Z (3 months ago)
- Topics: arxiv, arxiv-api, classification, multilabel, neural-network, scientific-papers
- Language: Python
- Homepage:
- Size: 9.37 MB
- Stars: 24
- Watchers: 3
- Forks: 3
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Scholarly
Category classification of scientific papers.Given a title and an abstract of a paper, the model will predict a list of categories to which the paper belongs. These categories are the 148 categories used on [arXiv](https://arxiv.org).
## Usage
A demonstration of the model can be found at [ttilt.dk/scholarly](https://www.ttilt.dk/scholarly). Note that you're also free to write LaTeX equations like $\frac{1}{5}$.A REST API is also available at the same endpoint, with arguments `title` and `abstract`. You will then receive a JSON response containing a list of lists, with each inner list containing the category id, category description and the probability. The list will only include results with probabilities at least 50%, and the list is sorted descending by probability. [Here](https://saattrupdan.pythonanywhere.com/scholarly?title="test"&abstract="test") is an example of a query.
## Performance
The score that I was using was the *sample-average F1 score*, which means that for every sample I'm computing the F1 score of the predictions of the sample (note that we are in a [multilabel](https://en.wikipedia.org/wiki/Multi-label_classification) setup), and averaging that over all the samples. If this was a [multiclass](https://en.wikipedia.org/wiki/Multiclass_classification) setup (in particular binary classification) then this would simply correspond to accuracy. The difference is that in a multilabel setup the model can be *partially* correct, if it correctly predicts some of the categories.The model ended up achieving a ~93% and ~65% validation sample-average F1 score on the master categories and all the categories, respectively. Training the model requires ~17GB memory and it takes roughly a day to train on an Nvidia P100 GPU. This was trained on the [BlueCrystal Phase 4 compute cluster](https://www.acrc.bris.ac.uk/acrc/phase4.htm) at University of Bristol, UK.
## Documentation and data
This model was trained on all titles and abstracts from all of [arXiv](https://arxiv.org) up to and including year 2019, which were all scraped from their API. The scraping script can be found in `arxiv_scraper.py`. All the data can be found at
https://filedn.com/lRBwPhPxgV74tO0rDoe8SpH/scholarly_data
The main data file is the SQLite file `arxiv_data.db`, which contains 6 tables:
- `cats`, containing the 148 arXiv categories
- `master_cats`, containing the arXiv "master categories", which are 6 aggregates of the categories into things like Mathematics and Physics
- `papers`, containing the id, date, title and abstract of papers
- `papers_cats`, which links papers to their categories
- `authors`, containing author names
- `papers_authors`, which links authors to their papersFrom this database I extracted the dataset `arxiv_data.tsv`, which contains the title and abstract for every paper in the database, along with a binary column for every category, denoting whether a paper belongs to that category. LaTeX equations in titles and abstracts have been replaced by "-EQN-" at this stage. Everything related to the database can be found in the `db.py` script.
A preprocessed dataset is also available, `arxiv_data_pp.tsv`, in which the titles and abstracts have been merged as "-TITLE_START- {title} -TITLE_END- -ABSTRACT_START- {abstract} -ABSTRACT_END-", and the texts have been tokenised using the SpaCy en_core_web_sm tokeniser. The resulting texts have all tokens separated by spaces, so that simply splitting the texts by whitespace will yield the tokenised versions. The preprocessing is done in the `data.py` script.
Two JSON files, `cats.json` and `mcat_dict.json` are also available, which are basically the `cats` table from the database and a dictionary to convert from a category to its master category, respectively.
I trained FastText vectors on the entire corpus, and the resulting model can be found as `fasttext_model.bin`, with the vectors themselves belonging to the text file `fasttext`. The script I used to train these can be found in `train_fasttext.py`.
In case you're like me and are having trouble working with the entire dataset, there's also the `arxiv_data_mini_pp.tsv` dataset, which simply consists of 100,000 randomly samples papers from `arxiv_data_pp.tsv`. You can make your own versions of these using the `make_mini.py` script, which constructs the smaller datasets without loading the larger one into memory.
The model itself is a simplified version of the new [SHA-RNN model](https://arxiv.org/abs/1911.11423), trained from scratch on the [BlueCrystal Phase 4 compute cluster](https://www.acrc.bris.ac.uk/acrc/phase4.htm) at the University of Bristol, UK. All scripts pertaining to the model are `modules.py`, `training.py` and `inference.py`.
## Contact
If you have any questions regarding the data or model, please contact me at saattrupdan at gmail dot com.