https://github.com/dfm/data.arxiv.io
Code and website for my arxiv abstract dataset
https://github.com/dfm/data.arxiv.io
Last synced: 2 months ago
JSON representation
Code and website for my arxiv abstract dataset
- Host: GitHub
- URL: https://github.com/dfm/data.arxiv.io
- Owner: dfm
- License: mit
- Created: 2013-11-27T17:43:41.000Z (over 12 years ago)
- Default Branch: main
- Last Pushed: 2020-06-12T18:17:21.000Z (almost 6 years ago)
- Last Synced: 2025-04-13T19:54:08.634Z (about 1 year ago)
- Language: Python
- Size: 7.81 KB
- Stars: 9
- Watchers: 2
- Forks: 3
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
A little script that scrapes the metadata from [the arXiv](http://arxiv.org)
and saves it in a form that is useful for statistical analysis.
Usage
-----
You'll need to install [NLTK](http://nltk.org) first and then run
```
python scrape.py
```
This will take many hours to run and it will save files of the form
`data/astro-ph/2007-05-10.txt.gz` with one abstract per line. Each row has the
tab-separated columns: arxiv id, space-separated categories, tokenized title,
and tokenized abstract.
Credits
-------
Copyright 2013 Dan Foreman-Mackey
Licensed under the terms of the MIT License (see `LICENSE`).