https://github.com/mohamedelashri/arxivabscraper
A python module for scraping arxiv abstracts for NLP testing purpose.
https://github.com/mohamedelashri/arxivabscraper
arxiv natural-language-processing scraper
Last synced: 4 months ago
JSON representation
A python module for scraping arxiv abstracts for NLP testing purpose.
- Host: GitHub
- URL: https://github.com/mohamedelashri/arxivabscraper
- Owner: MohamedElashri
- License: mit
- Created: 2020-12-27T06:42:40.000Z (almost 5 years ago)
- Default Branch: main
- Last Pushed: 2022-06-26T07:30:12.000Z (over 3 years ago)
- Last Synced: 2025-03-03T08:26:34.576Z (7 months ago)
- Topics: arxiv, natural-language-processing, scraper
- Language: Jupyter Notebook
- Homepage: https://mohamedelashri.github.io/arxivabscraper/
- Size: 45.9 KB
- Stars: 1
- Watchers: 3
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[](https://doi.org/10.5281/zenodo.4395835)
# arxivabscraper
An ArXiV scraper to retrieve abstracts from given categories and date range.[website](https://mohamedelashri.github.io/arxivabscraper/)
## Install
Use `pip` (or `pip3` for python3):
```bash
$ pip install arxivabscraper
```or download the source and use `setup.py`:
```bash
$ python setup.py install
```or if you do not want to install the module, copy `arxivabscraper.py` into your working
directory.To update the module using `pip`:
```bash
pip install arxivabscraper --upgrade
```## Examples
There is a tutorial on how to use the package directly on google colab [here](https://github.com/MohamedElashri/Arxiv-Aabstract-scraper/blob/main/arxivabscraper_tutorial.ipynb)
.
it provides the basic usage to the package and can be run directly on the notebookYou can directly use `arxivabscraper` in your scripts. Let's import `arxivabscraper`
and create a scraper to fetch all preprints in condensed matter physics category
from 2 May 2018 until 2 June 2020 (for other categories, see below):```python
import arxivabscraper
scraper = arxivabscraper.Scraper(category='physics:hep-th', date_from='2010-05-27',date_until='2020-06-07')
```
Once we built an instance of the scraper, we can start the scraping:```python
output = scraper.scrape()
```
While scraper is running, it prints its status:```
fetching up to 1000 records...
fetching up to 2000 records...
Got 503. Retrying after 30 seconds.
fetching up to 3000 records...
fetching is complete.
```Finally you can save the output in your favorite format or readily convert it into a pandas dataframe:
```python
import pandas as pd
cols = ('categories', 'abstract')
df = pd.DataFrame(output,columns=cols)
```## Categories
Here is a list of all categories available on ArXiv.| Category | Code |
| --- | --- |
| Computer Science | `cs` |
| Economics | `econ` |
| Electrical Engineering and Systems Science | `eess` |
| Mathematics | `math` |
| Physics | `physics` |
| Astrophysics | `physics:astro-ph` |
| Condensed Matter | `physics:cond-mat` |
| General Relativity and Quantum Cosmology | `physics:gr-qc` |
| High Energy Physics - Experiment | `physics:hep-ex` |
| High Energy Physics - Lattice | `physics:hep-lat` |
| High Energy Physics - Phenomenology | `physics:hep-ph` |
| High Energy Physics - Theory | `physics:hep-th` |
| Mathematical Physics | `physics:math-ph` |
| Nonlinear Sciences | `physics:nlin` |
| Nuclear Experiment | `physics:nucl-ex` |
| Nuclear Theory | `physics:nucl-th` |
| Physics (Other) | `physics:physics` |
| Quantum Physics | `physics:quant-ph` |
| Quantitative Biology | `q-bio` |
| Quantitative Finance | `q-fin` |
| Statistics | `stat` |## Contributing
Ideas/bugs/comments? Please open an issue or submit a pull request on Github.## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.## Acknowledgments
This work is based on the arxivscraper from
Mahdi Sadjadi (2017). arxivscraper: Zenodo. http://doi.org/10.5281/zenodo.889853