https://github.com/chartgerink/journal-spiders
Repository with tools to spider journal websites for links to articles
https://github.com/chartgerink/journal-spiders
Last synced: about 1 month ago
JSON representation
Repository with tools to spider journal websites for links to articles
- Host: GitHub
- URL: https://github.com/chartgerink/journal-spiders
- Owner: chartgerink
- Created: 2015-08-09T21:22:33.000Z (almost 10 years ago)
- Default Branch: master
- Last Pushed: 2016-04-18T08:38:00.000Z (about 9 years ago)
- Last Synced: 2025-04-28T16:13:21.871Z (about 1 month ago)
- Language: Python
- Size: 6.3 MB
- Stars: 7
- Watchers: 3
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# journal-spider: facilitating the spidering of journal articles to scrape
---
The ContentMine facilitates scraping journals, via both [`getpapers`](https://github.com/ContentMine/getpapers), [`quickscrape`](https://github.com/ContentMine/quickscrape), and [`journal-scrapers`](https://github.com/ContentMine/journal-scrapers), but finding the links to input into `quickscrape` remains a tedious job if done manually. This repository provides a way of spidering journals which requires only minimal user adjustment.The main workhorse, [`get_links.py`](https://github.com/jabbalaci/Bash-Utils/blob/master/get_links.py) was written by Laszlo Szathmary in 2011. This file returns all links on a webpage, which is all we really need. Link extraction currently works for SAGE journals and Springer journals. In order to run this:
1. Import `spiderer` as module into python (make sure to have installed the `BeautifulSoup` module! `pip install BeautifulSoup` to do this)
2. Run `spiderer.sage(journal = '')` or `spiderer.springer(journal = '')` to download all links for that specific journal. For the `spiderer.sage()` you only need the first three letters of the web url (e.g., `pss` for [Psychological Science](pss.sagepub.com)); for `springer()` you require the unique journal identifier (e.g., 13428 for [Behavior Research Methods](http://link.springer.com/journal/volumesAndIssues/13428)); for `elsevier()` you require the unique journal identifier (e.g., 2212683X for [Biologically Inspired Cognitive Architectures](http://www.sciencedirect.com/science/journal/2212683X)).If you want to collect the links for all journals available in `journal_list.csv`, you only need to use the command `python run_all.py` in the commandline of your choosing.
# To-do
- [x] Incorporate some form of selection mechanism into the journal_list
- [x] Incorporate a date checker to prevent re-spidering of recently spidered journals (what is a reasonable timeframe for this?)
- [x] Incorporate Elsevier
- [ ] Incorporate Taylor & Francis
- [ ] Incorporate Wiley