https://github.com/chartgerink/journal-spiders

Repository with tools to spider journal websites for links to articles
https://github.com/chartgerink/journal-spiders

Last synced: about 1 month ago
JSON representation

Repository with tools to spider journal websites for links to articles

Host: GitHub
URL: https://github.com/chartgerink/journal-spiders
Owner: chartgerink
Created: 2015-08-09T21:22:33.000Z (almost 10 years ago)
Default Branch: master
Last Pushed: 2016-04-18T08:38:00.000Z (about 9 years ago)
Last Synced: 2025-04-28T16:13:21.871Z (about 1 month ago)
Language: Python
Size: 6.3 MB
Stars: 7
Watchers: 3
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # journal-spider: facilitating the spidering of journal articles to scrape

---

The ContentMine facilitates scraping journals, via both [`getpapers`](https://github.com/ContentMine/getpapers), [`quickscrape`](https://github.com/ContentMine/quickscrape), and [`journal-scrapers`](https://github.com/ContentMine/journal-scrapers), but finding the links to input into `quickscrape` remains a tedious job if done manually. This repository provides a way of spidering journals which requires only minimal user adjustment.

The main workhorse, [`get_links.py`](https://github.com/jabbalaci/Bash-Utils/blob/master/get_links.py) was written by Laszlo Szathmary in 2011. This file returns all links on a webpage, which is all we really need. Link extraction currently works for SAGE journals and Springer journals. In order to run this:

1. Import `spiderer` as module into python (make sure to have installed the `BeautifulSoup` module! `pip install BeautifulSoup` to do this)

2. Run `spiderer.sage(journal = '')` or `spiderer.springer(journal = '')` to download all links for that specific journal. For the `spiderer.sage()` you only need the first three letters of the web url (e.g., `pss` for [Psychological Science](pss.sagepub.com)); for `springer()` you require the unique journal identifier (e.g., 13428 for [Behavior Research Methods](http://link.springer.com/journal/volumesAndIssues/13428)); for `elsevier()` you require the unique journal identifier (e.g., 2212683X for [Biologically Inspired Cognitive Architectures](http://www.sciencedirect.com/science/journal/2212683X)).

If you want to collect the links for all journals available in `journal_list.csv`, you only need to use the command `python run_all.py` in the commandline of your choosing.

# To-do

- [x] Incorporate some form of selection mechanism into the journal_list

- [x] Incorporate a date checker to prevent re-spidering of recently spidered journals (what is a reasonable timeframe for this?)

- [x] Incorporate Elsevier

- [ ] Incorporate Taylor & Francis

- [ ] Incorporate Wiley

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/chartgerink/journal-spiders

Awesome Lists containing this project

README