https://github.com/alan-turing-institute/misinformation-crawler
Web crawler to collect snapshots of articles to web archive
https://github.com/alan-turing-institute/misinformation-crawler
hut23 hut23-134
Last synced: about 2 months ago
JSON representation
Web crawler to collect snapshots of articles to web archive
- Host: GitHub
- URL: https://github.com/alan-turing-institute/misinformation-crawler
- Owner: alan-turing-institute
- License: mit
- Created: 2018-06-29T10:29:40.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2021-08-02T13:49:55.000Z (over 3 years ago)
- Last Synced: 2025-01-12T22:41:47.079Z (3 months ago)
- Topics: hut23, hut23-134
- Language: HTML
- Size: 6.32 MB
- Stars: 5
- Watchers: 19
- Forks: 2
- Open Issues: 41
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Misinformation crawler
Web crawler to collect snapshots of articles to web archive.See [main project](https://github.com/alan-turing-institute/misinformation) for project board and issues.
## Prerequisites
- `chromedriver`
- needed by Selenium. See [installation instructions](https://selenium-python.readthedocs.io/installation.html)
- `node.js`
- needed both for some `ReadabiliPy` tests and also to avoid Cloudflare protections. See [installation instructions](https://nodejs.org/en/download/)
- (optional) Microsoft SQL drivers
- Needed only if recording the crawl in the Azure database, not if writing output to a local file
- See sections below on `Using pyodbc on macos` and `How to install Microsoft SQL Server drivers`##### Using pyodbc on OSX
If you are not using the latest version of macos, you may get an `sql.h not found` error when installing the `pyodbc`
dependency via pip. This is because there is no compiled wheel for your version of OSX.The options are to either (i) upgrade to the latest version of OSX or (ii) install the `unixodbc` driver libraries using `brew install unixodbc`.
##### How to install Microsoft SQL Server drivers on OSX
1. Install Homebrew is you have not already: `/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"`
2. Add the Microsoft Tap: `brew tap microsoft/mssql-release https://github.com/Microsoft/homebrew-mssql-release`
3. Update Homebrew: `brew update`
4. Install the Microsoft SQL ODBC driver: `brew install openssl msodbcsql17 mssql-tools`## Installation
- Check out the code from `github`
- Install the Python dependencies for this package by running:
- As a user of this project `pip install -r requirements.txt`
- As a developer of this project `pip install -r requirements-dev.txt`
- Ensure that [`ReadabiliPy`](https://github.com/alan-turing-institute/ReadabiliPy) is installed by running:
```git submodule update --init --recursive```
- Install the Python dependencies for `ReadabiliPy` by typing `pip install -r ReadabiliPy/requirements.txt`
- (Optional) Install the `node.js` dependencies for `ReadabiliPy` by entering the `ReadabiliPy` directory and typing `npm install`## Usage
Site configurations for 107 sites are included in `misinformation/site_configs.yml`
Crawled articles are saved one file per site in `articles/`
The actual number of articles returned may be slightly higher due to number of parallel requests scrapy has open at any time.### Crawling all sites
Usage: `python crawl.py --all -n ` (limit is optional and all articles will be crawled if left off)### Crawling a single site
Usage: `python crawl.py --site -n ` (limit is optional and all articles will be crawled if left off)### Crawling a list of URLs
Usage: `python crawl.py --list ` (the file must be in CSV format with an `article_url` column and a `site name` column)## Testing
To run tests, run `python -m pytest` from the repository root.## Running the crawler with the Azure backend
In order to run the crawler you will need to create a file at `secrets/db_config.yaml` inside the top-level `misinformation-crawler` directory. This should look like the following:```
driver: ODBC Driver 17 for SQL Server
server: misinformation.database.windows.net
database: misinformation
user: database-crawler-user
password:
```where the password is obtained from the Azure keyvault for the database, using
```az keyvault secret show --vault-name misinformation-user --name database-crawler-user```
The crawler can then be run using `python crawl.py --all -e blob`.
## Developing
To update to the latest version of [ReadabiliPy](https://github.com/martintoreilly/ReadabiliPy/blob/features/14-plain-content-structure/README.md).
- Navigate to the `ReadabiliPy` folder with `cd ReadabiliPy`
- Ensure you are on the `master` branch with `git checkout master`
- Pull the latest version with `git pull`
- Install the dependencies with `pip install -r requirements.txt`