https://github.com/theden/bbc-crawl
Spider that crawles bbc.com articles and builds an api to query
https://github.com/theden/bbc-crawl
Last synced: about 1 year ago
JSON representation
Spider that crawles bbc.com articles and builds an api to query
- Host: GitHub
- URL: https://github.com/theden/bbc-crawl
- Owner: TheDen
- Created: 2017-02-05T08:20:15.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2018-05-28T16:35:41.000Z (about 8 years ago)
- Last Synced: 2024-05-30T01:18:20.359Z (about 2 years ago)
- Language: Python
- Homepage: https://bbc-crawl.herokuapp.com/
- Size: 35.2 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# BBC-Crawl
Scrapes articles from www.bbc.com using `scrapy`, and creates an api using `flask` to query the articles for keywords.
Live version: https://bbc-crawl.herokuapp.com/
Example query: https://bbc-crawl.herokuapp.com/api/v1/articles/?query=sydney
### Build
Requirements:
`mongo`: https://docs.mongodb.com/getting-started/shell/installation/
`pip`: `wget https://bootstrap.pypa.io/get-pip.py && python get-pip.py`
Pip modules (`make install` should also install these)
```
pip install -r requirements.txt
```
Ubuntu might need certain prereqs:
```
sudo apt-get update
sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev
```
### Test:
For a very simple test on localhost: `./flaskr_tests.py`
### Run
`MONGODB_URL` needs to be exported to connect to the mlab db:
`export MONGODB_URL=mongodb://$user:$pass@$db.mlab.com:$port/bbc`
To run the spider:
`make run-spider`
To start the server:
`make run-server`
To import the output file to a remote db:
`db-import`
### API
Once the server is running, for example a query for the word `sydney` on bbc.com can be made at the enpoint:
https://bbc-crawl.herokuapp.com/api/v1/articles/?query=sydney