Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ecprice/newsdiffs
Automatic scraper that tracks changes in news articles over time.
https://github.com/ecprice/newsdiffs
Last synced: 7 days ago
JSON representation
Automatic scraper that tracks changes in news articles over time.
- Host: GitHub
- URL: https://github.com/ecprice/newsdiffs
- Owner: ecprice
- License: other
- Created: 2012-06-16T21:54:51.000Z (over 12 years ago)
- Default Branch: master
- Last Pushed: 2020-09-19T13:01:50.000Z (about 4 years ago)
- Last Synced: 2024-08-02T14:12:18.570Z (3 months ago)
- Language: Python
- Size: 1.97 MB
- Stars: 493
- Watchers: 40
- Forks: 135
- Open Issues: 36
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
NewsDiffs
==========A website and framework that tracks changes in online news articles over time.
Original installation at newsdiffs.org.
A product of the Knight Mozilla MIT news hackathon in June 2012.
Authors: Eric Price ([email protected]), Greg Price ([email protected]),
and Jennifer 8. Lee ([email protected])This is free software under the MIT/Expat license; see LICENSE.
The project's source code lives at http://github.com/ecprice/newsdiffs .Requirements
------------You need to have installed on your local machine
* Git
* Python 2.6 or later
* Django and other Python librariesOn a Debian- or Ubuntu-based system, it may suffice (untested) to run
$ sudo apt-get install git-core python-django python-django-south python-simplejsonOn Mac OS, the easiest way may be to install pip:
http://www.pip-installer.org/en/latest/installing.html
and then
$ pip install DjangoInitial setup
-------------$ python website/manage.py syncdb && python website/manage.py migrate
$ mkdir articlesRunning NewsDiffs Locally
-------------------------Do the initial setup above. Then to start the webserver for testing:
$ python website/manage.py runserverand visit http://localhost:8000/
Running the scraper
-------------------Do the initial setup above. You will also need additional Python
libraries; on a Debian- or Ubuntu-based system, it may suffice
(untested) to run
$ sudo apt-get install python-bs4 python-beautifulsoupon a Mac, you will want something like
$ pip install beautifulsoup4
$ pip install beautifulsoup
$ pip install html5libNote that we need two versions of BeautifulSoup, both 3.2 and 4.0;
some websites are parsed correctly in only one version.Then run
$ python website/manage.py scraperThis will populate the articles repository with a list of current news
articles. This is a snapshot at a single time, so the website will
not yet have any changes. To get changes, wait some time (say, 3
hours) and run 'python website/manage.py scraper' again. If any of
the articles have changed in the intervening time, the website should
display the associated changes.The scraper will log progress to /tmp/newsdiffs_logging (which is
overwritten each run) and errors to /tmp/newsdiffs/logging_errs (which
is cumulative).To run the scraper every hour, run something like:
$ while true; do python website/manage.py scraper; sleep 60m; done
or make a cron job.
Adding new sites to the scraper
-------------------------------The procedure for adding new sites to the scraper is outlined in
parsers/__init__.py . You need to(1) Create a new parser module in parsers/ . This should be a
subclass of BaseParser (in parsers/baseparser.py). Model it off
the other parsers in that directory. You can test the parser
with by running, e.g.,$ python parsers/test_parser.py bbc.BBCParser
which will output a list of URLs to track, and
$ python parsers/test_parser.py bbc.BBCParser http://www.bbc.co.uk/news/uk-21649494
which will output the text that NewsDiffs would store.
(2) Add the parser to 'parsers' in parsers/__init__.py
This should cause the scraper to start tracking the site.
To make the source display properly on the website, you will need
minor edits to two other files: website/frontend/models.py and
website/frontend/views.py (to define the display name and create a tab
for the source, respectively). Search for 'bbc' to find the locations
to edit.