https://github.com/jplusplus/statscraper
A base library for building web scrapers for statistical data, and a helper ontology for (primarily Swedish) statistical data.
https://github.com/jplusplus/statscraper
data-journalism scraping
Last synced: 10 months ago
JSON representation
A base library for building web scrapers for statistical data, and a helper ontology for (primarily Swedish) statistical data.
- Host: GitHub
- URL: https://github.com/jplusplus/statscraper
- Owner: jplusplus
- License: mit
- Created: 2017-03-07T09:32:54.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2023-05-23T06:25:53.000Z (about 3 years ago)
- Last Synced: 2024-12-13T21:59:41.888Z (over 1 year ago)
- Topics: data-journalism, scraping
- Language: Python
- Homepage: https://www.facebook.com/groups/skrejperpark
- Size: 283 KB
- Stars: 13
- Watchers: 14
- Forks: 4
- Open Issues: 2
-
Metadata Files:
- Readme: README.rst
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
Statscraper is a base library for building web scrapers for statistical data, with a helper ontology for (primarily Swedish) statistical data. A set of ready-to-use scrapers are included.
For users
=========
You can use Statscraper as a foundation for your next scraper, or try out any of the included scrapers. With Statscraper comes a unified interface for scraping, and some useful helper methods for scraper authors.
Full documentation: ReadTheDocs_
For updates and discussion: Facebook_
By `Journalism++ Stockholm `_, and Robin Linderborg.
Installing
----------
.. code:: bash
pip install statscraper
Using a scraper
---------------
Scrapers acts like “cursors” that move around a hierarchy of datasets and collections of datasets. Collections and datasets are refered to as “items”.
::
┏━ Collection ━━━ Collection ━┳━ Dataset
ROOT ━╋━ Collection ━┳━ Dataset ┣━ Dataset
┗━ Collection ┣━ Dataset ┗━ Dataset
┗━ Dataset
╰─────────────────────────┬───────────────────────╯
items
Here's a simple example, with a scraper that returns only a single dataset: The number of cranes spotted at Hornborgarsjön each day as scraped from `Länsstyrelsen i Västra Götalands län `_.
.. code:: python
>>> from statscraper.scrapers import Cranes
>>> scraper = Cranes()
>>> scraper.items # List available datasets
[]
>>> dataset = scraper["Number of cranes"]
>>> dataset.dimensions
[, , ]
>>> row = dataset.data[0] # first row in this dataset
>>> row
>>> row.dict
{'value': '7', u'date': u'7', u'month': u'march', u'year': u'2015'}
>>> df = dataset.data.pandas # get this dataset as a Pandas dataframe
Building a scraper
------------------
Scrapers are built by extending a base scraper, or a derative of that. You need to provide a method for listing datasets or collections of datasets, and for fetching data.
Statscraper is built for statistical data, meaning that it's most useful when the data you are scraping/fetching can be organized with a numerical value in each row:
======== ====== =======
city year value
======== ====== =======
Voi 2009 45483
Kabarnet 2006 10191
Taveta 2009 67505
======== ====== =======
A scraper can override these methods:
* `_fetch_itemslist(item)` to yield collections or datasets at the current cursor position
* `_fetch_data(dataset)` to yield rows from the currently selected dataset
* `_fetch_dimensions(dataset)` to yield dimensions available for the currently selected dataset
* `_fetch_allowed_values(dimension)` to yield allowed values for a dimension
A number of hooks are avaiable for more advanced scrapers. These are called by adding the on decorator on a method:
.. code:: python
@BaseScraper.on("up")
def my_method(self):
# Do something when the user moves up one level
For developers
==============
These instructions are for developers working on the BaseScraper. See above for instructions for developing a scraper using the BaseScraper.
Downloading
-----------
.. code:: bash
git clone https://github.com/jplusplus/statscraper
python setup.py install
This repo includes `statscraper-datatypes` as a subtree. To update this, do:
.. code:: bash
git subtree pull --prefix statscraper/datatypes git@github.com:jplusplus/statscraper-datatypes.git master --squash
Tests
-----
Since 2.0.0 we are using pytest. To run an individual test:
.. code:: bash
python3 -m pytest tests/test-datatypes.py
Changelog
---------
The changelog has been moved to `CHANGELOG.md `_.
.. _Facebook: https://www.facebook.com/groups/skrejperpark
.. _ReadTheDocs: http://statscraper.readthedocs.io