https://github.com/jplusplus/statscraper

A base library for building web scrapers for statistical data, and a helper ontology for (primarily Swedish) statistical data.
https://github.com/jplusplus/statscraper

data-journalism scraping

Last synced: 11 months ago
JSON representation

A base library for building web scrapers for statistical data, and a helper ontology for (primarily Swedish) statistical data.

Host: GitHub
URL: https://github.com/jplusplus/statscraper
Owner: jplusplus
License: mit
Created: 2017-03-07T09:32:54.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2023-05-23T06:25:53.000Z (about 3 years ago)
Last Synced: 2024-12-13T21:59:41.888Z (over 1 year ago)
Topics: data-journalism, scraping
Language: Python
Homepage: https://www.facebook.com/groups/skrejperpark
Size: 283 KB
Stars: 13
Watchers: 14
Forks: 4
Open Issues: 2
Metadata Files:
- Readme: README.rst
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

          Statscraper is a base library for building web scrapers for statistical data, with a helper ontology for (primarily Swedish) statistical data. A set of ready-to-use scrapers are included.

For users

=========

You can use Statscraper as a foundation for your next scraper, or try out any of the included scrapers. With Statscraper comes a unified interface for scraping, and some useful helper methods for scraper authors.

Full documentation: ReadTheDocs_

For updates and discussion: Facebook_

By `Journalism++ Stockholm `_, and Robin Linderborg.

Installing

----------

.. code:: bash

  pip install statscraper

Using a scraper

---------------

Scrapers acts like “cursors” that move around a hierarchy of datasets and collections of datasets. Collections and datasets are refered to as “items”.

::

        ┏━ Collection ━━━ Collection ━┳━ Dataset

  ROOT ━╋━ Collection ━┳━ Dataset     ┣━ Dataset

        ┗━ Collection  ┣━ Dataset     ┗━ Dataset

                       ┗━ Dataset

  ╰─────────────────────────┬───────────────────────╯

                       items

Here's a simple example, with a scraper that returns only a single dataset: The number of cranes spotted at Hornborgarsjön each day as scraped from `Länsstyrelsen i Västra Götalands län `_.

.. code:: python

  >>> from statscraper.scrapers import Cranes

  >>> scraper = Cranes()

  >>> scraper.items  # List available datasets

  []

  >>> dataset = scraper["Number of cranes"]

  >>> dataset.dimensions

  [, , ]

  >>> row = dataset.data[0]  # first row in this dataset

  >>> row

  

  >>> row.dict

  {'value': '7', u'date': u'7', u'month': u'march', u'year': u'2015'}

  >>> df = dataset.data.pandas  # get this dataset as a Pandas dataframe

Building a scraper

------------------

Scrapers are built by extending a base scraper, or a derative of that. You need to provide a method for listing datasets or collections of datasets, and for fetching data.

Statscraper is built for statistical data, meaning that it's most useful when the data you are scraping/fetching can be organized with a numerical value in each row:

========  ======  =======

  city     year    value

========  ======  =======

Voi       2009    45483

Kabarnet  2006    10191

Taveta    2009    67505

========  ======  =======

A scraper can override these methods:

* `_fetch_itemslist(item)` to yield collections or datasets at the current cursor position

* `_fetch_data(dataset)` to yield rows from the currently selected dataset

* `_fetch_dimensions(dataset)` to yield dimensions available for the currently selected dataset

* `_fetch_allowed_values(dimension)` to yield allowed values for a dimension

A number of hooks are avaiable for more advanced scrapers. These are called by adding the on decorator on a method:

.. code:: python

  @BaseScraper.on("up")

  def my_method(self):

    # Do something when the user moves up one level

For developers

==============

These instructions are for developers working on the BaseScraper. See above for instructions for developing a scraper using the BaseScraper.

Downloading

-----------

.. code:: bash

  git clone https://github.com/jplusplus/statscraper

  python setup.py install

This repo includes `statscraper-datatypes` as a subtree. To update this, do:

.. code:: bash

  git subtree pull --prefix statscraper/datatypes git@github.com:jplusplus/statscraper-datatypes.git master --squash

Tests

-----

Since 2.0.0 we are using pytest. To run an individual test:

.. code:: bash

  python3 -m pytest tests/test-datatypes.py

Changelog

---------

The changelog has been moved to `CHANGELOG.md `_.

.. _Facebook: https://www.facebook.com/groups/skrejperpark

.. _ReadTheDocs: http://statscraper.readthedocs.io

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jplusplus/statscraper

Awesome Lists containing this project

README