{"id":23290995,"url":"https://github.com/jplusplus/statscraper","last_synced_at":"2025-08-21T22:31:53.571Z","repository":{"id":57471246,"uuid":"84180734","full_name":"jplusplus/statscraper","owner":"jplusplus","description":"A base library for building web scrapers for statistical data, and a helper ontology for (primarily Swedish) statistical data.","archived":false,"fork":false,"pushed_at":"2023-05-23T06:25:53.000Z","size":290,"stargazers_count":13,"open_issues_count":2,"forks_count":4,"subscribers_count":14,"default_branch":"master","last_synced_at":"2024-12-13T21:59:41.888Z","etag":null,"topics":["data-journalism","scraping"],"latest_commit_sha":null,"homepage":"https://www.facebook.com/groups/skrejperpark","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jplusplus.png","metadata":{"files":{"readme":"README.rst","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2017-03-07T09:32:54.000Z","updated_at":"2023-11-14T15:21:28.000Z","dependencies_parsed_at":"2023-01-22T18:00:29.770Z","dependency_job_id":"a12f0b71-78ab-4f02-812a-02cd4ecf0008","html_url":"https://github.com/jplusplus/statscraper","commit_stats":{"total_commits":445,"total_committers":4,"mean_commits":111.25,"dds":"0.24943820224719104","last_synced_commit":"35f9db2b96fbf6b3dd646e91ea459d27f67ef065"},"previous_names":["jplusplus/skrejperpark"],"tags_count":20,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jplusplus%2Fstatscraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jplusplus%2Fstatscraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jplusplus%2Fstatscraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jplusplus%2Fstatscraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jplusplus","download_url":"https://codeload.github.com/jplusplus/statscraper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230537084,"owners_count":18241519,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-journalism","scraping"],"created_at":"2024-12-20T05:13:49.068Z","updated_at":"2024-12-20T05:13:49.641Z","avatar_url":"https://github.com/jplusplus.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Statscraper is a base library for building web scrapers for statistical data, with a helper ontology for (primarily Swedish) statistical data. A set of ready-to-use scrapers are included.\n\nFor users\n=========\n\nYou can use Statscraper as a foundation for your next scraper, or try out any of the included scrapers. With Statscraper comes a unified interface for scraping, and some useful helper methods for scraper authors.\n\nFull documentation: ReadTheDocs_\n\nFor updates and discussion: Facebook_\n\nBy `Journalism++ Stockholm \u003chttp://jplusplus.org/sv\u003e`_, and Robin Linderborg.\n\nInstalling\n----------\n\n.. code:: bash\n\n  pip install statscraper\n\nUsing a scraper\n---------------\nScrapers acts like “cursors” that move around a hierarchy of datasets and collections of datasets. Collections and datasets are refered to as “items”.\n\n::\n\n        ┏━ Collection ━━━ Collection ━┳━ Dataset\n  ROOT ━╋━ Collection ━┳━ Dataset     ┣━ Dataset\n        ┗━ Collection  ┣━ Dataset     ┗━ Dataset\n                       ┗━ Dataset\n\n  ╰─────────────────────────┬───────────────────────╯\n                       items\n\nHere's a simple example, with a scraper that returns only a single dataset: The number of cranes spotted at Hornborgarsjön each day as scraped from `Länsstyrelsen i Västra Götalands län \u003chttp://web05.lansstyrelsen.se/transtat_O/transtat.asp\u003e`_.\n\n.. code:: python\n\n  \u003e\u003e\u003e from statscraper.scrapers import Cranes\n\n  \u003e\u003e\u003e scraper = Cranes()\n  \u003e\u003e\u003e scraper.items  # List available datasets\n  [\u003cDataset: Number of cranes\u003e]\n\n  \u003e\u003e\u003e dataset = scraper[\"Number of cranes\"]\n  \u003e\u003e\u003e dataset.dimensions\n  [\u003cDimension: date (Day of the month)\u003e, \u003cDimension: month\u003e, \u003cDimension: year\u003e]\n\n  \u003e\u003e\u003e row = dataset.data[0]  # first row in this dataset\n  \u003e\u003e\u003e row\n  \u003cResult: 7 (value)\u003e\n  \u003e\u003e\u003e row.dict\n  {'value': '7', u'date': u'7', u'month': u'march', u'year': u'2015'}\n\n  \u003e\u003e\u003e df = dataset.data.pandas  # get this dataset as a Pandas dataframe\n\nBuilding a scraper\n------------------\nScrapers are built by extending a base scraper, or a derative of that. You need to provide a method for listing datasets or collections of datasets, and for fetching data.\n\nStatscraper is built for statistical data, meaning that it's most useful when the data you are scraping/fetching can be organized with a numerical value in each row:\n\n========  ======  =======\n  city     year    value\n========  ======  =======\nVoi       2009    45483\nKabarnet  2006    10191\nTaveta    2009    67505\n========  ======  =======\n\nA scraper can override these methods:\n\n* `_fetch_itemslist(item)` to yield collections or datasets at the current cursor position\n* `_fetch_data(dataset)` to yield rows from the currently selected dataset\n* `_fetch_dimensions(dataset)` to yield dimensions available for the currently selected dataset\n* `_fetch_allowed_values(dimension)` to yield allowed values for a dimension\n\nA number of hooks are avaiable for more advanced scrapers. These are called by adding the on decorator on a method:\n\n.. code:: python\n\n  @BaseScraper.on(\"up\")\n  def my_method(self):\n    # Do something when the user moves up one level\n\nFor developers\n==============\nThese instructions are for developers working on the BaseScraper. See above for instructions for developing a scraper using the BaseScraper.\n\nDownloading\n-----------\n\n.. code:: bash\n\n  git clone https://github.com/jplusplus/statscraper\n  python setup.py install\n\nThis repo includes `statscraper-datatypes` as a subtree. To update this, do:\n\n.. code:: bash\n\n  git subtree pull --prefix statscraper/datatypes git@github.com:jplusplus/statscraper-datatypes.git master --squash\n\n\nTests\n-----\n\nSince 2.0.0 we are using pytest. To run an individual test:\n\n.. code:: bash\n\n  python3 -m pytest tests/test-datatypes.py\n\n\nChangelog\n---------\nThe changelog has been moved to `CHANGELOG.md \u003cCHANGELOG.md\u003e`_.\n\n.. _Facebook: https://www.facebook.com/groups/skrejperpark\n.. _ReadTheDocs: http://statscraper.readthedocs.io\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjplusplus%2Fstatscraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjplusplus%2Fstatscraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjplusplus%2Fstatscraper/lists"}