An open API service indexing awesome lists of open source software.

https://github.com/devinit/iati-covid19-first-prototype

Extracting COVID-19 data from D-Portal and reprocesses it nightly (not used for prod visual)
https://github.com/devinit/iati-covid19-first-prototype

Last synced: 11 months ago
JSON representation

Extracting COVID-19 data from D-Portal and reprocesses it nightly (not used for prod visual)

Awesome Lists containing this project

README

          

# COVID-19 Data

## Note that this data is not used for the prod visual

The scraper and data for the prod visual can be found here: https://github.com/OCHA-DAP/hdx-scraper-iati-viz

This scraper extracts data from IATI Datastore nightly and reprocesses it:

* selects certain fields and exports them in a nice clean JSON format
* converts financial data to USD

The scripts in this repository automatically generate fresh data every day (using Github Actions), which can be seen in (and downloaded from) [the gh-pages branch](https://github.com/OCHA-DAP/covid19-data/tree/gh-pages).

For more detail on how the data was processed, see the [data notes](https://github.com/OCHA-DAP/covid19-data/blob/master/DATA-NOTES.md).

### Installing

```
git clone git@github.com:OCHA-DAP/covid19-data.git
virtualenv ./pyenv
source ./pyenv/bin/activate
pip install -r requirements.txt
```

### Running

Download and reprocess data using the following script. Add `--help` to see optional arguments.

```
python run.py
```

#### Running with cached rates (saves downloading a new file)

```
python run.py --cached-rates
```

#### Running and deploying to gh-pages

```
python run.py --deploy
```

### Overview

The code in this repository runs at 1500 UTC every day, using Github Actions. Files are pushed to the `gh-pages` branch and made available through Github Pages. The data is then visualised using software stored in the [OCHA-DAP/viz-covid19-visualisation](https://github.com/OCHA-DAP/viz-covid19-visualisation) repository, and also served from Github Pages.

#### Data sources

Data is downloaded from a few places:

* IATI data: D-Portal
* FTS data: UNOCHA FTS
* Codelists: CodeforIATI
* Exchange Rates: CodeforIATI

These downloads are now reasonably stable, though a few things to be aware of:

* **IATI data**: D-Portal fairly frequently fails to respond with relevant data. This appears to be more reliable now that we request fewer activities at once, and we run at 1500 rather than early in the morning (when D-Portal is itself collecting and updating source data). One option could be to consider switching to the new IATI Datastore (though see discussion below).
* **FTS data**: FTS now seems to be pretty stable; occasionally the FTS API is unavailable
* **Codelists**: these endpoints are very stable now as flat files are hosted on Github Pages. These files are generally much faster to download than the official IATI codelists, and they are also often more up to date.
* **Exchange rates**: this file is also now very stable, again as a single compiled flat file is hosted on Github Pages; previously this data was hosted only on morph.io, but there have been a lot of stability issues recently. There don't appear to be any significant problems here any more.

#### Process

The basic process is as follows:

* `run.py`:
* either download or load in a list of exchange rates
* download data from D-Portal (`get_activities_from_urls()`)
* filter out activities that have certain problems (`activities_filter()`)
* filter out activities that don't conform to the IATI COVID-19 Publishing Guidance
* extract relevant data from each activity (`process_activity()`)
* write XML data for all activities (`write_xml_files()`)
* up to 3000 activities per file, labelled `activities-N.xml` where N is the page)
* write XML data for each reporting organisation
* write out the list of sectors and countries that are used in the data (so that in the user interface we don't display countries or sectors with no activities)
* download and process FTS data
* run `traceability.py` (see below)
* remove `activities.xml` (it is used by `traceability.py`, but it is a very large file and exceeds Github usage limits)
* `traceability.py`:
* read in list of exchange rates
* download `TransactionType` codelist
* read in the activities XML (from `activities.xml`)
* identify which activities contain explicit COVID-19 transactions
* extract relevant data from each transaction (`make_transaction()`)
* export transactions to Excel
* disaggregate transactions by sector and country (`make_sector_country_transactions_data()`)
* export disaggregated data to JSON and Excel
* make grouped traceability data for Sankey diagram
* export grouped traceability data to JSON and Excel