https://github.com/flavienbwk/pandemic-knowledge

A fully-featured multi-source data pipeline for continuously extracting knowledge from COVID-19 data.
https://github.com/flavienbwk/pandemic-knowledge

crawl data-pipeline data-visualization elasticsearch kibana minio prefect python3 s3

Last synced: 5 months ago
JSON representation

A fully-featured multi-source data pipeline for continuously extracting knowledge from COVID-19 data.

Host: GitHub
URL: https://github.com/flavienbwk/pandemic-knowledge
Owner: flavienbwk
License: mit
Created: 2021-04-26T08:31:25.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2021-06-20T10:17:34.000Z (over 4 years ago)
Last Synced: 2025-03-31T05:51:07.663Z (6 months ago)
Topics: crawl, data-pipeline, data-visualization, elasticsearch, kibana, minio, prefect, python3, s3
Language: Python
Homepage:
Size: 3.1 MB
Stars: 21
Watchers: 3
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Pandemic-Knowledge

![Pandemic Knowledge logo](./pandemic_knowledge.png)



    

        

    

    

    



A fully-featured multi-source data pipeline for continuously extracting knowledge from COVID-19 data.

- Contamination figures

- Vaccination figures

- Death figures

- COVID-19-related news (Google News, Twitter)

## What you can achieve

|                        Live contaminations map + Latest news                        |                   Last 7 days news                    |

| :---------------------------------------------------------------------------------: | :---------------------------------------------------: |

| ![Live contamination and vaccination world map](./illustrations/live_dashboard.png) | ![Last news, live !](./illustrations/latest_news.png) |

|            France 3-weeks live map (Kibana Canvas)            |                     Live vaccinations map                     |

| :-----------------------------------------------------------: | :-----------------------------------------------------------: |

| ![France Live Status](./illustrations/france_live_status.png) | ![World vaccination map](./illustrations/vaccination_map.png) |

## Context

This project was realized over 4 days as part of a MSc hackathon from [ETNA](https://etna.io), a french computer science school.

The incentives were both to experiment/prototype a big data pipeline and contribute to an open source project.

## Install

Below, you'll find the procedure to process COVID-related file and news into the Pandemic Knowledge database (elasticsearch).

The process is **scheduled** to run every 24 hours so you can update the files and obtain the latest news

- [Pandemic-Knowledge](#pandemic-knowledge)

  - [What you can achieve](#what-you-can-achieve)

  - [Context](#context)

  - [Install](#install)

    - [Env file](#env-file)

    - [Initialize elasticsearch](#initialize-elasticsearch)

    - [Initialize Prefect](#initialize-prefect)

    - [Run Prefect workers](#run-prefect-workers)

    - [COVID-19 data](#covid-19-data)

    - [News data](#news-data)

    - [News web app](#news-web-app)

### Env file

Running this project on your local computer ? Just copy the `.env.example` file :

```bash

cp .env.example .env

```

Open this `.env` file and edit password-related variables.

### Initialize elasticsearch

Raise your host's ulimits for ElasticSearch to handle high I/O :

```bash

sudo sysctl -w vm.max_map_count=500000

```

Then :

```bash

docker-compose -f create-certs.yml run --rm create_certs

docker-compose up -d es01 es02 es03 kibana

```

### Initialize Prefect

Create a `~/.prefect/config.toml` file with the following content :

```bash

# debug mode

debug = true

# base configuration directory (typically you won't change this!)

home_dir = "~/.prefect"

backend = "server"

[server]

host = "http://172.17.0.1"

port = "4200"

host_port = "4200"

endpoint = "${server.host}:${server.port}"

```

Run Prefect :

```bash

docker-compose up -d prefect_postgres prefect_hasura prefect_graphql prefect_towel prefect_apollo prefect_ui

```

We need to create a _tenant_. Execute on your host :

```bash

pip3 install prefect

prefect backend server

prefect server create-tenant --name default --slug default

```

Access the web UI at [localhost:8081](http://localhost:8081)

### Run Prefect workers

Agents are services that run your scheduled flows.

1. Open and optionally edit the [`agent/config.toml`](./agent/config.toml) file.

2. Let's instanciate 3 workers :

  ```bash

  docker-compose -f agent/docker-compose.yml up -d --build --scale agent=3 agent

  ```

  > :information_source: You can run the agent on another machine than the one with the Prefect server. Edit the [`agent/config.toml`](./agent/config.toml) file for that.

### COVID-19 data

Injection scripts should are scheduled in Prefect so they automatically inject data with the latest news (delete + inject).

There are several data source supported by Pandemic Knowledge

- [Our World In Data](https://ourworldindata.org/coronavirus-data); used by Google

  - docker-compose slug : `insert_owid`

  - MinIO bucket : `contamination-owid`

  - Format : CSV

- [OpenCovid19-Fr](https://github.com/opencovid19-fr/data)

  - docker-compose slug : `insert_france`

  - Format : CSV (download from Internet)

- [Public Health France - Virological test results](https://www.data.gouv.fr/en/datasets/donnees-relatives-aux-resultats-des-tests-virologiques-covid-19/) (official source)

  - docker-compose slug : `insert_france_virtests`

  - Format : CSV (download from Internet)

1. Start MinIO and import your files according to the buckets evoked upper.

    For _Our World In Data_, create the `contamination-owid` bucket and import the CSV file inside.

    ```bash

    docker-compose up -d minio

    ```

    > MinIO is available at `localhost:9000`

2. Download dependencies and start the injection service of your choice. For instance :

    ```bash

    pip3 install -r ./flow/requirements.txt

    docker-compose -f insert.docker-compose.yml up --build insert_owid

    ```

3. In [Kibana](https://localhost:5601), create an index pattern `contamination_owid_*`

4. Once injected, we recommend to adjust the number of replicas [in the DevTool](https://localhost:5601/app/dev_tools#/console) :

    ```json

    PUT /contamination_owid_*/_settings

    {

        "index" : {

            "number_of_replicas" : "2"

        }

    }

    ```

5. Start making your dashboards in [Kibana](https://localhost:5601) !

### News data

There are two sources for news :

- Google News (elasticsearch index: `news_googlenews`)

- Twitter (elasticsearch index: `news_tweets`)

1. Run the Google News crawler :

  ```bash

  docker-compose -f crawl.docker-compose.yml up --build crawl_google_news # and/or crawl_tweets

  ```

2. In Kibana, create a `news_*` index pattern

3. **Edit** the index pattern fields :

  | Name | Type                                                  | Format  |

  | ---- | ----------------------------------------------------- | ------- |

  | img  | string                                                | **Url** |

  | link | string **with Type: Image** with empty _URL template_ | **Url** |

4. Create your visualisation

### News web app

Browse through the news with our web application.

![News web app](./illustrations/news_web_app.png)

1. Make sure you've accepted the self-signed certificate of Elasticsearch at [`https://localhost:9200`](https://localhost:9200)

2. Start-up the app

    ```bash

    docker-compose -f news_app/docker-compose.yml up --build -d

    ```

3. Discover the app at [`localhost:8080`](http://localhost:8080)

---

TODOs

Possible improvements :

- [ ] [Using Dask for parallelizing](https://docs.prefect.io/core/idioms/parallel.html) process of CSV lines by batch of 1000

- [ ] Removing indices only when source process is successful (adding new index, then remove old index)

- [ ] Removing indices only when crawling is successful (adding new index, then remove old index)

Useful commands

To stop everything :

```bash

docker-compose down

docker-compose -f agent/docker-compose.yml down

docker-compose -f insert.docker-compose.yml down

docker-compose -f crawl.docker-compose.yml down

```

To start each service, step by step :

```bash

docker-compose up -d es01 es02 es03 kibana

docker-compose up -d minio

docker-compose up -d prefect_postgres prefect_hasura prefect_graphql prefect_towel prefect_apollo prefect_ui

docker-compose -f agent/docker-compose.yml up -d --build --scale agent=3 agent

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/flavienbwk/pandemic-knowledge

Awesome Lists containing this project

README