{"id":22018446,"url":"https://github.com/flavienbwk/pandemic-knowledge","last_synced_at":"2025-05-07T03:27:48.214Z","repository":{"id":52270000,"uuid":"361675165","full_name":"flavienbwk/Pandemic-Knowledge","owner":"flavienbwk","description":"A fully-featured multi-source data pipeline for continuously extracting knowledge from COVID-19 data.","archived":false,"fork":false,"pushed_at":"2021-06-20T10:17:34.000Z","size":3255,"stargazers_count":21,"open_issues_count":0,"forks_count":2,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-31T05:51:07.663Z","etag":null,"topics":["crawl","data-pipeline","data-visualization","elasticsearch","kibana","minio","prefect","python3","s3"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/flavienbwk.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-04-26T08:31:25.000Z","updated_at":"2022-04-16T04:21:52.000Z","dependencies_parsed_at":"2022-09-11T12:20:27.344Z","dependency_job_id":null,"html_url":"https://github.com/flavienbwk/Pandemic-Knowledge","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/flavienbwk%2FPandemic-Knowledge","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/flavienbwk%2FPandemic-Knowledge/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/flavienbwk%2FPandemic-Knowledge/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/flavienbwk%2FPandemic-Knowledge/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/flavienbwk","download_url":"https://codeload.github.com/flavienbwk/Pandemic-Knowledge/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252805984,"owners_count":21807129,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawl","data-pipeline","data-visualization","elasticsearch","kibana","minio","prefect","python3","s3"],"created_at":"2024-11-30T05:12:13.672Z","updated_at":"2025-05-07T03:27:48.194Z","avatar_url":"https://github.com/flavienbwk.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Pandemic-Knowledge\n\n![Pandemic Knowledge logo](./pandemic_knowledge.png)\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://travis-ci.com/flavienbwk/Pandemic-Knowledge\" target=\"_blank\"\u003e\n        \u003cimg src=\"https://travis-ci.org/flavienbwk/Pandemic-Knowledge.svg?branch=main\"/\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/psf/black\"\u003e\u003cimg alt=\"Code style: black\" src=\"https://img.shields.io/badge/code%20style-black-000000.svg\"\u003e\u003c/a\u003e\n    \u003ca href=\"./LICENSE\"\u003e\u003cimg atl=\"Repo license MIT\" src=\"https://img.shields.io/badge/License-MIT-yellow.svg\"/\u003e\u003c/a\u003e\n\u003c/p\u003e\n\nA fully-featured multi-source data pipeline for continuously extracting knowledge from COVID-19 data.\n\n- Contamination figures\n- Vaccination figures\n- Death figures\n- COVID-19-related news (Google News, Twitter)\n\n## What you can achieve\n\n|                        Live contaminations map + Latest news                        |                   Last 7 days news                    |\n| :---------------------------------------------------------------------------------: | :---------------------------------------------------: |\n| ![Live contamination and vaccination world map](./illustrations/live_dashboard.png) | ![Last news, live !](./illustrations/latest_news.png) |\n\n|            France 3-weeks live map (Kibana Canvas)            |                     Live vaccinations map                     |\n| :-----------------------------------------------------------: | :-----------------------------------------------------------: |\n| ![France Live Status](./illustrations/france_live_status.png) | ![World vaccination map](./illustrations/vaccination_map.png) |\n\n## Context\n\nThis project was realized over 4 days as part of a MSc hackathon from [ETNA](https://etna.io), a french computer science school.\n\nThe incentives were both to experiment/prototype a big data pipeline and contribute to an open source project.\n\n## Install\n\nBelow, you'll find the procedure to process COVID-related file and news into the Pandemic Knowledge database (elasticsearch).\n\nThe process is **scheduled** to run every 24 hours so you can update the files and obtain the latest news\n\n- [Pandemic-Knowledge](#pandemic-knowledge)\n  - [What you can achieve](#what-you-can-achieve)\n  - [Context](#context)\n  - [Install](#install)\n    - [Env file](#env-file)\n    - [Initialize elasticsearch](#initialize-elasticsearch)\n    - [Initialize Prefect](#initialize-prefect)\n    - [Run Prefect workers](#run-prefect-workers)\n    - [COVID-19 data](#covid-19-data)\n    - [News data](#news-data)\n    - [News web app](#news-web-app)\n\n### Env file\n\nRunning this project on your local computer ? Just copy the `.env.example` file :\n\n```bash\ncp .env.example .env\n```\n\nOpen this `.env` file and edit password-related variables.\n\n### Initialize elasticsearch\n\nRaise your host's ulimits for ElasticSearch to handle high I/O :\n\n```bash\nsudo sysctl -w vm.max_map_count=500000\n```\n\nThen :\n\n```bash\ndocker-compose -f create-certs.yml run --rm create_certs\ndocker-compose up -d es01 es02 es03 kibana\n```\n\n### Initialize Prefect\n\nCreate a `~/.prefect/config.toml` file with the following content :\n\n```bash\n# debug mode\ndebug = true\n\n# base configuration directory (typically you won't change this!)\nhome_dir = \"~/.prefect\"\n\nbackend = \"server\"\n\n[server]\nhost = \"http://172.17.0.1\"\nport = \"4200\"\nhost_port = \"4200\"\nendpoint = \"${server.host}:${server.port}\"\n```\n\nRun Prefect :\n\n```bash\ndocker-compose up -d prefect_postgres prefect_hasura prefect_graphql prefect_towel prefect_apollo prefect_ui\n```\n\nWe need to create a _tenant_. Execute on your host :\n\n```bash\npip3 install prefect\nprefect backend server\nprefect server create-tenant --name default --slug default\n```\n\nAccess the web UI at [localhost:8081](http://localhost:8081)\n\n### Run Prefect workers\n\nAgents are services that run your scheduled flows.\n\n1. Open and optionally edit the [`agent/config.toml`](./agent/config.toml) file.\n\n2. Let's instanciate 3 workers :\n\n  ```bash\n  docker-compose -f agent/docker-compose.yml up -d --build --scale agent=3 agent\n  ```\n\n  \u003e :information_source: You can run the agent on another machine than the one with the Prefect server. Edit the [`agent/config.toml`](./agent/config.toml) file for that.\n\n### COVID-19 data\n\nInjection scripts should are scheduled in Prefect so they automatically inject data with the latest news (delete + inject).\n\nThere are several data source supported by Pandemic Knowledge\n\n- [Our World In Data](https://ourworldindata.org/coronavirus-data); used by Google\n  - docker-compose slug : `insert_owid`\n  - MinIO bucket : `contamination-owid`\n  - Format : CSV\n- [OpenCovid19-Fr](https://github.com/opencovid19-fr/data)\n  - docker-compose slug : `insert_france`\n  - Format : CSV (download from Internet)\n- [Public Health France - Virological test results](https://www.data.gouv.fr/en/datasets/donnees-relatives-aux-resultats-des-tests-virologiques-covid-19/) (official source)\n  - docker-compose slug : `insert_france_virtests`\n  - Format : CSV (download from Internet)\n\n1. Start MinIO and import your files according to the buckets evoked upper.\n\n    For _Our World In Data_, create the `contamination-owid` bucket and import the CSV file inside.\n\n    ```bash\n    docker-compose up -d minio\n    ```\n\n    \u003e MinIO is available at `localhost:9000`\n\n2. Download dependencies and start the injection service of your choice. For instance :\n\n    ```bash\n    pip3 install -r ./flow/requirements.txt\n    docker-compose -f insert.docker-compose.yml up --build insert_owid\n    ```\n\n3. In [Kibana](https://localhost:5601), create an index pattern `contamination_owid_*`\n\n4. Once injected, we recommend to adjust the number of replicas [in the DevTool](https://localhost:5601/app/dev_tools#/console) :\n\n    ```json\n    PUT /contamination_owid_*/_settings\n    {\n        \"index\" : {\n            \"number_of_replicas\" : \"2\"\n        }\n    }\n    ```\n\n5. Start making your dashboards in [Kibana](https://localhost:5601) !\n\n### News data\n\nThere are two sources for news :\n\n- Google News (elasticsearch index: `news_googlenews`)\n- Twitter (elasticsearch index: `news_tweets`)\n\n1. Run the Google News crawler :\n\n  ```bash\n  docker-compose -f crawl.docker-compose.yml up --build crawl_google_news # and/or crawl_tweets\n  ```\n\n2. In Kibana, create a `news_*` index pattern\n\n3. **Edit** the index pattern fields :\n\n  | Name | Type                                                  | Format  |\n  | ---- | ----------------------------------------------------- | ------- |\n  | img  | string                                                | **Url** |\n  | link | string **with Type: Image** with empty _URL template_ | **Url** |\n\n4. Create your visualisation\n\n### News web app\n\nBrowse through the news with our web application.\n\n![News web app](./illustrations/news_web_app.png)\n\n1. Make sure you've accepted the self-signed certificate of Elasticsearch at [`https://localhost:9200`](https://localhost:9200)\n\n2. Start-up the app\n\n    ```bash\n    docker-compose -f news_app/docker-compose.yml up --build -d\n    ```\n\n3. Discover the app at [`localhost:8080`](http://localhost:8080)\n\n---\n\n\u003cdetails\u003e\n\u003csummary\u003eTODOs\u003c/summary\u003e\n\nPossible improvements :\n\n- [ ] [Using Dask for parallelizing](https://docs.prefect.io/core/idioms/parallel.html) process of CSV lines by batch of 1000\n- [ ] Removing indices only when source process is successful (adding new index, then remove old index)\n- [ ] Removing indices only when crawling is successful (adding new index, then remove old index)\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eUseful commands\u003c/summary\u003e\n\nTo stop everything :\n\n```bash\ndocker-compose down\ndocker-compose -f agent/docker-compose.yml down\ndocker-compose -f insert.docker-compose.yml down\ndocker-compose -f crawl.docker-compose.yml down\n```\n\nTo start each service, step by step :\n\n```bash\ndocker-compose up -d es01 es02 es03 kibana\ndocker-compose up -d minio\ndocker-compose up -d prefect_postgres prefect_hasura prefect_graphql prefect_towel prefect_apollo prefect_ui\ndocker-compose -f agent/docker-compose.yml up -d --build --scale agent=3 agent\n```\n\n\u003c/details\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fflavienbwk%2Fpandemic-knowledge","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fflavienbwk%2Fpandemic-knowledge","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fflavienbwk%2Fpandemic-knowledge/lists"}