{"id":21070675,"url":"https://github.com/pyronear/pyro-dataset","last_synced_at":"2025-07-24T16:40:52.483Z","repository":{"id":235153609,"uuid":"459154263","full_name":"pyronear/pyro-dataset","owner":"pyronear","description":null,"archived":false,"fork":false,"pushed_at":"2022-12-07T23:18:22.000Z","size":141,"stargazers_count":3,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-08T14:21:29.354Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pyronear.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2022-02-14T12:32:00.000Z","updated_at":"2025-01-23T10:23:36.000Z","dependencies_parsed_at":"2024-04-22T12:56:17.895Z","dependency_job_id":"fc6f6d79-02af-48b0-93dc-147652784383","html_url":"https://github.com/pyronear/pyro-dataset","commit_stats":null,"previous_names":["pyronear/pyro-dataset"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pyronear%2Fpyro-dataset","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pyronear%2Fpyro-dataset/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pyronear%2Fpyro-dataset/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pyronear%2Fpyro-dataset/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pyronear","download_url":"https://codeload.github.com/pyronear/pyro-dataset/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243278946,"owners_count":20265664,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-19T18:47:47.597Z","updated_at":"2025-07-24T16:40:52.465Z","avatar_url":"https://github.com/pyronear.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Pyro Dataset\n\nThis repository contains all the code and data necessary to build the wildfire\ndataset. This dataset is then used to train our ML models.\n\n## Setup\n\n### 🐍 Python dependencies\n\nInstall `uv` with `pipx`:\n\n```sh\npipx install uv\n```\n\nCreate a virtualenv and install the dependencies with `uv`:\n\n```sh\nuv sync\n```\n\nActivate the `uv` virutalenv:\n\n```sh\nsource .venv/bin/activate\n```\n\n### 🍜 Data dependencies\n\nGet the wildfire datasets with `dvc`:\n\n```sh\ndvc get . data/processed\n```\n\nPull all the data with `dvc`:\n\n```sh\ndvc pull\n```\n\n__Note__: One needs to configure their dvc remote and get access to our remote\ndata storage. Please ask somebody from the team to give you access.\n\nRun the pipeline to build the dataset:\n\n```sh\ndvc repro\n```\n\n## Data Pipeline\n\nThe whole repository is organized as a data pipeline that can be run to\ngenerate the different datasets.\n\nThe Data pipeline is organized with a [dvc.yaml](./dvc.yaml) file.\n\n### DVC stages\n\nThis section list and describes all the DVC stages that are defined in the\n[dvc.yaml](./dvc.yaml) file:\n\n#### ⛱️ Data Preparation\n\n- __data_pyro_sdis_testset__: Turn the parquet files of the\n__pyro-sdis-testset__ dataset into a regular ultralytics folder structure.\n\n#### 🧠 Model Inference\n\n- __predictions_wise_wolf_pyro_sdis_val__: Run inference on all images from the\npyro-sdis val split with the `wise_wolf` model.\n- __predictions_legendary_field_pyro_sdis_val__: Run inference on all images\nfrom the pyro-sdis val split with the `legendary_field` model.\n- __predictions_wise_wolf_FP_2024__: Run inference on all images from the\nFP_2024 dataset with the `wise_wolf` model.\n- __crops_wise_wolf_pyro_sdis_val__: Generate crops from the predictions of the\n`wise_wolf` model on the pyro-sdis val split.\n- __crops_wise_wolf_FP_2024__: Generate crops from the predictions of the\n`wise_wolf` model on the FP_2024 dataset.\n\n#### 🚭 Filtering\n\n- __filter_data_pyrosdis_smoke__: Keep only the fire smokes from the\n`pyro-sdis` dataset - remove the background images.\n- __filter_data_figlib_smoke__: Keep only the fire smokes from the\n`FIGLIB_ANNOTATED_RESIZED` dataset - remove the background images.\n- __filter_data_pyronear_ds_smoke__: Keep only the fire smokes from the\n`pyronear-ds-03-2024` dataset - remove the background images.\n- __filter_data_false_positives_FP_2024__: Keep only the false positives that\nthe `wise_wolf` has made on the `FP_2024` dataset.\n\n#### 🍞 Data Splitting\n\n- __split_data_figlib__: Split the `FIGLIB_ANNOTATED_RESIZED` dataset into\ntrain/val/test sets.\n- __split_data_false_positives_FP_2024__: Split the false postives dataset into\ntrain/val/test sets.\n- __merge_smoke_datasets__: Merge the different data sources of fire smokes and\nsplit into the train/val/test sets.\n\n#### 🧬 Dataset Creation\n\n##### Wildfire Dataset\n\n- __make_train_val_wildfire_dataset__: Make the train/val `wildfire` dataset\nusing the previous stages.\n- __make_test_wildfire_dataset__: Make the test `wildfire` dataset using the\nprevious stages.\n\n##### Temporal Dataset\n\n- __make_temporal_train_val_dataset__: Make the train/val temporal `wildfire`\ndataset using the previous stages.\n- __make_temporal_test_dataset__: Make the test temporal `wildfire` dataset\nusing the previous stages.\n\n#### 🔎 Dataset Analysis\n\n- __analyze_wildfire_dataset__: Run some analyses on the generated dataset to\ncheck for data leakage, data distribution, and background images. Some\ninteractive plots are also generated and exported.\n\n## Data\n\n### Raw\n\nThe datasets below are the foundation of our data pipeline and are the source\nof truth.\n\n- __FIGLIB_ANNOTATED_RESIZED__: re-annotated dataset from the [Fire Ignition\nimages Library](https://www.hpwren.ucsd.edu/FIgLib/).\n- __DS_fp__: All the collected false positives of the Pyronear System before 2024.\n- __FP_2024__: All the collected false positives of the Pyronear System in 2024.\n- [__pyro-sdis__](https://huggingface.co/datasets/pyronear/pyro-sdis):\nPyro-SDIS is a dataset designed for wildfire smoke detection using AI models.\nIt is developed in collaboration with the Fire and Rescue Services (SDIS) in\nFrance and the dedicated volunteers of the Pyronear association. It contains\nonly detected fires by the Pyronear System.\n- __pyronear-ds-03-2024__: Dataset of fire smokes as a mix of different public\ndatasets and synthetic images. It also includes temporal sequences of fire\nevents.\n- [__pyro-sdis-testset__](https://huggingface.co/datasets/pyronear/pyro-sdis-testset):\nPrivate dataset used for evaluating the final performances of the ML models.\n- __Test_dataset_2025__: built from Test_DS by adding extra false positives.\n- __Test_DS__: The initial and curated test dataset.\n\n### Interim\n\nAll the folders located in `./data/interim/` are intermediary results needed to\nbuild up the final datasets. They are versioned with DVC.\n\nMany artifacts and datasets can be found here: from cropped image areas, to\nfiltered datasets to focus on false positives for instance.\n\n- __false_positives__: curated and annotated dataset containing false positives\nfrom the pyronear systems.\n\n### Processed\n\nThe final datasets are located in `./data/processed/`:\n\n- 🔥 __wildfire__: the train/val dataset used to train our ML models. It\nfollows the ultralytics format.\n- 🔥 __wildfire_test__: the test dataset used to evaluate the performance of\nour ML models.\n- ⏰ __wildfire_temporal__: the train/val dataset used to train our temporal ML\nmodels.\n- ⏰ __wildfire_temporal_test__: the test dataset used to evaluate the performance of\nour temporal ML models and the pyronear engine.\n\n### Reporting\n\nOnce the datasets are generated and stored in the `./data/processed/`\ndirectory, various reports are created to visualize the data. These reports\nbreak down the datasets across different dimensions, allowing for a quick\nassessment of whether the various data splits are logical and meaningful.\n\nThese reports live under `./data/reporting/`.\n\n## Scripts\n\nScripts are located in the `./scripts` folder.\n\nMost scripts are connected via the [dvc.yaml](./dvc.yaml) configuration file.\nOthers are utility scripts that can be used to perform various tasks.\n\n### [fetch_platform_sequence_id.py](./scripts/fetch_platform_sequence_id.py)\n\nFetch a detection sequences by its sequence-id directly from the Pyronear\nplatform API.\n\n```bash\nexport PLATFORM_API_ENDPOINT=\"https://alertapi.pyronear.org\"\nexport PLATFORM_LOGIN=sdis-07\nexport PLATFORM_PASSWORD=XXX\nexport PLATFORM_ADMIN_LOGIN=XXX\nexport PLATFORM_ADMIN_PASSWORD=XXX\n\nuv run python ./scripts/platform_train_loop/fetch_platform_sequence_id.py \\\n  --save-dir ./data/raw/pyronear-platform/sequences/my-sequence-5347/ \\\n  --sequence-id 5347\n```\n\n__Note__: Make sure to use an admin login/password as well as a regular\nlogin/password. The admin level access is needed to fetch information about the\norganizations and properly name the detection images locally.\n\n### [fetch_platform_sequences.py](./scripts/fetch_platform_sequences.py)\n\nFetch detection sequences directly from the Pyronear platform API.\n\nFetch all the detection sequences for `sdis-07` and save them in the specified\ndirectory:\n\n```bash\nexport PLATFORM_API_ENDPOINT=\"https://alertapi.pyronear.org\"\nexport PLATFORM_LOGIN=sdis-07\nexport PLATFORM_PASSWORD=XXX\nexport PLATFORM_ADMIN_LOGIN=XXX\nexport PLATFORM_ADMIN_PASSWORD=XXX\n\nuv run python ./scripts/platform_train_loop/fetch_platform_sequences.py \\\n  --save-dir ./data/raw/pyronear-platform/sequences/sdis-07/ \\\n  --date-from 2025-05-01 \\\n  --date-end 2025-06-01\n```\n\n__Note__: Make sure to use an admin login/password as well as a regular\nlogin/password. The admin level access is needed to fetch information about the\norganizations and properly name the detection images locally.\n\n\n## 🧠 Models\n\n- 🌈 __legendary_field__: yolov8s object detection model, first performant model\ntrained in 2019. to detect fire smoke.\n- 🐺 __wise_wolf__: yolov11s object detection model, trained on `2024-04-26` using\na larger dataset.\n\n## 🌎 Release the datasets\n\nThe script to release a new version of the model is located in\n`./scripts/release.py`.\nMake sure to set your `GITHUB_ACCESS_TOKEN` as an env variable in your shell\nbefore running the following script:\n\n```sh\nexport GITHUB_ACCESS_TOKEN=XXX\nuv run python ./scripts/release.py \\\n  --version v1.3.5 \\\n  --github-owner earthtoolsmaker \\\n  --github-repo pyro-dataset\n```\n\nThis will create a new release in the github repository with and upload an\narchive of the datasets to a private S3 repository. The link to the dataset is\ndisplayed in the release summary.\n\n## Run the tests\n\n```bash\nuv run pytest tests/\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpyronear%2Fpyro-dataset","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpyronear%2Fpyro-dataset","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpyronear%2Fpyro-dataset/lists"}