{"id":13568462,"url":"https://github.com/catalyst-team/classification","last_synced_at":"2025-04-04T04:31:03.309Z","repository":{"id":52414849,"uuid":"174105047","full_name":"catalyst-team/classification","owner":"catalyst-team","description":"Catalyst.Classification","archived":true,"fork":false,"pushed_at":"2021-09-13T06:01:38.000Z","size":1146,"stargazers_count":36,"open_issues_count":5,"forks_count":9,"subscribers_count":8,"default_branch":"master","last_synced_at":"2024-10-29T18:41:51.817Z","etag":null,"topics":["augmentation","catalyst","classification","classification-pipeline","deep-learning","docker","focal-loss","image-classification","image-processing","machine-learning","pipeline","python","pytorch","reproducibility"],"latest_commit_sha":null,"homepage":"https://github.com/catalyst-team/catalyst","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/catalyst-team.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null},"funding":{"github":null,"patreon":"catalyst_team","open_collective":null,"ko_fi":null,"tidelift":null,"community_bridge":null,"liberapay":null,"issuehunt":null,"otechie":null,"custom":null}},"created_at":"2019-03-06T08:36:39.000Z","updated_at":"2024-01-04T16:31:27.000Z","dependencies_parsed_at":"2022-09-06T04:50:47.436Z","dependency_job_id":null,"html_url":"https://github.com/catalyst-team/classification","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/catalyst-team%2Fclassification","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/catalyst-team%2Fclassification/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/catalyst-team%2Fclassification/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/catalyst-team%2Fclassification/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/catalyst-team","download_url":"https://codeload.github.com/catalyst-team/classification/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247123072,"owners_count":20887259,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["augmentation","catalyst","classification","classification-pipeline","deep-learning","docker","focal-loss","image-classification","image-processing","machine-learning","pipeline","python","pytorch","reproducibility"],"created_at":"2024-08-01T14:00:26.260Z","updated_at":"2025-04-04T04:31:03.280Z","avatar_url":"https://github.com/catalyst-team.png","language":"Shell","funding_links":["https://patreon.com/catalyst_team"],"categories":["Tutorials and Pipelines"],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n[![Catalyst logo](https://raw.githubusercontent.com/catalyst-team/catalyst-pics/master/pics/catalyst_logo.png)](https://github.com/catalyst-team/catalyst)\n\n**Accelerated DL \u0026 RL!**\n\n[![Build Status](http://66.248.205.49:8111/app/rest/builds/buildType:id:Catalyst_Deploy/statusIcon.svg)](http://66.248.205.49:8111/project.html?projectId=Catalyst\u0026tab=projectOverview\u0026guest=1)\n[![CodeFactor](https://www.codefactor.io/repository/github/catalyst-team/catalyst/badge)](https://www.codefactor.io/repository/github/catalyst-team/catalyst)\n[![Pipi version](https://img.shields.io/pypi/v/catalyst.svg)](https://pypi.org/project/catalyst/)\n[![Docs](https://img.shields.io/badge/dynamic/json.svg?label=docs\u0026url=https%3A%2F%2Fpypi.org%2Fpypi%2Fcatalyst%2Fjson\u0026query=%24.info.version\u0026colorB=brightgreen\u0026prefix=v)](https://catalyst-team.github.io/catalyst/index.html)\n[![PyPI Status](https://pepy.tech/badge/catalyst)](https://pepy.tech/project/catalyst)\n\n[![Twitter](https://img.shields.io/badge/news-twitter-499feb)](https://twitter.com/CatalystTeam)\n[![Telegram](https://img.shields.io/badge/channel-telegram-blue)](https://t.me/catalyst_team)\n[![Slack](https://img.shields.io/badge/Catalyst-slack-success)](https://join.slack.com/t/catalyst-team-devs/shared_invite/zt-d9miirnn-z86oKDzFMKlMG4fgFdZafw)\n[![Github contributors](https://img.shields.io/github/contributors/catalyst-team/catalyst.svg?logo=github\u0026logoColor=white)](https://github.com/catalyst-team/catalyst/graphs/contributors)\n\n\n\u003c/div\u003e\n\nPyTorch framework for Deep Learning research and development.\nIt was developed with a focus on reproducibility,\nfast experimentation and code/ideas reusing.\nBeing able to research/develop something new,\nrather than write another regular train loop. \u003cbr/\u003e\nBreak the cycle - use the Catalyst!\n\nProject [manifest](https://github.com/catalyst-team/catalyst/blob/master/MANIFEST.md). Part of [PyTorch Ecosystem](https://pytorch.org/ecosystem/). Part of [Catalyst Ecosystem](https://docs.google.com/presentation/d/1D-yhVOg6OXzjo9K_-IS5vSHLPIUxp1PEkFGnpRcNCNU/edit?usp=sharing):\n- [Alchemy](https://github.com/catalyst-team/alchemy) - Experiments logging \u0026 visualization\n- [Catalyst](https://github.com/catalyst-team/catalyst) - Accelerated Deep Learning Research and Development\n- [Reaction](https://github.com/catalyst-team/reaction) - Convenient Deep Learning models serving\n\n[Catalyst at AI Landscape](https://landscape.lfai.foundation/selected=catalyst).\n\n---\n\n# Catalyst.Classification [![Build Status](http://66.248.205.49:8111/app/rest/builds/buildType:id:Classification_Tests/statusIcon.svg)](http://66.248.205.49:8111/project.html?projectId=Classification\u0026tab=projectOverview\u0026guest=1) [![Github contributors](https://img.shields.io/github/contributors/catalyst-team/classification.svg?logo=github\u0026logoColor=white)](https://github.com/catalyst-team/classification/graphs/contributors)\n\n\u003e *Note: this repo uses advanced Catalyst Config API and could be a bit out-of-day right now. \n\u003e Use [Catalyst's minimal examples section](https://github.com/catalyst-team/catalyst#minimal-examples) for a starting point and up-to-day use cases, please.*\n\nYou will learn how to build image classification pipeline with transfer learning using the Catalyst framework to get reproducible results.\n\n## Goals\n1. Install requirements\n2. Prepare data\n3. **Run: raw data → production-ready model**\n4. **Get results**\n5. Customize own pipeline\n\n## 1. Install requirements\n\n### Using local environment:\n\n```bash\npip install -r requirements/requirements.txt\n```\n\n### Using docker:\n\nThis creates a build `catalyst-classification` with the necessary libraries:\n```bash\nmake docker-build\n```\n\n## 2. Get Dataset\n\n### Try on open datasets\n\n\u003cdetails\u003e\n\u003csummary\u003eYou can use one of the open datasets \u003c/summary\u003e\n\u003cp\u003e\n\n```bash\nexport DATASET=\"artworks\"\n\nrm -rf data/\nmkdir -p data\n\nif [[ \"$DATASET\" == \"ants_bees\" ]]; then\n    # https://www.kaggle.com/ajayrana/hymenoptera-data\n    download-gdrive 1czneYKcE2sT8dAMHz3FL12hOU7m1ZkE7 ants_bees_cleared_190806.tar.gz\n    tar -xf ants_bees_cleared_190806.tar.gz \u0026\u003e/dev/null\n    mv ants_bees_cleared_190806 ./data/origin\nelif [[ \"$DATASET\" == \"flowers\" ]]; then\n    # https://www.kaggle.com/alxmamaev/flowers-recognition\n    download-gdrive 1rvZGAkdLlbR_MEd4aDvXW11KnLaVRGFM flowers.tar.gz\n    tar -xf flowers.tar.gz \u0026\u003e/dev/null\n    mv flowers ./data/origin\nelif [[ \"$DATASET\" == \"artworks\" ]]; then\n    # https://www.kaggle.com/ikarus777/best-artworks-of-all-time\n    download-gdrive 1eAk36MEMjKPKL5j9VWLvNTVKk4ube9Ml artworks.tar.gz\n    tar -xf artworks.tar.gz \u0026\u003e/dev/null\n    mv artworks ./data/origin\nfi\n\n```\n\n\u003c/p\u003e\n\u003c/details\u003e\n\n\n### Use your own dataset\n\n\n\u003cdetails\u003e\n\u003csummary\u003ePrepare your dataset\u003c/summary\u003e\n\u003cp\u003e\n\n#### Data structure\nMake sure, that final folder with data has the required structure:\n```bash\n/path/to/your_dataset/\n        class_name_1/\n            images\n        class_name_2/\n            images\n        ...\n        class_name_100500/\n            ...\n```\n\n#### Data location\n\n* The easiest way is to move your data:\n    ```bash\n    mv /path/to/your_dataset/* /catalyst.classification/data/origin\n    ```\n    In that way you can run pipeline with default settings.\n\n* If you prefer leave data in `/path/to/your_dataset/`\n    * In local environment:\n        * Link directory\n            ```bash\n            ln -s /path/to/your_dataset $(pwd)/data/origin\n            ```\n         * Or just set path to your dataset `DATADIR=/path/to/your_dataset` when you start the pipeline.\n\n    * Using docker\n\n        You need to set:\n        ```bash\n           -v /path/to/your_dataset:/data \\ #instead default  $(pwd)/data/origin:/data\n         ```\n        in the script below to start the pipeline.\n\u003c/p\u003e\n\u003c/details\u003e\n\n## 3. Classification pipeline\n### Fast\u0026Furious: raw data → production-ready model\n\nThe pipeline will automatically guide you from raw data to the production-ready model.\n\nWe will initialize ResNet-18 model with a pre-trained network. During current pipeline model will be trained sequentially in two stages, also in the first stage we will train several heads simultaneously.\n\n#### Run in local environment:\n\n```bash\nCUDA_VISIBLE_DEVICES=0 \\\nCUDNN_BENCHMARK=\"True\" \\\nCUDNN_DETERMINISTIC=\"True\" \\\nbash ./bin/catalyst-classification-pipeline.sh \\\n  --workdir ./logs \\\n  --datadir ./data/origin \\\n  --max-image-size 224 \\  # 224 or 448 works good\n  --balance-strategy 256 \\  # images in epoch per class, 1024 works good\n  --config-template ./configs/templates/main.yml \\\n  --num-workers 4 \\\n  --batch-size 256 \\\n  --criterion CrossEntropyLoss  # one of CrossEntropyLoss, BCEWithLogits, FocalLossMultiClass\n```\n\n#### Run in docker:\n\n```bash\ndocker run -it --rm --shm-size 8G --runtime=nvidia \\\n  -v $(pwd):/workspace/ \\\n  -v $(pwd)/logs:/logdir/ \\\n  -v $(pwd)/data/origin:/data \\\n  -e \"CUDA_VISIBLE_DEVICES=0\" \\\n  -e \"CUDNN_BENCHMARK='True'\" \\\n  -e \"CUDNN_DETERMINISTIC='True'\" \\\n  catalyst-classification ./bin/catalyst-classification-pipeline.sh \\\n    --workdir /logdir \\\n    --datadir /data \\\n    --max-image-size 224 \\  # 224 or 448 works good\n    --balance-strategy 256 \\  # images in epoch per class, 1024 works good\n    --config-template ./configs/templates/main.yml \\\n    --num-workers 4 \\\n    --batch-size 256 \\\n    --criterion CrossEntropyLoss  # one of CrossEntropyLoss, BCEWithLogits, FocalLossMultiClass\n```\nThe pipeline is running and you don’t have to do anything else, it remains to wait for the best model!\n\n#### Visualizations\n\nYou can use [W\u0026B](https://www.wandb.com/) account for visualisation right after `pip install wandb`:\n\n```\nwandb: (1) Create a W\u0026B account\nwandb: (2) Use an existing W\u0026B account\nwandb: (3) Don't visualize my results\n```\n\u003cimg src=\"/pics/wandb_metrics.png\" title=\"w\u0026b classification metrics\"  align=\"left\"\u003e\n\nTensorboard also can be used for visualisation:\n\n```bash\ntensorboard --logdir=/catalyst.classification/logs\n```\n\u003cimg src=\"/pics/tf_metrics.png\" title=\"tf classification metrics\"  align=\"left\"\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eConfusion matrix\u003c/summary\u003e\n\u003cp\u003e\n\u003cimg src=\"/pics/cm.png\" title=\"tf classification metrics\" width=\"700\"\u003e\n\u003c/p\u003e\n\u003c/details\u003e\n\n## 4. Results\nAll results of all experiments can be found locally in `WORKDIR`, by default `catalyst.classification/logs`. Results of experiment, for instance `catalyst.classification/logs/logdir-191010-141450-c30c8b84`, contain:\n\n#### checkpoints\n*  The directory contains all checkpoints: best, last, also of all stages.\n* `best.pth` and `last.pht` can be also found in the corresponding experiment in your W\u0026B account.\n\n#### configs\n*  The directory contains experiment\\`s configs for reproducibility.\n\n#### logs\n* The directory contains all logs of experiment.\n* Metrics also logs can be displayed in the corresponding experiment in your W\u0026B account.\n\n#### code\n*  The directory contains code on which calculations were performed. This is necessary for complete reproducibility.\n\n## 5. Customize own pipeline\n\nFor your future experiments framework provides powerful configs allow to optimize configuration of the whole pipeline of classification in a controlled and reproducible way.\n\n\u003cdetails\u003e\n\u003csummary\u003eConfigure your experiments\u003c/summary\u003e\n\u003cp\u003e\n\n* Common settings of stages of training and model parameters can be found in `catalyst.classification/configs/_common.yml`.\n    * `model_params`: detailed configuration of models, including:\n        * model, for instance `MultiHeadNet`\n        * detailed architecture description\n        * using pretrained model\n    * `stages`: you can configure training or inference in several stages with different hyperparameters. In our example:\n        * optimizer params\n        * first learn the head(s), then train the whole network\n\n* The `CONFIG_TEMPLATE` with other experiment\\`s hyperparameters, such as data_params and is here: `catalyst.classification/configs/templates/main.yml`.  The config allows you to define:\n    * `data_params`: path, batch size, num of workers and so on\n    * `callbacks_params`: Callbacks are used to execute code during training, for example, to get metrics or save checkpoints. Catalyst provide wide variety of helpful callbacks also you can use custom.\n\n\nYou can find much more options for configuring experiments in [catalyst documentation.](https://catalyst-team.github.io/catalyst/)\n\n\u003c/p\u003e\n\u003c/details\u003e\n\n## 6. Autolabel\n\n#### Goals\n\nThe classical way to reduce the amount of unlabeled data by having a trained model would be to run unlabeled dataset through the model and automatically label images with confidence of label prediction above the threshold. Then automatically labeled data pushing in the training process so as to optimize prediction accuracy.\n\nTo run the iteration process we need to specify number of iterations `n-trials` and `threshold` of confidence to label image.\n\n- tune ResNetEncoder\n- train MultiHeadNet for image classification\n- predict unlabelled dataset\n- use most confident predictions as true labels\n- repeat\n\n\n#### Preparation\n\n```bash\ncatalyst.classification/data/\n    raw/\n        all/\n            ...\n    clean/\n        0/\n            ...\n        1/\n            ...\n```\n\n#### Model training\n\n##### Run in local environment:\n\n```bash\nCUDA_VISIBLE_DEVICES=0 \\\nCUDNN_BENCHMARK=\"True\" \\\nCUDNN_DETERMINISTIC=\"True\" \\\nbash ./bin/catalyst-autolabel-pipeline.sh \\\n  --workdir ./logs \\\n  --datadir-clean ./data/clean \\\n  --datadir-raw ./data/raw \\\n  --n-trials 10 \\\n  --threshold 0.8 \\\n  --config-template ./configs/templates/autolabel.yml \\\n  --max-image-size 224 \\\n  --num-workers 4 \\\n  --batch-size 256\n```\n\n##### Run in docker:\n\n```bash\ndocker run -it --rm --shm-size 8G --runtime=nvidia \\\n  -v $(pwd):/workspace/ \\\n  -e \"CUDA_VISIBLE_DEVICES=0\" \\\n  -e CUDNN_BENCHMARK=\"True\" \\\n  -e CUDNN_DETERMINISTIC=\"True\" \\\n  catalyst-classification bash ./bin/catalyst-autolabel-pipeline.sh \\\n    --workdir ./logs \\\n    --datadir-clean ./data/clean \\\n    --datadir-raw ./data/raw \\\n    --n-trials 10 \\\n    --threshold 0.8 \\\n    --config-template ./configs/templates/autolabel.yml \\\n    --max-image-size 224 \\\n    --num-workers 4 \\\n    --batch-size 256\n```\n\n#### Results of autolabeling\nOut:\n```\nPredicted: 23 (100.00%)\n...\nPseudo Lgabeling done. Nothing more to label.\n```\nLogs for trainings visualisation can be found here: `./logs/autolabel`\n\nLabeled raw data can be found here: `/data/data_clean/dataset.csv`\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcatalyst-team%2Fclassification","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcatalyst-team%2Fclassification","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcatalyst-team%2Fclassification/lists"}