{"id":15175870,"url":"https://github.com/akrielz/vision_models_playground","last_synced_at":"2025-10-26T11:31:35.182Z","repository":{"id":41824684,"uuid":"507501903","full_name":"Akrielz/vision_models_playground","owner":"Akrielz","description":"Playground for testing and implementing various Vision Models","archived":false,"fork":false,"pushed_at":"2024-05-30T12:51:35.000Z","size":190612,"stargazers_count":13,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-10-30T04:49:31.605Z","etag":null,"topics":["artificial-intelligence","attention-mechanism","computer-vision","deep-learning","pytorch","transformer"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/vision-models-playground/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Akrielz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-06-26T07:03:47.000Z","updated_at":"2024-05-30T12:51:39.000Z","dependencies_parsed_at":"2024-11-15T12:56:00.189Z","dependency_job_id":"1ba38b93-07d2-4148-aac5-b6268ec45df7","html_url":"https://github.com/Akrielz/vision_models_playground","commit_stats":{"total_commits":88,"total_committers":2,"mean_commits":44.0,"dds":"0.011363636363636354","last_synced_commit":"767aba98776ea98a64b3fb8fef5ed69967c541cb"},"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Akrielz%2Fvision_models_playground","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Akrielz%2Fvision_models_playground/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Akrielz%2Fvision_models_playground/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Akrielz%2Fvision_models_playground/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Akrielz","download_url":"https://codeload.github.com/Akrielz/vision_models_playground/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238319459,"owners_count":19452340,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","attention-mechanism","computer-vision","deep-learning","pytorch","transformer"],"created_at":"2024-09-27T12:43:22.924Z","updated_at":"2025-10-26T11:31:32.923Z","avatar_url":"https://github.com/Akrielz.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Vision Models - Playground\n\n## Table of Contents\n\n- [Description](#description)\n- [Install](#install)\n\n## Models\n- [YoloV1](#yolov1)\n- [ResNet](#resnet)\n- [Vision Transformer](#vision-transformer-vit)\n- [Generative Adversarial Networks](#generative-adversarial-networks-gan)\n- [Perceiver](#perceiver)\n- [Vision Perceiver](#vision-perceiver-vip)\n- [Convolutional Vision Transformer](#convolutional-vision-transformer-cvt)\n- [UNet](#unet)\n\n## Description\n\nThis playground is a collection of vision models implemented by me from scratch\nin PyTorch, with the purpose of getting a better understanding of the specific\npapers and techniques used.\n\n## Install\n\nIn order to install this package, run the following command:\n\n```bash\n$ pip install vision-models-playground\n```\n\n\n## YoloV1\n\nA detector based on the [YoloV1](https://arxiv.org/abs/1506.02640) architecture. Also known as Darknet.\n\n### Usage\n\nModels can be initialized with pre-build or custom versions.\n\nCode example to initialize and use prebuild YoloV1\n\n```python\nimport torch\nfrom vision_models_playground.models.segmentation import build_yolo_v1\n\nmodel = build_yolo_v1(num_classes=20, in_channels=3, grid_size=7, num_bounding_boxes=2)\n\nimg = torch.randn(1, 3, 448, 448)  # \u003cbatch_size, in_channels, height, width\u003e\npreds = model(img)  # (1, 7, 7, 30) \u003cbatch_size, grid_size, grid_size, num_classes + 5 * num_bounding_boxes\u003e\n```\n\nOr if you want to use the custom YoloV1\n\n```python\nimport torch\nfrom vision_models_playground.models.segmentation import YoloV1\n\ndims = [[64], [192], [128, 256, 256, 512], [256, 512, 256, 512, 256, 512, 256, 512, 512, 1024], [512, 1024, 512, 1024]]\nkernel_size = [[7], [3], [1, 3, 1, 3], [1, 3, 1, 3, 1, 3, 1, 3, 1, 3], [1, 3, 1, 3]]\nstride = [[2], [1], [1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1]]\nmax_pools = [True, True, True, True, False]\n\nmodel = YoloV1(\n    dims=dims,\n    kernel_size=kernel_size,\n    stride=stride,\n    max_pools=max_pools,\n\n    in_channels=3,\n    num_classes=20,\n    num_bounding_boxes=2,\n    grid_size=7,\n\n    mlp_size=1024,\n)\n\nimg = torch.randn(1, 3, 448, 448)  # \u003cbatch_size, in_channels, height, width\u003e\npreds = model(img)  # (1, 7, 7, 30) \u003cbatch_size, grid_size, grid_size, num_classes + 5 * num_bounding_boxes\u003e\n```\n\n### Parameters\n\n- `dims`: List[List[int]].  \nThe number of channels in each layer.\n\n\n- `kernel_size`: List[List[int]].  \nThe kernel size in each layer.\n\n\n- `stride`: List[List[int]].  \nThe stride in each layer.\n\n\n- `max_pools`: List[bool].  \nIf enabled, a max pool layer is applied after the convolutional layer.\n\n\n- `in_channels`: int.  \nThe number of channels in the input image.\n\n\n- `num_classes`: int.  \nThe number of classes to classify.\n\n\n- `num_bounding_boxes`: int.  \nThe number of bounding boxes to predict per grid cell.\n\n\n- `grid_size`: int.  \nThe size of the grid that the image is split into.\n\n\n- `mlp_size`: int.  \nThe size of the MLP layer that is applied after the convolutional layers.\n\n\n### PreTrained Weights\n\nIf you want to use the pretrained weights, you use the following code:\n\n```py\nimport torch\nfrom vision_models_playground.utility.hub import load_vmp_model_from_hub\n\n\nmodel = load_vmp_model_from_hub(\"Akriel/ResNetYoloV1\")\n\nimg = torch.randn(1, 3, 448, 448)  # \u003cbatch_size, in_channels, height, width\u003e\npreds = model(img)  # (1, 7, 7, 30) \u003cbatch_size, grid_size, grid_size, num_classes + 5 * num_bounding_boxes\u003e\n```\n\nIf you want to use the pretrained model within a pipeline on raw images, you can use the following code:\n\n```py\nfrom PIL import Image\n\nfrom vision_models_playground.utility.hub import load_vmp_pipeline_from_hub\n\npipeline = load_vmp_pipeline_from_hub(\"Akriel/ResNetYoloV1\")\n\n# Load image\nx = Image.open(\"path/to/image.jpg\")\ny = pipeline(x)  # A dictionary containing the predictions\n```\n\nFor more information about the pretrain models, check the [hub](https://huggingface.co/Akriel/ResNetYoloV1) page.  \nOr check the demo results from `explore_models/yolo_trained_model.ipynb`\n\n## ResNet\n\nA classifier based on the [ResNet](https://arxiv.org/abs/1512.03385) architecture.\n\n### Usage\n\nModels can be initialized with pre-build or custom versions.\n\nPre-build models:\n\n- ResNet18\n- ResNet34\n- ResNet50\n- ResNet101\n- ResNet152\n\nCode example to initialize and use prebuild ResNet34\n\n```python\nimport torch\nfrom vision_models_playground.models.classifiers import build_resnet_34\n\nmodel = build_resnet_34(num_classes=10, in_channels=3)\n\nimg = torch.randn(1, 3, 256, 256)  # \u003cbatch_size, in_channels, height, width\u003e\npreds = model(img)  # (1, 10) \u003cbatch_size, num_classes\u003e\n```\n\nCode example to initialize ResNet34 using the custom ResNet\n\n```python\nimport torch\nfrom vision_models_playground.models.classifiers import ResNet\nfrom vision_models_playground.components.convolutions import ResidualBlock\n\nmodel = ResNet(\n    in_channels=3,\n    num_classes=10,\n    num_layers=[3, 4, 6, 3],\n    num_channels=[64, 128, 256, 512],\n    block=ResidualBlock\n)\n\nimg = torch.randn(1, 3, 256, 256)  # \u003cbatch_size, in_channels, height, width\u003e\npreds = model(img)  # (1, 10) \u003cbatch_size, num_classes\u003e\n```\n\n\n### Parameters\n\n- `in_channels`: int.  \nThe number of channels in the input image.\n\n\n- `num_classes`: int.  \nThe number of predicted classes\n\n\n- `num_layers`: List[int]  \nThe number of block layers in each stage.\n\n\n- `num_channels`: List[int]  \nThe number of channels in each stage.  \nEach stage will start with a stride of 2, connecting the previous stage channels \nwith the current stage channels.\n\n\n- `block`: Union[ResidualBlock, BottleneckBlock]  \nThe block type used.  \nThere are two pre-implemented block types: ResidualBlock and BottleneckBlock.  \nCan be replaced with any custom block that has the following params in the constructor:\n`in_channels`, `out_channels`, `stride`.\n\n## Vision Transformer (ViT)\n\nA classifier based on the [Vision Transformer](https://openreview.net/pdf?id=YicbFdNTTy) \narchitecture.\n\n### Usage\n\nCode example to initialize and use Vision Transformer\n\n```python\nimport torch\nfrom vision_models_playground.models.classifiers import VisionTransformer\n\nmodel = VisionTransformer(\n    image_size=256,\n    patch_size=32,\n    num_classes=1000,\n    projection_dim=1024,\n    depth=6,\n    heads=16,\n    mlp_dim=2048,\n    dropout=0.1,\n    emb_dropout=0.1,\n    apply_rotary_emb=True,\n)\n\nimg = torch.randn(1, 3, 256, 256)\npreds = model(img)  # (1, 1000)\n```\n\n### Parameters\n\n- `image_size`: int.  \nImage size. If you have rectangular images, make sure your image size is the maximum of the width and height\n\n\n- `patch_size`: int.  \nNumber of patches. `image_size` must be divisible by `patch_size`.  \nThe number of patches is: ` n = (image_size // patch_size) ** 2` and `n` **must be greater than 16**.\n\n\n- `num_classes`: int.  \nNumber of classes to classify.\n\n\n- `projection_dim`: int.  \nLast dimension of output tensor after linear transformation `nn.Linear(..., dim)`.\n\n\n- `depth`: int.  \nNumber of Transformer blocks.\n\n\n- `heads`: int.  \nNumber of heads in Multi-head Attention layer. \n\n\n- `mlp_dim`: int.  \nDimension of the MLP (FeedForward) layer. \n\n\n- `channels`: int, default `3`.  \nNumber of image's channels. \n\n\n- `dropout`: float between `[0, 1]`, default `0`.  \nDropout rate. \n\n\n- `emb_dropout`: float between `[0, 1]`, default `0`.   \nEmbedding dropout rate.\n\n\n- `dim_head`: int, default to `64`.  \nThe dim for each head for Multi-Head Attention.\n\n\n- `pool`: string, either `cls` or `mean`, default to `mean`  \nDetermines if token pooling or mean pooling is applied\n\n\n- `apply_rotary_emb`: bool, default `False`.  \nIf enabled, applies rotary_embedding in Attention blocks.\n\n## Generative Adversarial Networks (GAN)\n\nA generative model based on the [GAN](https://arxiv.org/abs/1406.2661) architecture.\n\n### Usage\n\nSince the generated images must have a certain shape, the GAN model receives\nboth the Generator and the Discriminator as input.\n\nThe GAN is taking care of the training process, by computing the loss and\nupdating the weights of the Generator and Discriminator.\n\nHere is a code example that shows how to use the GAN interface to train on the\nMNIST dataset.\n\n```python\nimport torch\n\nfrom torchvision import datasets, transforms\n\nfrom vision_models_playground.models.generative import GAN\n\n# Import custom Generator and Discriminator adequate to the problem\nfrom vision_models_playground.models.generative.adverserial.gan import Discriminator\nfrom vision_models_playground.models.generative.adverserial.gan import Generator\n\n# Create GAN\ngenerator = Generator()\ndiscriminator = Discriminator()\ngan = GAN(generator, discriminator)\n\n# Put model on cuda\ngan.cuda()\n\n# Create the data loader\ntrain_loader = torch.utils.data.DataLoader(\n    datasets.MNIST('./data', train=True, download=True, transform=transforms.Compose([\n        transforms.ToTensor(),\n    ])),\n    batch_size=64,\n    shuffle=True\n)\n\n# Train the GAN\ngan.train_epochs(train_loader, epochs=100, print_every=100)\n```\n\n### Parameters\n\n- `generator`: nn.Module  \nThe Generator model.  \nMust have self.noise_dim set to the dimension of the noise vector used by the \nGenerator in the forward step.\n\n\n- `discriminator`: nn.Module  \nThe Discriminator model.\nThe output of the Discriminator must have shape (\u003cbatch_size, 1), having the\nprobability of the image being real.\n\n### Results\n\nThis is a sample of the results of the GAN on MNIST.  \n\n\u003cimg src=\"./readme_assets/fake_images.png\" width=\"500px\"\u003e\u003c/img\u003e\n\nFor reference, this is a sample of the original MNIST dataset.  \n\n\u003cimg src=\"./readme_assets/real_images.png\" width=\"500px\"\u003e\u003c/img\u003e\n\n### Known issues\n\nAt this moment, the gan is coded to operate only on CUDA devices.\nIn future the code will be refactored to allow the use of CPU devices too.\n\n## Perceiver\n\nA classifier based on the [Perceiver](https://arxiv.org/abs/2103.03206) \narchitecture.\n\n### Usage\n\nCode example to initialize and use Perceiver\n\n```python\nimport torch\nfrom vision_models_playground.models.classifiers import Perceiver\n\nmodel = Perceiver(\n    input_dim=3,\n    input_axis=2,\n    final_classifier_head=True,\n    num_classes=1000,\n    apply_rotary_emb=True,\n    apply_fourier_encoding=True,\n    max_freq=10,\n    num_freq_bands=6,\n    constant_mapping=False,\n    max_position=1600,\n    num_layers=4,\n    num_latents=16,\n    latent_dim=32,\n    cross_num_heads=4,\n    cross_head_dim=32,\n    self_attend_heads=4,\n    self_attend_dim=32,\n    transformer_depth=2,\n    attention_dropout=0.,\n    ff_hidden_dim=64,\n    ff_dropout=0.,\n    activation=None,\n)\n\nimg = torch.randn(1, 256, 256, 3)\npreds = model(img)  # (1, 1000)\n```\n\n### Parameters\n\n- `input_dim`: int.  \nNumber of channels of the input.\n\n\n- `input_axis`: int.  \nNumber of axis of the input.  \nIf the input is a sequence, the input_axis is 1  \nIf the input is an image, the input_axis is 2.  \nIf the input is a video, the input_axis is 3.  \n\n\n- `final_classifier_head`: bool.  \nIf enabled, the final classifier head is applied, and logits are returned.  \nIf disabled, the final classifier head is not applied, and latents are returned.\n\n\n- `num_classes`: int.  \nNumber of classes to classify.\n\n\n- `apply_rotary_emb`: bool.  \nIf enabled, applies rotary_embedding in Attention blocks.\n\n\n- `apply_fourier_encoding`: bool.  \nIf enabled, applies fourier_encoding over the input\n\n\n- `max_freq`: int.  \nMaximum frequency to be used in fourier_encoding.\n\n\n- `num_freq_bands`: int.  \nNumber of frequency bands to be used in fourier_encoding.\n\n\n- `constant_mapping`: bool.\nIf enabled, uses a constant mapping for the axis of the fourier_encoding.\n\n\n- `max_position`: int.  \nMaximum position to be used in the positional fourier encoding.  \nWorks only if `constant_mapping` is enabled.\n\n\n- `num_layers`: int.  \nNumber of layers\n\n\n- `num_latents`: int.  \nNumber of latents\n\n\n- `latent_dim`: int.  \nDimension of the latent vector\n\n\n- `cross_num_heads`: int.  \nNumber of heads in the cross attention blocks\n\n\n- `cross_head_dim`: int.  \nDimension of the heads in the cross attention blocks\n\n\n- `self_attend_heads`: int.  \nNumber of heads in the self attention blocks\n\n\n- `self_attend_dim`: int.  \nDimension of the heads in the self attention blocks\n\n\n- `transformer_depth`: int.  \nNumber of layers in the transformer\n\n\n- `attention_dropout`: float.  \nDropout probability for the attention layers\n\n\n- `ff_hidden_dim`: int.  \nDimension of the hidden layers in the feed forward blocks\n\n\n- `ff_dropout`: float.  \nDropout probability for the feed forward layers\n\n\n- `activation`: Callable.  \nActivation function to be used in the feed forward blocks.  \nIf left as None, GEGLU is used.\n\n\n## Vision Perceiver (ViP)\n\nA classifier based on the [Perceiver](https://arxiv.org/abs/2103.03206) architecture, \nbut adapted to work with the technique of the \n[Vision Transformer](https://openreview.net/pdf?id=YicbFdNTTy) by splitting the\nimage into patches, and projecting them into a sequence.\n\n### Usage\n\nCode example to initialize and use Vision Perceiver\n\n```python\nimport torch\nfrom vision_models_playground.models.classifiers import VisionPerceiver\n\nmodel = VisionPerceiver(\n    patch_size=4,\n    projection_dim=1024,\n    num_classes=1000,\n    apply_rotary_emb=True,\n    max_position=1600,\n    num_layers=2,\n    num_latents=16,\n    latent_dim=32,\n    cross_num_heads=8,\n    cross_head_dim=64,\n    self_attend_heads=8,\n    self_attend_dim=64,\n    transformer_depth=2,\n    attention_dropout=0.0,\n    ff_hidden_dim=512,\n    ff_dropout=0.0,\n    activation=None,\n)\n\nimg = torch.randn(1, 256, 256, 3)\npreds = model(img)  # (1, 1000)\n```\n\n### Parameters\n\n- `patch_size`: int.  \nSize of the patches the image is split into.\n\n\n- `projection_dim`: int.  \nDimension of the projection layer.\n\n\n- `num_classes`: int.  \nNumber of classes to classify.\n\n\n- `apply_rotary_emb`: bool.  \nIf enabled, applies rotary_embedding in Attention blocks.\n\n\n- `apply_fourier_encoding`: bool.  \nIf enabled, applies fourier_encoding over the input\n\n\n- `max_freq`: int.  \nMaximum frequency to be used in fourier_encoding.\n\n\n- `num_freq_bands`: int.  \nNumber of frequency bands to be used in fourier_encoding.\n\n\n- `constant_mapping`: bool, default `False`.  \nIf enabled, uses a constant mapping for the axis of the fourier_encoding.\n\n\n- `max_position`: int.  \nMaximum position to be used in the positional fourier encoding.  \nWorks only if `constant_mapping` is enabled.\n\n\n- `num_layers`: int.  \nNumber of layers\n\n\n- `num_latents`: int.  \nNumber of latents\n\n\n- `latent_dim`: int.  \nDimension of the latent vector\n\n\n- `cross_num_heads`: int.  \nNumber of heads in the cross attention blocks\n\n\n- `cross_head_dim`: int.  \nDimension of the heads in the cross attention blocks\n\n\n- `self_attend_heads`: int.  \nNumber of heads in the self attention blocks\n\n\n- `self_attend_dim`: int.  \nDimension of the heads in the self attention blocks\n\n\n- `transformer_depth`: int.  \nNumber of layers in the transformer\n\n\n- `attention_dropout`: float.  \nDropout probability for the attention layers\n\n\n- `ff_hidden_dim`: int.  \nDimension of the hidden layers in the feed forward blocks\n\n\n- `ff_dropout`: float.  \nDropout probability for the feed forward layers\n\n\n- `activation`: callable.  \nActivation function to be used in the feed forward blocks.  \nIf left as None, GEGLU is used.\n\n\n## Convolutional Vision Transformer (CvT)\n\nA classifier based on the \n[Convolutional Vision Transformer](https://arxiv.org/abs/2103.15808) \narchitecture.\n\n### Usage\n\nModels can be initialized with pre-build or custom versions.\n\nPre-build models:\n\n- CvT13\n- CvT21\n- CvTW24\n\nCode example to initialize and use prebuild CvT13\n\n```python\nimport torch\nfrom vision_models_playground.models.classifiers import build_cvt_13\n\nmodel = build_cvt_13(num_classes=1000, in_channels=3)\n\nimg = torch.randn(1, 256, 256, 3)\npreds = model(img)  # (1, 1000)\n```\n\nCode example to initialize CvT13 using the custom Convolutional Vision Transformer\n\n```python\nimport torch\nfrom vision_models_playground.models.classifiers import ConvVisionTransformer\n\nmodel = ConvVisionTransformer(\n    in_channels=3,\n    num_classes=1000,\n    patch_size=[7, 3, 3],\n    patch_stride=[4, 2, 2],\n    patch_padding=[2, 1, 1],\n    embedding_dim=[64, 192, 384],\n    depth=[1, 2, 10],\n    num_heads=[1, 3, 6],\n    ff_hidden_dim=[256, 768, 1536],\n    qkv_bias=[True, True, True],\n    drop_rate=[0.0, 0.0, 0.0],\n    attn_drop_rate=[0.0, 0.0, 0.0],\n    drop_path_rate=[0.0, 0.0, 0.1],\n    kernel_size=[3, 3, 3],\n    stride_kv=[2, 2, 2],\n    stride_q=[1, 1, 1],\n    padding_kv=[1, 1, 1],\n    padding_q=[1, 1, 1],\n    method=['conv', 'conv', 'conv'],\n)\n\nimg = torch.randn(1, 256, 256, 3)\npreds = model(img)  # (1, 1000)\n```\n\n### Parameters\n\n- `in_channels`: int.  \nNumber of input channels.\n\n\n- `num_classes`: int.  \nNumber of classes to classify.\n\n\n- `activation`: callable.  \nActivation function to be used in the feed forward blocks.  \nIf left as None, QuickGELU is used.\n\n\n- `final_classifier_head`: bool.  \nIf enabled, uses a final classifier head.  \nIf disabled, returns the image features.\n\n\n- `patch_size`: List[int].  \nSize of the patches the image is split into, per each Vision Transformer layer.\nThe patches can overlap\n\n\n- `patch_stride`: List[int].  \nStride of the patches, per each Vision Transformer layer.\n\n\n- `patch_padding`: List[int].  \nPadding of the patches, per each Vision Transformer layer.\n\n\n- `embedding_dim`: List[int].  \nDimension of the embedding layers, per each Vision Transformer layer.\n\n\n- `depth`: List[int].  \nThe depth of each Vision Transformer layer.\n\n\n- `num_heads`: List[int].  \nThe number of heads in each Attention block, per each Vision Transformer layer.\n\n\n- `ff_hidden_dim`: List[int].  \nDimension of the hidden layers in the feed forward blocks, \nper each Vision Transformer layer.\n\n\n- `qkv_bias`: List[bool].  \nIf enabled, adds a bias to the query, key and value vectors,\nper each Vision Transformer layer.\n\n\n- `drop_rate`: List[float].  \nThe dropout rate for the dropout layers in the Vision Transformer, \nFeed Forward and the output of the Attention layers.\n\n\n- `attn_drop_rate`: List[float].  \nThe dropout rate for the dropout layers in the Attention layers.\n\n\n- `drop_path_rate`: List[float].  \nThe DropPath rate for the DropPath layers, per each Vision Transformer.  \nThe DropPath rate that is applied inside each Attend block for the residual\nconnections is computed dynamically based on the depth of the Vision Transformer.\n\n\n- `kernel_size`: List[int].  \nThe kernel size of the convolutional layers, per each Vision Transformer layer.\n\n\n- `stride_kv`: List[int].  \nThe stride of the convolutional layers, used in the projection of the Keys and Values.\n\n\n- `stride_q`: List[int].  \nThe stride of the convolutional layers, used in the projection of the Queries.\n\n\n- `padding_kv`: List[int].  \nThe padding of the convolutional layers, used in the projection of the Keys and Values.\n\n\n- `padding_q`: List[int].  \nThe padding of the convolutional layers, used in the projection of the Queries.\n\n\n- `method`: List[Literal['conv', 'avg', 'linear']].  \nThe method of computing the projections of the Keys, Values and Queries.  \n`conv` stand for convolutional normalized layer, followed by linear projection  \n`avg` stands for average pool layer, followed by linear projection  \n`linear` stands for linear projection.\n\n## UNet\n\nA pixel-level classifier based on the \n[UNet](https://arxiv.org/abs/1505.04597v1) \narchitecture.\n\n### Usage\n\nCode example to initialize and use UNet\n\n```python\nimport torch\nfrom vision_models_playground.models.segmentation import UNet\n\nmodel = UNet(\n    in_channels=1,\n    out_channels=2,\n    channels=[64, 128, 256, 512, 1024],\n    pooling_type=\"max\",\n    scale=2,\n    conv_kernel_size=3,\n    conv_padding=0,\n    method=\"conv\",\n    crop=True\n)\nx = torch.randn(1, 1, 572, 572)\ny = model(x)  # (1, 2, 388, 388)\n```\n\n### Parameters\n\n- `in_channels`: int.  \nNumber of input channels.\n\n\n- `out_channels`: int.  \nNumber of output channels.  \nCan be used as number of classes per pixel.  \nIn case of segmentation, the number of classes can be 2 for example.\n\n\n- `channels`: List[int].  \nList of the number of channels in each layer.\n\n\n- `pooling_type`: Literal['max', 'avg'].  \nType of pooling to be used for the DownScale layers\n\n\n- `scale`: int.  \nScale of the image for each stage.\n\n\n- `conv_kernel_size`: int.  \nKernel size of the convolutional layers.\n\n\n- `conv_padding`: int.  \nPadding of the convolutional layers.\n\n\n- `method`: Literal[\"nearest\", \"linear\", \"bilinear\", \"bicubic\", \"trilinear\", \"conv\"].  \nMethod of computing the initial upscale of the image.\nIf `conv` is selected, uses a convolutional transposed layer.\nElse, uses a `nn.functional.upsample` function with the corresponding method.\n\n\n- `crop`: bool.  \nIf enabled, the output each upscale layer will be cropped to the native size\nof the UpScaled image.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fakrielz%2Fvision_models_playground","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fakrielz%2Fvision_models_playground","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fakrielz%2Fvision_models_playground/lists"}