{"id":21427876,"url":"https://github.com/andravin/spio","last_synced_at":"2025-07-14T10:31:34.706Z","repository":{"id":264094260,"uuid":"880627377","full_name":"andravin/spio","owner":"andravin","description":"Memory-Efficient CUDA kernels for training ConvNets with PyTorch.","archived":false,"fork":false,"pushed_at":"2025-02-25T21:29:06.000Z","size":228,"stargazers_count":41,"open_issues_count":1,"forks_count":1,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-06-09T14:56:49.817Z","etag":null,"topics":["convolutional-neural-networks","cuda","pytorch"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/andravin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-30T03:45:23.000Z","updated_at":"2025-06-01T11:29:31.000Z","dependencies_parsed_at":"2025-02-25T22:34:23.627Z","dependency_job_id":null,"html_url":"https://github.com/andravin/spio","commit_stats":null,"previous_names":["andravin/spio"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/andravin/spio","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andravin%2Fspio","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andravin%2Fspio/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andravin%2Fspio/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andravin%2Fspio/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/andravin","download_url":"https://codeload.github.com/andravin/spio/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andravin%2Fspio/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265280608,"owners_count":23739851,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["convolutional-neural-networks","cuda","pytorch"],"created_at":"2024-11-22T22:07:49.499Z","updated_at":"2025-07-14T10:31:34.698Z","avatar_url":"https://github.com/andravin.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Spio\n\n![Benchmark Result on NVIDIA GeForce RTX 3090](figures/batch_size_vs_eff_bandwidth__nvidia_geforce_rtx_3090__convfirst_64c_3r_3s_8gw.png)\n\n## Introduction\n\nThe goal of the Spio project is to improve training efficiency for convolutional neural networks (ConvNets). While there has been a lot of progress in the design of ConvNet models, the performance of ConvNet kernels has languished. Today, the performance of a ConvNet is often limited by the efficiency of its implementation.\n\nOur [paper](https://arxiv.org/abs/2404.03617) implemented efficient GPU kernels for ConvNet inference. Spio implements kernels for training.\n\nThe first Spio kernel is for grouped convolution, a promising layer that has fallen into disuse because of the inefficiency of the current implementation. We focus on group width equal to eight and stride 1, as used in our ConvFirst model, and support NVIDIA Ampere ([sm_80](https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf) and [sm_86](https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf)) and Ada ([sm_89](https://images.nvidia.com/aem-dam/Solutions/Data-Center/l4/nvidia-ada-gpu-architecture-whitepaper-v2.1.pdf)) GPUs.\n\n## Benchmarks\n\nThe cuDNN Conv2d kernels use an \"implicit GEMM\" algorithm that tiles the input tensor with horizontal strips. The support halo for the convolution kernel causes overlapping reads of the input tensor, and when the tile is a 1D strip, the overlap is larger than the tile. This results in excess global memory traffic.\n\nThe Spio Conv2d kernel uses 2D tiles. This reduces the overlap between tiles and reduces global memory traffic. It processes the 2D tile one row at a time, convolving each input row with every filter row while updating a circular buffer of output rows. The circular buffer is implemented in registers by unrolling the input-row loop by the number of filter rows. This overlap-add style algorithm minimizes the kernel's local memory footprint, which increases occupancy and maximizes utilization of the global memory bandwidth.\n\nGroup width 8 matches the accumulation depth of the Float16 tensor core (through AD102, sm_89). Therefore, the grouped convolution is implemented just like regular planar convolution, but with scalar input elements\nreplaced by 8-element vectors, scalar filter elements replaced by 8x8 matrices, and scalar multiplication replaced by matrix-vector multiplication. Processing 16 columns of the input row at once turns the input vectors into input matrices, so that the algorithm can use the [mma.sync.aligned.m16n8k8.row.col.f32.f16.f16.f32](https://docs.nvidia.com/cuda/parallel-thread-execution/#warp-level-matrix-instructions-mma) instruction.\n\nOn the NVIDIA RTX 3090 GPU (above), Spio approaches the DRAM memory bandwidth limit for the FProp, DGrad (gradient with respect to inputs), and WGrad (gradient with respect to weights) kernels, while the PyTorch / cuDNN kernels struggle with excess data transfers.\n\nOn the NVIDIA RTX 4090 GPU, Spio exceeds the DRAM memory bandwidth limit for small batch sizes by exploiting the fact that the activation tensors fit in the GPU's large (72 MB) L2 cache:\n\n![Benchmark Result on NVIDIA GeForce RTX 4090](figures/batch_size_vs_eff_bandwidth__nvidia_geforce_rtx_4090__convfirst_64c_3r_3s_8gw.png)\n\n### Benchmarking Methodology\n\nOur benchmarks use [torch.profile](https://pytorch.org/docs/stable/profiler.html), which uses NVIDIA's [libcupti](https://developer.nvidia.com/cupti-ctk12_0) internally for precise\nkernel timing. We benchmark layers *in situ*, placing a grouped convolution layer inside a\nConvFirst or MBConv building block and constructing a stack of several blocks. This creates a realistic environment for the target kernel, where the memory hierarchy is exercised similarly to a real-world use case.\n\n## Implementation Notes\n\nSpio uses several strategies to simplify the development of high-performance CUDA kernels that\nintegrate with PyTorch.\n\n### Named Tensors\n\nSpio uses named tensors to simplify tensor indexing in CUDA source code. In Python, you specify the tensor\nand indexing dimensions like this:\n\n```python\n        TensorSpec(\"Output\", \"uint4\", {\"n\": n, \"p\": p, \"q\": q, \"k8\": c8}),\n        TensorSpec(\n            \"ConstSmemOutput\",\n            \"const uint4\",\n            {\"q\": block_q, \"n\": block_n, \"k8\": block_c8 + 1},\n        ),\n        IndexSpec(\"OutputStoreIdx\", {\"n\": block_n, \"q\": block_q, \"k8\": block_c8}),\n```\n\nwhich generates CUDA/C++ classes that you use in your kernel like this:\n\n```c++\n    // Output-smem to output.\n    ConstSmemOutput smem_output_load(smem_output_buf);\n    Output output(dst);\n    bool thread_stores_output;\n    {\n        OutputStoreIdx idx(threadIdx.x);\n        auto q = block_q + idx.q();\n        auto n = block_n + idx.n();\n        auto k8 = block_c8 + idx.k8();\n        smem_output_load = smem_output_load.n(idx.n()).q(idx.q()).k8(idx.k8());\n        output = output.n(n).p(block_p).q(q).k8(k8);\n        thread_stores_output = n \u003c Output::N \u0026\u0026 q \u003c Output::Q \u0026\u0026 k8 \u003c Output::K8 \u0026\u0026\n            threadIdx.x \u003c OutputStoreIdx::size;\n    }\n\n    # ...\n\n    if (thread_stores_output)\n    {\n        *output = *smem_output_load;\n    }\n    output = output.p(1);\n\n```\n\n### Run Time Compilation\n\nSpio compiles kernels at runtime using [libnvrtc](https://docs.nvidia.com/cuda/nvrtc/index.html) and launches them with [libcuda](https://docs.nvidia.com/cuda/cuda-driver-api/index.html). Unlike other packages that offer runtime compilation, Spio does not depend on the CUDA toolkit. We simply use the same NVIDIA [libnvrtc](https://pypi.org/project/nvidia-cuda-nvrtc-cu12/) and [cuda-runtime](https://pypi.org/project/nvidia-cuda-runtime-cu12/) Python packages on which PyTorch already [depends](https://github.com/pytorch/pytorch/blob/bae3426af77be643af83f1527fb430e9ca09b058/.github/scripts/generate_binary_build_matrix.py#L71). This minimizes software dependencies and simplifies installation.\n\n### Kernel Performance Models\n\nSpio predicts the best kernel configuration for each layer with a performance model trained on thousands of offline benchmarking samples. Prediction takes just a few milliseconds, so startup is much faster than other frameworks that use a time consuming auto-tuning step.\n\n### Integration with torch.compile\n\nWe integrate with `torch.compile` using the [Python Custom Operators](https://pytorch.org/tutorials/advanced/python_custom_ops.html) interface from PyTorch 2.4. This functionality passes basic tests but is still experimental. See this [PyTorch issue](https://github.com/pytorch/pytorch/issues/137033).\n\n## Installation from Source\n\nFirst, ensure you have a C compiler installed. On Ubuntu:\n\n```bash\nsudo apt update\nsudo apt install build-essential\n```\n\nClone the repository:\n\n```bash\ngit clone https://github.com/andravin/spio.git\ncd spio\n```\n\nOptionally, create a virtual environment and activate it:\n\n```bash\npython3 -m venv .venv\nsource .venv/bin/activate\n```\n\nInstall the package from source using pip:\n\n```bash\npip install --upgrade pip\npip install .\n```\n\nOptionally, run the unit tests. This can take a while,\nbecause Spio tests every configuration of each kernel. It goes a bit faster\nif we set the SPIO_WORKERS environment variable to use all CPU cores for compiling kernels:\n\n```bash\ncd tests\nSPIO_WORKERS=$(nproc) pytest .\n```\n\nNote: the tests and scripts cannot be run from the top-level spio directory because\nthat would cause Python to find the local spio package instead of the installed package.\nOnly the installed package includes the compiled spio.cuda.driver Cython extension, so using\nthe local package would result in an import error. Therefore, running `cd tests` before `pytest .` is essential.\n\n## Using Spio with Timm\n\nSpio is integrated with [our fork](https://github.com/andravin/pytorch-image-models.git) of pytorch-image-models (timm) on the `spio_dev` branch. Add the `--spio` option to the command line of `benchmark.py`, `validate.py`, or `train.py`, and timm will use the Spio implementation for any supported operations.\n\nSet the environment variable `export SPIO_LOGGER=1` to cause Spio to print diagnostic info to the console.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandravin%2Fspio","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fandravin%2Fspio","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandravin%2Fspio/lists"}