{"id":30041461,"url":"https://github.com/pytorch/helion","last_synced_at":"2026-05-21T01:13:42.280Z","repository":{"id":291488917,"uuid":"970899748","full_name":"pytorch/helion","owner":"pytorch","description":"A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.","archived":false,"fork":false,"pushed_at":"2026-01-13T06:03:49.000Z","size":8291,"stargazers_count":710,"open_issues_count":113,"forks_count":92,"subscribers_count":17,"default_branch":"main","last_synced_at":"2026-01-13T07:50:55.295Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pytorch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2025-04-22T17:50:13.000Z","updated_at":"2026-01-13T07:25:29.000Z","dependencies_parsed_at":"2025-05-05T00:30:13.777Z","dependency_job_id":"171a429a-c862-4804-97bf-1747a96dd41a","html_url":"https://github.com/pytorch/helion","commit_stats":null,"previous_names":["pytorch-labs/helion","pytorch/helion"],"tags_count":30,"template":false,"template_full_name":null,"purl":"pkg:github/pytorch/helion","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pytorch%2Fhelion","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pytorch%2Fhelion/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pytorch%2Fhelion/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pytorch%2Fhelion/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pytorch","download_url":"https://codeload.github.com/pytorch/helion/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pytorch%2Fhelion/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28442608,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-15T00:55:22.719Z","status":"online","status_checked_at":"2026-01-15T02:00:08.019Z","response_time":62,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-08-07T02:45:02.250Z","updated_at":"2026-05-21T01:13:42.273Z","avatar_url":"https://github.com/pytorch.png","language":"Python","funding_links":[],"categories":["Python","其他_机器学习与深度学习"],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"docs/_static/helion_nobackground.png\" alt=\"Helion Logo\" width=\"250\"/\u003e\n\u003c/div\u003e\n\n# Events\n\n- **June 15, 2026**: Helion Tutorial, [Writing Performance-Portable Kernels Simplified with Helion](https://pldi26.sigplan.org/details/pldi-2026-tutorials/1/Writing-Performance-Portable-Kernels-Simplified-with-Helion) @ PLDI 2026, Boulder, CO\n\n# About\n\n📚 **[View Documentation](https://helionlang.com)** 📚 |\n🎥 **[Watch Talk](https://youtu.be/BW-Ht-5IxgM)** 🎥 |\n🚀 **[Try In Colab](https://colab.research.google.com/github/pytorch/helion/blob/main/notebooks/softmax.ipynb)** 🚀 |\n**[Try In AMD DevCloud](https://amd-ai-academy.com/github/pytorch/helion/blob/main/notebooks/softmax.ipynb)**\n\n**Helion** is a Python-embedded domain-specific language (DSL) for\nauthoring machine learning kernels, designed to compile down to [Triton],\na performant backend for programming GPUs and other devices. Helion aims\nto raise the level of abstraction compared to Triton, making it easier\nto write correct and efficient kernels while enabling more automation\nin the autotuning process.\n\n[Triton]: https://github.com/triton-lang/triton\n\nThe name *Helion* refers to the nucleus of a helium-3 atom, while *Triton*\nrefers to hydrogen-3.\n\nHelion can be viewed either as *PyTorch with tiles* or as *a higher-level Triton*. Compared to\nTriton, Helion reduces manual coding effort through autotuning. Helion spends more time (approx\n10 min) autotuning as it evaluates hundreds of potential Triton implementations generated\nfrom a single Helion kernel. This larger search space also makes kernels more performance\nportable between different hardware. Helion automates and autotunes over:\n\n1. **Tensor Indexing:**\n\n   * Automatically calculates strides and indices.\n   * Autotunes choices among various indexing methods (pointers, block pointers, TensorDescriptors).\n   * Supports per-operation indexing strategies for fine-grained memory access control of loads and stores.\n\n2. **Masking:**\n\n   * Most masking is implicit in Helion, and is optimized away when not needed.\n\n3. **Grid Sizes and PID Calculations:**\n\n   * Automatically determines grid sizes.\n   * Autotunes multiple mappings from Program IDs (PIDs) to data tiles.\n\n4. **Implicit Search Space Definition:**\n\n   * Eliminates the need to manually define search configurations.\n   * Automatically generates configuration flags and exploration spaces.\n\n5. **Kernel Arguments Management:**\n\n   * Automates the handling of kernel arguments, including tensor sizes and strides.\n   * Lifts global variables and (nested) closures into kernel arguments, allowing better templating.\n\n6. **Looping Reductions:**\n\n   * Can automatically convert large reductions into looped implementations.\n\n7. **Automated Optimizations:**\n\n   * PID swizzling for improved L2 cache reuse.\n   * Loop reordering.\n   * Persistent kernel strategies.\n   * Warp specialization choices, unrolling, and more.\n\n## Example\n\nA minimal matrix multiplication kernel in Helion looks like this:\n\n```python\nimport torch, helion, helion.language as hl\n\n@helion.kernel()\ndef matmul(x: torch.Tensor, y: torch.Tensor) -\u003e torch.Tensor:\n    m, k = x.size()\n    k, n = y.size()\n    out = torch.empty([m, n], dtype=x.dtype, device=x.device)\n\n    for tile_m, tile_n in hl.tile([m, n]):\n        acc = hl.zeros([tile_m, tile_n], dtype=torch.float32)\n        for tile_k in hl.tile(k):\n            acc = torch.addmm(acc, x[tile_m, tile_k], y[tile_k, tile_n])\n        out[tile_m, tile_n] = acc\n\n    return out\n```\n\nThe code outside the `for` loops is standard PyTorch code executed on\nthe CPU. It is typically used for tasks like allocating output tensors\nand performing shape computations.\n\nThe code inside the `for` loops is compiled into a Triton kernel,\nresulting in a single GPU kernel.  A single Helion kernel is always\ncompiled to exactly one GPU kernel.\n\nThe `hl.tile` function subdivides the iteration space (in this case `m` by\n`n`) into tiles. These tiles are executed in parallel on the GPU. Tiling\ndetails, such as dimensionality (1D vs 2D), tile sizes, and loop ordering,\nare automatically determined by Helion's autotuner. Alternatively, these\ndetails can be explicitly specified using the `config=` argument in\n`helion.kernel`.\n\n* The outer `for` loop is mapped onto the grid of the generated\nkernel. The grid size is determined automatically based on the chosen\ntile size.\n\n* The inner `for` loop translates into a loop within the generated kernel,\nand its tile size is also determined automatically.\n\nWithin a Helion kernel, standard PyTorch operators (like\n`torch.addmm`) are automatically mapped to Triton operations using\n[TorchInductor](https://github.com/pytorch/pytorch/tree/main/torch/_inductor).\nThus, familiarity with PyTorch means you already know most of\nHelion. Helion supports a wide range of operations including pointwise\n(`add`, `sigmoid`, etc.), reductions (`sum`, `softmax`, etc.), views,\nand matrix multiplication operations.  Arbitrary function calls\nwithin a Helion kernel are supported, but must be traceable with\n[make_fx](https://pytorch.org/docs/stable/generated/torch.fx.experimental.proxy_tensor.make_fx.html).\n\n## Autotuning\n\nThe above example can be executed with:\n\n```python\nout = matmul(torch.randn([2048, 2048], device=\"cuda\"),\n             torch.randn([2048, 2048], device=\"cuda\"))\n```\n\nWhen a kernel runs for the first time, Helion initiates autotuning. A\ntypical autotuning session produces output similar to:\n\n```\n[0s] Starting DifferentialEvolutionSearch with population=40, generations=20, crossover_rate=0.8\n[20s] Initial population: failed=4 min=0.0266 mid=0.1577 max=1.2390 best=Config(block_sizes=[64, 32, 64], loop_orders=[[1, 0]], l2_groupings=[8], range_unroll_factors=[3, 1], range_warp_specializes=[True, False], range_num_stages=[1, 0], range_multi_buffers=[True, True], range_flattens=[None, False], num_warps=4, num_stages=7, indexing='block_ptr', pid_type='persistent_blocked')\n[51s] Generation 2: replaced=17 min=0.0266 mid=0.0573 max=0.1331 best=Config(block_sizes=[64, 32, 64], loop_orders=[[1, 0]], l2_groupings=[8], range_unroll_factors=[3, 1], range_warp_specializes=[True, False], range_num_stages=[1, 0], range_multi_buffers=[True, True], range_flattens=[None, False], num_warps=4, num_stages=7, indexing='block_ptr', pid_type='persistent_blocked')\n[88s] Generation 3: replaced=18 min=0.0225 mid=0.0389 max=0.1085 best=Config(block_sizes=[64, 64, 16], loop_orders=[[0, 1]], l2_groupings=[4], range_unroll_factors=[0, 1], range_warp_specializes=[None, None], range_num_stages=[0, 0], range_multi_buffers=[None, False], range_flattens=[None, None], num_warps=4, num_stages=6, indexing='pointer', pid_type='flat')\n...\n[586s] Generation 19: replaced=3 min=0.0184 mid=0.0225 max=0.0287 best=Config(block_sizes=[64, 64, 64], loop_orders=[[0, 1]], l2_groupings=[4], range_unroll_factors=[0, 1], range_warp_specializes=[None, False], range_num_stages=[0, 3], range_multi_buffers=[None, False], range_flattens=[None, None], num_warps=8, num_stages=6, indexing='block_ptr', pid_type='flat')\n[586s] Autotuning complete in 586.6s after searching 1520 configs.\nOne can hardcode the best config and skip autotuning with:\n    @helion.kernel(config=helion.Config(block_sizes=[64, 64, 64], loop_orders=[[0, 1]], l2_groupings=[4], range_unroll_factors=[0, 1], range_warp_specializes=[None, False], range_num_stages=[0, 3], range_multi_buffers=[None, False], range_flattens=[None, None], num_warps=8, num_stages=6, indexing='block_ptr', pid_type='flat'))\n```\n\nBecause autotuning can be time-consuming (around 10 minutes in the above\nexample), you may want to manually specify the best configuration found from\nautotuning to avoid repeated tuning:\n\n```python\n@helion.kernel(config=helion.Config(\n    block_sizes=[64, 64, 64],\n    loop_orders=[[0, 1]],\n    l2_groupings=[4],\n    range_unroll_factors=[0, 1],\n    range_warp_specializes=[None, False],\n    range_num_stages=[0, 3],\n    range_multi_buffers=[None, False],\n    range_flattens=[None, None],\n    num_warps=8,\n    num_stages=6,\n    indexing='block_ptr',\n    pid_type='flat'\n))\ndef matmul(x: torch.Tensor, y: torch.Tensor) -\u003e torch.Tensor:\n    ...\n```\n\nThis explicit configuration skips autotuning on subsequent runs.\n\nYou can also specify multiple configurations, prompting Helion to perform\na more lightweight autotuning process:\n\n```python\n@helion.kernel(configs=[\n    helion.Config(...),\n    helion.Config(...),\n])\ndef matmul(x: torch.Tensor, y: torch.Tensor) -\u003e torch.Tensor:\n    ...\n```\n\nIn this case, Helion evaluates the provided configurations and selects the fastest one.\n\nAdditionally, Helion provides programmatic APIs to manage autotuning\nand configurations directly from your code.\n\n**For production deployment**, we recommend using ahead-of-time tuned configurations rather than relying on runtime autotuning. The autotuning process can be time-consuming and resource-intensive, making it unsuitable for production environments where predictable performance and startup times are critical.\n\n### Static shapes and autotuning keys\n\nBy default Helion uses static shapes (`static_shapes=True`). This means each unique input shape/stride signature is treated as its own specialization and will be autotuned separately. This typically yields the best performance, but may increase autotuning time when many shapes are encountered.\n\nIf you want to reduce autotuning time by sharing configurations between different shapes, set `static_shapes=False`. In this mode, the autotuning key ignores exact sizes, allowing a single tuned config to be reused across multiple shapes. This can come with a performance penalty compared to fully specialized static shapes.\n\n```python\n@helion.kernel(static_shapes=False)\ndef my_kernel(x: torch.Tensor) -\u003e torch.Tensor:\n    ...\n```\n\n## Configurations\n\nHelion configurations include the following options:\n\n* **block\\_sizes** (`list[int]`):\nControls tile sizes corresponding to each dimension passed `hl.tile` or call\nto `hl.register_block_size` in the kernel.\n\n* **loop\\_orders** (`list[list[int]]`):\nContains one entry per `hl.tile` call with two or more dimensions,\nallowing you to permute the iteration order of the tiles.\n\n* **flatten_loops** (`list[bool]`):\nContains one entry per `hl.tile` call with two or more dimensions,\nallowing you to flatten the iteration space into a single dimension.\n\n* **range\\_unroll\\_factors** (`list[int]`):\nContains one entry per loop dimension, specifying the unroll factor for\n`tl.range()` calls. Values less than 1 omit the `loop_unroll_factor` parameter.\n\n* **range\\_num\\_stages** (`list[int]`):\nContains one entry per loop dimension, specifying the number of stages for\n`tl.range()` calls. Values less than 1 omit the `num_stages` parameter.\n\n* **range\\_multi\\_buffers** (`list[bool | None]`):\nContains one entry per loop dimension, controlling the `disallow_acc_multi_buffer`\nparameter for `tl.range()` calls. `True` allows multi-buffer (sets `disallow_acc_multi_buffer=False`),\n`False` disallows multi-buffer (sets `disallow_acc_multi_buffer=True`), and `None` omits the parameter.\n\n* **range\\_flattens** (`list[bool | None]`):\nContains one entry per loop dimension, controlling the `flatten`\nparameter for `tl.range()` calls. `True` sets `flatten=True`,\n`False` sets `flatten=False`, and `None` omits the parameter.\n\n* **range\\_warp\\_specializes** (`list[bool | None]`):\nContains one entry per loop dimension, controlling the `warp_specialize`\nparameter for `tl.range()` calls. `True` sets `warp_specialize=True`,\n`False` sets `warp_specialize=False`, and `None` omits the parameter.\nOnly available on CUDA devices with Blackwell or newer architectures\nwhen `allow_warp_specialize` setting is enabled.\n\n* **static\\_ranges** (`list[bool]`):\nContains one entry per loop dimension with static bounds, controlling whether to use\n`tl.static_range()` calls. `True` generates `tl.static_range()` and ignores range_* configs for that loop. `False` generates `tl.range()`.\n\n* **reduction\\_loops** (`list[int | None]`):\nContains one entry per reduction dimension (see\n`examples/softmax.py`). Using `None` triggers a persistent reduction,\nwhere the entire reduction is processed in a single tile. Specifying an\ninteger block size converts the reduction into a loop, beneficial for\nlarger reductions that exceed the registers available.\n\n* **l2\\_groupings** (`list[int]`):\nReorders the program IDs (PIDs) of the generated kernel for improved L2\ncache behavior. A value of `1` disables this optimization, while higher\nvalues specify the grouping size.\n\n* **indexing** (`\"pointer\"`, `\"tensor_descriptor\"`, `\"block_ptr\"`, or a list of these):\nSpecifies the memory indexing strategy for load and store operations. Can be:\n  - A single strategy (applies to all loads and stores): `indexing=\"block_ptr\"`\n  - A list of strategies (one per load/store in execution order): `indexing=[\"pointer\", \"pointer\", \"block_ptr\"]`\n  - Empty/omitted (defaults to `\"pointer\"` for all operations)\n  - When using a list, provide strategies in order: `[load1, load2, ..., store1, store2, ...]`\n\n  The `\"tensor_descriptor\"` option uses Tensor Memory Accelerators (TMAs) but\n  requires a Hopper or newer GPU and the latest development version of Triton.\n\n* **pid\\_type** (`\"flat\"`, `\"xyz\"`, `\"persistent_blocked\"`, or `\"persistent_interleaved\"`):\n  Specifies the program ID mapping strategy. `\"flat\"` uses only the x-dimension,\n  `\"xyz\"` utilizes multiple grid dimensions, and persistent strategies enable\n  persistent kernels for improved SM utilization.\n\n* **num\\_warps** (`int`):\nSets the number of warps the kernel will use.\n\n* **num\\_stages** (`int`):\nDefines the number of pipeline stages to be passed to Triton.\n\n* **load_eviction_policies** (`list[str]`):\nControls eviction policy used for loads discovered in device loops. Each entry\ncorresponds to a load site; allowed values are `\"\"` (no policy), `\"first\"`\n(maps to Triton `evict_first`), and `\"last\"` (maps to Triton `evict_last`).\nExplicit `eviction_policy=...` on `hl.load` overrides this config.\n\nChanging these options results in often significantly different\noutput Triton code, allowing the autotuner to explore a wide range of\nimplementations from a single Helion kernel.\n\n## TileIR Backend (Blackwell GPUs)\n\nHelion supports the [Triton-TileIR backend](https://github.com/triton-lang/Triton-to-tile-IR) for NVIDIA Blackwell GPUs (compute capability 10.x/12.x). This backend provides optimized code generation targeting [TileIR](https://docs.nvidia.com/cuda/tile-ir/latest/index.html) with additional tuning knobs.\n\nTo enable the TileIR backend:\n\n1. Install the [Triton-TileIR backend](https://github.com/triton-lang/Triton-to-tile-IR)\n2. Set the environment variable:\n\n```bash\nexport ENABLE_TILE=1\n```\n\nFor detailed documentation, see the [TileIR Backend Guide](docs/tileir_backend.md).\n\n## Settings for Development and Debugging\n\nWhen developing kernels with Helion, you might prefer skipping autotuning for faster iteration. To\ndo this, set the environment variable `HELION_AUTOTUNE_EFFORT=none` or use the decorator argument\n`@helion.kernel(autotune_effort=\"none\")`. **Warning:** The default configuration is slow and not intended for\nproduction or performance testing.\n\nTo view the generated Triton code, set the environment variable `HELION_PRINT_OUTPUT_CODE=1` or include\n`print_output_code=True` in the `@helion.kernel` decorator. This prints the Triton code to `stderr`, which is\nhelpful for debugging and understanding Helion's compilation process.  One can also use\n`foo_kernel.bind(args).to_triton_code(config)` to get the Triton code as a string.\n\nTo emit a repro script that includes the Helion kernel definition, the config decorator, and a\n`helion_repro_caller()` helper that recreates the runtime inputs before invoking the Helion kernel, set\n`HELION_PRINT_REPRO=1` or include `print_repro=True` in the `@helion.kernel` decorator. This prints\nthe repro script to `stderr`, which is helpful for debugging and for sharing minimal repro on GitHub issue tracker.\n\nWithin an `hl.tile`/`hl.grid` device loop, if you want to print intermediate results using `print(\"x\", ...)` syntax,\nor pause execution using Python's built-in `breakpoint()`, set either `TRITON_INTERPRET=1` (runs Triton's CPU interpreter)\nor `HELION_INTERPRET=1` (runs the Helion kernel in eager mode).\n\nTo force autotuning, bypassing provided configurations, set `HELION_FORCE_AUTOTUNE=1` or invoke `foo_kernel.autotune(args,\nforce=True)`.\n\nAdditional settings are available in\n[settings.py](https://github.com/pytorch/helion/blob/main/helion/runtime/settings.py).  If both an environment\nvariable and a kernel decorator argument are set, the kernel decorator argument takes precedence, and the environment\nvariable will be ignored.\n\nEnable logging by setting the environment variable `HELION_LOGS=all` for INFO-level logs, or `HELION_LOGS=+all`\nfor DEBUG-level logs. Alternatively, you can specify logging for specific modules using a comma-separated list\n(e.g., `HELION_LOGS=+helion.runtime.kernel`).\n\n\n## Requirements\n\nHelion currently targets Linux systems and requires a recent Python and PyTorch environment:\n\n- Linux-based OS\n- Python 3.10–3.14\n- [PyTorch] 2.9 or later\n- [Triton] 3.5 or later\n  *(Older versions may work, but will lack support for features like\n  TMA on Hopper/Blackwell GPUs and may exhibit lower performance.)*\n- [Triton-to-tile-IR](https://github.com/triton-lang/Triton-to-tile-IR) *(Optional)* 3.5 or later\n\n[PyTorch]: https://github.com/pytorch/pytorch\n\n## Installation\n\nWe recommend using [uv] to manage an isolated virtual environment. First,\ninstall compatible versions of [PyTorch] and [Triton].\n\n[uv]: https://docs.astral.sh/uv/\n\nOnce your environment is set up, you can install Helion:\n\n```bash\npip install helion\n```\n\nAlternatively, you may install from source for development purposes. If using `uv`, create and activate a virtual environment first:\n```bash\ngit clone https://github.com/pytorch/helion.git\ncd helion\n\n# Create and activate a virtual environment with uv (one-time)\nuv venv .venv\nsource .venv/bin/activate\n\n# To install in editable w/ required dev packages\npip install -e .'[dev]'\n```\nThis installs Helion in \"editable\" mode so that changes to the source\ncode take effect without needing to reinstall.\n\n## Linting\n\nWe use `pre-commit` to run ruff, pyrefly, and other checks automatically.\n\n– One-time setup (installs the git hook):\n```bash\npip install pre-commit\npre-commit install\n```\n\n– Run all checks across the repository:\n```bash\npre-commit run --all-files\n```\n\nNote: You can still run the underlying tools directly via `./lint.sh [fix|check|unsafe]`.\n\n## Community\n\nQuestions or feedback? Join us on the [GPU MODE Discord](https://discord.gg/gpumode) in the `#helion` channel.\n\n## License\n\nHelion is BSD-style licensed, as found in the LICENSE file.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpytorch%2Fhelion","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpytorch%2Fhelion","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpytorch%2Fhelion/lists"}