{"id":15642460,"url":"https://github.com/andersy005/tvm-in-action","last_synced_at":"2025-04-30T09:48:43.640Z","repository":{"id":49373448,"uuid":"132387068","full_name":"andersy005/tvm-in-action","owner":"andersy005","description":"TVM stack: exploring the incredible explosion of deep-learning frameworks and how to bring them together ","archived":false,"fork":false,"pushed_at":"2018-05-22T02:52:12.000Z","size":32,"stargazers_count":64,"open_issues_count":0,"forks_count":7,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-01-01T13:35:51.764Z","etag":null,"topics":["deep-learning","tvm"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/andersy005.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-05-07T00:30:09.000Z","updated_at":"2024-11-18T02:36:06.000Z","dependencies_parsed_at":"2022-09-11T21:32:43.717Z","dependency_job_id":null,"html_url":"https://github.com/andersy005/tvm-in-action","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andersy005%2Ftvm-in-action","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andersy005%2Ftvm-in-action/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andersy005%2Ftvm-in-action/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andersy005%2Ftvm-in-action/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/andersy005","download_url":"https://codeload.github.com/andersy005/tvm-in-action/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":232802574,"owners_count":18578685,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","tvm"],"created_at":"2024-10-03T11:56:18.999Z","updated_at":"2025-01-07T00:17:47.070Z","avatar_url":"https://github.com/andersy005.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"![](http://tvmlang.org/images/main/stack_tvmlang.png) (Image Source: http://tvmlang.org/)\n\n# TVM in Action\n\n[TVM: End-to-End Optimization Stack for Deep Learning](https://github.com/dmlc/tvm)\n\nThis repo hosts my notes, tutorial materials (source code) for TVM stack as I explore the incredible explosition of deep-learning frameworks and how to bring them together. \n\n# [Summary of TVM: End-to-End Optimization Stack for Deep Learning](https://arxiv.org/abs/1802.04799)\n\n## Abstract\n\n- Scalable frameworks, such as TensorFlow, MXNet, Caffe, and PyTorch are optimized for a narrow range of serve-class GPUs.\n- Deploying workloads to other platforms such as mobile phones, IoT, and specialized accelarators(FPGAs, ASICs) requires laborious manual effort.\n- TVM is an end-to-end optimization stack that exposes:\n  - graph-level\n  - operator-level optimizations\n  ---\u003e to provide performance portability to deep learning workloads across diverse hardware back-ends.\n\n## Introduction\n\n- The number and diversity of specialized deep learning (DL) accelerators pose an adoption challenge\n  - They introduce new hardware abstractions that modern compilers and frameworks are ill-equipped to deal with.\n\n- Providing support in various DL frameworks for diverse hardware back-ends in the present ad-hoc fashion is **unsustainable**.\n\n- Hardware targets significantly diverge in terms of memory organization, compute, etc..\n\n![](https://i.imgur.com/XRSZMt0.png)\n\n- *The Goal*: **easily deploy DL workloads to all kinds of hardware targets, including embedded devives, GPUs, FPGAs, ASCIs (e.g, the TPU).**\n\n- Current DL frameworks rely on a **computational graph intermediate representation** to implement optimizations such as:\n  - auto differentiation\n  - dynamic memory management\n\n- **Graph-level optimizations** are often too high-level to handle hardware back-end-specific **operator transformations**.\n- **Current operator-level libraries** that DL frameworks rely on are:\n  - too rigid\n  - specialized\n\n  ---\u003e to be easily ported **across hardware devices**\n\n- To address these weaknesses, we need a **compiler framework** that can expose optimization opportunities across both\n  - graph-level and\n  - operator-level\n\n  ---\u003e to deliver competitive performance across hardware back-ends.\n\n### Four fundamental challenges at the computation graph level and tensor operator level\n\n1. **High-level dataflow rewriting:**\n    - Different hardware devices may have vastly different memory hierarchies.\n\n    - Enabling strategies to fuse operators and optimize data layouts are crucial for optimizing memory access.\n\n2. **Memory reuse across threads:**\n   - Modern GPUs and specialized accelerators ahve memory that can be shared across compute cores.\n   - Traditional shared-nothing nested parallel model is no longer optimal.\n   - Cooperation among threads on shared memory loaded is required for optimized kernels. \n\n3. **Tensorized compute intrinsics:**\n   - The latest hardware provides new instructions that go beyond vector operations like the GEMM operator in TPU or the tensor core in NVIDIA's Volta.\n   - Consequently, the scheduling procedure must break computation into tensor arithmetic intrinsics instead of scalar or vector code.\n\n4. **Latency Hiding**\n    - Traditional architectures with simultaneous multithreading and automatically managed caches implicitly hide latency in modern CPUs/GPUs.\n    - Specialized accelerator designs favor learner control and offload most of the scheduling complexity to the compiler stack.\n    - Still, scheduling must be peformed carefully to hide memory access latency.\n\n\n### TVM: An End-to-End Optimization Stack\n\n- An end-to-end optimizing compiler stack to lower and fine-tune DL workloads to diverse hardware back-ends. \n- Designed to separate:\n  - the algorithm description\n  - schedule\n  - hardware interface\n- This separation enables **support for novel specialized accelerators** and **their corresponding new intrinsics**. \n- TVM presents **two optimization layers**:\n  - a computation graph optimization layer to address:\n    - High-level dataflow rewriting\n  - a tensor optimization layer with new schedule primitives to address:\n    - memory reuse across threads\n    - tensorized compute intrinsics\n    - latency hiding\n\n## Optimizing Computational Graphs\n\n### Computational Graph\n\n- Computational graphs are a common way to represent programs in DL frameworks. \n- They provide a global view on computation tasks, yet avoid specifying how each computation task needs to be implemented. \n\n\n\n### Operator Fusion\n\n- An optimization that can greatly reduce execution time, particulary in GPUs and specialized accelerators.\n- The idea is to **combine multiple operators together into a single kernel without saving the intermediate results back into global memory**\n \n![](https://i.imgur.com/mlNhoDT.png)\n\n**Four categories of graph operators**:\n\n- Injective (one-to-one map)\n- Reduction\n- Complex-out-fusable (can fuse element-wise map to output)\n- Opaque (cannot be fused)\n\n![](https://i.imgur.com/XnhSWVN.png)\n\n### Data Layout Transformation\n\n- Tensor operations are the basic operators of computational graphs\n- They can have divergent layout requirements across different operations\n- Optimizing data layout starts with specifying the preferred data layout of each operator given the constraints dictating their implementation in hardware.\n\n![](https://i.imgur.com/0J5QxGs.png)\n\n### Limitations of Graph-Level Optimizations\n\n- They are only as effective as what the operator library provides.\n- Currently, the few DL frameworks that support operator fusion require the operator library to provide an implementation of the fused patterns.\n    - With more network operators introduced on a regular basis, this approach is no longer sustainable when targeting an increasing number of hardware back-ends.\n- It is not feasible to handcraft operator kernels for this massive space of back-end specific operators\n    - TVM provides a code-generation approach that can generate tensor operators. \n\n## Optimizing Tensor Operations\n\n### Tensor Expression Language\n\n- TVM introduces a dataflow tensor expression language to support automatic code generation.\n- Unlike high-level computation graph languages, where the implementation of tensor operations is opaque, *each operation is described in an index formula expression language*.\n\n![](https://i.imgur.com/LG1pguT.png)\n\n- TVM tensor expression language supports common arithmetic and math operations found in common language like C. \n- TVM explicitly introduces a **commutative reduction** operator to easily schedule commutative reductions across multiple threads. \n- TVM further introduces a **high-order scan operator** that can combine basic compute operators to form recurrent computations over time. \n\n### Schedule Space \n\n- Given a tensor expression, it is challenging to create high-performance implementations for each hardware back-end. \n- Each optimized low-level program is the result of different combinations of scheduling strategies, imposing a large burden on the kernel writer.\n- TVM adopts the **principle of decoupling compute descriptions from schedule optimizations**.\n- Schedules are the specific rules that lower compute descriptions down to back-end-optimized implementations. \n\n![](https://i.imgur.com/JUikGQz.png)\n\n![](https://i.imgur.com/BCg6gCz.png)\n\n\n### Nested Parallelism with Cooperation\n\n- Parallel programming is key to improving the efficiency of compute intensive kernels in deep learning workloads. \n- Modern GPUs offer massive parallelism \n    \n    ---\u003e Requiring TVM to bake parallel programming models into schedule transformations\n\n- Most existing solutions adopt a parallel programming model referred to as [nested parallel programs](https://youtu.be/4lS_WThsFoM), which is a form of [fork-join parallelism](https://en.wikipedia.org/wiki/Fork%E2%80%93join_model). \n- TVM uses a parallel schedule primitive to parallelize a data parallel task\n  - Each parallel task can be further recursively subdivided into subtasks to exploit the multi-level thread hierarchy on the target architecture (e.g, thread groups in GPU)\n- This model is called **shared-nothing nested parallelism**\n  - One working thread cannot look at the data of its sibling within the same parallel computation stage.\n  - Interactions between sibling threads happen at the join stage, when the subtasks are done and the next stage can consume the data produced by the previous stage. \n  - This programming model **does not enable threads to cooperate with each other in order to perform collective task within the same parallel stage**.\n\n- A better alternative to the shared-nothing approach is to **fetch data cooperatively across threads**\n    - This pattern is well known in GPU programming using languages like CUDA, OpenCL and Metal.\n    - **It has not been implemented into a schedule primitive.**\n- TVM introduces the **concept of memory scopes to the schedule space**, so that a stage can be marked as shared.\n    - Without memory scopes, automatic scope inference will mark the relevant stage as thread-local.\n    - Memory scopes are useful to GPUs.\n    - Memory scopes allow us to tag special memory buffers and create special lowering rules when targeting specialized deep learning accelerators. \n\n![](https://i.imgur.com/HHYtujL.png)\n\n\n### Tensorization: Generalizing the Hardware Interface\n\n- **Tensorization** problem is analogous to the **vectorization** problem for [SIMD architectures](https://en.wikipedia.org/wiki/SIMD). \n- Tensorization differs significantly from vectorization\n    - The inputs to the tensor compute primitives are multi-dimensional, with fixed or variable lengths, and dictate different data layouts.\n    - Cannot resort to a fixed set of primitives, as new DL accelerators are emerging with their own flavors of tensor instructions. \n- To solve this challenge, TVM **separates the hardware interface from the schedule**:\n    - TVM introduces a tensor intrinsic declaration mechanism\n    - TVM uses the tensor expression language to declare the behavior of each new hardware intrinsic, as well as the lowering rule associated to it. \n    - TVM introduces a **tensorize** schedule primitive to replace a unit of computation with the corresponding tensor intrinsics. \n    - The compiler matches the computation pattern with a hardware declaration, and lowers it to the corresping hardware intrinsic. \n   \n\n### Compiler Support for Latency Hiding\n\n- **Latency Hiding:** refers to the process of overlapping memory operations with computation to maximize memory and compute utilization. \n- It requires different different strategies depending on the hardware back-end that is being targeted. \n- On CPUs, memory latency hiding is achieved **implicitly with simultaneous multithreading** or **hardware prefetching techniques**. \n- GPUs rely on **rapid context switching of many wraps of threads** to maximize the utilization of functional units. \n- TVM provides a virtual threading schedule primitive that lets the programmer specify a high-level data parallel program that TVM automatically lowers to a low-level explicit data dependence program. \n\n\n## Code Generation and Runtime Support \n\n### Code Generation\n\n- For a specific tuple of data-flow declaration, axis relation hyper-graph, and schedule tree, TVM can generate lowered code by:\n  - iteratively traversing the schedule tree\n  - inferring the dependent bounds of the input tensors (using the axis relation hyergraph)\n  - generating the loop nest in the low-level code\n- The code is lowered to an in-memory representation of an imperative C style loop program. \n- TVM reuses a variant of Halide's the loop program data structure in this process. \n- TVM reuses passes from Halide for common lowering primitives like storage flattening and unrolling, \n  - and add GPU/accelerator-specific transformations such as:\n    - *synchronization point detection*\n    - *virtual thread injection**\n    - *module generation*\n- Finally, the loop program is transformed into **LLVM** or **CUDA/Metal/OpenCL** source code.\n\n### Runtime Support\n\n- For GPU programs, TVM builds the host and device modules **separately** and provide a runtime module system that launch kernels using corresponding driver APIs. \n\n### Remote Deployment Profiling\n\n- TVM includes infrastructure to make profiling and autotuning easier on embedded devices. \n- Traditionally, targeting an embedded device for tuning requires:\n  - cross-compiling on the host side, \n  - copying to the target device, \n  - and timing the execution\n\n- TVM provides remote function call support. Through the **RPC interface**:\n  - TVM compiles the program on a host compiler\n  - it uploads to remote embedded devices\n  - it runs the funcion remotely, \n  - and it accesses the results in the same script on the host. \n\n![](https://i.imgur.com/oL0Z9pp.png)\n\n\n## Conclusion\n\n- TVM provides an end-to-end stack to solve fundamental optimization challenges across a diverse set of hardware back-ends.\n- TVM can encourage more studies of programming languages, compilation, and open new opportunities for hardware co-design techniques for deep learning systems. \n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandersy005%2Ftvm-in-action","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fandersy005%2Ftvm-in-action","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandersy005%2Ftvm-in-action/lists"}