{"id":23039227,"url":"https://github.com/1ytic/warp-rnnt","last_synced_at":"2025-04-05T21:06:10.242Z","repository":{"id":47259431,"uuid":"199127460","full_name":"1ytic/warp-rnnt","owner":"1ytic","description":"CUDA-Warp RNN-Transducer","archived":false,"fork":false,"pushed_at":"2023-02-22T18:22:40.000Z","size":158,"stargazers_count":212,"open_issues_count":17,"forks_count":41,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-03-29T20:07:04.788Z","etag":null,"topics":["cuda","forward-backward","pytorch","rnn-transducer","tensorflow","warp"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/1ytic.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-07-27T06:44:30.000Z","updated_at":"2025-02-14T13:57:41.000Z","dependencies_parsed_at":"2023-12-21T16:44:59.742Z","dependency_job_id":"6c33cc6a-54b6-4782-a58b-e4dd71b0bc0c","html_url":"https://github.com/1ytic/warp-rnnt","commit_stats":{"total_commits":28,"total_committers":6,"mean_commits":4.666666666666667,"dds":0.3928571428571429,"last_synced_commit":"66dfe053f236d86320c09cd5d1238828cc78b8fa"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/1ytic%2Fwarp-rnnt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/1ytic%2Fwarp-rnnt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/1ytic%2Fwarp-rnnt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/1ytic%2Fwarp-rnnt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/1ytic","download_url":"https://codeload.github.com/1ytic/warp-rnnt/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247399871,"owners_count":20932876,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","forward-backward","pytorch","rnn-transducer","tensorflow","warp"],"created_at":"2024-12-15T18:29:13.965Z","updated_at":"2025-04-05T21:06:10.226Z","avatar_url":"https://github.com/1ytic.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"![PyPI](https://img.shields.io/pypi/v/warp-rnnt.svg)\n[![Downloads](https://pepy.tech/badge/warp-rnnt)](https://pepy.tech/project/warp-rnnt)\n\n# CUDA-Warp RNN-Transducer\nA GPU implementation of RNN Transducer (Graves [2012](https://arxiv.org/abs/1211.3711), [2013](https://arxiv.org/abs/1303.5778)).\nThis code is ported from the [reference implementation](https://github.com/awni/transducer/blob/master/ref_transduce.py) (by Awni Hannun)\nand fully utilizes the CUDA warp mechanism.\n\nThe main bottleneck in the loss is a forward/backward pass, which based on the dynamic programming algorithm.\nIn particular, there is a nested loop to populate a lattice with shape (T, U),\nand each value in this lattice depend on the two previous cells from each dimension (e.g. [forward pass](https://github.com/awni/transducer/blob/6b37e98c21551c7ed2181e2f526053bae8ae94d2/ref_transduce.py#L56)).\n\nCUDA executes threads in groups of 32 parallel threads called [warps](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture).\nFull efficiency is realized when all 32 threads of a warp agree on their execution path.\nThis is exactly what is used to optimize the RNN Transducer. The lattice is split into warps in the T dimension.\nIn each warp, variables between threads exchanged using a fast operations.\nAs soon as the current warp fills the last value, the next two warps (t+32, u) and (t, u+1) are start running. \nA schematic procedure for the forward pass is shown in the figure below, where T - number of frames, U - number of labels, W - warp size.\nThe similar procedure for the backward pass runs in parallel.\n\n![](lattice.gif)\n\n\n## Performance\nNVIDIA Profiler shows advantage of the _warp_ implementation over the _non-warp_ implementation.\n\nThis warp implementation:\n![](warp-rnnt.nvvp.png)\n\nNon-warp implementation [warp-transducer](https://github.com/HawkAaron/warp-transducer):\n![](warp-transducer.nvvp.png)\n\nUnfortunately, in practice this advantage disappears because the memory operations takes much longer. Especially if you synchronize memory on each iteration.\n\n|                         |    warp_rnnt (gather=False)    |    warp_rnnt (gather=True)    | [warprnnt_pytorch](https://github.com/HawkAaron/warp-transducer/tree/master/pytorch_binding) | [transducer (CPU)](https://github.com/awni/transducer) |\n| :---------------------- | ------------------: | ------------------: | ------------------: | ------------------: |\n|  **T=150, U=40, V=28**  | \n|         N=1             |       0.50 ms       |       0.54 ms       |       0.63 ms       |       1.28 ms       |\n|         N=16            |       1.79 ms       |       1.72 ms       |       1.85 ms       |       6.15 ms       |\n|         N=32            |       3.09 ms       |       2.94 ms       |       2.97 ms       |      12.72 ms       |\n|         N=64            |       5.83 ms       |       5.54 ms       |       5.23 ms       |      23.73 ms       |\n|         N=128           |      11.30 ms       |      10.74 ms       |       9.99 ms       |      47.93 ms       |\n| **T=150, U=20, V=5000** |\n|         N=1             |       0.95 ms       |       0.80 ms       |       1.74 ms       |      21.18 ms       |\n|         N=16            |       8.74 ms       |       6.24 ms       |      16.20 ms       |     240.11 ms       |\n|         N=32            |      17.26 ms       |      12.35 ms       |      31.64 ms       |     490.66 ms       |\n|         N=64            |    out-of-memory    |    out-of-memory    |    out-of-memory    |     944.73 ms       |\n|         N=128           |    out-of-memory    |    out-of-memory    |    out-of-memory    |    1894.93 ms       |\n| **T=1500, U=300, V=50** |\n|         N=1             |       5.89 ms       |       4.99 ms       |      10.02 ms       |     121.82 ms       |\n|         N=16            |      95.46 ms       |      78.88 ms       |      76.66 ms       |     732.50 ms       |\n|         N=32            |    out-of-memory    |     157.86 ms       |     165.38 ms       |    1448.54 ms       |\n|         N=64            |    out-of-memory    |    out-of-memory    |     out-of-memory   |    2767.59 ms       |\n\n[Benchmarked](pytorch_binding/benchmark.py) on a GeForce RTX 2070 Super GPU, Intel i7-10875H CPU @ 2.30GHz.\n\n## Note\n\n- This implementation assumes that the input is log_softmax.\n\n- In addition to alphas/betas arrays, counts array is allocated with shape (N, U * 2), which is used as a scheduling mechanism.\n\n- [core_gather.cu](core_gather.cu) is a memory-efficient version that expects log_probs with the shape (N, T, U, 2) only for blank and labels values. It shows excellent performance with a large vocabulary.\n\n- Do not expect that this implementation will greatly reduce the training time of RNN Transducer model. Probably, the main bottleneck will be a trainable joint network with an output (N, T, U, V).\n\n- Also, there is a restricted version, called [Recurrent Neural Aligner](https://github.com/1ytic/warp-rna), with assumption that the length of input sequence is equal to or greater than the length of target sequence.\n\n\n## Install\nThere are two bindings for the core algorithm:\n- [pytorch_binding](pytorch_binding)\n- [tensorflow_binding](tensorflow_binding)\n\n\n## Reference\n- Awni Hannun [transducer](https://github.com/awni/transducer)\n\n- Mingkun Huang [warp-transducer](https://github.com/HawkAaron/warp-transducer)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F1ytic%2Fwarp-rnnt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F1ytic%2Fwarp-rnnt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F1ytic%2Fwarp-rnnt/lists"}