{"id":21264878,"url":"https://github.com/maghoumi/pytorch-softdtw-cuda","last_synced_at":"2025-04-04T10:09:41.942Z","repository":{"id":41392794,"uuid":"260793611","full_name":"Maghoumi/pytorch-softdtw-cuda","owner":"Maghoumi","description":"Fast CUDA implementation of (differentiable) soft dynamic time warping for PyTorch","archived":false,"fork":false,"pushed_at":"2024-04-03T14:17:16.000Z","size":34,"stargazers_count":659,"open_issues_count":19,"forks_count":60,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-03-28T09:09:31.132Z","etag":null,"topics":["cuda","deep-learning","dynamic-time-warping","pytorch","soft-dtw"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Maghoumi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-05-02T23:28:24.000Z","updated_at":"2025-03-24T15:46:05.000Z","dependencies_parsed_at":"2025-02-23T18:10:20.518Z","dependency_job_id":"c87ed59c-727c-4728-b0c9-dad81b240062","html_url":"https://github.com/Maghoumi/pytorch-softdtw-cuda","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Maghoumi%2Fpytorch-softdtw-cuda","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Maghoumi%2Fpytorch-softdtw-cuda/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Maghoumi%2Fpytorch-softdtw-cuda/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Maghoumi%2Fpytorch-softdtw-cuda/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Maghoumi","download_url":"https://codeload.github.com/Maghoumi/pytorch-softdtw-cuda/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247157283,"owners_count":20893220,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","deep-learning","dynamic-time-warping","pytorch","soft-dtw"],"created_at":"2024-11-21T05:04:13.791Z","updated_at":"2025-04-04T10:09:41.918Z","avatar_url":"https://github.com/Maghoumi.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Soft DTW for PyTorch in CUDA\n===\nFast CUDA implementation of [soft-DTW](https://github.com/mblondel/soft-dtw) for PyTorch.\nBased on [pytorch-softdtw](https://github.com/Sleepwalking/pytorch-softdtw) but can run up to 100x faster!\nBoth `forward()` and `backward()` passes are implemented using CUDA.\n\nMy implementation is partly inspired by\n[_\"Developing a pattern discovery method in time series data and its GPU acceleration\"_](https://ieeexplore.ieee.org/document/8400444)\nwherein a diagonal-based implementation of the Belman recursion is proposed.\n\n## Getting Started\n\nThis code depends on [PyTorch](https://pytorch.org/) and [Numba](http://numba.pydata.org/).\nJust include `soft_dtw_cuda.py` in your projects, and you should be good to go!\n\nYou can also run the included profiler/test (tested with Python v3.6), and see the speedups you'd get:\n\n```\ngit clone https://github.com/Maghoumi/pytorch-softdtw-cuda\ncd pytorch-softdtw-cuda\npython soft_dtw_cuda.py\n```\n\n### Example Usage\nA sample code is already provided in the script. Here's a quick example:\n\n```python\nfrom soft_dtw_cuda import SoftDTW\n\n# Create the sequences\nbatch_size, len_x, len_y, dims = 8, 15, 12, 5\nx = torch.rand((batch_size, len_x, dims), requires_grad=True)\ny = torch.rand((batch_size, len_y, dims))\n# Transfer tensors to the GPU\nx = x.cuda()\ny = y.cuda()\n\n# Create the \"criterion\" object\nsdtw = SoftDTW(use_cuda=True, gamma=0.1)\n\n# Compute the loss value\nloss = sdtw(x, y)  # Just like any torch.nn.xyzLoss()\n\n# Aggregate and call backward()\nloss.mean().backward()\n```\n\n### Demo Project\n\nCheckout [DeepNAG](https://github.com/Maghoumi/DeepNAG), our deep non-adversarial gesture generator.\nWe show that a RNN-based gesture generator trained with soft DTW can outperform the same generator\ntrained using a GAN framework.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg width=\"400\" src=\"https://github.com/Maghoumi/DeepNAG/raw/master/images/kick.gif\"/\u003e\n  \u003cimg width=\"400\" src=\"https://github.com/Maghoumi/DeepNAG/raw/master/images/uppercut.gif\"/\u003e\n\u003c/p\u003e\n\n## Citation\nIf you use this code in your research, please cite the following publications:\n\n```\n@phdthesis{maghoumi2020dissertation,\n  title={{Deep Recurrent Networks for Gesture Recognition and Synthesis}},\n  author={Mehran Maghoumi},\n  year={2020},\n  school={University of Central Florida Orlando, Florida}\n}\n\n@inproceedings{maghoumi2021deepnag,\n  title={DeepNAG: Deep Non-Adversarial Gesture Generation},\n  author={Maghoumi, Mehran and Taranta, Eugene Matthew and LaViola, Joseph},\n  booktitle={26th International Conference on Intelligent User Interfaces},\n  pages={213--223},\n  year={2021}\n}\n```\n\n## FAQ:\n\n### This is awesome! What can I do to help?\nConsider starring this repository if you find it helpful. Also, don't forget to thank the author of\n[pytorch-softdtw](https://github.com/Sleepwalking/pytorch-softdtw) for his CPU implementation.\n\nAlso, please consider contributing to this project by improving the performance, addressing existing\nlimitations, etc. PRs are greatly welcome!\n\n### Does it support pruning?\nYes! Use the `bandwitdh` argument to specify the Sakoe-Chiba bandwidth to use for pruning.\n\n### How fast does it run?\nIt depends on your batch size and sequence length. The longer the sequences and the larger the batch size,\nthe faster this code runs.\n\nHere's what I get with Intel Core-i7 12700K and Titan RTX:\n\n```\nProfiling forward() + backward() times for batch_size=128, seq_len_a=17, seq_len_b=15, dims=2...\n    CPU:      0.004228143487125635\n    GPU:      0.0014472737908363341\n    Speedup:  2.9214537801325924\n\nProfiling forward() + backward() times for batch_size=512, seq_len_a=64, seq_len_b=64, dims=2...\n    CPU:      0.023894597217440604\n    GPU:      0.003414902277290821\n    Speedup:  6.997154025853163\n\nProfiling forward() + backward() times for batch_size=512, seq_len_a=256, seq_len_b=256, dims=2...\n    CPU:      0.5894654761068523\n    GPU:      0.0343648319132626\n    Speedup:  17.153160463425888\n```\n\nNote that there are tons of opportunities for optimizing this code further (e.g. various\nCUDA optimizations such as the use shared memory, etc.). Contributions/improvements are greatly appreciated!\n\n### How accurate are the results?\nDepends on the length of your inputs. Because of the sequential nature of this code, the longer your input\nsequences are, the higher numerical errors become due to accumulation. Especially in the `backward()` call,\nyou could see floating point errors of up to `1e-3` on uniform random inputs in the range `[0, 1)` in the\nresulting derivative tensor.\n\nThe unit tests included in `soft_dtw_cuda.py` verify the results against the CPU implementation.\n\n### What are the limitations?\nSome limitations are:\n\n1. All sequences in the same batch should have the same length / number of features.\n2. Inputs cannot have lengths longer than 1024 (due to CUDA limitations on the maximum block size).\n   The code will warn if your sequence length is too long, and will fall-back to the CPU implementation.\n3. You may run out of CUDA resources if your inputs are long (but still less than 1024). See below.\n\n### I'm seeing `CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES`. Help!\nThis means the length of your sequences is too long, and your GPU cannot spawn a sufficient number of threads.\nThis is related to point 4 above in the \"limitations\". I'm not sure if it's possible to query the CUDA device\nin Numba to see if launching the kernel is possible given the number of necessary threads. In these cases\nconsider using the CPU implementation.\n\nLicense\n---\nThis project is licensed under the MIT License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaghoumi%2Fpytorch-softdtw-cuda","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmaghoumi%2Fpytorch-softdtw-cuda","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaghoumi%2Fpytorch-softdtw-cuda/lists"}