{"id":15600971,"url":"https://github.com/lucidrains/itransformer","last_synced_at":"2025-05-15T04:04:50.552Z","repository":{"id":199732519,"uuid":"703596504","full_name":"lucidrains/iTransformer","owner":"lucidrains","description":"Unofficial implementation of iTransformer - SOTA Time Series Forecasting using Attention networks, out of Tsinghua / Ant group","archived":false,"fork":false,"pushed_at":"2024-12-26T19:02:31.000Z","size":225,"stargazers_count":490,"open_issues_count":9,"forks_count":38,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-04-15T01:58:41.450Z","etag":null,"topics":["artificial-intelligence","attention-mechanisms","deep-learning","time-series-forecasting","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lucidrains.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-10-11T14:35:58.000Z","updated_at":"2025-04-14T05:06:34.000Z","dependencies_parsed_at":"2023-11-07T00:39:19.146Z","dependency_job_id":"82c397cb-57f3-4566-af53-bb6322c65990","html_url":"https://github.com/lucidrains/iTransformer","commit_stats":{"total_commits":45,"total_committers":1,"mean_commits":45.0,"dds":0.0,"last_synced_commit":"fa2773a60f7c9d087640c95ba9ba909531f70b44"},"previous_names":["lucidrains/itransformer"],"tags_count":29,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2FiTransformer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2FiTransformer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2FiTransformer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2FiTransformer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lucidrains","download_url":"https://codeload.github.com/lucidrains/iTransformer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254270641,"owners_count":22042858,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","attention-mechanisms","deep-learning","time-series-forecasting","transformers"],"created_at":"2024-10-03T02:10:39.489Z","updated_at":"2025-05-15T04:04:50.527Z","avatar_url":"https://github.com/lucidrains.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cimg src=\"./itransformer.png\" width=\"400px\"\u003e\u003c/img\u003e\n\n## iTransformer\n\nImplementation of \u003ca href=\"https://arxiv.org/abs/2310.06625\"\u003eiTransformer\u003c/a\u003e - SOTA Time Series Forecasting using Attention networks, out of Tsinghua / Ant group\n\nAll that remains is tabular data (xgboost still champion here) before one can truly declare \"Attention is all you need\"\n\nIn before Apple gets the authors to change the name.\n\nThe official implementation has been released \u003ca href=\"https://github.com/thuml/iTransformer\"\u003ehere\u003c/a\u003e!\n\n## Appreciation\n\n- \u003ca href=\"https://stability.ai/\"\u003eStabilityAI\u003c/a\u003e and \u003ca href=\"https://huggingface.co/\"\u003e🤗 Huggingface\u003c/a\u003e for the generous sponsorship, as well as my other sponsors, for affording me the independence to open source current artificial intelligence techniques.\n\n- \u003ca href=\"https://github.com/gdevos010\"\u003eGreg DeVos\u003c/a\u003e for sharing \u003ca href=\"https://github.com/lucidrains/iTransformer/issues/20\"\u003eexperiments\u003c/a\u003e he ran on `iTransformer` and some of the improvised variants\n\n## Install\n\n```bash\n$ pip install iTransformer\n```\n\n## Usage\n\n```python\nimport torch\nfrom iTransformer import iTransformer\n\n# using solar energy settings\n\nmodel = iTransformer(\n    num_variates = 137,\n    lookback_len = 96,                  # or the lookback length in the paper\n    dim = 256,                          # model dimensions\n    depth = 6,                          # depth\n    heads = 8,                          # attention heads\n    dim_head = 64,                      # head dimension\n    pred_length = (12, 24, 36, 48),     # can be one prediction, or many\n    num_tokens_per_variate = 1,         # experimental setting that projects each variate to more than one token. the idea is that the network can learn to divide up into time tokens for more granular attention across time. thanks to flash attention, you should be able to accommodate long sequence lengths just fine\n    use_reversible_instance_norm = True # use reversible instance normalization, proposed here https://openreview.net/forum?id=cGDAkQo1C0p . may be redundant given the layernorms within iTransformer (and whatever else attention learns emergently on the first layer, prior to the first layernorm). if i come across some time, i'll gather up all the statistics across variates, project them, and condition the transformer a bit further. that makes more sense\n)\n\ntime_series = torch.randn(2, 96, 137)  # (batch, lookback len, variates)\n\npreds = model(time_series)\n\n# preds -\u003e Dict[int, Tensor[batch, pred_length, variate]]\n#       -\u003e (12: (2, 12, 137), 24: (2, 24, 137), 36: (2, 36, 137), 48: (2, 48, 137))\n```\n\nFor an improvised version that does granular attention across time tokens (as well as the original per-variate tokens), just import `iTransformer2D` and set the additional `num_time_tokens`\n\nUpdate: It works! Thanks goes out to \u003ca href=\"https://github.com/gdevos010\"\u003eGreg DeVos\u003c/a\u003e for running the experiment \u003ca href=\"https://github.com/lucidrains/iTransformer/issues/6#issuecomment-1794989685\"\u003ehere\u003c/a\u003e!\n\nUpdate 2: Got an email. Yes you are free to write a paper on this, if the architecture holds up for your problem. I have no skin in the game\n\n```python\nimport torch\nfrom iTransformer import iTransformer2D\n\n# using solar energy settings\n\nmodel = iTransformer2D(\n    num_variates = 137,\n    num_time_tokens = 16,               # number of time tokens (patch size will be (look back length // num_time_tokens))\n    lookback_len = 96,                  # the lookback length in the paper\n    dim = 256,                          # model dimensions\n    depth = 6,                          # depth\n    heads = 8,                          # attention heads\n    dim_head = 64,                      # head dimension\n    pred_length = (12, 24, 36, 48),     # can be one prediction, or many\n    use_reversible_instance_norm = True # use reversible instance normalization\n)\n\ntime_series = torch.randn(2, 96, 137)  # (batch, lookback len, variates)\n\npreds = model(time_series)\n\n# preds -\u003e Dict[int, Tensor[batch, pred_length, variate]]\n#       -\u003e (12: (2, 12, 137), 24: (2, 24, 137), 36: (2, 36, 137), 48: (2, 48, 137))\n```\n\n## Experimental\n\n### iTransformer with fourier tokens\n\nA `iTransformer` but also with fourier tokens (FFT of time series is projected into tokens of their own and attended along side with the variate tokens, spliced out at the end)\n\n```python\nimport torch\nfrom iTransformer import iTransformerFFT\n\n# using solar energy settings\n\nmodel = iTransformerFFT(\n    num_variates = 137,\n    lookback_len = 96,                  # or the lookback length in the paper\n    dim = 256,                          # model dimensions\n    depth = 6,                          # depth\n    heads = 8,                          # attention heads\n    dim_head = 64,                      # head dimension\n    pred_length = (12, 24, 36, 48),     # can be one prediction, or many\n    num_tokens_per_variate = 1,         # experimental setting that projects each variate to more than one token. the idea is that the network can learn to divide up into time tokens for more granular attention across time. thanks to flash attention, you should be able to accommodate long sequence lengths just fine\n    use_reversible_instance_norm = True # use reversible instance normalization, proposed here https://openreview.net/forum?id=cGDAkQo1C0p . may be redundant given the layernorms within iTransformer (and whatever else attention learns emergently on the first layer, prior to the first layernorm). if i come across some time, i'll gather up all the statistics across variates, project them, and condition the transformer a bit further. that makes more sense\n)\n\ntime_series = torch.randn(2, 96, 137)  # (batch, lookback len, variates)\n\npreds = model(time_series)\n\n# preds -\u003e Dict[int, Tensor[batch, pred_length, variate]]\n#       -\u003e (12: (2, 12, 137), 24: (2, 24, 137), 36: (2, 36, 137), 48: (2, 48, 137))\n```\n\n## Todo\n\n- [x] beef up the transformer with latest findings\n- [x] improvise a 2d version across both variates and time\n- [x] improvise a version that includes fft tokens\n- [x] improvise a variant that uses adaptive normalization conditioned on statistics across all variates\n\n## Citation\n\n```bibtex\n@misc{liu2023itransformer,\n  title   = {iTransformer: Inverted Transformers Are Effective for Time Series Forecasting}, \n  author  = {Yong Liu and Tengge Hu and Haoran Zhang and Haixu Wu and Shiyu Wang and Lintao Ma and Mingsheng Long},\n  year    = {2023},\n  eprint  = {2310.06625},\n  archivePrefix = {arXiv},\n  primaryClass = {cs.LG}\n}\n```\n\n```bibtex\n@misc{shazeer2020glu,\n    title   = {GLU Variants Improve Transformer},\n    author  = {Noam Shazeer},\n    year    = {2020},\n    url     = {https://arxiv.org/abs/2002.05202}\n}\n```\n\n```bibtex\n@misc{burtsev2020memory,\n    title   = {Memory Transformer},\n    author  = {Mikhail S. Burtsev and Grigory V. Sapunov},\n    year    = {2020},\n    eprint  = {2006.11527},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CL}\n}\n```\n\n```bibtex\n@inproceedings{Darcet2023VisionTN,\n    title   = {Vision Transformers Need Registers},\n    author  = {Timoth'ee Darcet and Maxime Oquab and Julien Mairal and Piotr Bojanowski},\n    year    = {2023},\n    url     = {https://api.semanticscholar.org/CorpusID:263134283}\n}\n```\n\n```bibtex\n@inproceedings{dao2022flashattention,\n    title   = {Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness},\n    author  = {Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{\\'e}, Christopher},\n    booktitle = {Advances in Neural Information Processing Systems},\n    year    = {2022}\n}\n```\n\n```bibtex\n@Article{AlphaFold2021,\n    author  = {Jumper, John and Evans, Richard and Pritzel, Alexander and Green, Tim and Figurnov, Michael and Ronneberger, Olaf and Tunyasuvunakool, Kathryn and Bates, Russ and {\\v{Z}}{\\'\\i}dek, Augustin and Potapenko, Anna and Bridgland, Alex and Meyer, Clemens and Kohl, Simon A A and Ballard, Andrew J and Cowie, Andrew and Romera-Paredes, Bernardino and Nikolov, Stanislav and Jain, Rishub and Adler, Jonas and Back, Trevor and Petersen, Stig and Reiman, David and Clancy, Ellen and Zielinski, Michal and Steinegger, Martin and Pacholska, Michalina and Berghammer, Tamas and Bodenstein, Sebastian and Silver, David and Vinyals, Oriol and Senior, Andrew W and Kavukcuoglu, Koray and Kohli, Pushmeet and Hassabis, Demis},\n    journal = {Nature},\n    title   = {Highly accurate protein structure prediction with {AlphaFold}},\n    year    = {2021},\n    doi     = {10.1038/s41586-021-03819-2},\n    note    = {(Accelerated article preview)},\n}\n```\n\n```bibtex\n@inproceedings{kim2022reversible,\n    title   = {Reversible Instance Normalization for Accurate Time-Series Forecasting against Distribution Shift},\n    author  = {Taesung Kim and Jinhee Kim and Yunwon Tae and Cheonbok Park and Jang-Ho Choi and Jaegul Choo},\n    booktitle = {International Conference on Learning Representations},\n    year    = {2022},\n    url     = {https://openreview.net/forum?id=cGDAkQo1C0p}\n}\n```\n\n```bibtex\n@inproceedings{Katsch2023GateLoopFD,\n    title   = {GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling},\n    author  = {Tobias Katsch},\n    year    = {2023},\n    url     = {https://api.semanticscholar.org/CorpusID:265018962}\n}\n```\n\n```bibtex\n@article{Zhou2024ValueRL,\n    title   = {Value Residual Learning For Alleviating Attention Concentration In Transformers},\n    author  = {Zhanchao Zhou and Tianyi Wu and Zhiyun Jiang and Zhenzhong Lan},\n    journal = {ArXiv},\n    year    = {2024},\n    volume  = {abs/2410.17897},\n    url     = {https://api.semanticscholar.org/CorpusID:273532030}\n}\n```\n\n```bibtex\n@article{Zhu2024HyperConnections,\n    title   = {Hyper-Connections},\n    author  = {Defa Zhu and Hongzhi Huang and Zihao Huang and Yutao Zeng and Yunyao Mao and Banggu Wu and Qiyang Min and Xun Zhou},\n    journal = {ArXiv},\n    year    = {2024},\n    volume  = {abs/2409.19606},\n    url     = {https://api.semanticscholar.org/CorpusID:272987528}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucidrains%2Fitransformer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flucidrains%2Fitransformer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucidrains%2Fitransformer/lists"}