{"id":13687619,"url":"https://github.com/ELS-RD/kernl","last_synced_at":"2025-05-01T13:30:39.583Z","repository":{"id":62096183,"uuid":"521535271","full_name":"ELS-RD/kernl","owner":"ELS-RD","description":"Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.","archived":false,"fork":false,"pushed_at":"2024-02-16T00:58:35.000Z","size":2395,"stargazers_count":1564,"open_issues_count":32,"forks_count":96,"subscribers_count":27,"default_branch":"main","last_synced_at":"2025-04-24T10:52:51.595Z","etag":null,"topics":["cuda","cuda-kernel","pytorch","transformer","triton"],"latest_commit_sha":null,"homepage":"http://www.kernl.ai","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ELS-RD.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-08-05T06:52:49.000Z","updated_at":"2025-04-22T20:27:46.000Z","dependencies_parsed_at":"2024-12-14T20:01:53.680Z","dependency_job_id":"e90beb0f-3dff-485c-81dc-9616d96f1dce","html_url":"https://github.com/ELS-RD/kernl","commit_stats":{"total_commits":122,"total_committers":7,"mean_commits":"17.428571428571427","dds":0.4180327868852459,"last_synced_commit":"91e2cd92db44d503874d39a9f6dec42c9f481a8e"},"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ELS-RD%2Fkernl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ELS-RD%2Fkernl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ELS-RD%2Fkernl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ELS-RD%2Fkernl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ELS-RD","download_url":"https://codeload.github.com/ELS-RD/kernl/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251881589,"owners_count":21659125,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","cuda-kernel","pytorch","transformer","triton"],"created_at":"2024-08-02T15:00:57.620Z","updated_at":"2025-05-01T13:30:39.572Z","avatar_url":"https://github.com/ELS-RD.png","language":"Jupyter Notebook","funding_links":[],"categories":["Jupyter Notebook"],"sub_categories":[],"readme":"![Kernl logo](./resources/images/logo-readme.svg)\n\n---\n[![Tests](https://github.com/ELS-RD/kernl/actions/workflows/test.yaml/badge.svg)](https://github.com/ELS-RD/kernl/actions/workflows/test.yaml)\n\n**Kernl lets you run Pytorch transformer models several times faster on GPU with a single line of code,** \n**and is designed to be easily hackable.**\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./resources/images/speedup.png\"\u003e\n\u003c/p\u003e\n\n*benchmarks ran on a 3090 RTX*\n\nKernl is the first OSS inference engine written in ~~CUDA C~~ [OpenAI Triton](https://openai.com/blog/triton/), \na new language designed by OpenAI to make it easier to write GPU kernels.  \nEach kernel is less than 200 lines of code, and is **easy to understand** and modify.\n\n## Tutorials - End to End Use Cases\n\nA list of Examples contains how to use kernl with Pytorch.\n\n| Topic                                                                                                         | Notebook                                                                                   |\n|---------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|\n| **Tiled matmul**: matrix multiplication implementation in `CUDA` style                                        | [link](https://github.com/ELS-RD/kernl/blob/main/tutorial/1%20-%20tiled%20matmul.ipynb)    |\n| **Matmul offsets**: detailed explanations related to a performance trick used in Triton matmul implementation | [link](https://github.com/ELS-RD/kernl/blob/main/tutorial/2%20-%20matmul%20offsets.ipynb)  |\n| **Online softmax**: parallelized softmax computation, a key ingredient of `Flash Attention`                   | [link](https://github.com/ELS-RD/kernl/blob/main/tutorial/3%20-%20online%20softmax.ipynb)  |\n| **`Flash Attention`**: attention computation without saving attention matrix to global memory                 | [link](https://github.com/ELS-RD/kernl/blob/main/tutorial/4%20-%20flash%20attention.ipynb) |\n| **XNLI classification**: classification with / without optimizations (`Roberta` + `XNLI` classification task) | [link](https://github.com/ELS-RD/kernl/blob/main/tutorial/bert%20e2e.ipynb)                |\n| **Text generation**: with/without optimizations (`T5`)                                                        | [link](https://github.com/ELS-RD/kernl/blob/main/tutorial/t5%20e2e.ipynb)                  |\n| **Transcription generation**: with/without optimizations (`Whisper`)                                          | [link](https://github.com/ELS-RD/kernl/blob/main/experimental/whisper/speedup.ipynb)       |\n| **Llama version 2 optimization by kernel fusion                                                               | [link](https://github.com/ELS-RD/kernl/blob/main/experimental/llama-v2)                    |\n\n## Installation\n\n**IMPORTANT**: This package requires `pytorch` being installed.  \nPlease install it first.\n\n```shell\npip install 'git+https://github.com/ELS-RD/kernl'\n# or for local dev, after git clone ...\npip install -e .\n```\n\nThis project requires `Python` \u003e= 3.9.\nFurthermore, the library requires an Ampere GPU and CUDA to be installed. \n\nIf you prefer `Docker`:\n\n```shell\n# build\nDOCKER_BUILDKIT=1 docker build -t kernl .\n# run\ndocker run --rm -it --gpus all -v $(pwd):/kernl kernl\n```\n\n## Getting started\n\n```python\nimport torch\nfrom transformers import AutoModel\nfrom kernl.model_optimization import optimize_model\n\nmodel = AutoModel.from_pretrained(\"model_name\").eval().cuda()\noptimize_model(model)\n\ninputs = ...\n\nwith torch.inference_mode(), torch.cuda.amp.autocast():\n    outputs = model(**inputs)\n```\n\nFor end-to-end use cases, you may want to check:\n\n* [XNLI classication with Roberta](./tutorial/bert%20e2e.ipynb)\n* [text generation with T5](./tutorial/t5%20e2e.ipynb)\n\n## Test and Benchmark\n\n### Conventions\n\n- A test function using benchmark features must have a name that starts with `test_benchmark_`\n- Benchmark function must have a param called `implementation` when benchmarking the same operation using different\n  strategy\n\n### Run tests and benchmarks\n\n```shell\n# tada!\npytest\n```\n\nThere are over 2K benchmarks, and they take a while to run.\n\nSome rules on how `PyTest` works, in particular for benchmarks:\n\n- add `-k` to filter tests/benchmarks by their name like `pytest -k benchmark` to run only tests with `benchmark`\n  in their name\n- you can combine expressions in the filter: `pytest -k \"benchmark and not bert\"` if you want to run all benchmarks\n  except those related to BERT\n- to group and compare benchmark measures, use `pytest -k benchmark --benchmark-group-by ...`:\n  - groupinng by names: `pytest -k benchmark --benchmark-group-by fullfunc`\n  - grouping by names of parameters: `pytest -k benchmark --benchmark-group-by param:implementation,param:shape`\n    - `param:x`, `x` is the parameter name in `@pytest.mark.parametrize`\n  - combining both: `pytest -k benchmark --benchmark-group-by fullfunc,param:implementation`\n- add `-s` to see the output of the tests (print, etc.)\n- add `-v` to see the verbose output of the tests\n\n*WARNING*: `param:X` will make PyTest crash if `X` is not a parameter of at least one of the function ran.\n\nSome useful commands:\n\n```shell\n# only benchmarks\npytest -k benchmark\n# no benchmarks\npytest -k \"not benchmark\"\n# only linear layers benchmark, group by shape and if the input is contiguous or not \npytest test/test_linear_layer.py --benchmark-group-by fullfunc,param:shape,param:contiguous\n```\n\n## Create new patterns to replace fx graph nodes\n\nThe first step to replace function/module calls in the graph is to create the pattern that will be replaced.\nThe easiest way to do this is to [convert the model to a fx graph](https://pytorch.org/docs/stable/fx.html), and then\nprint it with `utils.graph_report` or by printing the code `print(you_graph_module.code)`\n\nThen you can use [replace_pattern](https://pytorch.org/docs/stable/fx.html#torch.fx.replace_pattern) to replace the\npattern in the graph. We have our own version of `replace_pattern` with some enhancements to work with modules, for\nexample. You can find examples of that in `optimizer` folder.\n\n## Code Formatting\n\nWe use `black` / `isort` / `flake8` to format the code. You can run them with:\n\n```shell\nmake source_code_format\nmake source_code_check_format\n```\n\n## Why?\n\nAt Lefebvre Sarrut, we run several transformers in production, some of them being latency sensitive (search and recsys mostly).\n\nWe are using OnnxRuntime and TensorRT and even created \n[transformer-deploy](https://github.com/ELS-RD/transformer-deploy) an OSS library to share our knowledge with the community.  \nRecently, we were testing generative languages, and we tried to accelerate them. It proves very difficult with traditional tools.\n\nBasically, and to make it short, it seems to us that Onnx (the main format to feed those tools) is an interesting \nformat with a wide range support of hardware. \n\nHowever, its ecosystem (and mostly inference engines) has several limitations when we deal with new LLM architectures :\n\n* Export to Onnx is simple for models without control flow because we can rely on tracing, \n  but dynamic behaviors are harder to obtain (see https://ppwwyyxx.com/blog/2022/TorchScript-Tracing-vs-Scripting/ for \n  more info, it’s about torchscript but is exactly the same for onnx).\n* Unlike Pytorch, both ONNX Runtime/TensorRT have not yet native support for multi GPUs tasks enabling tensor parallelism\n* TensorRT is not able to manage 2 dynamic axis for transformer models with the same profile. \n  Because usually we want to be able to provide inputs of different lengths, we need to build 1 model per batch size.\n* Very large models are common and Onnx (as a protobuff file) has some limitations regarding its file size, \n  requiring to store weights outsides of the model to workaround.\n\nOne thing very annoying is the fact that new models are never accelerated, you need to wait for someone to write custom CUDA kernels for that.\n\nIt’s not to say the solutions are bad, one big thing with OnnxRuntime is its multi hardware support.  \nRegarding TensorRT, it’s really fast.\n\nSo we wanted something as fast as TensorRT and on Python / PyTorch, that’s why we built Kernl.\n\n## How?\n\nThe simple rule is memory bandwidth is often the bottleneck in deep learning, to accelerate inference, memory access \nreduction is usually a good strategy. \nOn short input sequence, the bottleneck is often related to the CPU overhead, it has to be removed too. \nCounterintuitively, to make things faster, you don’t need to be faster in computation.\n\nWe leverage mostly 3 technologies:\n\n* [OpenAI Triton](https://triton-lang.org/): it’s a language to write GPU kernels like CUDA (not to be confused with \n  Nvidia Triton inference server), but much more productive (at least for us). \n  Improvement is due to the fusion of several ops, making us able to chain computations without\n  saving intermediate results in GPU memory. We are using it to rewrite:\n\n  * Attention (replaced by Flash Attention),\n  * Linear layer and their activation,\n  * and finally Layernorm/Rmsnorm.\n\n* [CUDA graphs](https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/) : you may have heard that Python is slow,\n  blablabla and to limit overhead C++/Rust should be the solution.\n  It is true but better than low overhead is no overhead at all. That’s CUDA graphs!\n  During a warmup step, it will save every kernel launched and their parameters, and then, with a single GPU instruction,\n  we can replay the whole inference.\n\n* [TorchDynamo](https://github.com/pytorch/torchdynamo/): this prototype from Meta helps us to cope with dynamic\n  behavior. It’s described [here](https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747),\n  and in a few words during a warmup step it traces the model and provides a Fx graph (a static computation graph).\n  We replace some operations of this graph with our kernels and recompile it in Python.\n  We do that for any possible dynamic behavior we expect to have. During inference, inputs are analyzed, and the correct\n  static graph is used. It’s really an awesome project, check their repo to know more.\n\n## Acknowledgments\n\nCode of OpenAI Triton kernels takes inspiration from examples from OpenAI Triton tutorials or xformers library.  \n\n## Contributing\n\nIf you would like to contribute, for example to code or documentation, please see our [contribution guide](https://www.kernl.ai/contribution-guide/contributing/).\n\n## Code of Conduct\n\nPlease see our [Code of Conduct](https://www.kernl.ai/contribution-guide/code-of-conduct/) for any questions about the community we are trying to build and what to do if you need help with someone who is acting unprofessionally.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FELS-RD%2Fkernl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FELS-RD%2Fkernl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FELS-RD%2Fkernl/lists"}