{"id":22943705,"url":"https://github.com/andrewkchan/yalm","last_synced_at":"2025-04-12T22:31:40.899Z","repository":{"id":266436767,"uuid":"866356299","full_name":"andrewkchan/yalm","owner":"andrewkchan","description":"Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O","archived":false,"fork":false,"pushed_at":"2025-01-15T07:22:42.000Z","size":405,"stargazers_count":279,"open_issues_count":1,"forks_count":28,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-04-04T03:03:32.542Z","etag":null,"topics":["cpp","cuda","inference-engine","llama","llamacpp","llm","llm-inference","machine-learning","mistral"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/andrewkchan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-02T05:28:08.000Z","updated_at":"2025-04-01T17:28:51.000Z","dependencies_parsed_at":"2024-12-04T09:29:55.620Z","dependency_job_id":"d78e53a6-ae78-4f0b-bbe4-c4915069da0a","html_url":"https://github.com/andrewkchan/yalm","commit_stats":null,"previous_names":["andrewkchan/yalm"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andrewkchan%2Fyalm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andrewkchan%2Fyalm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andrewkchan%2Fyalm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andrewkchan%2Fyalm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/andrewkchan","download_url":"https://codeload.github.com/andrewkchan/yalm/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248640204,"owners_count":21137996,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cpp","cuda","inference-engine","llama","llamacpp","llm","llm-inference","machine-learning","mistral"],"created_at":"2024-12-14T14:14:01.737Z","updated_at":"2025-04-12T22:31:40.876Z","avatar_url":"https://github.com/andrewkchan.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"yalm (Yet Another Language Model) is an LLM inference implementation in C++/CUDA, using no libraries except to load and save frozen LLM weights.\n- This project is intended as an **educational exercise** in performance engineering and LLM inference implementation. \n- The codebase therefore emphasizes documentation, whether external or in comments, scientific understanding of optimizations, and readability where possible. \n- It is not meant to be run in production. See [limitations](#limitations) section at bottom.\n- See my blog post [Fast LLM Inference From Scratch](https://andrewkchan.dev/posts/yalm.html) for more.\n\nLatest benchmarks with Mistral-7B-Instruct-v0.2 in FP16 with 4k context, on RTX 4090 + EPYC 7702P:\n\n| Engine      | Avg. throughput (~120 tokens) tok/s | Avg. throughput (~4800 tokens) tok/s |\n| ----------- | ----------- | ----------- |\n| huggingface transformers, GPU | 25.9 | 25.7 |\n| llama.cpp, GPU | 61.0 | 58.8 |\n| calm, GPU | 66.0 | 65.7 |\n| yalm, GPU | 63.8 | 58.7 |\n\n# Instructions\n\nyalm requires a computer with a C++20-compatible compiler and the CUDA toolkit (including `nvcc`) to be installed. You'll also need a directory containing LLM safetensor weights and configuration files in huggingface format, which you'll need to convert into a `.yalm` file. Follow the below to download Mistral-7B-v0.2, build `yalm`, and run it:\n\n```\n# install git LFS\ncurl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash\nsudo apt-get -y install git-lfs\n# download Mistral\ngit clone git@hf.co:mistralai/Mistral-7B-Instruct-v0.2\n# clone this repository\ngit clone git@github.com:andrewkchan/yalm.git\n\ncd yalm\npip install -r requirements.txt\npython convert.py --dtype fp16 mistral-7b-instruct-fp16.yalm ../Mistral-7B-Instruct-v0.2/\n./build/main mistral-7b-instruct-fp16.yalm -i \"What is a large language model?\" -m c\n```\n\n# Usage\n\nSee the CLI help documentation below for `./build/main`:\n\n```\nUsage:   main \u003ccheckpoint\u003e [options]\nExample: main model.yalm -i \"Q: What is the meaning of life?\" -m c\nOptions:\n  -h Display this help message\n  -d [cpu,cuda] which device to use (default - cuda)\n  -m [completion,passkey,perplexity] which mode to run in (default - completion)\n  -T \u003cint\u003e sliding window context length (0 - max)\n\nPerplexity mode options:\n  Choose one:\n    -i \u003cstring\u003e input prompt\n    -f \u003cfilepath\u003e input file with prompt\nCompletion mode options:\n  -n \u003cint\u003e    number of steps to run for in completion mode, default 256. 0 = max_seq_len, -1 = infinite\n  -t \u003cfloat\u003e temperature (default - 1.0)\n  Choose one:\n    -i \u003cstring\u003e input prompt\n    -f \u003cfilepath\u003e input file with prompt\nPasskey mode options:\n  -n \u003cint\u003e    number of junk lines to insert (default - 250)\n  -l \u003cint\u003e    passkey position (-1 - random)\n```\n\n# Tests and benchmarks\n\nyalm comes with a basic test suite that checks implementations of attention, matrix multiplications, feedforward nets in the CPU and GPU backends. Build and run it like so:\n\n```\nmake test\n./build/test\n```\n\nThe test binary also includes benchmarks for individual kernels (useful for profiling with `ncu`) and broader system tools such as 2 benchmarks to determine main memory bandwidth:\n\n```\n# Memory benchmarks\n./build/test -b\n./build/test -b2\n\n# Kernel benchmarks\n./build/test -k [matmul,mha,ffn]\n```\n\n# Limitations\n\n- Only completions may be performed (in addition to some testing modes like computing perplexity on a prompt or performing a [passkey test](https://github.com/ggerganov/llama.cpp/pull/3856)). Chat interface has not been implemented.\n- An NVIDIA GPU is required.\n- The GPU backend only works with a single GPU and the entire model must fit into VRAM.\n- As of Dec 31, 2024 only the following models have been tested:\n  - Mistral-v0.2 \n  - Mixtral-v0.1 (CPU only)\n  - Llama-3.2\n\n# Acknowledgements\n\n- [calm](https://github.com/zeux/calm) - Much of my implementation is inspired by Arseny Kapoulkine’s inference engine. In a way, this project was kicked off by “understand calm and what makes it so fast.” I’ve tried to keep my code more readable for myself though, and as much as possible scientifically understanding optimizations, which means foregoing some advanced techniques used in calm like dynamic parallelism.\n- [llama2.c](https://github.com/karpathy/llama2.c) - Parts of the CPU backend come from Andrej Karpathy’s excellent C implementation of Llama inference.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandrewkchan%2Fyalm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fandrewkchan%2Fyalm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandrewkchan%2Fyalm/lists"}