{"id":13497020,"url":"https://github.com/Noeda/rllama","last_synced_at":"2025-03-28T21:31:57.786Z","repository":{"id":142670303,"uuid":"612538130","full_name":"Noeda/rllama","owner":"Noeda","description":"Rust+OpenCL+AVX2 implementation of LLaMA inference code","archived":false,"fork":false,"pushed_at":"2024-02-12T00:45:39.000Z","size":1728,"stargazers_count":537,"open_issues_count":12,"forks_count":29,"subscribers_count":9,"default_branch":"master","last_synced_at":"2024-10-29T10:08:22.618Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Noeda.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-11T08:41:15.000Z","updated_at":"2024-10-23T23:34:58.000Z","dependencies_parsed_at":"2024-01-14T07:06:20.807Z","dependency_job_id":"e6491a2a-8162-4526-af8a-704553c6a562","html_url":"https://github.com/Noeda/rllama","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Noeda%2Frllama","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Noeda%2Frllama/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Noeda%2Frllama/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Noeda%2Frllama/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Noeda","download_url":"https://codeload.github.com/Noeda/rllama/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246063772,"owners_count":20717880,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T20:00:21.062Z","updated_at":"2025-03-28T21:31:57.378Z","avatar_url":"https://github.com/Noeda.png","language":"Rust","funding_links":[],"categories":["Core Libraries","Rust","Summary","Frameworks","Machine Learning","Model Inference"],"sub_categories":[],"readme":"# RLLaMA\n\nRLLaMA is a pure Rust implementation of [LLaMA large language model inference.](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/).\n\n## Supported features\n\n  * Uses either `f16` and `f32` weights.\n  * LLaMA-7B, LLaMA-13B, LLaMA-30B, LLaMA-65B all confirmed working\n  * Hand-optimized AVX2 implementation\n  * OpenCL support for GPU inference.\n  * Load model only partially to GPU with `--percentage-to-gpu` command line switch to run hybrid-GPU-CPU inference.\n  * Simple HTTP API support, with the possibility of doing token sampling on\n    client side\n  * It can load `Vicuna-13B` instruct-finetuned model (although currently there is no nice UX).\n\n## Performance\n\nThe current performance is as follows:\n\n```\nPure Rust implementations:\n\nLLaMA-7B:  AMD Ryzen 3950X:                       552ms / token     f16    (pure Rust)\nLLaMA-7B:  AMD Ryzen 3950X:                       1008ms / token    f32    (pure Rust)\nLLaMA-13B: AMD Ryzen 3950X:                       1029ms / token    f16    (pure Rust)\nLLaMA-13B: AMD Ryzen 3950X:                       1930ms / token    f32    (pure Rust)\nLLaMA-30B: AMD Ryzen 5950X:                       2112ms / token    f16    (pure Rust)\nLLaMA-65B: AMD Ryzen 5950X:                       4186ms / token    f16    (pure Rust)\n\nOpenCL (all use f16):\n\nLLaMA-7B:  AMD Ryzen 3950X + OpenCL RTX 3090 Ti:  216ms / token            (OpenCL on GPU)\nLLaMA-7B:  AMD Ryzen 3950X + OpenCL Ryzen 3950X:  680ms / token            (OpenCL on CPU)\nLLaMA-13B: AMD Ryzen 3950X + OpenCL RTX 3090 Ti:  420ms / token            (OpenCL on GPU)\nLLaMA-13B: AMD Ryzen 3950X + OpenCL Ryzen 3950X:  1232ms / token           (OpenCL on CPU)\nLLaMA-30B: AMD Ryzen 5950X + OpenCL Ryzen 5950X:  4098ms / token           (OpenCL on CPU)\n```\n\nScroll to the bottom of this README.md to see benchmarks over time.\n\n## Screenshot\n\n![Screenshot of RLLaMA in action](rllama.gif)\n\n## Install\n\nYou can install with `cargo` tool. RLLaMA uses intrinsics extensively and you\nlikely need to enable them to install the executable.\n\n```\nRUSTFLAGS=\"-C target-feature=+sse2,+avx,+fma,+avx2\" cargo install rllama\n```\n\nThere is a `.cargo/config.toml` inside this repository that will enable these\nfeatures if you install manually from this Git repository instead.\n\n## Install (Docker path)\n\nThere is a Dockerfile you can use if you'd rather just get started quickly and\nyou are familiar with `docker`. You still need to download the models yourself.\n\n\n### For CPU-only docker support:\n```\ndocker build -f ./.docker/cpu.dockerfile -t rllama .\n```\n\n```\ndocker run -v /models/LLaMA:/models:z -it rllama \\\n    rllama --model-path /models/7B \\\n           --param-path /models/7B/params.json \\\n           --tokenizer-path /models/tokenizer.model \\\n           --prompt \"hi I like cheese\"\n```\n\nReplace `/models/LLaMA` with the directory you've downloaded your models to.\nThe `:z` in `-v` flag may or may not be needed depending on your distribution\n(I needed it on Fedora Linux)\n\n### For GPU-enabled docker support with nvidia:\nFollow the instructions [here](.docker/nvidia.md).\n\n## LLaMA weights\n\nRefer to https://github.com/facebookresearch/llama/ As of now, you need to be\napproved to get weights.\n\nFor LLaMA-7B make sure, you got these files:\n\n```shell\n* 7B/consolidated.00.pth\n* 7B/params.json\n* tokenizer.model\n```\n\nThe `consolidated.00.pth` is actually a zip file. You need to unzip it:\n\n```shell\n$ cd 7B\n$ unzip consolidated.00.pth\n$ mv consolidated consolidated.00\n```\n\nIf you are using a larger model like LLaMA-13B, then you can skip the last step\nof renaming the `consolidated` directory.\n\nYou should now be ready to generate some text.\n\n## Example\n\nRun LLaMA-7B with some weights casted to 16-bit floats:\n\n```shell\nrllama --tokenizer-path /path/to/tokenizer.model \\\n       --model-path /path/to/LLaMA/7B \\\n       --param-path /path/to/LLaMA/7B/params.json \\\n       --f16 \\\n       --prompt \"The meaning of life is\"\n```\n\nUse `rllama --help` to see all the options.\n\n## Partially load model to GPU\n\n`rllama` can load only some of the transformer blocks to GPU. There is a\ncommand line argument:\n\n`--percentage-to-gpu \u003cvalue between 0 and 1, defaults to 1\u003e`\n\n1 means 100% and 0 means 0%. Values in-between load the model partially to GPU.\n\nYou can use this to load LLaMA-13B or Vicuna-13B on a consumer GPU of 24\ngigabytes at around `--percentage-to-gpu 0.9` before it fails to out-of-memory\nerror (if there are no competing programs on the computer that use GPU memory).\n\n## Interactive mode\n\nThere is a simple experimental interactive mode to try force a type of\nback-and-forth discussion with the model.\n\n```shell\nrllama ... --start-interactive \\\n           --interactive-system-prompt \"Helpful assistant helps curious human.\" \\   # (optional)\n           --interactive-prompt-postfix  \" ###Assistant:\" \\  # (optional)\n           --interactive-stop \"###Human: \"                   # (optional)\n```\n\nIn this mode, you need to type your prompt before the AI starts doing its work.\nIf the AI outputs token sequence given in `--interactive-stop` (defaults to\n`###Human:`) then it will ask for another input.\n\nThe defaults match Vicuna-13B model:\n\n```\n  --interactive-system-prompt    \"A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\"\n  --interactive-prompt-postfix   \" ###Assissant:\"\n  --interactive-prompt-prefix    \" \"\n  --interactive-stop             \"###Human:\"\n```\n\n`--interactive-prompt-postfix` is appended automatically to your typed text and\n`--interactive-prompt-prefix` is appended to the start of your typed text.Here\nis an example of interactive mode command line with the default settings:\n\n```shell\nrllama --f16 \\\n       --param-path /models/vicuna13b/params.json \\\n       --model-path /models/vicuna13b \\\n       --tokenizer-path /stonks/LLaMA/tokenizer.model \\\n       --start-interactive\n```\n\nAs of writing of this, the output is not formatted prettily for chat and there\nis no visual indication of when you are supposed to be typing. That will come\nlater.\n\n## Inference server\n\n`rllama` can run in an inference server mode with a simple HTTP JSON API. You\nneed to enable `server` features for this.\n\n```\ncargo build --release --features server\n```\n\nThe command line flags for this are:\n\n  * `--inference-server` using this will turn on the inference server.\n  * `--inference-server-port` sets the port. Default port is 8080.\n  * `--inference-server-host` sets the host. The default host is 127.0.0.1.\n  * `--inference-server-max-concurrent-inferences` sets how many concurrent\n    requests are allowed to be actively doing inference at the same time. The\n    default is 5.\n  * `--inference-server-api-path` sets which path servers the API requests. The\n    default path is `/rllama/v1/inference`\n  * `--inference-server-prompt-cache-size` sets how many previous prompt\n    calculations should be cached. Default is 50. This speeds up token\n    generation for prompts that were already requested before, however it also\n    increases memory use as the cache gets more full.\n  * `--inference-server-exit-after-one-query` will make the server exit with\n    exit code 0 after it has served one HTTP query. This is used for\n    troubleshooting and experiments.\n\nPrompts and flags related to token sampling are all ignored in inference server\nmode. Instead, they are obtained from each HTTP JSON API request.\n\n### Inference server API\n\nThere is an `examples/api_hello_world.py` for a minimal API use example.\n\n```\nPOST /rllama/v1/inference\n```\n\nExpects a JSON body and `Accept: application/json` or `Accept: text/jsonl`.\n\nThe expected JSON is as follows:\n\n```\n  {\n     \"temperature\":        \u003cnumber, optional\u003e\n     \"top_k\":              \u003cinteger, optional, default 20\u003e\n     \"top_p\":              \u003cnumber, optional, default: 1.0\u003e\n     \"repetition_penalty\": \u003cnumber, optional, default: 1.0\u003e\n     \"stop_at_end_token\":  \u003cbool, optional, default: true\u003e\n     \"max_seq_len\":        \u003cinteger, optional, default: 1024. Clamped to\n                            be at highest the same as --max-seq-len command line option.\u003e\n     \"max_new_tokens\":     \u003cinteger, optional, default: 1024\u003e\n     \"no_token_sampling\":  \u003cbool, optional, default: false\u003e\n     \"prompt\":             \u003cstring, required\u003e\n  }\n```\n\nThe form of the response depends on if `no_token_sampling` is set to true or false. The\nresponse is in JSONL, i.e. multiple JSON dictionaries, separated by newlines.\n\n`no_token_sampling` can turn off `rllama`'s own token sampling. In this case,\nthe probabilities for every token are returned instead.\n\nWhen no\\_token\\_sampling = false:\n\n```\n{\u003ctoken string\u003e: {\"p\": \u003cnumber\u003e, \"is_end_token\": bool, might not be present}}\n```\n\n  * `token` contains the new token to be appended to output. It does not\n    include string you fed to the system originally.\n  * `p` is the probability that this token was chosen. For example, if this\n    value is 0.1, it means that this particular token had 10% chance of being\n    selected with the current token sampling settings.\n  * `is_end_token` is `true` is the given token signifies end of output. This\n    field is not present otherwise.\n\nWhen no\\_token\\_sampling = true:\n\n```\n{\u003ctoken string\u003e: {\"p\": \u003cnumber\u003e, \"is_end_token\": bool, might not be present} \\\n,\u003ctoken string\u003e: {\"p\": \u003cnumber\u003e, \"is_end_token\": bool, might not be present} \\\n,...}\n```\n\nIf you want to implement your own token sampling, you may want to set\n`max_new_tokens=1` and `stop_at_end_token=false` to suppress rllama's own\nsampling behavior entirely.\n\n`rllama` internally caches recently queried prompts and the intermediate\ncomputations so that it's able to continue off quickly if you issue a query\nthat is either the same as a previous query or a continuation of one.\n\n## How to turn on OpenCL\n\nUse `opencl` Cargo feature.\n\n```\nRUSTFLAGS=\"-C target-feature=+sse2,+avx,+fma,+avx2\" cargo install rllama --features opencl\n```\n\n```\nrllama --tokenizer-path /path/to/tokenizer.model \\\n       --model-path /path/to/LLaMA/7B \\\n       --param-path /path/to/LLaMA/7B/params.json \\\n       --opencl-device 0 \\\n       --prompt \"The meaning of life is\"\n```\n\nWith `opencl` feature, there is also another argument, `--opencl-device` that\ntakes a number. That number selects Nth OpenCL device found on the system. You\ncan see the devices in the output when you run the program (e.g. see the\nscreenshot below).\n\nWeights are always cast to 16-bit floats for OpenCL.\n\n## Notes and future plans\n\nThis is a hobby thing for me so don't expect updates or help.\n\n* There are various BLAS libraries like CLBlast to speed up matrix\n  multiplication that probably outperform my handwritten code.\n* I've heard there is some thing called Tensor Cores on nVidia GPUs. Not\n  accessible with OpenCL. But might be accessible on Vulkan with a an\n  extension. Or with cuBLAS.\n\n## Benchmarks\n\nI'm trying to track that I'm making this faster and not slower.\n\nFor 50-length sequence generation:\n\n```\ncargo run --release --\n          --model-path /LLaMA/13B \\\n          --param-path /LLaMA/13B/params.json \\\n          --tokenizer-path /LLaMA/tokenizer.model \\\n          --prompt \"Computers are pretty complica\" --max-seq-len 50\n\n# commit c9c861d199bd2d87d7e883e3087661c1e287f6c4  (13 March 2023)\n\nLLaMA-7B:  AMD Ryzen 3950X: 1058ms / token\nLLaMA-13B: AMD Ryzen 3950X: 2005ms / token\n\n# commit 63d27dba9091823f8ba11a270ab5790d6f597311  (13 March 2023)\n# This one has one part of the transformer moved to GPU as a type of smoke test\n\nLLaMA-7B:  AMD Ryzen 3950X + OpenCL RTX 3090 Ti:  567ms / token\nLLaMA-7B:  AMD Ryzen 3950X + OpenCL Ryzen 3950X:  956ms / token\nLLaMA-13B: AMD Ryzen 3950X + OpenCL RTX 3090 Ti:  987ms / token\nLLaMA-13B: AMD Ryzen 3950X + OpenCL Ryzen 3950X:  1706ms / token\n\n# commit 35b0c372a87192761e17beb421699ea5ad4ac1ce  (13 March 2023)\n# I moved some attention stuff to OpenCL too.\n\nLLaMA-7B:  AMD Ryzen 3950X + OpenCL RTX 3090 Ti:  283ms / token\nLLaMA-7B:  AMD Ryzen 3950X + OpenCL Ryzen 3950X:  679ms / token\nLLaMA-13B: AMD Ryzen 3950X + OpenCL RTX 3090 Ti:  \u003cran out of GPU memory\u003e\nLLaMA-13B: AMD Ryzen 3950X + OpenCL Ryzen 3950X:  1226ms / token\n\n# commit de5dd592777b3a4f5a9e8c93c8aeef25b9294364  (15 March 2023)\n# The matrix multiplication on GPU is now much faster. It didn't have that much\n# effect overall though, but I got modest improvement on LLaMA-7B GPU.\n\nLLaMA-7B:  AMD Ryzen 3950X + OpenCL RTX 3090 Ti:  247ms / token\nLLaMA-7B:  AMD Ryzen 3950X + OpenCL Ryzen 3950X:  680ms / token\nLLaMA-13B: AMD Ryzen 3950X + OpenCL RTX 3090 Ti:  \u003cran out of GPU memory\u003e\nLLaMA-13B: AMD Ryzen 3950X + OpenCL Ryzen 3950X:  1232ms / token\nLLaMA-30B: AMD Ryzen 5950X + OpenCL Ryzen 5950X:  4098ms / token\n\n# commit 3d0afcf24309f28ec540ed7645c35400a865ad6f  (17 March 2023)\n# I've been focusing on making the ordinary non-OpenCL CPU implementation\n# faster and I got some gains, most importantly from multithreading.\n# There is Float16 support now, so I've added f16/f32 to these tables:\n#\n# I also managed to run LLaMA-65B for the first time.\n\nLLaMA-7B:  AMD Ryzen 3950X: 552ms / token     f16\nLLaMA-7B:  AMD Ryzen 3950X: 1008ms / token    f32\nLLaMA-13B: AMD Ryzen 3950X: 1029ms / token    f16\nLLaMA-13B: AMD Ryzen 3950X: 1930ms / token    f32\nLLaMA-30B: AMD Ryzen 5950X: 2112ms / token    f16\nLLaMA-65B: AMD Ryzen 5950X: 4186ms / token    f16\n\n# commit f5328ab5bd62fe9bd930539382b13e9033434a0b (5 April 2023)\n# I've worked on making Vicuna-13B runnable and added an option to only\n# partially use GPU. Improved one of the OpenCL kernels:\n\nLLaMA-7B:   AMD Ryzen 3950X + OpenCL RTX 3090 Ti:    420ms (at 90%/10% GPU/CPU split)\nLLaMA-13B:  AMD Ryzen 3950X + OpenCL RTX 3090 Ti:    216ms (at 100% GPU)\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNoeda%2Frllama","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FNoeda%2Frllama","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNoeda%2Frllama/lists"}