{"id":13907437,"url":"https://github.com/jankais3r/LLaMA_MPS","last_synced_at":"2025-07-18T05:31:48.367Z","repository":{"id":135434575,"uuid":"611156003","full_name":"jankais3r/LLaMA_MPS","owner":"jankais3r","description":"Run LLaMA (and Stanford-Alpaca) inference on Apple Silicon GPUs.","archived":true,"fork":false,"pushed_at":"2023-03-25T00:46:31.000Z","size":9867,"stargazers_count":582,"open_issues_count":9,"forks_count":47,"subscribers_count":16,"default_branch":"main","last_synced_at":"2024-10-18T21:59:19.420Z","etag":null,"topics":["alpaca","apple-silicon","chat","chatbot","chatgpt","llama","llms","macos","metal","ml","mps","stanford-alpaca","torch"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jankais3r.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-08T08:30:47.000Z","updated_at":"2024-10-11T15:51:31.000Z","dependencies_parsed_at":"2024-05-01T23:16:07.338Z","dependency_job_id":"c755eb35-790e-4956-b8e2-7b2f55350875","html_url":"https://github.com/jankais3r/LLaMA_MPS","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jankais3r%2FLLaMA_MPS","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jankais3r%2FLLaMA_MPS/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jankais3r%2FLLaMA_MPS/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jankais3r%2FLLaMA_MPS/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jankais3r","download_url":"https://codeload.github.com/jankais3r/LLaMA_MPS/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":226353694,"owners_count":17611748,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alpaca","apple-silicon","chat","chatbot","chatgpt","llama","llms","macos","metal","ml","mps","stanford-alpaca","torch"],"created_at":"2024-08-06T23:01:56.334Z","updated_at":"2024-11-25T15:31:55.659Z","avatar_url":"https://github.com/jankais3r.png","language":"Python","funding_links":[],"categories":["HarmonyOS"],"sub_categories":["Windows Manager"],"readme":"# LLaMA_MPS\nRun LLaMA (and Stanford-Alpaca) inference on Apple Silicon GPUs.\n\n![Demo](demo.gif)\n\nAs you can see, unlike other LLMs, LLaMA is not biased in any way 😄\n\n### Initial setup steps\n\n**1. Clone this repo**\n\n`git clone https://github.com/jankais3r/LLaMA_MPS`\n\n**2. Install Python dependencies**\n\n```bash\ncd LLaMA_MPS\npip3 install virtualenv\npython3 -m venv env\nsource env/bin/activate\npip3 install -r requirements.txt\npip3 install -e .\n```\n\n### LLaMA-specific setup\n\n**3. [Download the model weights](https://github.com/facebookresearch/llama/pull/73/files#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5R4) and put them into a folder called** `models` (e.g., `LLaMA_MPS/models/7B`)\n\n**4. _(Optional)_ Reshard the model weights (13B/30B/65B)**\n\nSince we are running the inference on a single GPU, we need to merge the larger models' weights into a single file.\n\n```bash\nmv models/13B models/13B_orig\nmkdir models/13B\npython3 reshard.py 1 models/13B_orig models/13B\n```\n\n**5. Run the inference**\n\n`python3 chat.py --ckpt_dir models/13B --tokenizer_path models/tokenizer.model --max_batch_size 8 --max_seq_len 256`\n\nThe above steps will let you run inference on the raw LLaMA model in an 'auto-complete' mode.\n\nIf you would like to try the 'instruction-response' mode similar to ChatGPT using the fine-tuned weights of [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca), continue the setup with the following steps:\n\n### Alpaca-specific setup\n\n![Alpaca demo](alpaca.gif)\n\n**3. Download the fine-tuned weights (available for 7B/13B)**\n\n```bash\npython3 export_state_dict_checkpoint.py 7B\npython3 clean_hf_cache.py\n```\n\n**4. Run the inference**\n\n`python3 chat.py --ckpt_dir models/7B-alpaca --tokenizer_path models/tokenizer.model --max_batch_size 8 --max_seq_len 256`\n\n### Memory requirements\n\n| Model | Starting memory during inference | Peak memory during checkpoint conversion | Peak memory during resharding |\n| ------------- | ------------- | ------------- | ------------- |\n| 7B | 16 GB | 14 GB | N/A |\n| 13B | 32 GB | 37 GB | 45 GB |\n| 30B | 66 GB | 76 GB | 125 GB |\n| 65B | ?? GB | ?? GB | ?? GB |\n\n**Min specs per model (slow due to swapping):**\n\n* 7B - 16 GB RAM\n* 13B - 32 GB RAM\n* 30B - 64 GB RAM\n* 65B - needs testing\n\n**Recommended specs per model:**\n\n* 7B - 24 GB RAM\n* 13B - 48 GB RAM\n* 30B - 96 GB RAM\n* 65B - needs testing\n\n### Parameters to experiment with\n**- max_batch_size**\n\nIf you have spare memory (e.g., when running the 13B model on a 64 GB Mac), you can increase the batch size by using the `--max_batch_size=32` argument. Default value is `1`.\n\n**- max_seq_len**\n\nTo increase/decrease the maximum length of generated text, use the `--max_seq_len=256` argument. Default value is `512`.\n\n**- use_repetition_penalty**\n\nThe example script penalizes the model for generating a repetitive content. This should lead to higher quality output, but it slightly slows down the inference. Run the script with `--use_repetition_penalty=False` argument to disable the penalty algorithm.\n\n### Alternatives\n\nThe best alternative to LLaMA_MPS for Apple Silicon users is [llama.cpp](https://github.com/ggerganov/llama.cpp), which is a C/C++ re-implementation that runs the inference purely on the CPU part of the SoC. Because compiled C code is so much faster than Python, it can actually beat this MPS implementation in speed, however at the cost of much worse power and heat efficiency.\n\nSee the below comparison when deciding which implementation better fits your use case.\n\n| Implementation | Total run time - 256 tokens | Tokens/s | Peak memory use | Peak SoC temperature | Peak SoC Power consumption | Tokens per 1 Wh |\n| -------------- | ------------------------------- | ----------------------------- | ------------- | ------------------------- | ------------------------------ | --------------------------- |\n| LLAMA_MPS (13B fp16) | 75 s | 3.41 | 30 GB | 79 °C | 10 W | 1,228.80 |\n| llama.cpp (13B fp16) | 70 s | 3.66 | 25 GB | 106 °C | 35 W | 376.16 |\n\n### Credits\n\n- facebookresearch ([original code](https://github.com/facebookresearch/llama))\n- markasoftware ([cpu optimizations](https://github.com/markasoftware/llama-cpu))\n- remixer-dec ([mps optimizations](https://github.com/remixer-dec/llama-mps))\n- venuatu ([continuous token printing](https://github.com/venuatu/llama/commit/25c84973f71877677547453dab77eeaea9a86376) / [loading optimizations](https://github.com/venuatu/llama/commit/0d2bb5a552114b69db588175edd3e55303f029be))\n- benob ([reshard script](https://gist.github.com/benob/4850a0210b01672175942203aa36d300))\n- tloen ([repetition penalty](https://github.com/tloen/llama-int8) / [LoRA merge script](https://github.com/tloen/alpaca-lora/blob/main/export_state_dict_checkpoint.py))\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjankais3r%2FLLaMA_MPS","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjankais3r%2FLLaMA_MPS","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjankais3r%2FLLaMA_MPS/lists"}