{"id":13625119,"url":"https://github.com/efeslab/fiddler","last_synced_at":"2025-08-21T07:14:39.772Z","repository":{"id":222217000,"uuid":"752856939","full_name":"efeslab/fiddler","owner":"efeslab","description":"[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration","archived":false,"fork":false,"pushed_at":"2024-11-18T00:25:45.000Z","size":1807,"stargazers_count":210,"open_issues_count":2,"forks_count":20,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-05-17T06:03:18.202Z","etag":null,"topics":["llm","llm-inference","local-inference","mixtral-8x7b","mixture-of-experts"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2402.07033","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/efeslab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-05T01:01:27.000Z","updated_at":"2025-05-16T06:58:58.000Z","dependencies_parsed_at":"2024-05-21T08:29:55.567Z","dependency_job_id":"81c69119-6cc9-45a1-a78c-38d5d8897019","html_url":"https://github.com/efeslab/fiddler","commit_stats":null,"previous_names":["efeslab/fiddler"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/efeslab/fiddler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/efeslab%2Ffiddler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/efeslab%2Ffiddler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/efeslab%2Ffiddler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/efeslab%2Ffiddler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/efeslab","download_url":"https://codeload.github.com/efeslab/fiddler/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/efeslab%2Ffiddler/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271442049,"owners_count":24760353,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-21T02:00:08.990Z","response_time":74,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llm","llm-inference","local-inference","mixtral-8x7b","mixture-of-experts"],"created_at":"2024-08-01T21:01:51.077Z","updated_at":"2025-08-21T07:14:39.745Z","avatar_url":"https://github.com/efeslab.png","language":"Python","funding_links":[],"categories":["llm"],"sub_categories":[],"readme":"# 🎻 Fiddler: CPU-GPU Orchestration for Fast Local Inference of MoE Models [[paper]](https://arxiv.org/abs/2402.07033)\n\n(This repository is a proof-of-concept and still under heavy construction)\n\nFiddler is a fast inference system for LLMs based on Mixture-of-Experts (MoE) architecture at local devices. It allows you to run **unquantized Mixtral-8x7B model (\u003e90GB of parameters) with \u003e3 token/s on a single 24GB GPU**.\n\n## Update\n- [2024/02] We published an [arxiv preprint](https://arxiv.org/abs/2402.07033)\n- [2024/02] We released the repository.\n\n## Usage\n```bash\npip install -r requirements.txt\npython src/fiddler/infer.py --model \u003cpath/to/mixtral/model\u003e --input \u003cprompt\u003e\n```\n\n## Key Idea\nFiddler is an inference system to run MoE models larger than the GPU memory capacity in a local setting (i.e., latency-oriented, single batch).\nThe key idea behind Fiddler is to use the CPU’s computation power.\n\n![](./asset/key-idea.png)\n\nExisting offloading systems (e.g., [Eliseev \u0026 Mazur, 2023](https://github.com/dvmazur/mixtral-offloading)) primarily utilize the memory resources available on the CPU, while the computation mainly occurs on the GPU. The typical process involves: ① When some expert weights are missing from the GPU memory, ② they are copied from the CPU memory to the GPU memory, then ③ GPU executes the expert layer.\nAlthough GPU execution is faster, the data movement introduces significant overhead. \n\nOn the other hand, **Fiddler uses CPU computation resources in addition to memory resources**. The process is as follows: ① when some expert weights are missing on the GPU memory, ② we copy the activation values from the GPU memory to the CPU memory, instead of copying the weights. \n③ The computation of the expert layer then happens on the CPU, and ④ the output activation after the expert is copied back to the GPU.\n\nThis approach significantly reduces the latency of CPU-GPU communication, especially since the size of activations is considerably smaller than the weight size (`batch_size x 4096` versus `3 x 4096 x 14336` per expert for the Mixtral-8x7B) for a small batch size. Despite slower computation speeds on the CPU compared to the GPU, avoiding the weight copying process makes this approach more efficient. \n\n### Motivation\nWhy Fiddler is important? Because: \n- MoE models are showing promising performance. For instance, Mixtral-8x7B is the best open-source model at [LMSys Chatbot Arena](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard) at the moment (2024/02)\n- MoE models are sparse, meaning there are fewer computations per parameter. As a result, investing in more GPUs is less cost-effective (they have high computation power but small memory), especially for local inference purposes.\n- MoE models can grow infinitely large, making it even more challenging to get enough GPUs. For instance, [Switch Transformer](https://arxiv.org/abs/2101.03961) has 2,048 experts per layer and \u003e1T parameter in total.\n\nTherefore, there is a huge benefit if we could efficiently run large MoE models with limited GPU resources.\n\nFor more technical details, please refer to our arxiv preprint.\n\n## Benchmarks\n\nWe evaluate the performance of Fiddler in two environments: Quadro RTX 6000 GPU (24GB) + 48-core Intel Skylake CPU and L4 GPU (24GB) + 32-core Intel Cascade Lake CPU.\nHere is the single batch latency (measured by token/s) compared against [DeepSpeed-MII](https://github.com/microsoft/DeepSpeed-MII) and [Mixtral offloading](https://github.com/dvmazur/mixtral-offloading) (Eliseev \u0026 Mazur, 2023):\n\n![](./asset/results.png)\n\nFiddler shows **an order of magnitude speedup** over existing methods.\nCompared with DeepSpeed-MII and Mixtral offloading, Fiddler is on average faster by 19.4 and 8.2 times for Environment 1, and by 22.5 and 10.1 times for Environment 2.\n\n## Roadmap\nFiddler is a research prototype and now only supports a 16-bit Mixtral-8x7B model.\nWe are working on supporting the following features.\n- [ ] Support for other MoE models ([DeepSeek-MoE](https://github.com/deepseek-ai/DeepSeek-MoE), [OpenMoE](https://github.com/XueFuzhao/OpenMoE), [Switch Transformer](https://huggingface.co/docs/transformers/model_doc/switch_transformers), etc.)\n- [ ] Support for quantized models\n- [ ] Support for AVX512_BF16\n\n## Known Limitations\nFiddler is currently relying on PyTorch implementation for expert processing at the CPU, and it is slow if your CPU does not support AVX512.\n\n## Citation\nIf you use Fiddler in your research, please cite the following paper. \n```\n@misc{kamahori2024fiddler,\n      title={Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models}, \n      author={Keisuke Kamahori and Yile Gu and Kan Zhu and Baris Kasikci},\n      year={2024},\n      eprint={2402.07033},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fefeslab%2Ffiddler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fefeslab%2Ffiddler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fefeslab%2Ffiddler/lists"}