{"id":24251339,"url":"https://github.com/powerserve-project/powerserve","last_synced_at":"2025-04-06T10:10:30.287Z","repository":{"id":272485995,"uuid":"916693314","full_name":"powerserve-project/PowerServe","owner":"powerserve-project","description":"High-speed and easy-use LLM serving framework for local deployment","archived":false,"fork":false,"pushed_at":"2025-03-18T07:55:27.000Z","size":1163,"stargazers_count":98,"open_issues_count":6,"forks_count":9,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-04-06T10:09:41.276Z","etag":null,"topics":["llama","llm","llm-inference","llm-serving","npu","qwen","smallthinker","smartphone"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/powerserve-project.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-14T15:38:50.000Z","updated_at":"2025-04-06T07:34:30.000Z","dependencies_parsed_at":"2025-01-14T19:34:34.236Z","dependency_job_id":"405c4fec-6e6c-454d-a223-115493b16abe","html_url":"https://github.com/powerserve-project/PowerServe","commit_stats":null,"previous_names":["powerserve-project/powerserve"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/powerserve-project%2FPowerServe","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/powerserve-project%2FPowerServe/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/powerserve-project%2FPowerServe/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/powerserve-project%2FPowerServe/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/powerserve-project","download_url":"https://codeload.github.com/powerserve-project/PowerServe/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247464220,"owners_count":20942970,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llama","llm","llm-inference","llm-serving","npu","qwen","smallthinker","smartphone"],"created_at":"2025-01-15T02:50:46.449Z","updated_at":"2025-04-06T10:10:30.261Z","avatar_url":"https://github.com/powerserve-project.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PowerServe\nPowerServe is a high-speed and easy-use LLM serving framework for local deployment.\n\n## Features\n- [One-click compilation and deployment](./docs/end_to_end.md)\n- NPU speculative inference support\n- Achieves 40 tokens/s running Smallthinker on mobile devices\n- Support Android and HarmonyOS NEXT\n\n## Supported Models\n\nHere's the list of models that PowerServe supports:\n\n| Model Name | Hugging Face Link | Speculation Support(Draft model) | Soc Setting | Prefill Speed (tokens/s) | Decode Speed (tokens/s) | Speculative Decode Speed (tokens/s) |\n|---|---|---|---|---|---|---|\n| smallthinker-3b | [SmallThinker-3B](https://huggingface.co/PowerServe/SmallThinker-3B-PowerServe-QNN29-8G3) | Yes(smallthinker-0.5b) | 8G3 | 975.00 | 19.71 | 38.75 |\n| llama-3.2-1b | [Llama-3.2-1B](https://huggingface.co/PowerServe/Llama-3.2-1B-PowerServe-QNN29-8G3) | No | 8G3 | 1876.58 | 58.99 | / |\n| llama-3.1-8b | [Llama-3.1-8B](https://huggingface.co/PowerServe/Llama-3.1-8B-PowerServe-QNN29-8G3) | Yes(llama-3.2-1b) | 8G3 | 468.35 | 12.03 | 21.02 |\n| qwen-2-0.5b | [Qwen-2-0.5B](https://huggingface.co/PowerServe/Qwen-2-0.5B-PowerServe-QNN29-8G3) | No | 8G3 | 3590.91 | 104.53 | / |\n| qwen-2.5-3b | [Qwen-2.5-3B](https://huggingface.co/PowerServe/Qwen-2.5-3B-PowerServe-QNN29-8G3) | No | 8G3 | 906.98 | 21.01 | / |\n| internlm-3-8b | [InternLM-3-8B](https://huggingface.co/PowerServe/InternLM-3-8B-PowerServe-QNN29-8G3) | No | 8G3 | TBC | TBC | / |\n| deepseek-r1-llama-8b | [DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/PowerServe/DeepSeek-R1-Distill-Llama-8B-PowerServe-QNN29-8G3/tree/main) | Yes(llama-3.2-1b) | 8G3 | TBC | TBC | / |\n| smallthinker-3b | [SmallThinker-3B](https://huggingface.co/PowerServe/SmallThinker-3B-PowerServe-QNN29-8G4) | Yes(smallthinker-0.5b) | 8G4(8Elite) | 1052.63 | 20.90 | 43.25 |\n| llama-3.2-1b | [Llama-3.2-1B](https://huggingface.co/PowerServe/Llama-3.2-1B-PowerServe-QNN29-8G4) | No | 8G4(8Elite) | 1952.38 | 59.00 | / |\n| llama-3.1-8b | [Llama-3.1-8B](https://huggingface.co/PowerServe/Llama-3.1-8B-PowerServe-QNN29-8G4) | Yes(llama-3.2-1b) | 8G4(8Elite) | 509.09 | 12.48 | 22.83 |\n| qwen-2-0.5b | [Qwen-2-0.5B](https://huggingface.co/PowerServe/Qwen-2-0.5B-PowerServe-QNN29-8G4) | No | 8G4(8Elite) | 4027.30 | 109.49 | / |\n| qwen-2.5-3b | [Qwen-2.5-3B](https://huggingface.co/PowerServe/Qwen-2.5-3B-PowerServe-QNN29-8G4) | No | 8G4(8Elite) | 981.69 | 22.19 | / |\n| internlm-3-8b | [InternLM-3-8B](https://huggingface.co/PowerServe/InternLM-3-8B-PowerServe-QNN29-8G4) | No | 8G4(8Elite) | 314.80 | 7.62 | / |\n| deepseek-r1-llama-8b | [DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/PowerServe/DeepSeek-R1-Distill-Llama-8B-PowerServe-QNN29-8G4/tree/main) | Yes(llama-3.2-1b) | 8G4(8Elite) | 336.37 | 10.21 | / |\n\nWe test these speeds with files in `./assets/prompts`as input prompt files. More tests on multiple datasets will be conducted in the future.\n\n## News\n- [2025/1/14] We release PowerServe 🎉\n\n## Table of Contents\n\n1. [End to end deployment](#end-to-end)\n2. [Prerequisites](#prerequisites)\n3. [Directory Structure](#directory-structure)\n4. [Model Preparation](#model-preparation)\n5. [Compile PowerServe](#compile-powerserve)\n6. [Prepare PowerServe Workspace](#prepare-powerserve-workspace)\n7. [Execution](#execution)\n8. [Known Issues](#known-issues)\n\n## End to End Deployment\n\nWe provide nearly one-click end to end deployment document(./docs/end_to_end.md), including model downloading, compiling, deploying, and running.\n\nNo matter what operating systems you are using, you can follow the instructions in the document to use Powerserve to run support models on your phone.\n\nDetails please refer to [End to End Deployment](./docs/end_to_end.md)\n\n\n## Prerequisites\n\n```bash\npip install -r requirements.txt\ngit submodule update --init --recursive\n```\n\nTo deploy on aarch64 with Qualcomm NPU using QNN, [**NDK**](https://developer.android.google.cn/ndk/downloads) and [**QNN**](https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/linux_setup.html) are required to be installed.\n\n```shell\nexport NDK=\u003cpath-to-ndk\u003e\nexport QNN_SDK_ROOT=\u003cpath-to-QNN\u003e\n```\n## directory-structure\n```\npowerserve\n├── app\n├── assets               # Prompt files.\n├── CMakeLists.txt\n├── docs\n├── libs                 # External dependencies.\n├── LICENSE\n├── powerserve           # Python script to create work directory.\n├── pyproject.toml\n├── README.md\n├── requirements.txt\n├── src\n│   ├── backend          # Backend implementations, include ggml and qnn.\n│   ├── CMakeLists.txt\n│   ├── core             # Core structures used across all levels of the runtime, like type definition, config, tensor and buffer.\n│   ├── executor         # Tensor execution.\n│   ├── graph            # Computing Graph.\n│   ├── model            # Various model implementations.\n│   ├── sampler          # Token sampler.\n│   ├── speculative      # Speculative decoding.\n│   ├── storage          # File loader.\n│   └── tokenizer\n├── tests\n└── tools\n    ├── add_license.py\n    ├── CMakeLists.txt\n    ├── convert_hf_to_gguf   # Convert huggingface to gguf, based on llama.cpp\n    ├── cos_sim.py\n    ├── end_to_end\n    ├── extract_embd_from_vl\n    ├── format.py\n    ├── gen_flame_graph.sh\n    ├── gguf_config_to_json  # Export config.json from gguf.\n    ├── gguf_export.py\n    ├── mmlu\n    ├── mmmu_test\n    ├── parameter_search\n    ├── qnn_converter\n    └── simple_qnn_test\n```\n\n## Model Preparation\n\nFor CPU-only execution, only `Models For CPU` is required. For NPU execution, both `Models For CPU` and `Models For NPU` is required.\n\nTake llama3.1-8b-instruct model as example, the structure of model folder:\n```shell\n-- models                       # Level-1 dir, where server search different models and CLI search for runtime configurations\n    -- hparams.json                 # Hyper params, containing #threads, #batch_size and sampler configurations.\n    -- workspace.json               # The definition of model workspace structure, where main model and target model(if exist) is determined.\n    -- bin                          # The binaries for execution\n        -- powerserve-config-generator\n        -- powerserve-perplexity-test\n        -- powerserve-run\n        -- powerserve-server\n    -- qnn_libs                     # Dependent libraries of QNN\n        -- libQNNSystem.so\n        -- libQNNHtp.so\n        -- libQNNHtpV79.so\n        -- libQNNHtpV79Skel.so\n        -- libQNNHtpV79Stub.so\n    -- llama3.1-8b-instruct         # The model weights of GGUF and QNN\n        -- model.json\n        -- vocab.gguf               # The vocab table of model\n        -- ggml                     # GGUF model binaries\n            -- weights.gguf\n        -- qnn                      # QNN model binaries\n            -- kv\n                -- *.raw\n                -- ...\n            -- config.json          # The information of QNN models and QNN backend configurations\n            -- llama3_1_8b_0.bin\n            -- llama3_1_8b_1.bin\n            -- llama3_1_8b_2.bin\n            -- llama3_1_8b_3.bin\n            -- lmhead.bin\n    -- qwen2_7b_instruct            # another model\n        -- ...\n\n```\n\n### Convert Models For CPU\n\n```shell\n# Under the root directory of PowerServe\npython ./tools/gguf_export.py -m \u003chf-model\u003e -o models/llama3.1-8b-instruct\n```\n\n\n### Convert Models For NPU\n\nIf you just want to run PowerServe on CPUs, this step can be skipped. More details please refer to [QNN Model Conversion](./tools/qnn_converter/README.md)\n\n```shell\n# Under the root directory of PowerServe\ncd powerserve/tools/qnn_converter\n\n# This may take a long time...\npython converter.py                                 \\\n    --model-folder Llama-3.1-8B-Instruct            \\\n    --model-name llama3_1_8b                        \\\n    --system-prompt-file system_prompt_llama.txt    \\\n    --prompt-file lab_intro_llama.md                \\\n    --batch-sizes 1 128                             \\\n    --artifact-name llama3_1_8b                     \\\n    --n-model-chunk 4                               \\\n    --output-folder ./llama3.1-8b-QNN               \\\n    --build-folder ./llama3.1-8b-QNN-tmp            \\\n    --soc 8gen4\n\n```\nConvert GGUF models and integrate them with QNN models\n\nNote: this scripts can only create fp32 and q8_0 in ./llama3.1-8b-instruct-model/ggml/weights.gguf,\nif you want to use q4_0, please use llama-quantize in llama.cpp like: `./build/bin/llama-quantize --pure /\u003cpath\u003e/llama3.1-fp32.gguf Q4_0`, then replace weight file: `cp /\u003cpath\u003e/ggml-model-Q4_0.gguf ./llama3.1-8b-instruct-model/ggml/weights.gguf`\n\n```shell\n# Under the root directory of PowerServe\npython ./tools/gguf_export.py -m \u003chf-llama3.1-model\u003e --qnn-path tools/qnn_converter/llama3.1-8b-QNN -o ./llama3.1-8b-instruct-model\n```\n\n## Compile PowerServe\n\nThe options of platform and ABI vary when deploying on different devices. DO CARE about the configuration.\n\n### Build for Linux cpu\n```shell\n# Under the root directory of PowerServe\ncmake -B build -DCMAKE_BUILD_TYPE=Release\ncmake --build build\n```\n\n### Build for Android cpu\n```shell\n# Under the root directory of PowerServe\ncmake -B build                                                      \\\n    -DCMAKE_BUILD_TYPE=Release                                      \\\n    -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake \\\n    -DANDROID_ABI=arm64-v8a                                         \\\n    -DANDROID_PLATFORM=android-35                                   \\\n    -DGGML_OPENMP=OFF                                               \\\n    -DPOWERSERVE_WITH_QNN=OFF\n\ncmake --build build\n```\n\n### Build for Android qnn\n- ❗️ Because the llama3.1-8b model is too large, qnn needs to open multiple sessions when loading. We conducted tests on 4 mobile phones. Among them, one plus 12, one plus 13 and Xiaomi 14 need to be updated to android 15 to apply for additional sessions in non-root mode, while honor Magic6 updates to android 15 to run in non-root mode will cause an error.\n\n```shell\n# Under the root directory of PowerServe\ncmake -B build                                                      \\\n    -DCMAKE_BUILD_TYPE=Release                                      \\\n    -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake \\\n    -DANDROID_ABI=arm64-v8a                                         \\\n    -DANDROID_PLATFORM=android-35                                   \\\n    -DGGML_OPENMP=OFF                                               \\\n    -DPOWERSERVE_WITH_QNN=ON\n\ncmake --build build\n```\n\n\n## Prepare PowerServe Workspace\n\n```shell\n# Under the root directory of PowerServe\nmkdir -p models\n\n# Generate PowerServe Workspace\n./powerserve create -m ./llama3.1-8b-instruct-model --exe-path ./build/out -o ./models/llama3.1-8b-instruct\n```\n\n## Execution\n\n### CLI\nMore details please refer to [CLI App](./app/run/README.md)\n\nFor pure CPU execution\n```shell\n# Under the root directory of PowerServe\n./models/llama3.1-8b-instruct/bin/powerserve-run --work-folder ./models/llama3.1-8b-instruct --prompt \"Once upon a time, there was a little girl named Lucy\" --no-qnn\n```\nFor NPU execution\n```shell\n# Under the root directory of PowerServe\nexport LD_LIBRARY_PATH=/system/lib64:/vendor/lib64 \u0026\u0026 ./models/llama3.1-8b-instruct/bin/powerserve-run --work-folder ./models/llama3.1-8b-instruct --prompt \"Once upon a time, there was a little girl named Lucy\"\n```\n\n### Server\nMore details please refer to [Server App](./app/server/README.md)\n```shell\n# Under the root directory of PowerServe\nexport LD_LIBRARY_PATH=/system/lib64:/vendor/lib64 \u0026\u0026 ./models/llama3.1-8b-instruct/bin/powerserve-server --work-folder ./models --host \u003cip-addr\u003e --port \u003cport\u003e\n```\n\n## Known Issues\n\n### Model Conversion\n\n1. **When exporting model to onnx**: RuntimeError: The serialized model is larger than the 2GiB limit imposed by the protobuf library. Therefore the output file must be a file path, so that the ONNX external data can be written to the same directory. Please specify the output file name.\n\n    \u003e The version of pytorch should be less than **2.5.1**. Please reinstall pytorch like:\n    \u003e ```shell\n    \u003e pip install pytorch==2.4.1\n    \u003e ```\n\n### Execution\n\n1. **When inferencing with QNN**: Failed to open lib /vendor/lib64/libcdsprpc.so: dlopen failed: library \"/vendor/lib64/libcdsprpc.so\" needed or dlopened by \"/data/data/com.termux/files/home/workspace/qnn/llama-3.2-1b-instruct/bin/powerserve-run\" is not accessible for the namespace \"(default)\n\n    \u003e Use `export LD_LIBRARY_PATH=/system/lib64:/vendor/lib64` before executing the program.\n    \u003e\n    \u003e Because `libcdsprpc.so` depends on `/system/lib64/libbinder.so` instead of `/vendor/lib64/libbinder.so`. If the linker searches the `/vendor/lib64` at first, it may find and links `/vendor/lib64/libbinder.so` which does not contain corresponding function definitions.\n\n2. **Some mobile phones cannot run large models**: Some mobile phones cannot run larger models due to different security policies.\n\n    **Some of known models and phones are listed below:**\n\n    | Phone    | Models can't be run |\n    |----------|---------------------|\n    | All smartphones of HONOR | LLMs larger than 3B |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpowerserve-project%2Fpowerserve","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpowerserve-project%2Fpowerserve","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpowerserve-project%2Fpowerserve/lists"}