{"id":14964802,"url":"https://github.com/intel/neural-speed","last_synced_at":"2025-10-25T10:30:45.589Z","repository":{"id":208574992,"uuid":"720968026","full_name":"intel/neural-speed","owner":"intel","description":"An innovative library for efficient LLM inference via low-bit quantization","archived":true,"fork":false,"pushed_at":"2024-08-30T22:53:13.000Z","size":17002,"stargazers_count":352,"open_issues_count":26,"forks_count":38,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-01-20T21:12:48.572Z","etag":null,"topics":["cpu","fp4","fp8","gaudi2","gpu","int1","int2","int3","int4","int5","int6","int7","int8","llamacpp","llm-fine-tuning","llm-inference","low-bit","mxformat","nf4","sparsity"],"latest_commit_sha":null,"homepage":"https://github.com/intel/neural-speed","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/intel.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"security.md","support":"docs/supported_models.md","governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-11-20T04:33:41.000Z","updated_at":"2025-01-04T02:30:58.000Z","dependencies_parsed_at":"2024-01-29T09:11:03.498Z","dependency_job_id":"1c8dee8a-b758-4503-816d-d22e154ba89b","html_url":"https://github.com/intel/neural-speed","commit_stats":null,"previous_names":["intel/neural-speed"],"tags_count":10,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/intel%2Fneural-speed","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/intel%2Fneural-speed/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/intel%2Fneural-speed/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/intel%2Fneural-speed/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/intel","download_url":"https://codeload.github.com/intel/neural-speed/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238120374,"owners_count":19419761,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cpu","fp4","fp8","gaudi2","gpu","int1","int2","int3","int4","int5","int6","int7","int8","llamacpp","llm-fine-tuning","llm-inference","low-bit","mxformat","nf4","sparsity"],"created_at":"2024-09-24T13:33:48.322Z","updated_at":"2025-10-25T10:30:39.548Z","avatar_url":"https://github.com/intel.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PROJECT NOT UNDER ACTIVE MANAGEMENT\nThis project will no longer be maintained by Intel.  \n\nIntel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project.  \n\nIntel no longer accepts patches to this project.  \n\n## Please refer to https://github.com/intel/intel-extension-for-pytorch as an alternative\n\n\n# Neural Speed\n\nNeural Speed is an innovative library designed to support the efficient inference of large language models (LLMs) on Intel platforms through the state-of-the-art (SOTA) low-bit quantization powered by [Intel Neural Compressor](https://github.com/intel/neural-compressor). The work is inspired by [llama.cpp](https://github.com/ggerganov/llama.cpp) and further optimized for Intel platforms with our innovations in [NeurIPS' 2023](https://arxiv.org/abs/2311.00502)\n\n## Key Features\n- Highly optimized kernels on CPUs with ISAs (AMX, VNNI, AVX512F, AVX_VNNI and AVX2) for N-bit weight (int1, int2, int3, int4, int5, int6, int7 and int8). See [details](neural_speed/core/README.md)\n- Up to 40x performance speedup on popular LLMs compared with llama.cpp. See [details](https://medium.com/@NeuralCompressor/llm-performance-of-intel-extension-for-transformers-f7d061556176) \n- Tensor parallelism across sockets/nodes on CPUs. See [details](./docs/tensor_parallelism.md)\n\n\u003e Neural Speed is under active development so APIs are subject to change.\n\n## Supported Hardware\n| Hardware | Supported |\n|-------------|:-------------:|\n|Intel Xeon Scalable Processors | ✔ |\n|Intel Xeon CPU Max Series | ✔ |\n|Intel Core Processors | ✔ |\n\n## Supported Models\nSupport almost all the LLMs in PyTorch format from Hugging Face such as Llama2, ChatGLM2, Baichuan2, Qwen, Mistral, Whisper, etc. File an [issue](https://github.com/intel/neural-speed/issues) if your favorite LLM does not work.\n\nSupport typical LLMs in GGUF format such as Llama2, Falcon, MPT, Bloom etc. More are coming. Check out the [details](./docs/supported_models.md).\n\n## Installation\n\n### Install from binary\n```shell\npip install -r requirements.txt\npip install neural-speed\n```\n\n### Build from Source\n```shell\npip install .\n```\n\n\u003e**Note**: GCC requires version 10+\n\n\n## Quick Start (Transformer-like usage)\n\nInstall [Intel Extension for Transformers](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/installation.md) to use Transformer-like APIs.\n\n\n### PyTorch Model from Hugging Face\n\n```python\nfrom transformers import AutoTokenizer, TextStreamer\nfrom intel_extension_for_transformers.transformers import AutoModelForCausalLM\nmodel_name = \"Intel/neural-chat-7b-v3-1\"     # Hugging Face model_id or local model\nprompt = \"Once upon a time, there existed a little girl,\"\n\ntokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\ninputs = tokenizer(prompt, return_tensors=\"pt\").input_ids\nstreamer = TextStreamer(tokenizer)\n\nmodel = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)\noutputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)\n```\n\n### GGUF Model from Hugging Face\n\n```python\nfrom transformers import AutoTokenizer, TextStreamer\nfrom intel_extension_for_transformers.transformers import AutoModelForCausalLM\n\n# Specify the GGUF repo on the Hugginface\nmodel_name = \"TheBloke/Llama-2-7B-Chat-GGUF\"\n# Download the the specific gguf model file from the above repo\ngguf_file = \"llama-2-7b-chat.Q4_0.gguf\"\n# make sure you are granted to access this model on the Huggingface.\ntokenizer_name = \"meta-llama/Llama-2-7b-chat-hf\"\n\nprompt = \"Once upon a time\"\ntokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)\ninputs = tokenizer(prompt, return_tensors=\"pt\").input_ids\nstreamer = TextStreamer(tokenizer)\nmodel = AutoModelForCausalLM.from_pretrained(model_name, gguf_file = gguf_file)\noutputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)\n```\n### PyTorch Model from Modelscope\n```python\nfrom transformers import TextStreamer\nfrom modelscope import AutoTokenizer\nfrom intel_extension_for_transformers.transformers import AutoModelForCausalLM\nmodel_name = \"qwen/Qwen-7B\"     # Modelscope model_id or local model\nprompt = \"Once upon a time, there existed a little girl,\"\n\nmodel = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, model_hub=\"modelscope\")\ntokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\ninputs = tokenizer(prompt, return_tensors=\"pt\").input_ids\nstreamer = TextStreamer(tokenizer)\noutputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)\n```\n\n### As an Inference Backend in Neural Chat Server\n`Neural Speed` can be used in [Neural Chat Server](https://github.com/intel/intel-extension-for-transformers/tree/main/intel_extension_for_transformers/neural_chat/server) of `Intel Extension for Transformers`. You can choose to enable it by adding `use_neural_speed: true` in `config.yaml`.\n\n- add `optimization` key section to use `Neural Speed` and its RTN quantization ([example](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/neural_chat/examples/deployment/codegen/backend/pc/woq/codegen.yaml)).\n```yaml\ndevice: \"cpu\"\n\n# itrex int4 llm runtime optimization\noptimization:\n    use_neural_speed: true\n    optimization_type: \"weight_only\"\n    compute_dtype: \"fp32\"\n    weight_dtype: \"int4\"\n```\n- add key `use_neural_speed` and key `use_gptq` to use `Neural Speed` and load `GPT-Q` model ([example](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/neural_chat/examples/deployment/codegen/backend/pc/gptq/codegen.yaml)).\n\n```yaml\ndevice: \"cpu\"\nuse_neural_speed: true\nuse_gptq: true\n```\n\nMore details please refer to [Neural Chat](https://github.com/intel/intel-extension-for-transformers/tree/main/intel_extension_for_transformers/neural_chat).\n\n\n## Quick Start (llama.cpp-like usage)\n\n### Single (One-click) Step\n\n```\npython scripts/run.py model-path --weight_dtype int4 -p \"She opened the door and see\"\n```\n\n### Multiple Steps\n\n#### Convert and Quantize\n\n```bash\n# skip the step if GGUF model is from Hugging Face or generated by llama.cpp\npython scripts/convert.py --outtype f32 --outfile ne-f32.bin EleutherAI/gpt-j-6b\n# Using the quantize script requires a binary installation of Neural Speed\nmkdir build\u0026\u0026cd build\ncmake ..\u0026\u0026make -j\ncd ..\npython scripts/quantize.py  --model_name gptj --model_file ne-f32.bin  --out_file ne-q4_j.bin  --build_dir ./build --weight_dtype int4 --alg sym\n```\n\n#### Inference\n\n```bash\n# Linux and WSL\nOMP_NUM_THREADS=\u003cphysic_cores\u003e numactl -m 0 -C 0-\u003cphysic_cores-1\u003e python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t \u003cphysic_cores\u003e --color -p \"She opened the door and see\"\n```\n\n```bash\n# Windows\npython scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t \u003cphysic_cores|P-cores\u003e --color -p \"She opened the door and see\"\n```\n\n\u003e Please refer to [Advanced Usage](./docs/advanced_usage.md) for more details.\n\n## Advanced Topics\n\n### New model enabling\nYou can consider adding your own models, please follow the document: [graph developer document](./developer_document.md).\n\n### Performance profiling\nEnable `NEURAL_SPEED_VERBOSE` environment variable for performance profiling.\n\nAvailable modes:\n- 0: Print full information: evaluation time and operator profiling. Need to set `NS_PROFILING` to ON and recompile.\n- 1: Print evaluation time. Time taken for each evaluation.\n- 2: Profile individual operator. Identify performance bottleneck within the model. Need to set `NS_PROFILING` to ON and recompile.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fintel%2Fneural-speed","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fintel%2Fneural-speed","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fintel%2Fneural-speed/lists"}