{"id":31971335,"url":"https://github.com/vllm-project/vllm-neuron","last_synced_at":"2026-02-26T20:26:52.652Z","repository":{"id":317460431,"uuid":"1059856577","full_name":"vllm-project/vllm-neuron","owner":"vllm-project","description":"Community maintained hardware plugin for vLLM on AWS Neuron","archived":false,"fork":false,"pushed_at":"2026-02-11T20:14:15.000Z","size":353,"stargazers_count":21,"open_issues_count":2,"forks_count":8,"subscribers_count":2,"default_branch":"release-0.3.0","last_synced_at":"2026-02-12T04:28:18.159Z","etag":null,"topics":["aws","aws-neuron","inferentia","trainium"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vllm-project.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-19T03:45:30.000Z","updated_at":"2026-02-11T20:14:00.000Z","dependencies_parsed_at":"2025-10-01T03:22:13.693Z","dependency_job_id":null,"html_url":"https://github.com/vllm-project/vllm-neuron","commit_stats":null,"previous_names":["vllm-project/vllm-neuron"],"tags_count":7,"template":false,"template_full_name":null,"purl":"pkg:github/vllm-project/vllm-neuron","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vllm-project%2Fvllm-neuron","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vllm-project%2Fvllm-neuron/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vllm-project%2Fvllm-neuron/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vllm-project%2Fvllm-neuron/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vllm-project","download_url":"https://codeload.github.com/vllm-project/vllm-neuron/tar.gz/refs/heads/release-0.3.0","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vllm-project%2Fvllm-neuron/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29871011,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-26T18:42:30.764Z","status":"ssl_error","status_checked_at":"2026-02-26T18:41:47.936Z","response_time":89,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","aws-neuron","inferentia","trainium"],"created_at":"2025-10-14T19:45:24.065Z","updated_at":"2026-02-26T20:26:52.646Z","avatar_url":"https://github.com/vllm-project.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# vLLM User Guide for AWS Neuron\n\n[vLLM](https://docs.vllm.ai/en/latest/) is a popular library for LLM inference and serving utilizing advanced inference features such as continuous batching.\nThis guide describes how to utilize AWS Inferentia and AWS Trainium AI accelerators in vLLM by using NxD Inference (`neuronx-distributed-inference`).\n\n## Table of Contents\n\n- [Overview](#overview)\n- [Supported Models](#supported-models)\n- [Setup](#setup)\n  - [Prerequisite: Launch an instance and install drivers and tools](#prerequisite-launch-an-instance-and-install-drivers-and-tools)\n  - [Installing the vllm-neuron Plugin](#installing-the-vllm-neuron-plugin)\n- [Usage](#usage)\n- [Feature Support](#feature-support)\n- [Feature Configuration](#feature-configuration)\n- [Examples](#examples)\n- [Known Issues](#known-issues)\n- [Support](#support)\n\n## Overview\n\nNxD Inference integrates with vLLM by using [vLLM's Plugin System](https://docs.vllm.ai/en/latest/design/plugin_system.html) to extend the model execution components responsible for loading and invoking models within vLLM's LLMEngine (see [vLLM architecture](https://docs.vllm.ai/en/latest/design/arch_overview.html#llm-engine) for more details). This means input processing, scheduling and output processing follow the default vLLM behavior.\n\n### Versioning\n\nPlugin Version: `0.3.0`\n\nNeuron SDK Version: `2.27.1`\n\nvLLM Version: `0.13.0`\n\nPyTorch Version: `2.9.0`\n\n\n## Supported Models\n\nThe following models are supported on vLLM with NxD Inference:\n\n- Llama 2/3.1/3.3\n- Llama 4 Scout, Maverick\n- Qwen 2.5\n- Qwen 3\n- Pixtral (limited, see Known Issues)\n\n\n## Setup\n\n### Prerequisite: Launch an instance and install drivers and tools\n\nBefore installing vLLM with the instructions below, you must launch a Trainium or an Inferentia instance and install the necessary Neuron SDK dependency libraries. Refer to [these setup instructions](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/get-started/quickstart-configure-deploy-dlc.html) to prepare your environment.\n\n**Prerequisites:**\n\n- Latest AWS Neuron SDK ([Neuron SDK 2.27.1](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/2.27.1.html))\n- Python 3.10+ (compatible with vLLM requirements)\n- Supported AWS instances: Inf2, Trn1/Trn1n, Trn2\n\n### Installing the vllm-neuron Plugin\n\nAWS Neuron maintains a vLLM-Neuron Plugin that supports the latest features for NxD Inference. Follow the instructions below to obtain and configure it.\n\n#### Quickstart using Docker\n\nYou can use a preconfigured Deep Learning Container (DLC) with the AWS vLLM-Neuron plugin pre-installed.\nRefer to the [vllm-neuron DLC guide](https://github.com/aws-neuron/deep-learning-containers?tab=readme-ov-file#vllm-inference-neuronx) to get started.\n\nFor a complete step-by-step tutorial on deploying the vLLM Neuron DLC, see the [Quickstart Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/get-started/quickstart-configure-deploy-dlc.html#quickstart-vllm-dlc-deploy).\n\n#### Manually install from source\n\nInstall the plugin from GitHub sources using the following commands. The plugin will automatically install the correct version of vLLM along with other required dependencies.\n\n```bash\ngit clone --branch \"0.3.0\" https://github.com/vllm-project/vllm-neuron.git\ncd vllm-neuron\npip install --extra-index-url=https://pip.repos.neuron.amazonaws.com -e .\n```\n\n## Usage\n\n### Quickstart\n\nHere is a very basic example to get started:\n\n```python\nfrom vllm import LLM, SamplingParams\n\n# Initialize the model\nllm = LLM(\n    model=\"TinyLlama/TinyLlama-1.1B-Chat-v1.0\",\n    max_num_seqs=4,\n    max_model_len=128,\n    tensor_parallel_size=2,\n    block_size=32,\n    num_gpu_blocks_override=16\n)\n\n# Generate text\nprompts = [\n    \"Hello, my name is\",\n    \"The president of the United States is\",\n    \"The capital of France is\",\n]\nsampling_params = SamplingParams(temperature=0.0)\noutputs = llm.generate(prompts, sampling_params)\n\nfor output in outputs:\n    print(f\"Prompt: {output.prompt}\")\n    print(f\"Generated: {output.outputs[0].text}\")\n```\n\n## Feature Support\n\n| Feature                 | Status | Notes                             |\n|:------------------------|:------:|-----------------------------------|\n| Continuous batching     |   🟢   |                                   |\n| Prefix Caching          |   🟢   |                                   |\n| Multi-LORA              |   🟢   |                                   |\n| Speculative Decoding    |   🟢   | Only Eagle V1 is supported        |\n| Quantization            |   🟢   | INT8/FP8 quantization support     |\n| Dynamic sampling        |   🟢   |                                   |\n| Tool calling            |   🟢   |                                   |\n| CPU Sampling            |   🟢   |                                   |\n| Structured Outputs      |   🟢   |                                   |\n| Chunked Prefill         |   🚧   |                                   |\n| Multimodal              |   🚧   | Llama 4 and Pixtral are supported |\n\n- 🟢 Functional: Fully operational, with ongoing optimizations.\n- 🚧 WIP: Under active development.\n\n## Feature Configuration\n\nNxD Inference models provide many configuration options. When using NxD Inference through vLLM, you configure the model with a default configuration that sets the required fields from vLLM settings.\n\n```python\nneuron_config = dict(\n    tp_degree=parallel_config.tensor_parallel_size,\n    ctx_batch_size=1,\n    batch_size=scheduler_config.max_num_seqs,\n    max_context_length=scheduler_config.max_model_len,\n    seq_len=scheduler_config.max_model_len,\n    enable_bucketing=True,\n    is_continuous_batching=True,\n    quantized=False,\n    torch_dtype=TORCH_DTYPE_TO_NEURON_AMP[model_config.dtype],\n    padding_side=\"right\"\n)\n```\n\nUse the `additional_config` field to provide an `override_neuron_config` dictionary that specifies your desired NxD Inference configuration settings. You provide the settings you want to override as a dictionary (or JSON object when starting vLLM from the CLI) containing basic types. For example, to enable prefix caching:\n\n```python\nadditional_config=dict(\n    override_neuron_config=dict(\n        is_prefix_caching=True,\n        is_block_kv_layout=True,\n        pa_num_blocks=4096,\n        pa_block_size=32,\n    )\n)\n```\n\nor when launching vLLM from the CLI:\n\n```bash\n--additional-config '{\n    \"override-neuron-config\": {\n        \"is_prefix_caching\": true,\n        \"is_block_kv_layout\": true,\n        \"pa_num_blocks\": 4096,\n        \"pa_block_size\": 32\n    }\n}'\n```\n\nFor more information on NxD Inference features, see [NxD Inference Features Configuration Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html) and [NxD Inference API Reference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/api-guides/api-guide.html).\n\n### Scheduling and K/V Cache\n\nNxD Inference uses a contiguous memory layout for the K/V cache instead of PagedAttention support. It integrates into vLLM's block manager by setting the block size to the maximum length supported by the model and allocating one block per maximum number of sequences configured. However, the vLLM scheduler currently does not introspect the blocks associated to each sequence when (re-)scheduling running sequences. The scheduler requires an additional free block regardless of space available in the current block resulting in preemption. This would lead to a large increase in latency for the preempted sequence because it would be rescheduled in the context encoding phase. Since NxD Inference's implementation ensures each block is big enough to fit the maximum model length, preemption is never needed in our current integration. As a result, AWS Neuron disabled the preemption checks done by the scheduler in our fork. This significantly improves E2E performance of the Neuron integration.\n\n### Decoding\n\nOn-device sampling is enabled by default, which performs sampling logic on the Neuron devices rather than passing the generated logits back to CPU and sample through vLLM. This allows you to use Neuron hardware to accelerate sampling and reduce the amount of data transferred between devices leading to improved latency.\n\nHowever, on-device sampling comes with some limitations. Currently, we only support the following sampling parameters: `temperature`, `top_k` and `top_p` parameters. Other sampling parameters are currently not supported through on-device sampling.\n\nWhen on-device sampling is enabled, we handle the following special cases:\n\n* When `top_k` is set to -1, we limit `top_k` to 256 instead.\n* When `temperature` is set to 0, we use greedy decoding to remain compatible with existing conventions. This is the same as setting `top_k` to 1.\n\nBy default, on-device sampling utilizes a greedy decoding strategy to select tokens with the highest probabilities. You can enable a different on-device sampling strategy by passing a `on_device_sampling_config` using the override neuron config feature. It is strongly recommended to make use of the `global_top_k` configuration limiting the maximum value of `top_k` a user can request for improved performance.\n\n### Quantization\n\nNxD Inference supports quantization but has not yet been integrated with vLLM's configuration for quantization. If you want to use quantization, **do not** set vLLM's `--quantization` setting to `neuron_quant`. Keep it unset and use the Neuron configuration of the model to configure quantization of the NxD Inference model directly. For more information on how to configure and use quantization with NxD Inference incl. requirements on checkpoints, refer to [Quantization](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#quantization) in the NxD Inference Feature Guide.\n\n### Loading pre-compiled models / Serialization Support\n\nTracing and compiling the model can take a non-trivial amount of time depending on model size e.g. a small-ish model of 15GB might take around 15min to compile. Exact times depend on multiple factors. Doing this on each server start would lead to unacceptable application startup times. Therefore, we support storing and loading the traced and compiled models.\n\nBoth are controlled through the `NEURON_COMPILED_ARTIFACTS` variable. When pointed to a path that contains a pre-compiled model, we load the pre-compiled model directly, and any differing model configurations passed in to the vllm API will not trigger re-compilation. If loading from the `NEURON_COMPILED_ARTIFACTS` path fails, then we will recompile the model with the provided configurations and store the results in the provided location. If `NEURON_COMPILED_ARTIFACTS` is not set, we will compile the model and store it under a `neuron-compiled-artifacts` subdirectory in the directory of your model checkpoint.\n\n### Prefix Caching\n\nStarting in Neuron SDK 2.24, prefix caching is supported on the AWS Neuron fork of vLLM. Prefix caching allows developers to improve TTFT by re-using the KV Cache of the common shared prompts across inference requests. See [Prefix Caching](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#prefix-caching-support) for more information on how to enable prefix caching with vLLM.\n\n## Examples\n\nFor more in depth NxD Inference tutorials that include vLLM deployment steps, refer to [Tutorials](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/index.html).\n\nThe following examples use [TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) \n\nIf you have access to the model checkpoint locally, replace `TinyLlama/TinyLlama-1.1B-Chat-v1.0` with the path to your local copy. \n\nIf you use an instance type that supports a higher tensor parallel size, you need to adjust the `--tensor-parallel-size` according to the number of Neuron Cores available on your instance type. (For more information see: [Tensor-parallelism support](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/app-notes/parallelism.html).)\n\n### Offline Inference Example\n\nFor offline inference, refer to the code example in the [Quickstart](#quickstart) section above.\n\n### Online Inference Example\n\nYou can start an OpenAI API compatible server with the same settings as the offline example by running the following command:\n\n```bash\npython3 -m vllm.entrypoints.openai.api_server \\\n    --model \"TinyLlama/TinyLlama-1.1B-Chat-v1.0\" \\\n    --tensor-parallel-size 2 \\\n    --max-model-len 128 \\\n    --max-num-seqs 4 \\\n    --block-size 32 \\\n    --num-gpu-blocks-override 16 \\\n    --port 8000\n```\n\nIn addition to the sampling parameters supported by OpenAI, we also support `top_k`. You can change the sampling parameters and enable or disable streaming.\n\n```python\nfrom openai import OpenAI\n\n# Client Setup\nopenai_api_key = \"EMPTY\"\nopenai_api_base = \"http://localhost:8000/v1\"\n\nclient = OpenAI(\n    api_key=openai_api_key,\n    base_url=openai_api_base,\n)\n\nmodels = client.models.list()\nmodel_name = models.data[0].id\n\n# Sampling Parameters\nmax_tokens = 64\ntemperature = 1.0\ntop_p = 1.0\ntop_k = 50\nstream = False\n\n# Chat Completion Request\nprompt = \"Hello, my name is Llama \"\nresponse = client.chat.completions.create(\n    model=model_name,\n    messages=[{\"role\": \"user\", \"content\": prompt}],\n    max_tokens=int(max_tokens),\n    temperature=float(temperature),\n    top_p=float(top_p),\n    stream=stream,\n    extra_body={'top_k': top_k}\n)\n\n# Parse the response\ngenerated_text = \"\"\nif stream:\n    for chunk in response:\n        if chunk.choices[0].delta.content is not None:\n            generated_text += chunk.choices[0].delta.content\nelse:\n    generated_text = response.choices[0].message.content\n    \nprint(generated_text)\n```\n\n## Known Issues\n\n1. Chunked prefill is disabled by default on Neuron for optimal performance. To enable chunked prefill, set the environment variable `DISABLE_NEURON_CUSTOM_SCHEDULER=\"1\"`.\n\n2. You must provide `num_gpu_blocks_override` to avoid out-of-bounds (OOB) errors. This override ensures vLLM's scheduler uses the same block count that was compiled into the model. Currently NxDI does not support using different kv cache sizes at compile vs. runtime.\n\n   - With either chunked prefill or prefix caching: NxDI will internally use blockwise kv cache layout. Set `num_gpu_blocks_override` to at least `ceil(max_model_len / block_size) * max_num_seqs`\n   - With neither chunked prefill nor prefix caching: NxDI will internally use contiguous kv cache layout, and overwrite `block_size` to `max_model_len`. Set `num_gpu_blocks_override` to exactly `max_num_seqs`\n\n3. When using HuggingFace model IDs with both [shard on load](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/weights-sharding-guide.html#shard-on-load) and models that have `tie_word_embeddings` set to `true` in their config (such as [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B/blob/main/config.json#L24)), you may encounter the error `NotImplementedError: Cannot copy out of meta tensor; no data!`. To resolve this, download the model checkpoint locally from Hugging Face and serve it from the local path instead of using the HuggingFace model ID.\n\n4. Async tokenization in vLLM V1 can increase request preprocessing time for small inputs and batch sizes. The Neuron team is investigating potential solutions.\n\n5. Pixtral has out-of-bounds issues for batch sizes greater than 4. The max sequence length is 10240.\n\n## Support\n\n- **Documentation**: [AWS Neuron Documentation](https://awsdocs-neuron.readthedocs-hosted.com/)\n- **Issues**: [GitHub Issues](https://github.com/vllm-project/vllm-neuron/issues)\n- **Community**: [AWS Neuron Forum](https://repost.aws/tags/TAjy-krivRTDqDPWNNBmV9lA)\n\n## License\n\nThis project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvllm-project%2Fvllm-neuron","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvllm-project%2Fvllm-neuron","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvllm-project%2Fvllm-neuron/lists"}