https://github.com/ggml-org/llama.vscode
VS Code extension for LLM-assisted code/text completion
https://github.com/ggml-org/llama.vscode
copilot developer-tool extension llama llm vscode
Last synced: about 2 months ago
JSON representation
VS Code extension for LLM-assisted code/text completion
- Host: GitHub
- URL: https://github.com/ggml-org/llama.vscode
- Owner: ggml-org
- License: mit
- Created: 2025-01-06T11:54:52.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2026-02-08T21:31:13.000Z (3 months ago)
- Last Synced: 2026-02-17T05:26:24.887Z (2 months ago)
- Topics: copilot, developer-tool, extension, llama, llm, vscode
- Language: TypeScript
- Homepage:
- Size: 944 KB
- Stars: 1,160
- Watchers: 11
- Forks: 105
- Open Issues: 44
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# llama.vscode
Local LLM-assisted text completion, chat with AI and agentic coding extension for VS Code

---

## Features
- Auto-suggest on input
- Accept a suggestion with `Tab`
- Accept the first line of a suggestion with `Shift + Tab`
- Accept the next word with `Ctrl/Cmd + Right`
- Toggle the suggestion manually by pressing `Ctrl + L`
- Control max text generation time
- Configure scope of context around the cursor
- Ring context with chunks from open and edited files and yanked text
- [Supports very large contexts even on low-end hardware via smart context reuse](https://github.com/ggerganov/llama.cpp/pull/9787)
- Display performance stats
- Llama Agent for agentic coding
- Add/remove/export/import for models - completion, chat, embeddings and tools
- Model selection - for completion, chat, embeddings and tools
- Env (group of models) concept introduced. Selecting/Deselecting env selects/deselects all the models in it
- Add/remove/export/import for env
- Predefined models (including OpenAI gpt-oss 20B added as a local one)
- Predefined envs for different use cases - only completion, chat + completion, chat + agent, loccal full package (with gpt-oss 20B), etc.
- MCP tools selection for the agent (from VS Code installed MCP Servers)
- Search and download models from Huggingface directly from llama-vscode
## Installation
### VS Code extension setup
Install the [llama-vscode](https://marketplace.visualstudio.com/items?itemName=ggml-org.llama-vscode) extension from the VS Code extension marketplace:

Note: also available at [Open VSX](https://open-vsx.org/extension/ggml-org/llama-vscode)
### `llama.cpp` setup
Show llama-vscode menu by clicking on llama-vscode in the status bar or Ctrl+Shift+M and select "Install/Upgrade llama.cpp". This will install llama.cpp automatically for Mac and Windows. For Linux get the [latest binaries](https://github.com/ggerganov/llama.cpp/releases) and add the bin folder to the path.
Once you have llama.cpp installed, you can select env for your needs from llama-vscode menu "Select/start env..."
Below are some details how to install llama.cpp manually (if you prefer it).
#### Mac OS
```bash
brew install llama.cpp
```
#### Windows
```bash
winget install llama.cpp
```
#### Any other OS
Either use the [latest binaries](https://github.com/ggerganov/llama.cpp/releases) or [build llama.cpp from source](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md). For more information how to run the `llama.cpp` server, please refer to the [Wiki](https://github.com/ggml-org/llama.vscode/wiki).
### llama.cpp settings
Here are recommended settings, depending on the amount of VRAM that you have:
- More than 64GB VRAM:
```bash
llama-server --fim-qwen-30b-default
```
- More than 16GB VRAM:
```bash
llama-server --fim-qwen-7b-default
```
- Less than 16GB VRAM:
```bash
llama-server --fim-qwen-3b-default
```
- Less than 8GB VRAM:
```bash
llama-server --fim-qwen-1.5b-default
```
CPU-only configs
These are `llama-server` settings for CPU-only hardware. Note that the quality will be significantly lower:
```bash
llama-server \
-hf ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF \
--port 8012 -ub 512 -b 512 --ctx-size 0 --cache-reuse 256
```
```bash
llama-server \
-hf ggml-org/Qwen2.5-Coder-0.5B-Q8_0-GGUF \
--port 8012 -ub 1024 -b 1024 --ctx-size 0 --cache-reuse 256
```
You can use any other FIM-compatible model that your system can handle. By default, the models downloaded with the `-hf` flag are stored in:
- Mac OS: `~/Library/Caches/llama.cpp/`
- Linux: `~/.cache/llama.cpp`
- Windows: `LOCALAPPDATA`
### Recommended LLMs
The plugin requires FIM-compatible models: [HF collection](https://huggingface.co/collections/ggml-org/llamavim-6720fece33898ac10544ecf9)
## Llama Agent
The extension includes Llama Agent
### Features
- Llama Agent UI in Explorer view
- Works with local models - gpt-oss 20B is the best choice for now
- Could work with external models (for example from OpenRouter)
- MCP Support - could use the tools from the MCP Servers, which are installed and started in VS Code
- 9 internal tools available for use
- custom_tool - returns the content of a file or a web page
- custom_eval_tool - write your own tool in javascript (function with input and return value string)
- Attach the selection to the context
- Configure maximum loops for Llama Agent
### Usage
1. Open Llama Agent with Ctrl+Shift+A or from llama-vscode menu "Show Llama Agent"
2. Select Env with an agent if you haven't done it before.
3. Write a query and attach files with the @ button if needed
More details(https://github.com/ggml-org/llama.vscode/wiki)
## Examples
Speculative FIMs running locally on a M2 Studio:
https://github.com/user-attachments/assets/cab99b93-4712-40b4-9c8d-cf86e98d4482
## Implementation details
The extension aims to be very simple and lightweight and at the same time to provide high-quality and performant local FIM completions, even on consumer-grade hardware.
- The initial implementation was done by Ivaylo Gardev [@igardev](https://github.com/igardev) using the [llama.vim](https://github.com/ggml-org/llama.vim) plugin as a reference
- Techincal description: https://github.com/ggerganov/llama.cpp/pull/9787
## Other IDEs
- Vim/Neovim: https://github.com/ggml-org/llama.vim