https://github.com/muhtasham/char-prefix-conditioning

A minimal, efficient implementation of character prefix conditioning for code completion.
https://github.com/muhtasham/char-prefix-conditioning

code-generation

Last synced: 11 months ago
JSON representation

A minimal, efficient implementation of character prefix conditioning for code completion.

Host: GitHub
URL: https://github.com/muhtasham/char-prefix-conditioning
Owner: Muhtasham
Created: 2025-07-17T23:18:09.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-07-18T19:48:53.000Z (11 months ago)
Last Synced: 2025-07-20T09:01:27.957Z (11 months ago)
Topics: code-generation
Language: Python
Homepage: https://www.cursor.so/blog/cpc
Size: 35.2 KB
Stars: 2
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.MD

Awesome Lists containing this project

README

          # Character Prefix Conditioning

A minimal, efficient implementation of character prefix conditioning (CPC) for code completion, inspired by the [Cursor blog](https://cursor.com/blog/cpc).

## Overview

When using a language model for code completion, we typically want the model to produce a completion that begins with what the user has typed. However, modern language models operate on sequences of tokens, not characters, so naively tokenizing the user's input and sending it to the model produces wrong results if the user's cursor doesn't happen to lie on a token boundary.

**CPC** is an algorithm for sampling a sequence of tokens conditioned on a character prefix, ensuring completions always start with the user's typed prefix—even if it doesn't align with token boundaries.

## Mathematical Foundation

### Problem Statement

We want to sample a sequence of tokens $s = t_1, t_2, \ldots, t_n$ from a distribution specified by an autoregressive model $p(s)$ given by:

$$p(s) = p(t_1, t_2, \ldots, t_n) = \prod_{k=1}^{n} p(t_k \mid t_1, \ldots, t_{k-1})$$

subject to the constraint that $s$ starts with a character prefix $P$, i.e., $P$ is a prefix of $\text{repr}(t_1) + \text{repr}(t_2) + \cdots + \text{repr}(t_n)$, where $+$ means string concatenation and $\text{repr}$ maps a token to the characters it represents.

We define $q(s) = p(s \mid s \text{ starts with } P)$. It's sufficient to find a way to sample autoregressively from $q(s)$, that is, to sample from $q(t_k \mid t_1, \ldots, t_{k-1})$ for each $k$.

### Algorithm

For each step $k$, we need to sample from $q(t_k \mid t_1, \ldots, t_{k-1})$. Here's the efficient algorithm:

1. **Get model predictions**: Compute $p(t_k \mid t_1, \ldots, t_{k-1})$ from the language model for all possible tokens $t_k$

2. **Apply constraint mask**: For each token $t_k$, check if appending it to the current sequence would satisfy the character prefix constraint $P$. Create a binary mask $M(t_k)$ where:

   - $M(t_k) = 1$ if $\text{repr}(t_1) + \cdots + \text{repr}(t_{k-1}) + \text{repr}(t_k)$ starts with $P$

   - $M(t_k) = 0$ otherwise

3. **Renormalize probabilities**: Compute the constrained distribution:

   $$q(t_k) = \frac{p(t_k) \cdot M(t_k)}{\sum_{t'} p(t') \cdot M(t')}$$

4. **Sample from constrained distribution**: Sample $t_k \sim q(t_k)$

5. **Terminate when constraint is satisfied**: Stop when the generated sequence starts with the prefix $P$

### Key Insights

- **Efficiency**: The algorithm requires only one forward pass through the language model per generated token, minimizing model calls.

- **Vectorization**: The constraint checking (for all possible next tokens) is vectorized across the vocabulary, making it efficient despite being O(|V|) per step.

- **Early termination**: Generation can stop once the constraint is satisfied, then continue normally.

- **Fallback strategies**: For edge cases where no valid tokens have sufficient probability, the algorithm can fall back to the most probable valid token or even violate the constraint with a retry mechanism.

### Complexity Analysis

- **Per step**: O(|V|) constraint checking (vectorized), **1 model call**

- **Total**: O(n · |V|) constraint checks and O(n) model calls for generating n tokens

- **Memory**: O(|V|) for storing token representations and masks

- **Optimizations**: KV caching reduces repeated computations, early termination reduces total steps

## Setup

**Install dependencies**:

   ```sh

   uv sync

   ```

**Run the main script**:

   ```sh

   uv run main.py

   ```

## Usage

```python

from main import ModelManager, character_prefix_sample

# Initialize and load model

model_manager = ModelManager("gpt2")

model_manager.load_model()

# Generate with character prefix constraint

result = character_prefix_sample(

    model_manager=model_manager,

    prompt_text="import",

    character_prefix="import num",

    max_new_tokens=15

)

print(result)  # Output: "import numpy as np"

```

## Examples

The implementation includes comprehensive test cases demonstrating various scenarios:

- **Simple prefix matching**: `"import"` → `"import num"` → `"import numpy as np"`

- **Mid-token completion**: `"The model's behav"` → `"The model's behavi"` → `"The model's behavior"`

- **F-string completion**: `'print(f"The result is {re'` → `'print(f"The result is {res'` → `'print(f"The result is {result}"'`

- **Empty prompt generation**: `""` → `"Once upon a ti"` → `"Once upon a time"`

- **JSON completion**: `'{"data": {"user'` → `'{"data": {"username": "test'` → `'{"data": {"username": "test"}}'`

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/muhtasham/char-prefix-conditioning

Awesome Lists containing this project

README