https://github.com/marella/ctransformers

Python bindings for the Transformer models implemented in C/C++ using GGML library.
https://github.com/marella/ctransformers
ai ctransformers llm transformers
Last synced: 2 months ago
JSON representation
Python bindings for the Transformer models implemented in C/C++ using GGML library.
Host: GitHub
URL: https://github.com/marella/ctransformers
Owner: marella
License: mit
Created: 2023-05-14T22:06:24.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2024-01-28T07:37:48.000Z (over 1 year ago)
Last Synced: 2025-05-13T10:58:25.508Z (2 months ago)
Topics: ai, ctransformers, llm, transformers
Language: C
Homepage:
Size: 62.4 MB
Stars: 1,864
Watchers: 17
Forks: 141
Open Issues: 112
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

awesome-local-ai - CTransformers - Python bindings for the Transformer models implemented in C/C++ using GGML library | GGML/GPTQ | Both | ❌ | C/C++ | Text-Gen | (Inference Engine)
README

        # [CTransformers](https://github.com/marella/ctransformers) [![PyPI](https://img.shields.io/pypi/v/ctransformers)](https://pypi.org/project/ctransformers/) [![tests](https://github.com/marella/ctransformers/actions/workflows/tests.yml/badge.svg)](https://github.com/marella/ctransformers/actions/workflows/tests.yml) [![build](https://github.com/marella/ctransformers/actions/workflows/build.yml/badge.svg)](https://github.com/marella/ctransformers/actions/workflows/build.yml)

Python bindings for the Transformer models implemented in C/C++ using [GGML](https://github.com/ggerganov/ggml) library.

> Also see [ChatDocs](https://github.com/marella/chatdocs)

- [Supported Models](#supported-models)

- [Installation](#installation)

- [Usage](#usage)

  - [🤗 Transformers](#transformers)

  - [LangChain](#langchain)

  - [GPU](#gpu)

  - [GPTQ](#gptq)

- [Documentation](#documentation)

- [License](#license)

## Supported Models

| Models              | Model Type    | CUDA | Metal |

| :------------------ | ------------- | :--: | :---: |

| GPT-2               | `gpt2`        |      |       |

| GPT-J, GPT4All-J    | `gptj`        |      |       |

| GPT-NeoX, StableLM  | `gpt_neox`    |      |       |

| Falcon              | `falcon`      |  ✅  |       |

| LLaMA, LLaMA 2      | `llama`       |  ✅  |  ✅   |

| MPT                 | `mpt`         |  ✅  |       |

| StarCoder, StarChat | `gpt_bigcode` |  ✅  |       |

| Dolly V2            | `dolly-v2`    |      |       |

| Replit              | `replit`      |      |       |

## Installation

```sh

pip install ctransformers

```

## Usage

It provides a unified interface for all models:

```py

from ctransformers import AutoModelForCausalLM

llm = AutoModelForCausalLM.from_pretrained("/path/to/ggml-model.bin", model_type="gpt2")

print(llm("AI is going to"))

```

[Run in Google Colab](https://colab.research.google.com/drive/1GMhYMUAv_TyZkpfvUI1NirM8-9mCXQyL)

To stream the output, set `stream=True`:

```py

for text in llm("AI is going to", stream=True):

    print(text, end="", flush=True)

```

You can load models from Hugging Face Hub directly:

```py

llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml")

```

If a model repo has multiple model files (`.bin` or `.gguf` files), specify a model file using:

```py

llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", model_file="ggml-model.bin")

```



### 🤗 Transformers

> **Note:** This is an experimental feature and may change in the future.

To use it with 🤗 Transformers, create model and tokenizer using:

```py

from ctransformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)

tokenizer = AutoTokenizer.from_pretrained(model)

```

[Run in Google Colab](https://colab.research.google.com/drive/1FVSLfTJ2iBbQ1oU2Rqz0MkpJbaB_5Got)

You can use 🤗 Transformers text generation pipeline:

```py

from transformers import pipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

print(pipe("AI is going to", max_new_tokens=256))

```

You can use 🤗 Transformers generation [parameters](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig):

```py

pipe("AI is going to", max_new_tokens=256, do_sample=True, temperature=0.8, repetition_penalty=1.1)

```

You can use 🤗 Transformers tokenizers:

```py

from ctransformers import AutoModelForCausalLM

from transformers import AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)  # Load model from GGML model repo.

tokenizer = AutoTokenizer.from_pretrained("gpt2")  # Load tokenizer from original model repo.

```

### LangChain

It is integrated into LangChain. See [LangChain docs](https://python.langchain.com/docs/ecosystem/integrations/ctransformers).

### GPU

To run some of the model layers on GPU, set the `gpu_layers` parameter:

```py

llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GGML", gpu_layers=50)

```

[Run in Google Colab](https://colab.research.google.com/drive/1Ihn7iPCYiqlTotpkqa1tOhUIpJBrJ1Tp)

#### CUDA

Install CUDA libraries using:

```sh

pip install ctransformers[cuda]

```

#### ROCm

To enable ROCm support, install the `ctransformers` package using:

```sh

CT_HIPBLAS=1 pip install ctransformers --no-binary ctransformers

```

#### Metal

To enable Metal support, install the `ctransformers` package using:

```sh

CT_METAL=1 pip install ctransformers --no-binary ctransformers

```

### GPTQ

> **Note:** This is an experimental feature and only LLaMA models are supported using [ExLlama](https://github.com/turboderp/exllama).

Install additional dependencies using:

```sh

pip install ctransformers[gptq]

```

Load a GPTQ model using:

```py

llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GPTQ")

```

[Run in Google Colab](https://colab.research.google.com/drive/1SzHslJ4CiycMOgrppqecj4VYCWFnyrN0)

> If model name or path doesn't contain the word `gptq` then specify `model_type="gptq"`.

It can also be used with LangChain. Low-level APIs are not fully supported.

## Documentation

### Config

| Parameter            | Type        | Description                                                     | Default |

| :------------------- | :---------- | :-------------------------------------------------------------- | :------ |

| `top_k`              | `int`       | The top-k value to use for sampling.                            | `40`    |

| `top_p`              | `float`     | The top-p value to use for sampling.                            | `0.95`  |

| `temperature`        | `float`     | The temperature to use for sampling.                            | `0.8`   |

| `repetition_penalty` | `float`     | The repetition penalty to use for sampling.                     | `1.1`   |

| `last_n_tokens`      | `int`       | The number of last tokens to use for repetition penalty.        | `64`    |

| `seed`               | `int`       | The seed value to use for sampling tokens.                      | `-1`    |

| `max_new_tokens`     | `int`       | The maximum number of new tokens to generate.                   | `256`   |

| `stop`               | `List[str]` | A list of sequences to stop generation when encountered.        | `None`  |

| `stream`             | `bool`      | Whether to stream the generated text.                           | `False` |

| `reset`              | `bool`      | Whether to reset the model state before generating text.        | `True`  |

| `batch_size`         | `int`       | The batch size to use for evaluating tokens in a single prompt. | `8`     |

| `threads`            | `int`       | The number of threads to use for evaluating tokens.             | `-1`    |

| `context_length`     | `int`       | The maximum context length to use.                              | `-1`    |

| `gpu_layers`         | `int`       | The number of layers to run on GPU.                             | `0`     |

> **Note:** Currently only LLaMA, MPT and Falcon models support the `context_length` parameter.

### class `AutoModelForCausalLM`

---

#### classmethod `AutoModelForCausalLM.from_pretrained`

```python

from_pretrained(

    model_path_or_repo_id: str,

    model_type: Optional[str] = None,

    model_file: Optional[str] = None,

    config: Optional[ctransformers.hub.AutoConfig] = None,

    lib: Optional[str] = None,

    local_files_only: bool = False,

    revision: Optional[str] = None,

    hf: bool = False,

    **kwargs

) → LLM

```

Loads the language model from a local file or remote repo.

**Args:**

- `model_path_or_repo_id`: The path to a model file or directory or the name of a Hugging Face Hub model repo.

- `model_type`: The model type.

- `model_file`: The name of the model file in repo or directory.

- `config`: `AutoConfig` object.

- `lib`: The path to a shared library or one of `avx2`, `avx`, `basic`.

- `local_files_only`: Whether or not to only look at local files (i.e., do not try to download the model).

- `revision`: The specific model version to use. It can be a branch name, a tag name, or a commit id.

- `hf`: Whether to create a Hugging Face Transformers model.

**Returns:**

`LLM` object.

### class `LLM`

### method `LLM.__init__`

```python

__init__(

    model_path: str,

    model_type: Optional[str] = None,

    config: Optional[ctransformers.llm.Config] = None,

    lib: Optional[str] = None

)

```

Loads the language model from a local file.

**Args:**

- `model_path`: The path to a model file.

- `model_type`: The model type.

- `config`: `Config` object.

- `lib`: The path to a shared library or one of `avx2`, `avx`, `basic`.

---

##### property LLM.bos_token_id

The beginning-of-sequence token.

---

##### property LLM.config

The config object.

---

##### property LLM.context_length

The context length of model.

---

##### property LLM.embeddings

The input embeddings.

---

##### property LLM.eos_token_id

The end-of-sequence token.

---

##### property LLM.logits

The unnormalized log probabilities.

---

##### property LLM.model_path

The path to the model file.

---

##### property LLM.model_type

The model type.

---

##### property LLM.pad_token_id

The padding token.

---

##### property LLM.vocab_size

The number of tokens in vocabulary.

---

#### method `LLM.detokenize`

```python

detokenize(tokens: Sequence[int], decode: bool = True) → Union[str, bytes]

```

Converts a list of tokens to text.

**Args:**

- `tokens`: The list of tokens.

- `decode`: Whether to decode the text as UTF-8 string.

**Returns:**

The combined text of all tokens.

---

#### method `LLM.embed`

```python

embed(

    input: Union[str, Sequence[int]],

    batch_size: Optional[int] = None,

    threads: Optional[int] = None

) → List[float]

```

Computes embeddings for a text or list of tokens.

> **Note:** Currently only LLaMA and Falcon models support embeddings.

**Args:**

- `input`: The input text or list of tokens to get embeddings for.

- `batch_size`: The batch size to use for evaluating tokens in a single prompt. Default: `8`

- `threads`: The number of threads to use for evaluating tokens. Default: `-1`

**Returns:**

The input embeddings.

---

#### method `LLM.eval`

```python

eval(

    tokens: Sequence[int],

    batch_size: Optional[int] = None,

    threads: Optional[int] = None

) → None

```

Evaluates a list of tokens.

**Args:**

- `tokens`: The list of tokens to evaluate.

- `batch_size`: The batch size to use for evaluating tokens in a single prompt. Default: `8`

- `threads`: The number of threads to use for evaluating tokens. Default: `-1`

---

#### method `LLM.generate`

```python

generate(

    tokens: Sequence[int],

    top_k: Optional[int] = None,

    top_p: Optional[float] = None,

    temperature: Optional[float] = None,

    repetition_penalty: Optional[float] = None,

    last_n_tokens: Optional[int] = None,

    seed: Optional[int] = None,

    batch_size: Optional[int] = None,

    threads: Optional[int] = None,

    reset: Optional[bool] = None

) → Generator[int, NoneType, NoneType]

```

Generates new tokens from a list of tokens.

**Args:**

- `tokens`: The list of tokens to generate tokens from.

- `top_k`: The top-k value to use for sampling. Default: `40`

- `top_p`: The top-p value to use for sampling. Default: `0.95`

- `temperature`: The temperature to use for sampling. Default: `0.8`

- `repetition_penalty`: The repetition penalty to use for sampling. Default: `1.1`

- `last_n_tokens`: The number of last tokens to use for repetition penalty. Default: `64`

- `seed`: The seed value to use for sampling tokens. Default: `-1`

- `batch_size`: The batch size to use for evaluating tokens in a single prompt. Default: `8`

- `threads`: The number of threads to use for evaluating tokens. Default: `-1`

- `reset`: Whether to reset the model state before generating text. Default: `True`

**Returns:**

The generated tokens.

---

#### method `LLM.is_eos_token`

```python

is_eos_token(token: int) → bool

```

Checks if a token is an end-of-sequence token.

**Args:**

- `token`: The token to check.

**Returns:**

`True` if the token is an end-of-sequence token else `False`.

---

#### method `LLM.prepare_inputs_for_generation`

```python

prepare_inputs_for_generation(

    tokens: Sequence[int],

    reset: Optional[bool] = None

) → Sequence[int]

```

Removes input tokens that are evaluated in the past and updates the LLM context.

**Args:**

- `tokens`: The list of input tokens.

- `reset`: Whether to reset the model state before generating text. Default: `True`

**Returns:**

The list of tokens to evaluate.

---

#### method `LLM.reset`

```python

reset() → None

```

Deprecated since 0.2.27.

---

#### method `LLM.sample`

```python

sample(

    top_k: Optional[int] = None,

    top_p: Optional[float] = None,

    temperature: Optional[float] = None,

    repetition_penalty: Optional[float] = None,

    last_n_tokens: Optional[int] = None,

    seed: Optional[int] = None

) → int

```

Samples a token from the model.

**Args:**

- `top_k`: The top-k value to use for sampling. Default: `40`

- `top_p`: The top-p value to use for sampling. Default: `0.95`

- `temperature`: The temperature to use for sampling. Default: `0.8`

- `repetition_penalty`: The repetition penalty to use for sampling. Default: `1.1`

- `last_n_tokens`: The number of last tokens to use for repetition penalty. Default: `64`

- `seed`: The seed value to use for sampling tokens. Default: `-1`

**Returns:**

The sampled token.

---

#### method `LLM.tokenize`

```python

tokenize(text: str, add_bos_token: Optional[bool] = None) → List[int]

```

Converts a text into list of tokens.

**Args:**

- `text`: The text to tokenize.

- `add_bos_token`: Whether to add the beginning-of-sequence token.

**Returns:**

The list of tokens.

---

#### method `LLM.__call__`

```python

__call__(

    prompt: str,

    max_new_tokens: Optional[int] = None,

    top_k: Optional[int] = None,

    top_p: Optional[float] = None,

    temperature: Optional[float] = None,

    repetition_penalty: Optional[float] = None,

    last_n_tokens: Optional[int] = None,

    seed: Optional[int] = None,

    batch_size: Optional[int] = None,

    threads: Optional[int] = None,

    stop: Optional[Sequence[str]] = None,

    stream: Optional[bool] = None,

    reset: Optional[bool] = None

) → Union[str, Generator[str, NoneType, NoneType]]

```

Generates text from a prompt.

**Args:**

- `prompt`: The prompt to generate text from.

- `max_new_tokens`: The maximum number of new tokens to generate. Default: `256`

- `top_k`: The top-k value to use for sampling. Default: `40`

- `top_p`: The top-p value to use for sampling. Default: `0.95`

- `temperature`: The temperature to use for sampling. Default: `0.8`

- `repetition_penalty`: The repetition penalty to use for sampling. Default: `1.1`

- `last_n_tokens`: The number of last tokens to use for repetition penalty. Default: `64`

- `seed`: The seed value to use for sampling tokens. Default: `-1`

- `batch_size`: The batch size to use for evaluating tokens in a single prompt. Default: `8`

- `threads`: The number of threads to use for evaluating tokens. Default: `-1`

- `stop`: A list of sequences to stop generation when encountered. Default: `None`

- `stream`: Whether to stream the generated text. Default: `False`

- `reset`: Whether to reset the model state before generating text. Default: `True`

**Returns:**

The generated text.

## License

[MIT](https://github.com/marella/ctransformers/blob/main/LICENSE)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/marella/ctransformers

Awesome Lists containing this project

README