Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/marella/ctransformers
Python bindings for the Transformer models implemented in C/C++ using GGML library.
https://github.com/marella/ctransformers
ai ctransformers llm transformers
Last synced: about 18 hours ago
JSON representation
Python bindings for the Transformer models implemented in C/C++ using GGML library.
- Host: GitHub
- URL: https://github.com/marella/ctransformers
- Owner: marella
- License: mit
- Created: 2023-05-14T22:06:24.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-01-28T07:37:48.000Z (12 months ago)
- Last Synced: 2024-10-29T15:32:54.980Z (3 months ago)
- Topics: ai, ctransformers, llm, transformers
- Language: C
- Homepage:
- Size: 62.4 MB
- Stars: 1,809
- Watchers: 19
- Forks: 137
- Open Issues: 110
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-local-ai - CTransformers - Python bindings for the Transformer models implemented in C/C++ using GGML library | GGML/GPTQ | Both | ❌ | C/C++ | Text-Gen | (Inference Engine)
README
# [CTransformers](https://github.com/marella/ctransformers) [![PyPI](https://img.shields.io/pypi/v/ctransformers)](https://pypi.org/project/ctransformers/) [![tests](https://github.com/marella/ctransformers/actions/workflows/tests.yml/badge.svg)](https://github.com/marella/ctransformers/actions/workflows/tests.yml) [![build](https://github.com/marella/ctransformers/actions/workflows/build.yml/badge.svg)](https://github.com/marella/ctransformers/actions/workflows/build.yml)
Python bindings for the Transformer models implemented in C/C++ using [GGML](https://github.com/ggerganov/ggml) library.
> Also see [ChatDocs](https://github.com/marella/chatdocs)
- [Supported Models](#supported-models)
- [Installation](#installation)
- [Usage](#usage)
- [🤗 Transformers](#transformers)
- [LangChain](#langchain)
- [GPU](#gpu)
- [GPTQ](#gptq)
- [Documentation](#documentation)
- [License](#license)## Supported Models
| Models | Model Type | CUDA | Metal |
| :------------------ | ------------- | :--: | :---: |
| GPT-2 | `gpt2` | | |
| GPT-J, GPT4All-J | `gptj` | | |
| GPT-NeoX, StableLM | `gpt_neox` | | |
| Falcon | `falcon` | ✅ | |
| LLaMA, LLaMA 2 | `llama` | ✅ | ✅ |
| MPT | `mpt` | ✅ | |
| StarCoder, StarChat | `gpt_bigcode` | ✅ | |
| Dolly V2 | `dolly-v2` | | |
| Replit | `replit` | | |## Installation
```sh
pip install ctransformers
```## Usage
It provides a unified interface for all models:
```py
from ctransformers import AutoModelForCausalLMllm = AutoModelForCausalLM.from_pretrained("/path/to/ggml-model.bin", model_type="gpt2")
print(llm("AI is going to"))
```[Run in Google Colab](https://colab.research.google.com/drive/1GMhYMUAv_TyZkpfvUI1NirM8-9mCXQyL)
To stream the output, set `stream=True`:
```py
for text in llm("AI is going to", stream=True):
print(text, end="", flush=True)
```You can load models from Hugging Face Hub directly:
```py
llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml")
```If a model repo has multiple model files (`.bin` or `.gguf` files), specify a model file using:
```py
llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", model_file="ggml-model.bin")
```### 🤗 Transformers
> **Note:** This is an experimental feature and may change in the future.
To use it with 🤗 Transformers, create model and tokenizer using:
```py
from ctransformers import AutoModelForCausalLM, AutoTokenizermodel = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)
tokenizer = AutoTokenizer.from_pretrained(model)
```[Run in Google Colab](https://colab.research.google.com/drive/1FVSLfTJ2iBbQ1oU2Rqz0MkpJbaB_5Got)
You can use 🤗 Transformers text generation pipeline:
```py
from transformers import pipelinepipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("AI is going to", max_new_tokens=256))
```You can use 🤗 Transformers generation [parameters](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig):
```py
pipe("AI is going to", max_new_tokens=256, do_sample=True, temperature=0.8, repetition_penalty=1.1)
```You can use 🤗 Transformers tokenizers:
```py
from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizermodel = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True) # Load model from GGML model repo.
tokenizer = AutoTokenizer.from_pretrained("gpt2") # Load tokenizer from original model repo.
```### LangChain
It is integrated into LangChain. See [LangChain docs](https://python.langchain.com/docs/ecosystem/integrations/ctransformers).
### GPU
To run some of the model layers on GPU, set the `gpu_layers` parameter:
```py
llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GGML", gpu_layers=50)
```[Run in Google Colab](https://colab.research.google.com/drive/1Ihn7iPCYiqlTotpkqa1tOhUIpJBrJ1Tp)
#### CUDA
Install CUDA libraries using:
```sh
pip install ctransformers[cuda]
```#### ROCm
To enable ROCm support, install the `ctransformers` package using:
```sh
CT_HIPBLAS=1 pip install ctransformers --no-binary ctransformers
```#### Metal
To enable Metal support, install the `ctransformers` package using:
```sh
CT_METAL=1 pip install ctransformers --no-binary ctransformers
```### GPTQ
> **Note:** This is an experimental feature and only LLaMA models are supported using [ExLlama](https://github.com/turboderp/exllama).
Install additional dependencies using:
```sh
pip install ctransformers[gptq]
```Load a GPTQ model using:
```py
llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GPTQ")
```[Run in Google Colab](https://colab.research.google.com/drive/1SzHslJ4CiycMOgrppqecj4VYCWFnyrN0)
> If model name or path doesn't contain the word `gptq` then specify `model_type="gptq"`.
It can also be used with LangChain. Low-level APIs are not fully supported.
## Documentation
### Config
| Parameter | Type | Description | Default |
| :------------------- | :---------- | :-------------------------------------------------------------- | :------ |
| `top_k` | `int` | The top-k value to use for sampling. | `40` |
| `top_p` | `float` | The top-p value to use for sampling. | `0.95` |
| `temperature` | `float` | The temperature to use for sampling. | `0.8` |
| `repetition_penalty` | `float` | The repetition penalty to use for sampling. | `1.1` |
| `last_n_tokens` | `int` | The number of last tokens to use for repetition penalty. | `64` |
| `seed` | `int` | The seed value to use for sampling tokens. | `-1` |
| `max_new_tokens` | `int` | The maximum number of new tokens to generate. | `256` |
| `stop` | `List[str]` | A list of sequences to stop generation when encountered. | `None` |
| `stream` | `bool` | Whether to stream the generated text. | `False` |
| `reset` | `bool` | Whether to reset the model state before generating text. | `True` |
| `batch_size` | `int` | The batch size to use for evaluating tokens in a single prompt. | `8` |
| `threads` | `int` | The number of threads to use for evaluating tokens. | `-1` |
| `context_length` | `int` | The maximum context length to use. | `-1` |
| `gpu_layers` | `int` | The number of layers to run on GPU. | `0` |> **Note:** Currently only LLaMA, MPT and Falcon models support the `context_length` parameter.
### class `AutoModelForCausalLM`
---
#### classmethod `AutoModelForCausalLM.from_pretrained`
```python
from_pretrained(
model_path_or_repo_id: str,
model_type: Optional[str] = None,
model_file: Optional[str] = None,
config: Optional[ctransformers.hub.AutoConfig] = None,
lib: Optional[str] = None,
local_files_only: bool = False,
revision: Optional[str] = None,
hf: bool = False,
**kwargs
) → LLM
```Loads the language model from a local file or remote repo.
**Args:**
- `model_path_or_repo_id`: The path to a model file or directory or the name of a Hugging Face Hub model repo.
- `model_type`: The model type.
- `model_file`: The name of the model file in repo or directory.
- `config`: `AutoConfig` object.
- `lib`: The path to a shared library or one of `avx2`, `avx`, `basic`.
- `local_files_only`: Whether or not to only look at local files (i.e., do not try to download the model).
- `revision`: The specific model version to use. It can be a branch name, a tag name, or a commit id.
- `hf`: Whether to create a Hugging Face Transformers model.**Returns:**
`LLM` object.### class `LLM`
### method `LLM.__init__`
```python
__init__(
model_path: str,
model_type: Optional[str] = None,
config: Optional[ctransformers.llm.Config] = None,
lib: Optional[str] = None
)
```Loads the language model from a local file.
**Args:**
- `model_path`: The path to a model file.
- `model_type`: The model type.
- `config`: `Config` object.
- `lib`: The path to a shared library or one of `avx2`, `avx`, `basic`.---
##### property LLM.bos_token_id
The beginning-of-sequence token.
---
##### property LLM.config
The config object.
---
##### property LLM.context_length
The context length of model.
---
##### property LLM.embeddings
The input embeddings.
---
##### property LLM.eos_token_id
The end-of-sequence token.
---
##### property LLM.logits
The unnormalized log probabilities.
---
##### property LLM.model_path
The path to the model file.
---
##### property LLM.model_type
The model type.
---
##### property LLM.pad_token_id
The padding token.
---
##### property LLM.vocab_size
The number of tokens in vocabulary.
---
#### method `LLM.detokenize`
```python
detokenize(tokens: Sequence[int], decode: bool = True) → Union[str, bytes]
```Converts a list of tokens to text.
**Args:**
- `tokens`: The list of tokens.
- `decode`: Whether to decode the text as UTF-8 string.**Returns:**
The combined text of all tokens.---
#### method `LLM.embed`
```python
embed(
input: Union[str, Sequence[int]],
batch_size: Optional[int] = None,
threads: Optional[int] = None
) → List[float]
```Computes embeddings for a text or list of tokens.
> **Note:** Currently only LLaMA and Falcon models support embeddings.
**Args:**
- `input`: The input text or list of tokens to get embeddings for.
- `batch_size`: The batch size to use for evaluating tokens in a single prompt. Default: `8`
- `threads`: The number of threads to use for evaluating tokens. Default: `-1`**Returns:**
The input embeddings.---
#### method `LLM.eval`
```python
eval(
tokens: Sequence[int],
batch_size: Optional[int] = None,
threads: Optional[int] = None
) → None
```Evaluates a list of tokens.
**Args:**
- `tokens`: The list of tokens to evaluate.
- `batch_size`: The batch size to use for evaluating tokens in a single prompt. Default: `8`
- `threads`: The number of threads to use for evaluating tokens. Default: `-1`---
#### method `LLM.generate`
```python
generate(
tokens: Sequence[int],
top_k: Optional[int] = None,
top_p: Optional[float] = None,
temperature: Optional[float] = None,
repetition_penalty: Optional[float] = None,
last_n_tokens: Optional[int] = None,
seed: Optional[int] = None,
batch_size: Optional[int] = None,
threads: Optional[int] = None,
reset: Optional[bool] = None
) → Generator[int, NoneType, NoneType]
```Generates new tokens from a list of tokens.
**Args:**
- `tokens`: The list of tokens to generate tokens from.
- `top_k`: The top-k value to use for sampling. Default: `40`
- `top_p`: The top-p value to use for sampling. Default: `0.95`
- `temperature`: The temperature to use for sampling. Default: `0.8`
- `repetition_penalty`: The repetition penalty to use for sampling. Default: `1.1`
- `last_n_tokens`: The number of last tokens to use for repetition penalty. Default: `64`
- `seed`: The seed value to use for sampling tokens. Default: `-1`
- `batch_size`: The batch size to use for evaluating tokens in a single prompt. Default: `8`
- `threads`: The number of threads to use for evaluating tokens. Default: `-1`
- `reset`: Whether to reset the model state before generating text. Default: `True`**Returns:**
The generated tokens.---
#### method `LLM.is_eos_token`
```python
is_eos_token(token: int) → bool
```Checks if a token is an end-of-sequence token.
**Args:**
- `token`: The token to check.
**Returns:**
`True` if the token is an end-of-sequence token else `False`.---
#### method `LLM.prepare_inputs_for_generation`
```python
prepare_inputs_for_generation(
tokens: Sequence[int],
reset: Optional[bool] = None
) → Sequence[int]
```Removes input tokens that are evaluated in the past and updates the LLM context.
**Args:**
- `tokens`: The list of input tokens.
- `reset`: Whether to reset the model state before generating text. Default: `True`**Returns:**
The list of tokens to evaluate.---
#### method `LLM.reset`
```python
reset() → None
```Deprecated since 0.2.27.
---
#### method `LLM.sample`
```python
sample(
top_k: Optional[int] = None,
top_p: Optional[float] = None,
temperature: Optional[float] = None,
repetition_penalty: Optional[float] = None,
last_n_tokens: Optional[int] = None,
seed: Optional[int] = None
) → int
```Samples a token from the model.
**Args:**
- `top_k`: The top-k value to use for sampling. Default: `40`
- `top_p`: The top-p value to use for sampling. Default: `0.95`
- `temperature`: The temperature to use for sampling. Default: `0.8`
- `repetition_penalty`: The repetition penalty to use for sampling. Default: `1.1`
- `last_n_tokens`: The number of last tokens to use for repetition penalty. Default: `64`
- `seed`: The seed value to use for sampling tokens. Default: `-1`**Returns:**
The sampled token.---
#### method `LLM.tokenize`
```python
tokenize(text: str, add_bos_token: Optional[bool] = None) → List[int]
```Converts a text into list of tokens.
**Args:**
- `text`: The text to tokenize.
- `add_bos_token`: Whether to add the beginning-of-sequence token.**Returns:**
The list of tokens.---
#### method `LLM.__call__`
```python
__call__(
prompt: str,
max_new_tokens: Optional[int] = None,
top_k: Optional[int] = None,
top_p: Optional[float] = None,
temperature: Optional[float] = None,
repetition_penalty: Optional[float] = None,
last_n_tokens: Optional[int] = None,
seed: Optional[int] = None,
batch_size: Optional[int] = None,
threads: Optional[int] = None,
stop: Optional[Sequence[str]] = None,
stream: Optional[bool] = None,
reset: Optional[bool] = None
) → Union[str, Generator[str, NoneType, NoneType]]
```Generates text from a prompt.
**Args:**
- `prompt`: The prompt to generate text from.
- `max_new_tokens`: The maximum number of new tokens to generate. Default: `256`
- `top_k`: The top-k value to use for sampling. Default: `40`
- `top_p`: The top-p value to use for sampling. Default: `0.95`
- `temperature`: The temperature to use for sampling. Default: `0.8`
- `repetition_penalty`: The repetition penalty to use for sampling. Default: `1.1`
- `last_n_tokens`: The number of last tokens to use for repetition penalty. Default: `64`
- `seed`: The seed value to use for sampling tokens. Default: `-1`
- `batch_size`: The batch size to use for evaluating tokens in a single prompt. Default: `8`
- `threads`: The number of threads to use for evaluating tokens. Default: `-1`
- `stop`: A list of sequences to stop generation when encountered. Default: `None`
- `stream`: Whether to stream the generated text. Default: `False`
- `reset`: Whether to reset the model state before generating text. Default: `True`**Returns:**
The generated text.## License
[MIT](https://github.com/marella/ctransformers/blob/main/LICENSE)