https://github.com/kelvich/pg_tiktoken
tiktoken tokenizer for postgres
https://github.com/kelvich/pg_tiktoken
Last synced: 2 months ago
JSON representation
tiktoken tokenizer for postgres
- Host: GitHub
- URL: https://github.com/kelvich/pg_tiktoken
- Owner: kelvich
- License: apache-2.0
- Created: 2023-03-08T21:20:34.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-10-29T15:40:46.000Z (7 months ago)
- Last Synced: 2025-03-23T18:54:17.030Z (3 months ago)
- Language: Rust
- Size: 12.7 KB
- Stars: 44
- Watchers: 2
- Forks: 5
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# pg_tiktoken
Postgres extension that does input tokenization using OpenAI's tiktoken.
## Usage
```sql
db=> create extension pg_tiktoken;
CREATE EXTENSION
db=> select tiktoken_count('p50k_edit', 'A long time ago in a galaxy far, far away');
tiktoken_count
----------------
11
(1 row)db=> select tiktoken_encode('cl100k_base', 'A long time ago in a galaxy far, far away');
tiktoken_encode
----------------------------------------------------
{32,1317,892,4227,304,264,34261,3117,11,3117,3201}
(1 row)
```## Supported models
| Encoding name | OpenAI models |
|-------------------------|-----------------------------------------------------|
| `cl100k_base` | ChatGPT models, `text-embedding-ada-002` |
| `p50k_base` | Code models, `text-davinci-002`, `text-davinci-003` |
| `p50k_edit` | Use for edit models like `text-davinci-edit-001`, `code-davinci-edit-001` |
| `r50k_base` (or `gpt2`) | GPT-3 models like `davinci` |`tiktoken_count` and `tiktoken_encode` functions accept both encoding name and OpenAI model name as a first argument.
## Installation
Assuming that rust toolchain is already istalled:
```sh
# install pgrx
cargo install --locked cargo-pgrx
cargo pgrx init
# build and install pg_tiktoken
git clone https://github.com/kelvich/pg_tiktoken
cd pg_tiktoken
cargo pgrx install
```## Kudos
- https://github.com/zurawiki/tiktoken-rs
- https://github.com/openai/tiktoken