https://github.com/helena-intel/test-prompt-generator

Create prompts with a given token length for testing LLMs and other transformers text models.
https://github.com/helena-intel/test-prompt-generator

benchmarking llm llm-inference nlp tokenizers transformers

Last synced: 3 months ago
JSON representation

Create prompts with a given token length for testing LLMs and other transformers text models.

Host: GitHub
URL: https://github.com/helena-intel/test-prompt-generator
Owner: helena-intel
Created: 2024-02-26T10:33:18.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-01-29T20:46:28.000Z (4 months ago)
Last Synced: 2025-01-29T21:29:19.435Z (4 months ago)
Topics: benchmarking, llm, llm-inference, nlp, tokenizers, transformers
Language: Python
Homepage:
Size: 172 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Test Prompt Generator

Create prompts with a given token length for testing LLMs and other transformers text models.

Pre-created prompts for popular model architectures are provided in .jsonl files in the [prompts](./prompts) directory.

To generate one or a few prompts, or to test the functionality, you can use the
[Test Prompt Generator](https://huggingface.co/spaces/helenai/test-prompt-generator) Space on Hugging Face.

## Install

```shell
pip install git+https://github.com/helena-intel/test-prompt-generator.git transformers
```

Some tokenizers may require additional dependencies. For example, `sentencepiece` or `protobuf`.

## Usage

Specify a tokenizer, and the number of tokens the prompt should have. A prompt will be returned that, when tokenized with
the given tokenizer, contains the requested number of tokens.

For tokenizer, use a model_id from the Hugging Face hub, a path to a local file, or one of the preset tokenizers:
`['bert', 'blenderbot', 'bloom', 'bloomz', 'chatglm3', 'falcon', 'gemma', 'gpt-neox', 'llama', 'magicoder', 'mistral', 'mpt', 'opt', 'phi-2', 'pythia', 'qwen', 'redpajama', 'roberta', 'starcoder', 't5', 'vicuna', 'zephyr']`. The preset tokenizers should work for most models with that architecture,
but if you want to be sure, use an exact model_id. [This list](./prompts/README.md) shows the exact tokenizers used for the presets.

Prompts are generated by truncating a given source text at the provided number of tokens. By default
[Alice in Wonderland](https://archive.org/stream/alicesadventures19033gut/19033.txt) is used; you can also provide your own source.
A prefix can optionally be prepended to the text, to create prompts like "Please summarize the following text: [text]". The
prompts are returned by the function/command line app, and can also optionally be saved to a .jsonl file.

### Python API

#### Basic usage

```python
from test_prompt_generator import generate_prompt

# use preset value for opt tokenizer
prompt = generate_prompt(tokenizer_id="opt", num_tokens=32)
# use model_id
prompt = generate_prompt(tokenizer_id="facebook/opt-2.7b", num_tokens=32)
```

#### Slightly less basic usage

Add a source_text_file and prefix. Instead of source_text_file, you can also pass `source_text` containing a string with the source text.

```python
from test_prompt_generator import generate_prompt

prompt = generate_prompt(
tokenizer_id="mistral",
num_tokens=32,
source_text_file="source.txt",
prefix="Please translate to Dutch:",
output_file="prompt_32.jsonl",
)
```

Use multiple token sizes. When using multiple token sizes, `output_file` is required, and the `generate_prompt`
function does not return anything. The `output_file` will contain one line for each token size.

```python
prompt = generate_prompt(
tokenizer_id="mistral",
num_tokens=[32,64,128],
output_file="prompts.jsonl",
)
```

> NOTE: When specifing one token size, the prompt will be returned as string, making it easy to copy and use in a test scenario
where you need one prompt. When specifying multiple token sizes a dictionary with the prompts will be returned. The output file
is always in .jsonl format, regardless of the number of generated prompts.

### Command Line App

```shell
test-prompt-generator -t mistral -n 32
```

Use `test-prompt-generator --help` to see all options:

```shell
usage: test-prompt-generator [-h] -t TOKENIZER -n NUM_TOKENS [-p PREFIX] [-o OUTPUT_FILE] [--overwrite] [-v] [-f FILE]

options:
-h, --help show this help message and exit
-t TOKENIZER, --tokenizer TOKENIZER
preset tokenizer id, model_id from Hugging Face hub, or path to local directory with tokenizer files. Options for presets are: ['bert', 'bloom', 'gemma', 'chatglm3', 'falcon', 'gpt-neox',
'llama', 'magicoder', 'mistral', 'opt', 'phi-2', 'pythia', 'roberta', 'qwen', 'starcoder', 't5']
-n NUM_TOKENS, --num_tokens NUM_TOKENS
Number of tokens the generated prompt should have. To specify multiple token sizes, use e.g. `-n 16 32`
-p PREFIX, --prefix PREFIX
Optional: prefix that the prompt should start with. Example: 'Translate to Dutch:'
-o OUTPUT_FILE, --output_file OUTPUT_FILE
Optional: Path to store the prompt as .jsonl file
--overwrite Overwrite output_file if it already exists.
-v, --verbose
-f FILE, --file FILE Optional: path to text file to generate prompts from. Default text_files/alice.txt
```

## Disclaimer

This software is provided "as is" and for testing purposes only. The author makes no warranties, express or implied, regarding the software's operation, accuracy, or reliability.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/helena-intel/test-prompt-generator

Awesome Lists containing this project

README