https://github.com/helena-intel/test-prompt-generator
Create prompts with a given token length for testing LLMs and other transformers text models.
https://github.com/helena-intel/test-prompt-generator
benchmarking llm llm-inference nlp tokenizers transformers
Last synced: 3 months ago
JSON representation
Create prompts with a given token length for testing LLMs and other transformers text models.
- Host: GitHub
- URL: https://github.com/helena-intel/test-prompt-generator
- Owner: helena-intel
- Created: 2024-02-26T10:33:18.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-01-29T20:46:28.000Z (4 months ago)
- Last Synced: 2025-01-29T21:29:19.435Z (4 months ago)
- Topics: benchmarking, llm, llm-inference, nlp, tokenizers, transformers
- Language: Python
- Homepage:
- Size: 172 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Test Prompt Generator
Create prompts with a given token length for testing LLMs and other transformers text models.
Pre-created prompts for popular model architectures are provided in .jsonl files in the [prompts](./prompts) directory.
To generate one or a few prompts, or to test the functionality, you can use the
[Test Prompt Generator](https://huggingface.co/spaces/helenai/test-prompt-generator) Space on Hugging Face.## Install
```shell
pip install git+https://github.com/helena-intel/test-prompt-generator.git transformers
```Some tokenizers may require additional dependencies. For example, `sentencepiece` or `protobuf`.
## Usage
Specify a tokenizer, and the number of tokens the prompt should have. A prompt will be returned that, when tokenized with
the given tokenizer, contains the requested number of tokens.For tokenizer, use a model_id from the Hugging Face hub, a path to a local file, or one of the preset tokenizers:
`['bert', 'blenderbot', 'bloom', 'bloomz', 'chatglm3', 'falcon', 'gemma', 'gpt-neox', 'llama', 'magicoder', 'mistral', 'mpt', 'opt', 'phi-2', 'pythia', 'qwen', 'redpajama', 'roberta', 'starcoder', 't5', 'vicuna', 'zephyr']`. The preset tokenizers should work for most models with that architecture,
but if you want to be sure, use an exact model_id. [This list](./prompts/README.md) shows the exact tokenizers used for the presets.Prompts are generated by truncating a given source text at the provided number of tokens. By default
[Alice in Wonderland](https://archive.org/stream/alicesadventures19033gut/19033.txt) is used; you can also provide your own source.
A prefix can optionally be prepended to the text, to create prompts like "Please summarize the following text: [text]". The
prompts are returned by the function/command line app, and can also optionally be saved to a .jsonl file.### Python API
#### Basic usage
```python
from test_prompt_generator import generate_prompt# use preset value for opt tokenizer
prompt = generate_prompt(tokenizer_id="opt", num_tokens=32)
# use model_id
prompt = generate_prompt(tokenizer_id="facebook/opt-2.7b", num_tokens=32)
```#### Slightly less basic usage
Add a source_text_file and prefix. Instead of source_text_file, you can also pass `source_text` containing a string with the source text.
```python
from test_prompt_generator import generate_promptprompt = generate_prompt(
tokenizer_id="mistral",
num_tokens=32,
source_text_file="source.txt",
prefix="Please translate to Dutch:",
output_file="prompt_32.jsonl",
)
```Use multiple token sizes. When using multiple token sizes, `output_file` is required, and the `generate_prompt`
function does not return anything. The `output_file` will contain one line for each token size.```python
prompt = generate_prompt(
tokenizer_id="mistral",
num_tokens=[32,64,128],
output_file="prompts.jsonl",
)
```> NOTE: When specifing one token size, the prompt will be returned as string, making it easy to copy and use in a test scenario
where you need one prompt. When specifying multiple token sizes a dictionary with the prompts will be returned. The output file
is always in .jsonl format, regardless of the number of generated prompts.### Command Line App
```shell
test-prompt-generator -t mistral -n 32
```Use `test-prompt-generator --help` to see all options:
```shell
usage: test-prompt-generator [-h] -t TOKENIZER -n NUM_TOKENS [-p PREFIX] [-o OUTPUT_FILE] [--overwrite] [-v] [-f FILE]options:
-h, --help show this help message and exit
-t TOKENIZER, --tokenizer TOKENIZER
preset tokenizer id, model_id from Hugging Face hub, or path to local directory with tokenizer files. Options for presets are: ['bert', 'bloom', 'gemma', 'chatglm3', 'falcon', 'gpt-neox',
'llama', 'magicoder', 'mistral', 'opt', 'phi-2', 'pythia', 'roberta', 'qwen', 'starcoder', 't5']
-n NUM_TOKENS, --num_tokens NUM_TOKENS
Number of tokens the generated prompt should have. To specify multiple token sizes, use e.g. `-n 16 32`
-p PREFIX, --prefix PREFIX
Optional: prefix that the prompt should start with. Example: 'Translate to Dutch:'
-o OUTPUT_FILE, --output_file OUTPUT_FILE
Optional: Path to store the prompt as .jsonl file
--overwrite Overwrite output_file if it already exists.
-v, --verbose
-f FILE, --file FILE Optional: path to text file to generate prompts from. Default text_files/alice.txt
```## Disclaimer
This software is provided "as is" and for testing purposes only. The author makes no warranties, express or implied, regarding the software's operation, accuracy, or reliability.