https://github.com/tmalsburg/llm_surprisal

Simple tool for generating tokens with open source transformers and/or calculate per-token surprisal.
https://github.com/tmalsburg/llm_surprisal
Last synced: 24 days ago
JSON representation
Simple tool for generating tokens with open source transformers and/or calculate per-token surprisal.
Host: GitHub
URL: https://github.com/tmalsburg/llm_surprisal
Owner: tmalsburg
Created: 2024-07-16T08:58:20.000Z (9 months ago)
Default Branch: main
Last Pushed: 2025-03-12T09:25:49.000Z (about 1 month ago)
Last Synced: 2025-03-18T17:41:08.017Z (28 days ago)
Language: Python
Homepage:
Size: 26.4 KB
Stars: 8
Watchers: 2
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.org
Awesome Lists containing this project

awesome_ai_agents - Llm_Surprisal - Simple tool for generating tokens with open source transformers and/or calculate per-token surprisal. (Building / Tools)
awesome_ai_agents - Llm_Surprisal - Simple tool for generating tokens with open source transformers and/or calculate per-token surprisal. (Building / Tools)
README

        
Simple command-line tools for low-barrier reproducible experiments with [[https://huggingface.co/docs/transformers/en/model_doc/gpt2][GPT2]], [[https://huggingface.co/docs/transformers/en/model_doc/xglm][XGLM]], and [[https://huggingface.co/docs/transformers/en/model_doc/bloom][Bloom]].  Primarily intended for use in education.  Designed to be easy to use for non-technical users.

There are currently two tools:

1. ~llm_generate.py~: Given a text, generate the next N tokens.  Annotate all words with their surprisal.

2. ~llm_topn.py~: Given a text, list the N most highly ranked next tokens with their surprisal.

*Key features:*

- Assumes basic familiarity with the use of a command line, but no programming.

- Runs open source pre-trained LLM locally without the need for an API key or internet connection.

  - GPT2 for English ([[https://huggingface.co/openai-community/gpt2][124M]], [[https://huggingface.co/openai-community/gpt2-large][774M]])

  - GPT2 for German ([[https://huggingface.co/dbmdz/german-gpt2][137M]])

  - [[https://huggingface.co/bigscience/bloom][Bloom]] for 46 languages ([[https://huggingface.co/bigscience/bloom-560m][560M]], [[https://huggingface.co/bigscience/bloom-1b7][1.7B]], [[https://huggingface.co/bigscience/bloom-3b][3B]])

  - XGLM for 30 languages ([[https://huggingface.co/facebook/xglm-564M][564M]], [[https://huggingface.co/facebook/xglm-1.7B][1.7B]], [[https://huggingface.co/facebook/xglm-2.9B][2.9B]])

- Reasonably fast on CPU.

- Reproducible results via random seeds.

- Batch mode that processes multiple items in one go.

- Output format is ASCII art bar charts for quick experiments or ~.csv~ for easy processing in R, Python, spreadsheet editors, etc.

Developed and tested on Ubuntu Linux, but it may also work out of the box on Mac OS and Windows.

* Install

** Prerequisites

It is assumed that the system has a recent version of Python3 and the pip package manager for Python.  On Ubuntu Linux, these can be installed via ~sudo apt install python3-pip~.

Then install PyTorch and Hugging Face’s ~transformer~ package along with some supporting packages needed by some models:

#+BEGIN_SRC sh :eval no

pip3 install torch transformers tiktoken sentencepiece protobuf

#+END_SRC

* Supported models

The listed languages have a representation of 5% or more in the training data.  They are sorted in descending order, starting with the most represented languages.  For more details, refer to the model descriptions on Hugging Face.

| Model class | Models                          | Languages                                                                                       |

|-------------+---------------------------------+-------------------------------------------------------------------------------------------------|

| GPT2        | ~gpt2~, ~gpt2-large~                | English                                                                                         |

| GPT2        | ~german-gpt2~, ~german-gpt2-larger~ | German                                                                                          |

| Bloom       | ~bloom-560m~, ~bloom-1b7~, ~bloom-3b~ | English, Chinese, French, Code, Indic, Indonesian, Niger-Congo, Spanish, Portuguese, Vietnamese |

| XGLM        | ~xglm-564M~, ~xglm-1.7B~, ~xglm-2.9B~ | English, Russian, Chinese, German, Spanish, …                                                   |

Additional models can be added in the file ~models.py~.

Note that we do not vouch for the quality of any of these models.  Do your own research to find out if they are suitable for your task.

* Use

** Generation with surprisal

*** Display help text

#+BEGIN_SRC sh :exports both :results verbatim

python3 llm_generate.py -h

#+END_SRC

#+RESULTS:

#+begin_example

usage: llm_generate.py [-h] [-m MODEL] [-n NUMBER] [-s SEED] [-t TEMPERATURE]

                       [-k TOPK] [-c] [-i INPUT] [-o OUTPUT]

                       [text]

Use GPT2 to generate tokens and calculate per-token surprisal.

positional arguments:

  text                  The string of text to be processed.

options:

  -h, --help            show this help message and exit

  -m MODEL, --model MODEL

                        The model that should be used. One of: gpt2,

                        gpt2-large, bloom-560m, bloom-1b7, bloom-3b,

                        xglm-564M, xglm-1.7B, xglm-2.9B (default gpt2)

  -n NUMBER, --number NUMBER

                        The number of tokes to generate (default is n=0).

  -s SEED, --seed SEED  Seed for used for sampling (to force reproducible

                        results)

  -t TEMPERATURE, --temperature TEMPERATURE

                        Temperature when sampling tokens (default is 1.0).

  -k TOPK, --topk TOPK  Only the top k probabilities are considered for

                        sampling the next token (default is k=50)

  -c, --csv             Output in csv format

  -i INPUT, --input INPUT

                        The path to the file from which the input should be

                        read.

  -o OUTPUT, --output OUTPUT

                        The path to the file to which the results should be

                        written (default is stdout).

#+end_example

*** Simple generation of tokens

Command to generate four additional tokens using GPT2 (default model) and calculate surprisal for each token.

#+BEGIN_SRC sh :exports code :eval no

python3 llm_generate.py "The key to the cabinets" -n 4

#+END_SRC

#+BEGIN_SRC sh :exports results :results output

python3 llm_generate.py "The key to the cabinets" -n 4 -s 2

#+END_SRC

#+RESULTS:

#+begin_example

Item Idx    Token: Surprisal (bits)

   1   1      The:                         nan

   1   2      key: ██████████             10.4

   1   3       to: ██                      2.0

   1   4      the: ████                    3.8

   1   5 cabinets: █████████████████████  21.0

   1   6       is: ██                      1.5

   1   7     that: ███                     3.3

   1   8      the: ███                     2.5

   1   9    doors: ████████                7.6

#+end_example

*** Multilingual models

Generation with XGLM 564M

#+BEGIN_SRC sh :exports code :eval no

python3 llm_generate.py "Der Polizist sagte, dass man nicht mehr ermitteln kann," -n 5 -m xglm-564M

#+END_SRC

#+BEGIN_SRC sh :exports results :results output

python3 llm_generate.py "Der Polizist sagte, dass man nicht mehr ermitteln kann," -n 5 -s 2 -m xglm-564M

#+END_SRC

#+RESULTS:

#+begin_example

Item Idx       Token: Surprisal (bits)

   1   1        :                    nan

   1   2        : █████              4.8

   1   3         Der: ████████████      11.6

   1   4      Polizi: █████████████     13.0

   1   5          st:                    0.2

   1   6       sagte: ███████████       10.7

   1   7           ,: ██                 1.7

   1   8        dass: ██                 2.0

   1   9         man: █████              5.5

   1  10       nicht: █████              4.5

   1  11        mehr: ████               4.2

   1  12          er: ████████           7.8

   1  13     mitteln: ████               4.1

   1  14        kann: ███                3.1

   1  15           ,: █                  1.2

   1  16          da: ████               4.3

   1  17       nicht: ███████            7.1

   1  18        alle: ██                 2.4

   1  19       Daten: ██████             5.7

   1  20 gespeichert: ███                3.3

#+end_example

Note the initial ~~ tokens that are generated by default when tokenizing text for XGLM.  These tokens do have an impact on subsequent tokens’ surprisal values, but it’s not clear if they can be safely dropped.  Generation of these tokens can be suppressed by providing the tokenizer with the optional argument ~add_special_tokens=False~.

Multilingual generation with Bloom 560M:

#+BEGIN_SRC sh :exports code :eval no

python3 llm_generate.py "Der Polizist sagte, dass man nicht mehr ermitteln kann," -n 5 -m bloom-560m

#+END_SRC

*** Sampling parameters

Two sampling parameters are currently supported: 1. Temperature (default 1) and 2. Top-k (default 50).  To use different sampling parameters:

#+BEGIN_SRC sh :exports code :eval no

python3 llm_generate.py "This is a" -t 1000 -k 1

#+END_SRC

#+BEGIN_SRC sh :exports results :results output

python3 llm_generate.py "This is a" -t 1000 -k 1 -s 2

#+END_SRC

#+RESULTS:

#+begin_example

Item Idx   Token: Surprisal (bits)

   1   1    This:                    nan

   1   2      is: ████               4.4

   1   3       a: ███                2.7

   1   4    very: ████               4.2

   1   5    good: ████               3.8

   1   6 example: ████               4.2

   1   7      of:                    0.4

   1   8     how: ██                 2.2

   1   9      to: ███                2.7

   1  10     use: ███                2.9

   1  11     the: ███                3.2

   1  12       ": ██████             6.4

   1  13       I: ████████           7.8

#+end_example

The repetition penalty is fixed at 1.0 assuming that larger values are not desirable when studying the behaviour of the model.  Nucleus sampling is currently not supported but could be added if needed.

*** Output in CSV format

CSV format in shell output can be obtained with the ~-c~ option:

#+BEGIN_SRC sh :exports code :eval no

python3 llm_generate.py "The key to the cabinets" -n 4 -c

#+END_SRC

#+BEGIN_SRC sh :exports results :results output

python3 llm_generate.py "The key to the cabinets" -n 4 -c -s 2

#+END_SRC

#+RESULTS:

#+begin_example

item,idx,token,surprisal

1,1,The,nan

1,2,key,10.35491943359375

1,3,to,2.019094467163086

1,4,the,3.7583045959472656

1,5,cabinets,21.04239845275879

1,6,is,1.5308449268341064

1,7,that,3.2748565673828125

1,8,the,2.5106589794158936

1,9,doors,7.590230464935303

#+end_example

*** Store results in a ~.csv~ file

To store results in a ~.csv~ file which can be easily loaded in R, Excel, Google Sheets, and similar:

#+BEGIN_SRC sh :eval no

python3 llm_generate.py "The key to the cabinets" -n 4 -o output.csv

#+END_SRC

When storing results to a file, there’s no need to specify ~-c~.  CSV will be used by default.

*** Reproducible generation

To obtain reproducible (i.e. non-random) results, the ~-s~ option can be used to set a random seed:

#+BEGIN_SRC sh :eval no

python3 llm_generate.py "The key to the cabinets" -n 4 -s 1

#+END_SRC

*** Batch mode generation

To process multiple items in batch mode, create a ~.csv~ file following this example:

#+BEGIN_SRC sh :exports results :results output

cat input_generate.csv

#+END_SRC

#+RESULTS:

: item,text,n

: 1,John saw the man who the card catalog had confused a great deal.,0

: 2,No head injury is too trivial to be ignored.,0

: 3,The key to the cabinets were on the table.,0

: 4,How many animals of each kind did Moses take on the ark?,0

: 5,The horse raced past the barn fell.,0

: 6,The first thing the new president will do is,10

Columns:

1. Item number

2. Text

3. Number of additional tokens that should be generated

Then run:

#+BEGIN_SRC sh :exports code :eval no

python3 llm_generate.py -i input_generate.csv -o output_generate.csv

#+END_SRC

#+BEGIN_SRC sh :exports none

python3 llm_generate.py -i input_generate.csv -o output_generate.csv -s 1

#+END_SRC

Result:

#+BEGIN_SRC sh :exports results

cat output_generate.csv

#+END_SRC

| item | wn | w         |             surprisal |

|------+----+-----------+-----------------------|

|    1 |  1 | John      |                   nan |

|    1 |  2 | saw       |    12.686095237731934 |

|    1 |  3 | the       |    2.5510218143463135 |

|    1 |  4 | man       |      6.69647216796875 |

|    1 |  5 | who       |    4.4374775886535645 |

|    1 |  6 | the       |     9.218789100646973 |

|    1 |  7 | card      |     12.91416072845459 |

|    1 |  8 | catalog   |    13.132523536682129 |

|    1 |  9 | had       |     5.045916557312012 |

|    1 | 10 | confused  |    12.417732238769531 |

|    1 | 11 | a         |     8.445308685302734 |

|    1 | 12 | great     |     8.923978805541992 |

|    1 | 13 | deal      |    0.5196788311004639 |

|    1 | 14 | .         |     2.855055093765259 |

|    2 |  1 | No        |                   nan |

|    2 |  2 | head      |    12.043790817260742 |

|    2 |  3 | injury    |     7.169843673706055 |

|    2 |  4 | is        |     3.976238965988159 |

|    2 |  5 | too       |      6.11444616317749 |

|    2 |  6 | trivial   |     10.36826229095459 |

|    2 |  7 | to        |    1.1925396919250488 |

|    2 |  8 | be        |    3.6252267360687256 |

|    2 |  9 | ignored   |     5.360403060913086 |

|    2 | 10 | .         |    1.3230934143066406 |

|    3 |  1 | The       |                   nan |

|    3 |  2 | key       |     10.35491943359375 |

|    3 |  3 | to        |     2.019094467163086 |

|    3 |  4 | the       |    3.7583045959472656 |

|    3 |  5 | cabinets  |     21.04239845275879 |

|    3 |  6 | were      |     6.044715404510498 |

|    3 |  7 | on        |     9.186738967895508 |

|    3 |  8 | the       |    1.0266693830490112 |

|    3 |  9 | table     |     6.743055820465088 |

|    3 | 10 | .         |    2.8487112522125244 |

|    4 |  1 | How       |                   nan |

|    4 |  2 | many      |     8.747537612915039 |

|    4 |  3 | animals   |    10.349991798400879 |

|    4 |  4 | of        |     7.982310771942139 |

|    4 |  5 | each      |     7.254271984100342 |

|    4 |  6 | kind      |    3.8629841804504395 |

|    4 |  7 | did       |     6.853036880493164 |

|    4 |  8 | Moses     |    11.290939331054688 |

|    4 |  9 | take      |     6.513387680053711 |

|    4 | 10 | on        |     5.387193202972412 |

|    4 | 11 | the       |     2.429086208343506 |

|    4 | 12 | ar        |      8.29068660736084 |

|    4 | 13 | k         |  0.001733059762045741 |

|    4 | 14 | ?         |    1.3717999458312988 |

|    5 |  1 | The       |                   nan |

|    5 |  2 | horse     |    13.856287002563477 |

|    5 |  3 | raced     |    10.928426742553711 |

|    5 |  4 | past      |     5.529265880584717 |

|    5 |  5 | the       |     1.912912130355835 |

|    5 |  6 | barn      |     6.164068222045898 |

|    5 |  7 | fell      |    18.577974319458008 |

|    5 |  8 | .         |    6.4461774826049805 |

|    6 |  1 | The       |                   nan |

|    6 |  2 | first     |     7.707244873046875 |

|    6 |  3 | thing     |     3.870574712753296 |

|    6 |  4 | the       |     5.894345760345459 |

|    6 |  5 | new       |     7.025041580200195 |

|    6 |  6 | president |    6.4177327156066895 |

|    6 |  7 | will      |     4.513916492462158 |

|    6 |  8 | do        |     0.641898512840271 |

|    6 |  9 | is        |    0.6119055151939392 |

|    6 | 10 | introduce |     6.937398910522461 |

|    6 | 11 | some      |     5.374466896057129 |

|    6 | 12 | sort      |    5.1832194328308105 |

|    6 | 13 | of        | 0.0006344764260575175 |

|    6 | 14 | """"      |     5.472208499908447 |

|    6 | 15 | Make      |     6.435114860534668 |

|    6 | 16 | America   |   0.20164340734481812 |

|    6 | 17 | Great     |   0.06291275471448898 |

|    6 | 18 | Again     |   0.01570785976946354 |

|    6 | 19 | """"      |   0.08896449953317642 |

** Top N next tokens with surprisal

*** Display help text

#+BEGIN_SRC sh :exports both :results verbatim

python3 llm_topn.py -h

#+END_SRC

#+RESULTS:

#+begin_example

usage: llm_topn.py [-h] [-n NUMBER] [-m MODEL] [-c] [-i INPUT] [-o OUTPUT]

                   [text]

Use GPT2 to generate ranking of the N most likely next tokens.

positional arguments:

  text                  The string of text to be processed.

options:

  -h, --help            show this help message and exit

  -n NUMBER, --number NUMBER

                        The number of top-ranking tokens to list (default is

                        n=10)

  -m MODEL, --model MODEL

                        The model that should be used. One of: gpt2,

                        gpt2-large, bloom-560m, bloom-1b7, bloom-3b,

                        xglm-564M, xglm-1.7B, xglm-2.9B (default gpt2)

  -c, --csv             Output in csv format

  -i INPUT, --input INPUT

                        The path to the file from which the input should be

                        read.

  -o OUTPUT, --output OUTPUT

                        The path to the file to which the results should be

                        written (default is stdout).

#+end_example

*** Simple top N

Top 5 next tokens:

#+BEGIN_SRC sh :exports both :results output

python3 llm_topn.py "The key to the cabinets" -n 5

#+END_SRC

#+RESULTS:

#+begin_example

Item                    Text Token Rank: Surprisal (bits)

   1 The key to the cabinets    is    1: ██                 1.5

   1 The key to the cabinets   are    2: ████               4.1

   1 The key to the cabinets     ,    3: ████               4.2

   1 The key to the cabinets   was    4: ████               4.2

   1 The key to the cabinets   and    5: ████               4.5

#+end_example

*** Multilingual top N

#+BEGIN_SRC sh :exports both :results output

python3 llm_topn.py "Der Schlüssel zu den Schränken" -n 10 -m xglm-564M

#+END_SRC

#+RESULTS:

#+begin_example

Item                           Text Token Rank: Surprisal (bits)

   1 Der Schlüssel zu den Schränken      1: ██                 2.3

   1 Der Schlüssel zu den Schränken   ist    2: ███                2.8

   1 Der Schlüssel zu den Schränken     ,    3: ████               4.0

   1 Der Schlüssel zu den Schränken   und    4: ████               4.4

   1 Der Schlüssel zu den Schränken    im    5: █████              4.5

   1 Der Schlüssel zu den Schränken    in    6: █████              4.6

   1 Der Schlüssel zu den Schränken   des    7: █████              4.9

   1 Der Schlüssel zu den Schränken     :    8: █████              5.0

   1 Der Schlüssel zu den Schränken   der    9: █████              5.4

   1 Der Schlüssel zu den Schränken     .   10: ██████             6.0

#+end_example

*** Force CSV format in shell output

#+BEGIN_SRC sh :results output verbatim

python3 llm_topn.py "The key to the cabinets" -n 5 -c

#+END_SRC

#+RESULTS:

: item,text,token,rank,surprisal

: 1,The key to the cabinets,is,1,1.530847191810608

: 1,The key to the cabinets,are,2,4.100262641906738

: 1,The key to the cabinets,",",3,4.1611528396606445

: 1,The key to the cabinets,was,4,4.206236839294434

: 1,The key to the cabinets,and,5,4.458767890930176

*** Store results in a file (CSV format)

#+BEGIN_SRC sh :eval no

python3 llm_topn.py "The key to the cabinets" -n 5 -o output.csv

#+END_SRC

*** Batch mode top N

To process multiple items in batch mode, create a ~.csv~ file following this example:

#+BEGIN_SRC sh :exports results :results output

cat input_topn.csv

#+END_SRC

#+RESULTS:

: item,text,n

: 1,The key to the cabinets,10

: 2,The key to the cabinet,10

: 3,The first thing the new president will do is to introduce,10

: 4,"After moving into the Oval Office, one of the first things that",10

Columns:

1. Item number

2. Text

3. Number of top tokens that should be reported

Then run:

#+BEGIN_SRC sh :exports code

python3 llm_topn.py -i input_topn.csv -o output_topn.csv

#+END_SRC

Result:

#+BEGIN_SRC sh :exports results

cat output_topn.csv

#+END_SRC

#+RESULTS:

| item | s                                                               | w           | rank |          surprisal |

|    1 | The key to the cabinets                                         | is          |    1 |  1.530847191810608 |

|    1 | The key to the cabinets                                         | are         |    2 |  4.100262641906738 |

|    1 | The key to the cabinets                                         | ,           |    3 | 4.1611528396606445 |

|    1 | The key to the cabinets                                         | was         |    4 |  4.206236839294434 |

|    1 | The key to the cabinets                                         | and         |    5 |  4.458767890930176 |

|    1 | The key to the cabinets                                         | in          |    6 |  4.966185569763184 |

|    1 | The key to the cabinets                                         | of          |    7 |  5.340408802032471 |

|    1 | The key to the cabinets                                         | '           |    8 |  5.369940280914307 |

|    1 | The key to the cabinets                                         | being       |    9 |  5.823633193969727 |

|    1 | The key to the cabinets                                         | that        |   10 |  6.032191753387451 |

|    2 | The key to the cabinet                                          | 's          |    1 | 1.8515361547470093 |

|    2 | The key to the cabinet                                          | is          |    2 | 2.9451916217803955 |

|    2 | The key to the cabinet                                          | ,           |    3 |  4.270960807800293 |

|    2 | The key to the cabinet                                          | was         |    4 |  4.756969928741455 |

|    2 | The key to the cabinet                                          | meeting     |    5 |  5.037260055541992 |

|    2 | The key to the cabinet                                          | being       |    6 | 5.4005866050720215 |

|    2 | The key to the cabinet                                          | resh        |    7 |  6.193490028381348 |

|    2 | The key to the cabinet                                          | has         |    8 |  6.257472991943359 |

|    2 | The key to the cabinet                                          | and         |    9 |  6.363502502441406 |

|    2 | The key to the cabinet                                          | of          |   10 |  6.371416091918945 |

|    3 | The first thing the new president will do is to introduce       | a           |    1 |  1.717236042022705 |

|    3 | The first thing the new president will do is to introduce       | legislation |    2 | 3.0158398151397705 |

|    3 | The first thing the new president will do is to introduce       | the         |    3 |  3.788292407989502 |

|    3 | The first thing the new president will do is to introduce       | his         |    4 |  4.383864402770996 |

|    3 | The first thing the new president will do is to introduce       | an          |    5 |  4.400935649871826 |

|    3 | The first thing the new president will do is to introduce       | new         |    6 |  4.592444896697998 |

|    3 | The first thing the new president will do is to introduce       | some        |    7 |  5.393261909484863 |

|    3 | The first thing the new president will do is to introduce       | himself     |    8 |  6.188421726226807 |

|    3 | The first thing the new president will do is to introduce       | more        |    9 |  7.121828079223633 |

|    3 | The first thing the new president will do is to introduce       | and         |   10 |  7.167385578155518 |

|    4 | After moving into the Oval Office, one of the first things that | came        |    1 |   4.16267204284668 |

|    4 | After moving into the Oval Office, one of the first things that | I           |    2 | 4.3133015632629395 |

|    4 | After moving into the Oval Office, one of the first things that | Trump       |    3 |   4.36268949508667 |

|    4 | After moving into the Oval Office, one of the first things that | President   |    4 |  4.635979652404785 |

|    4 | After moving into the Oval Office, one of the first things that | he          |    5 |  4.925130367279053 |

|    4 | After moving into the Oval Office, one of the first things that | the         |    6 |  5.133755207061768 |

|    4 | After moving into the Oval Office, one of the first things that | was         |    7 |  5.245244026184082 |

|    4 | After moving into the Oval Office, one of the first things that | happened    |    8 |  5.386913299560547 |

|    4 | After moving into the Oval Office, one of the first things that | Obama       |    9 |  6.018731117248535 |

|    4 | After moving into the Oval Office, one of the first things that | Mr          |   10 | 6.0303544998168945 |
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tmalsburg/llm_surprisal

Awesome Lists containing this project

README