Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/xu-song/bert_as_language_model

BERT as language model, fork from https://github.com/google-research/bert
https://github.com/xu-song/bert_as_language_model

bert language-model tensorflow

Last synced: 4 months ago
JSON representation

BERT as language model, fork from https://github.com/google-research/bert

Host: GitHub
URL: https://github.com/xu-song/bert_as_language_model
Owner: xu-song
License: apache-2.0
Created: 2018-11-30T03:43:38.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2024-03-06T06:20:47.000Z (9 months ago)
Last Synced: 2024-08-16T21:08:52.481Z (4 months ago)
Topics: bert, language-model, tensorflow
Language: Python
Homepage:
Size: 175 KB
Stars: 247
Watchers: 9
Forks: 68
Open Issues: 1
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

awesome-bert - xu-song/bert_as_language_model - research/bert, (BERT language model and embedding:)

README

        **[🤗Demo](#demo)** |

**[📖cases-en](#test-case)** |

**[📖cases-zh](cases/test.zh.md)** |

## BERT as Language Model

For a sentence  , we have

 

In traditional language model, such as RNN,   , 



In bidirectional language model, it has larger context, .

In this implementation, we simply adopt the following approximation,

.

### Demo

Try out the [Web Demo](https://huggingface.co/spaces/eson/bert-perplexity) at [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/eson/bert-perplexity)

### test-case

> [more cases: 中文](cases/test.zh.md)

```bash

export BERT_BASE_DIR=model/uncased_L-12_H-768_A-12

export INPUT_FILE=data/lm/test.en.tsv

python run_lm_predict.py \

  --input_file=$INPUT_FILE \

  --vocab_file=$BERT_BASE_DIR/vocab.txt \

  --bert_config_file=$BERT_BASE_DIR/bert_config.json \

  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \

  --max_seq_length=128 \

  --output_dir=/tmp/lm_output/

```

for the following test case

```bash

$ cat data/lm/test.en.tsv 

there is a book on the desk

there is a plane on the desk

there is a book in the desk

$ cat /tmp/lm/output/test_result.json

```

output:

```yml

# prob: probability

# ppl:  perplexity

[

  {

    "tokens": [

      {

        "token": "there",

        "prob": 0.9988962411880493

      },

      {

        "token": "is",

        "prob": 0.013578361831605434

      },

      {

        "token": "a",

        "prob": 0.9420605897903442

      },

      {

        "token": "book",

        "prob": 0.07452250272035599

      },

      {

        "token": "on",

        "prob": 0.9607976675033569

      },

      {

        "token": "the",

        "prob": 0.4983428418636322

      },

      {

        "token": "desk",

        "prob": 4.040586190967588e-06

      }

    ],

    "ppl": 17.69329728285426

  },

  {

    "tokens": [

      {

        "token": "there",

        "prob": 0.996775209903717

      },

      {

        "token": "is",

        "prob": 0.03194097802042961

      },

      {

        "token": "a",

        "prob": 0.8877727389335632

      },

      {

        "token": "plane",

        "prob": 3.4907534427475184e-05   # low probability

      },

      {

        "token": "on",

        "prob": 0.1902322769165039

      },

      {

        "token": "the",

        "prob": 0.5981084704399109

      },

      {

        "token": "desk",

        "prob": 3.3164762953674654e-06

      }

    ],

    "ppl": 59.646456254851806

  },

  {

    "tokens": [

      {

        "token": "there",

        "prob": 0.9969795942306519

      },

      {

        "token": "is",

        "prob": 0.03379646688699722

      },

      {

        "token": "a",

        "prob": 0.9095568060874939

      },

      {

        "token": "book",

        "prob": 0.013939591124653816

      },

      {

        "token": "in",

        "prob": 0.000823647016659379  # low probability

      },

      {

        "token": "the",

        "prob": 0.5844194293022156

      },

      {

        "token": "desk",

        "prob": 3.3361218356731115e-06

      }

    ],

    "ppl": 54.65941516205144

  }

]

```