https://github.com/k2-fsa/divide_lm

Last synced: 14 days ago
JSON representation

Host: GitHub
URL: https://github.com/k2-fsa/divide_lm
Owner: k2-fsa
Created: 2023-04-25T08:58:10.000Z (about 2 years ago)
Default Branch: master
Last Pushed: 2023-04-25T09:30:08.000Z (about 2 years ago)
Last Synced: 2025-04-11T18:04:32.902Z (14 days ago)
Language: Python
Size: 6.84 KB
Stars: 4
Watchers: 6
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # Divide a higher order language model with a lower order language model

## Usage

#### Installation

```shell script

git clone [email protected]:k2-fsa/divide_lm.git

python3 setup.py install

```

#### Example with fake code for demonstating functionality only. (See real code example below)

```python

# Full example is in tests/divide_test.py

# Following statements are not real code.

# Express in this way to emphasize the functionality.

l4f = "tests/arpa_files/4gram.arpa"

l2f = "tests/arpa_files/2gram.arpa"

w4 = 0.4

w2 = 0.2

# Not real code.

divided_lm4 = processes_to_divide_lms(l4f, l2f, w4, w2)

lm4 = kenlm.Model(l4f)

lm2 = kenlm.Model(l2f)

sentence = "a d b c d c b a"

# Not real code.

score4_for_each_token = lm4.full_scores(sentence)

score2_for_each_token = lm2.full_scores(sentence)

ref = w4 * score4_for_each_token - w2 * score2_for_each_token

hyp = divided_lm4.full_scores(sentence)

assert np.allclose(ref, hyp), f"{ref} {hyp}"

```

#### Real code of previous example.

```python

# Copied from tests/divide_test.py

# kenlm and numpy is only needed by this test.

# Package divide_lm itself does not depends them.

import kenlm

import numpy as np

from divide_lm import ARPAModel, LMConfig, Weight, divide

l4f = "tests/arpa_files/4gram.arpa"

l2f = "tests/arpa_files/2gram.arpa"

wnum = 0.4

wden = 0.2

ret_f = f"tests/wnum_{wnum}-wden_{wden}-divided.arpa"

unk_lgp = -100.0

unk_lgbo = 0.0

num_config = LMConfig(model_path=l4f, unk_weight=Weight(unk_lgp, unk_lgbo))

den_config = LMConfig(model_path=l2f, unk_weight=Weight(unk_lgp, unk_lgbo))

# No return value.

# The divided language model will be saved to disk directly.

divide(

    numerator_config=num_config,

    denominator_config=den_config,

    weight_numerator=wnum,

    weight_denominator=wden,

    saved_path=ret_f,

)

sentence = "a d b c d c b a"

lm4 = kenlm.Model(l4f)

lm2 = kenlm.Model(l2f)

divided_lm4 = ARPAModel(ret_f)

s4 = np.array(

    [entry[0] for entry in lm4.full_scores(sentence, bos=True, eos=True)]

)

s2 = np.array(

    [entry[0] for entry in lm2.full_scores(sentence, bos=True, eos=True)]

)

ref = wnum * s4 - wden * s2

hyp = divided_lm4.full_scores(sentence)

assert np.allclose(ref, hyp), f"{ref} {hyp}"

print("Done.")

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/k2-fsa/divide_lm

Awesome Lists containing this project

README