Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/salesforce/bite

Code for "Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding" (EMNLP 2020).
https://github.com/salesforce/bite

nlp python tokenizer

Last synced: 12 days ago
JSON representation

Code for "Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding" (EMNLP 2020).

Host: GitHub
URL: https://github.com/salesforce/bite
Owner: salesforce
License: bsd-3-clause
Created: 2020-09-29T02:42:35.000Z (about 4 years ago)
Default Branch: master
Last Pushed: 2023-03-27T19:18:34.000Z (over 1 year ago)
Last Synced: 2024-04-08T00:12:29.074Z (8 months ago)
Topics: nlp, python, tokenizer
Language: Python
Homepage: https://arxiv.org/abs/2004.14870
Size: 112 KB
Stars: 11
Watchers: 5
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
- Codeowners: CODEOWNERS

Awesome Lists containing this project

README

        # Base-Inflection Encoding

This repository contains code for the paper "[Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding](https://arxiv.org/abs/2004.14870)" (EMNLP 2020).

Authors: [Samson Tan](https://samsontmr.github.io), [Shafiq Joty](https://raihanjoty.github.io), [Lav Varshney](http://www.varshney.csl.illinois.edu), and [Min-Yen Kan](https://comp.nus.edu.sg/~kanmy)

# Installation

```

pip install git+https://github.com/salesforce/bite

```

# Usage

```

from bite import BITETokenizer

bite = BITETokenizer('moses')

print(bite.tokenize('I was going to the engine room!'))

```

We also include a script you can use to tokenize entire files (`run_bite.py`). The parser arguments (`--argument_name`) will give you an idea of the options supported by the script.

If you are using HuggingFace's BERT model, you may want to use the `BiteWordpieceTokenizer` instead. This is implementation we use in our BERT-based experiments.

## Pretokenization modes

Three types of pretokenizers are supported out of the box:

1. BertPreTokenizer (HuggingFace)

2. Moses (sacremoses)

3. Whitespace splitting

## Inflection symbols

Since subword tokenizers often operate on individual characters, running them on BITE-processed input with human readable inflection tags (e.g., `[VBD]`) would skew the character/subword statistics of the training corpus and occupy unnecessary slots in the subword vocabulary. Therefore, we recommend using single-character inflection symbols (by passing `map_to_single_char=True` to `tokenize`) when using BITE with such tokenizers.

# Dialectal Data

The scripts for cleaning the [CORAAL](https://oraal.uoregon.edu/coraal) data and scraping the Colloquial Singapore English data can be found in `paper_scripts`. Please be considerate when scraping and do not flood the site's servers with requests :)

# Citation

Please cite the following if you use the code in this repository:

```

@inproceedings{tan-etal-2020-mind,

    title = "Mind Your Inflections! {I}mproving {NLP} for Non-Standard {E}nglishes with {B}ase-{I}nflection {E}ncoding",

    author = "Tan, Samson  and

      Joty, Shafiq  and

      Varshney, Lav  and

      Kan, Min-Yen",

    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",

    month = nov,

    year = "2020",

    address = "Online",

    publisher = "Association for Computational Linguistics",

    url = "https://www.aclweb.org/anthology/2020.emnlp-main.455",

    pages = "5647--5663",

}

```