An open API service indexing awesome lists of open source software.

https://github.com/oneflow-inc/text

Data loaders and abstractions for text and NLP
https://github.com/oneflow-inc/text

Last synced: 4 months ago
JSON representation

Data loaders and abstractions for text and NLP

Awesome Lists containing this project

README

          

# text
Models, Datasets, Metrics and Utils for NLP.

## Installation
...

## Usage

Models

- **Supported model and model type**

**bert** : {"bert-base-cased", "bert-base-uncased", "bert-large-cased","bert-large-uncased", "bert-base-chinese"}

**elmo** : {"elmo-simplified-chinese", "elmo-traditional-chinese", "elmo-english"}

- **Load the pretrained model**

```python
# Load the pretrained model.
from flowtext.models import bert
bert, tokenizer, bert_config = bert(pretrained=True, model_type=bert-base-uncased', checkpoint_path=None)

# In addition, you can also load normal models.
from flowtext.models import BertConfig, BertModel
config = BertConfig()
bert = BertModel(config)
```

Datasets

- **The dataset module currently contains:**

Language modeling: [WikiText2, WikiText103, PennTreebank]

Machine translation: [IWSLT2016, IWSLT2017, Multi30k]

Sequence tagging(e.g. POS/NER): [UDPOS, CoNLL2000Chunking]

Question answering: [SQuAD1, SQuAD2]

Text classification: [AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull, IMDB]


- **Load NLP related datasets, and build dataloader**
```python
from flowtext.datasets import AG_NEWS
train_iter = AG_NEWS(split='train')
next(train_iter)
# Or iterate with for loop
for (label, line) in train_iter:
print(label, line)
# Or send to DataLoader
from oneflow.utils.data import DataLoader
train_iter = AG_NEWS(split='train')
dataloader = DataLoader(train_iter, batch_size=8, shuffle=False)
```

Metrics

- **The metrics currently contains:**

Bleu_score

Ngram_counter

- **NLP related evaluation metrics**
```python
>>> from flowtext.data.metrics import bleu_score
>>> candidate_corpus = [['This', 'is', 'a', 'oneflow', 'bleu','test'], ['Another', 'Sentence']]
>>> references_corpus = [[['This', 'is', 'a', 'oneflow', 'bleu','test'], ['Completely', 'Different']], [['No', 'Match']]]
>>> bleu_score(candidate_corpus, references_corpus)
0.889139711856842
```

Utils

- **Load tokenizer**
```python
>>> from flowtext.data import get_tokenizer
# The parameter ‘tokenizer’ can support spacy, moses, toktok, revtok, subword, jieba.
>>> tokenizer = get_tokenizer(tokenizer="basic_english", language="en")
>>> tokens = tokenizer("Today is a good day!")
>>> tokens
['today', 'is', 'a', 'good', 'day', '!']
```

## Disclaimer on Datasets

The datasets in flowtext.datasets is a utility library that downloads and prepares public datasets. We are not responsible for hosting and distributing these data sets, nor do we guarantee their quality and fairness, nor do we claim to have the license of the data set. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.

If you are the dataset owner and want to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please contact us through GitHub questions.

## License

OneFlow has a BSD-style license, as found in the [LICENSE](https://github.com/Oneflow-Inc/text/blob/main/LICENSE) file.