https://github.com/oneflow-inc/text
Data loaders and abstractions for text and NLP
https://github.com/oneflow-inc/text
Last synced: 4 months ago
JSON representation
Data loaders and abstractions for text and NLP
- Host: GitHub
- URL: https://github.com/oneflow-inc/text
- Owner: Oneflow-Inc
- License: bsd-3-clause
- Created: 2021-10-22T02:45:49.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2022-11-29T13:07:34.000Z (over 3 years ago)
- Last Synced: 2025-04-22T17:50:39.844Z (about 1 year ago)
- Language: Python
- Size: 111 KB
- Stars: 3
- Watchers: 6
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# text
Models, Datasets, Metrics and Utils for NLP.
## Installation
...
## Usage
Models
- **Supported model and model type**
**bert** : {"bert-base-cased", "bert-base-uncased", "bert-large-cased","bert-large-uncased", "bert-base-chinese"}
**elmo** : {"elmo-simplified-chinese", "elmo-traditional-chinese", "elmo-english"}
- **Load the pretrained model**
```python
# Load the pretrained model.
from flowtext.models import bert
bert, tokenizer, bert_config = bert(pretrained=True, model_type=bert-base-uncased', checkpoint_path=None)
# In addition, you can also load normal models.
from flowtext.models import BertConfig, BertModel
config = BertConfig()
bert = BertModel(config)
```
Datasets
- **The dataset module currently contains:**
Language modeling: [WikiText2, WikiText103, PennTreebank]
Machine translation: [IWSLT2016, IWSLT2017, Multi30k]
Sequence tagging(e.g. POS/NER): [UDPOS, CoNLL2000Chunking]
Question answering: [SQuAD1, SQuAD2]
Text classification: [AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull, IMDB]
- **Load NLP related datasets, and build dataloader**
```python
from flowtext.datasets import AG_NEWS
train_iter = AG_NEWS(split='train')
next(train_iter)
# Or iterate with for loop
for (label, line) in train_iter:
print(label, line)
# Or send to DataLoader
from oneflow.utils.data import DataLoader
train_iter = AG_NEWS(split='train')
dataloader = DataLoader(train_iter, batch_size=8, shuffle=False)
```
Metrics
- **The metrics currently contains:**
Bleu_score
Ngram_counter
- **NLP related evaluation metrics**
```python
>>> from flowtext.data.metrics import bleu_score
>>> candidate_corpus = [['This', 'is', 'a', 'oneflow', 'bleu','test'], ['Another', 'Sentence']]
>>> references_corpus = [[['This', 'is', 'a', 'oneflow', 'bleu','test'], ['Completely', 'Different']], [['No', 'Match']]]
>>> bleu_score(candidate_corpus, references_corpus)
0.889139711856842
```
Utils
- **Load tokenizer**
```python
>>> from flowtext.data import get_tokenizer
# The parameter ‘tokenizer’ can support spacy, moses, toktok, revtok, subword, jieba.
>>> tokenizer = get_tokenizer(tokenizer="basic_english", language="en")
>>> tokens = tokenizer("Today is a good day!")
>>> tokens
['today', 'is', 'a', 'good', 'day', '!']
```
## Disclaimer on Datasets
The datasets in flowtext.datasets is a utility library that downloads and prepares public datasets. We are not responsible for hosting and distributing these data sets, nor do we guarantee their quality and fairness, nor do we claim to have the license of the data set. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.
If you are the dataset owner and want to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please contact us through GitHub questions.
## License
OneFlow has a BSD-style license, as found in the [LICENSE](https://github.com/Oneflow-Inc/text/blob/main/LICENSE) file.