Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/prakhar21/TextAugmentation-GPT2

Fine-tuned pre-trained GPT2 for custom topic specific text generation. Such system can be used for Text Augmentation.
https://github.com/prakhar21/TextAugmentation-GPT2

gpt-2 natural-language-generation natural-language-processing nlp-machine-learning text-augmentation textclassification transformer-architecture

Last synced: 3 months ago
JSON representation

Fine-tuned pre-trained GPT2 for custom topic specific text generation. Such system can be used for Text Augmentation.

Host: GitHub
URL: https://github.com/prakhar21/TextAugmentation-GPT2
Owner: prakhar21
License: mit
Created: 2020-01-29T16:39:13.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2023-07-14T15:52:06.000Z (over 1 year ago)
Last Synced: 2024-08-04T06:03:53.740Z (7 months ago)
Topics: gpt-2, natural-language-generation, natural-language-processing, nlp-machine-learning, text-augmentation, textclassification, transformer-architecture
Language: Python
Homepage:
Size: 655 KB
Stars: 187
Watchers: 7
Forks: 43
Open Issues: 6
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# TextAugmentation-GPT2
![GPT2 model size representation](https://github.com/prakhar21/TextAugmentation-GPT2/blob/master/gpt2-sizes.png)
Fine-tuned pre-trained GPT2 for topic specific text generation. Such system can be used for Text Augmentation.

## Getting Started
1. git clone https://github.com/prakhar21/TextAugmentation-GPT2.git
2. Move your data to __data/ dir__.

_* Please refer to data/SMSSpamCollection to get the idea of file format._

## Tuning for own Corpus
1. Assuming are done with Point 2 under __Getting Started__
```
2. Run python3 train.py --data_file --epoch --warmup --model_name --max_len --learning_rate --batch
```
## Generating Text
```
1. python3 generate.py --model_name --sentences --label
```

_* It is recommended that you tune the parameters for your task. Not doing so may result in choosing default parameters and eventually giving sub-optimal performace._

## Quick Testing
I had fine-tuned the model on __SPAM/HAM dataset__. You can download it from [here](https://drive.google.com/open?id=1lDMFdcSsmWuzHIW8ceEgDnuJHzxX8Hiw) and follow the steps mentioned under __Generation Text__ section.

_Sample Results_
```
SPAM: You have 2 new messages. Please call 08719121161 now. £3.50. Limited time offer. Call 090516284580.<|endoftext|>
SPAM: Want to buy a car or just a drink? This week only 800p/text betta...<|endoftext|>
SPAM: FREE Call Todays top players, the No1 players and their opponents and get their opinions on www.todaysplay.co.uk Todays Top Club players are in the draw for a chance to be awarded the £1000 prize. TodaysClub.com<|endoftext|>
SPAM: you have been awarded a £2000 cash prize. call 090663644177 or call 090530663647<|endoftext|>

## Important Points to Note
* _Top-k and Top-p Sampling_ (Variant of __Nucleus Sampling__) has been used while decoding the sequence word-by-word. You can read more about it [here](https://arxiv.org/pdf/1904.09751.pdf)

__Note:__ First time you run, it will take considerable amount of time because of the following reasons -
1. Downloads pre-trained gpt2-medium model _(Depends on your Network Speed)_
2. Fine-tunes the gpt2 with your dataset _(Depends on size of the data, Epochs, Hyperparameters, etc)_

All the experiments were done on [IntelDevCloud Machines](https://software.intel.com/en-us/devcloud)