{"id":13747186,"url":"https://github.com/prakhar21/TextAugmentation-GPT2","last_synced_at":"2025-05-09T08:31:49.078Z","repository":{"id":43460879,"uuid":"237031914","full_name":"prakhar21/TextAugmentation-GPT2","owner":"prakhar21","description":"Fine-tuned pre-trained GPT2 for custom topic specific text generation. Such system can be used for Text Augmentation.","archived":false,"fork":false,"pushed_at":"2023-07-14T15:52:06.000Z","size":671,"stargazers_count":187,"open_issues_count":6,"forks_count":43,"subscribers_count":7,"default_branch":"master","last_synced_at":"2024-08-04T06:03:53.740Z","etag":null,"topics":["gpt-2","natural-language-generation","natural-language-processing","nlp-machine-learning","text-augmentation","textclassification","transformer-architecture"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/prakhar21.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2020-01-29T16:39:13.000Z","updated_at":"2024-05-20T13:36:11.000Z","dependencies_parsed_at":"2022-08-02T23:30:43.968Z","dependency_job_id":"f9e0b262-9d7b-4dfd-8933-2d9550ba2326","html_url":"https://github.com/prakhar21/TextAugmentation-GPT2","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prakhar21%2FTextAugmentation-GPT2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prakhar21%2FTextAugmentation-GPT2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prakhar21%2FTextAugmentation-GPT2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prakhar21%2FTextAugmentation-GPT2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/prakhar21","download_url":"https://codeload.github.com/prakhar21/TextAugmentation-GPT2/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224842611,"owners_count":17378999,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gpt-2","natural-language-generation","natural-language-processing","nlp-machine-learning","text-augmentation","textclassification","transformer-architecture"],"created_at":"2024-08-03T06:01:19.589Z","updated_at":"2024-11-15T20:31:09.377Z","avatar_url":"https://github.com/prakhar21.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# TextAugmentation-GPT2\n![GPT2 model size representation](https://github.com/prakhar21/TextAugmentation-GPT2/blob/master/gpt2-sizes.png)\nFine-tuned pre-trained GPT2 for topic specific text generation. Such system can be used for Text Augmentation.\n\n## Getting Started\n1. git clone https://github.com/prakhar21/TextAugmentation-GPT2.git\n2. Move your data to __data/ dir__.\n\n_* Please refer to data/SMSSpamCollection to get the idea of file format._\n\n## Tuning for own Corpus\n1. Assuming are done with Point 2 under __Getting Started__\n```\n2. Run python3 train.py --data_file \u003cfilename\u003e --epoch \u003cnumber_of_epochs\u003e --warmup \u003cwarmup_steps\u003e --model_name \u003cmodel_name\u003e --max_len \u003cmax_seq_length\u003e --learning_rate \u003clearning_rate\u003e --batch \u003cbatch_size\u003e\n```\n## Generating Text\n```\n1. python3 generate.py --model_name \u003cmodel_name\u003e --sentences \u003cnumber_of_sentences\u003e --label \u003cclass_of_training_data\u003e\n```\n\n_* It is recommended that you tune the parameters for your task. Not doing so may result in choosing default parameters and eventually giving sub-optimal performace._\n\n## Quick Testing\nI had fine-tuned the model on __SPAM/HAM dataset__. You can download it from [here](https://drive.google.com/open?id=1lDMFdcSsmWuzHIW8ceEgDnuJHzxX8Hiw) and follow the steps mentioned under __Generation Text__ section.\n\n_Sample Results_\n```\nSPAM: You have 2 new messages. Please call 08719121161 now. £3.50. Limited time offer. Call 090516284580.\u003c|endoftext|\u003e\nSPAM: Want to buy a car or just a drink? This week only 800p/text betta...\u003c|endoftext|\u003e\nSPAM: FREE Call Todays top players, the No1 players and their opponents and get their opinions on www.todaysplay.co.uk Todays Top Club players are in the draw for a chance to be awarded the £1000 prize. TodaysClub.com\u003c|endoftext|\u003e\nSPAM: you have been awarded a £2000 cash prize. call 090663644177 or call 090530663647\u003c|endoftext|\u003e\n\nHAM: Do you remember me?\u003c|endoftext|\u003e\nHAM: I don't think so. You got anything else?\u003c|endoftext|\u003e\nHAM: Ugh I don't want to go to school.. Cuz I can't go to exam..\u003c|endoftext|\u003e\nHAM: K.,k:)where is my laptop?\u003c|endoftext|\u003e\n```\n\n## Important Points to Note\n* _Top-k and Top-p Sampling_ (Variant of __Nucleus Sampling__) has been used while decoding the sequence word-by-word. You can read more about it [here](https://arxiv.org/pdf/1904.09751.pdf)\n\n\n__Note:__ First time you run, it will take considerable amount of time because of the following reasons - \n1. Downloads pre-trained gpt2-medium model  _(Depends on your Network Speed)_\n2. Fine-tunes the gpt2 with your dataset _(Depends on size of the data, Epochs, Hyperparameters, etc)_\n\nAll the experiments were done on [IntelDevCloud Machines](https://software.intel.com/en-us/devcloud)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprakhar21%2FTextAugmentation-GPT2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fprakhar21%2FTextAugmentation-GPT2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprakhar21%2FTextAugmentation-GPT2/lists"}