{"id":18256337,"url":"https://github.com/taishan1994/genius_for_your_data","last_synced_at":"2025-04-04T17:32:02.057Z","repository":{"id":63704587,"uuid":"569642146","full_name":"taishan1994/genius_for_your_data","owner":"taishan1994","description":"使用GENIUS文本生成模型训练自己的数据集。","archived":false,"fork":false,"pushed_at":"2022-12-03T05:36:07.000Z","size":4487,"stargazers_count":7,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2023-03-04T13:53:02.098Z","etag":null,"topics":["augmentation","bart","chinese"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/taishan1994.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-11-23T09:43:10.000Z","updated_at":"2023-02-21T02:46:55.000Z","dependencies_parsed_at":"2023-01-22T20:30:10.542Z","dependency_job_id":null,"html_url":"https://github.com/taishan1994/genius_for_your_data","commit_stats":null,"previous_names":[],"tags_count":null,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/taishan1994%2Fgenius_for_your_data","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/taishan1994%2Fgenius_for_your_data/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/taishan1994%2Fgenius_for_your_data/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/taishan1994%2Fgenius_for_your_data/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/taishan1994","download_url":"https://codeload.github.com/taishan1994/genius_for_your_data/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223151744,"owners_count":17096165,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["augmentation","bart","chinese"],"created_at":"2024-11-05T10:21:23.557Z","updated_at":"2024-11-05T10:21:24.227Z","avatar_url":"https://github.com/taishan1994.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# genius_for_your_data\n参考GENIUS代码，使用GENIUS文本生成模型训练自己的数据集。\n\n原代码地址：https://github.com/beyondguo/genius\n\n原论文地址：https://arxiv.org/abs/2211.10330v1\n\n演示地址：https://huggingface.co/spaces/beyond/genius\n\n# 依赖\n\n```python\npip install transformers\npip install dataset\npip install nltk\npip install rouge_score\npip install rouge\n```\n\n# 说明\n\n- 1、huggingface上下载genius-base-chinese模型放在model_hub/genius-base-chinese/下。\n\n- 2、数据格式参考data下的train.json，里面格式为：\n\n\t```python\n\t[(\"文本\", \"标签\")]\n\t```\n\n\t当然，我们所需要的只是文本，你也可以是任意的格式，只需要在prepare_genius_pretrain_data_chinese_mine.py里面定义自己的加载数据方法就行。\n\n- 3、运行```python genius_utils_mine.py```可测试一条数据。\n\n- 4、修改pre_training/prepare_genius_pretrain_data_chinese_mine.py里面为自己数据加载的方式，修改```__main__```下面相关代码，执行```python prepare_genius_pretrain_data_chinese_mine.py```得到[MASK]的数据并存储。\n\n- 5、修改pre_training/genius_pretrain_chinese.py里面相关设置，主要设置如下：\n\n\t```python\n\tdataset_path = '../data/data_with_sketch'  # [MASK]数据存储地址\n\tN = 40133\n\t# N为数据总数，这里由于数据较少，我们选择全部数据\n\ttokenized_dataset = dataset_with_sketch.select(random.sample(range(40133),k=N)).map(preprocess_function, \t\t\t\t\t\t\t\t\t\t\t\tbatched=True, \n\t                                        \t   remove_columns=dataset_with_sketch.column_names,\n\t                                         \t   batch_size=64,num_proc=8)  \n\tbatch_size = 32 \n\ttraining_args = Seq2SeqTrainingArguments(\n\t    output_dir=output_dir,\n\t    evaluation_strategy=\"steps\",\n\t    eval_steps = 500,  # 主要根据情况修改这里。  \n\t    save_strategy = 'epoch',\n\t    save_total_limit = num_train_epochs,\n\t    fp16 = True,\n\t    learning_rate=5.6e-5,\n\t    per_device_train_batch_size=batch_size,\n\t    per_device_eval_batch_size=batch_size,\n\t    weight_decay=0.01,\n\t    num_train_epochs=num_train_epochs,\n\t    predict_with_generate=True,\n\t    logging_steps=logging_steps,\n\t)\n\t\n\t# 选择1000条进行验证\n\tval_dataset = tokenized_dataset.select(range(1000))\n\t```\n\n\t最后运行```python genius_pretrain_chinese.py```即可。\n\n- 5、最后根据保存的模型进行预测：\n\n\t```python\n\t# sega-chinese\n\tfrom transformers import BertTokenizer, BartForConditionalGeneration, Text2TextGenerationPipeline\n\t# checkpoint = '../model_hub//genius-base-chinese'\n\tcheckpoint = '../saved_models/genius-base-chinese-data_with_sketch-40133/checkpoint-3765/'\n\ttokenizer = BertTokenizer.from_pretrained(checkpoint)\n\tsega_model = BartForConditionalGeneration.from_pretrained(checkpoint)\n\tsega_generator = Text2TextGenerationPipeline(sega_model, tokenizer, device=0)\n\tsega_generator\n\t\n\t\"\"\"\n\t银色的罗马高跟鞋，圆球吊饰耳饰单带，个性十足，有非常抢眼！\n\t稳吾到嘛？\n\t以后打死也不吃了\n\t来广州两天都没能织围脖,一直都在忙,再加上又感冒了,好痛苦[泪]不过广州给我的感觉灰常好!\n\t对骂我从来没怕过，你们也就只能考虑暗杀了，这样就充分保护动物了，臭傻逼们[打哈气]\n\t你这么说的我都不好意思呢\n\t我到了，文，好惨啊…\n\t\"\"\"\n\t\n\tsketchs = [\n\t  \"银色的罗马高跟鞋，圆球吊饰耳饰单带，个性十足[MASK]抢眼[MASK]\",\n\t  \"稳吾到[MASK]\",\n\t  \"以后打死也不吃[MASK]\",\n\t  \"[MASK]广州两天都没能织围脖,一直[MASK]加上又感冒[MASK]痛苦[MASK]广州[MASK]感觉灰常好[MASK]\",\n\t  \"对骂我从来没怕[MASK]只能[MASK]暗杀[MASK]充分保护动物[MASK]逼们[MASK]哈气[MASK]\",\n\t  \"[MASK]这么[MASK]不好意思[MASK]\",\n\t  \"[MASK]好惨[MASK]\",\n\t]\n\tfor sketch in sketchs:\n\t    print('input sketch:\\n\u003e\u003e\u003e ', sketch)\n\t    print('SEGA-chinese output:\\n\u003e\u003e\u003e ',sega_generator(sketch, max_length=100, do_sample=True, num_beams=3)[0]['generated_text'].replace(' ',''),'\\n')\n\t    \n\tinput sketch:\n\t\u003e\u003e\u003e  银色的罗马高跟鞋，圆球吊饰耳饰单带，个性十足[MASK]抢眼[MASK]\n\tSEGA-chinese output:\n\t\u003e\u003e\u003e  银色的罗马高跟鞋，圆球吊饰耳饰单带，个性十足，很抢眼的一件装饰，很有女人味道，很喜欢，很好看，很实用，很时尚，很潮流。 \n\t\n\tinput sketch:\n\t\u003e\u003e\u003e  稳吾到[MASK]\n\tSEGA-chinese output:\n\t\u003e\u003e\u003e  稳吾到家了！ \n\t\n\tinput sketch:\n\t\u003e\u003e\u003e  以后打死也不吃[MASK]\n\tSEGA-chinese output:\n\t\u003e\u003e\u003e  以后打死也不吃了！！！ \n\t\n\tinput sketch:\n\t\u003e\u003e\u003e  [MASK]广州两天都没能织围脖,一直[MASK]加上又感冒[MASK]痛苦[MASK]广州[MASK]感觉灰常好[MASK]\n\tSEGA-chinese output:\n\t\u003e\u003e\u003e  我在广州两天都没能织围脖,一直在忙,再加上又感冒又咳又痛苦,所以我只能去北京,去了广州就去了,感觉灰常好!!! \n\t\n\tinput sketch:\n\t\u003e\u003e\u003e  对骂我从来没怕[MASK]只能[MASK]暗杀[MASK]充分保护动物[MASK]逼们[MASK]哈气[MASK]\n\tSEGA-chinese output:\n\t\u003e\u003e\u003e  对骂我从来没怕过，只能说：我想暗杀那些没有充分保护动物的傻逼们，我也想打他们，可是我还是怕他们打我，给他们一个哈气。 \n\t\n\tinput sketch:\n\t\u003e\u003e\u003e  [MASK]这么[MASK]不好意思[MASK]\n\tSEGA-chinese output:\n\t\u003e\u003e\u003e  我也这么说，可是我还是不好意思说。 \n\t\n\tinput sketch:\n\t\u003e\u003e\u003e  [MASK]好惨[MASK]\n\tSEGA-chinese output:\n\t\u003e\u003e\u003e  我好惨啊！ \n\t```\n\n# 补充\n\n怎么用于数据增强这里没有继续下去了，大体看了一下：\n\n- 针对于命名实体识别而言，关键词是由结巴得到再加上实体得到。也就是这些词是不会被[MASK]掉的。然后可以通过模型生成相关的上下文。最后在重新计算实体的位置。\n- 对于文本分类而言，方法可以有很多种。可以通过随机[MASK]，然后用预测文本替换[MASK]。也可以像上面一样先选出关键词，再生成关键词的上下文，最后组成文本。\n- 以后看再补充数据增强的实例。\n\n最后，感谢原作者的相关工作，感兴趣的可以去读读论文，上手试一下。\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftaishan1994%2Fgenius_for_your_data","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftaishan1994%2Fgenius_for_your_data","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftaishan1994%2Fgenius_for_your_data/lists"}