https://github.com/nilboy/gaic_track3_pair_sim

全球人工智能技术创新大赛-赛道三-冠军方案
https://github.com/nilboy/gaic_track3_pair_sim

text-pair

Last synced: 6 months ago
JSON representation

全球人工智能技术创新大赛-赛道三-冠军方案

Host: GitHub
URL: https://github.com/nilboy/gaic_track3_pair_sim
Owner: nilboy
Created: 2021-03-30T13:02:56.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2021-07-12T08:45:07.000Z (almost 4 years ago)
Last Synced: 2024-08-03T09:07:06.046Z (9 months ago)
Topics: text-pair
Language: Python
Homepage:
Size: 159 KB
Stars: 235
Watchers: 2
Forks: 59
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

StarryDivineSky - nilboy/gaic_track3_pair_sim - 赛道三-冠军方案 (文本匹配文本检索文本相似度 / 其他_文本生成、文本对话)

README

# gaic_track3_pair_sim
[全球人工智能技术创新大赛-赛道三-冠军方案](https://yiwise-algo.yuque.com/docs/share/5a1e3b76-4d04-4127-979a-496d7bc8c1b8?#%20%E3%80%8A%E7%9F%AD%E6%96%87%E6%9C%AC%E7%9B%B8%E4%BC%BC%E5%8C%B9%E9%85%8D%E3%80%8B)
## 比赛主页
https://tianchi.aliyun.com/competition/entrance/531851/introduction

## 数据
本项目没有提供数据，如果需要数据，请到天池比赛主页下载

## 预训练模型准备
* 下载预训练模型
- nezha-base:

https://drive.google.com/file/d/1HmwMG2ldojJRgMVN0ZhxqOukhuOBOKUb/view?usp=sharing
- nezha-large:

https://drive.google.com/file/d/1EtahNvdjEpugm8juFuPIN_Fs2skFmeMU/view?usp=sharing
- uer/bert-base:

https://share.weiyun.com/5QOzPqq
- uer/bert-large:

https://share.weiyun.com/5G90sMJ
- macbert, chinese-bert-wwm-ext, chinese-roberta-wwm-ext-large

https://huggingface.co/models
* 预训练模型开源仓库
- https://github.com/dbiir/UER-py
- https://github.com/huawei-noah/Pretrained-Language-Model
* 下载并解压, 解压到文件夹 data, 文件夹结构如下:
```
data/
└── official_model
└── download
├── chinese-bert-wwm-ext
│   ├── added_tokens.json
│   ├── config.json
│   ├── pytorch_model.bin
│   ├── special_tokens_map.json
│   ├── tokenizer_config.json
│   └── vocab.txt
├── chinese-roberta-wwm-ext-large
│   ├── config.json
│   ├── pytorch_model.bin
│   ├── special_tokens_map.json
│   ├── tokenizer.json
│   ├── tokenizer_config.json
│   └── vocab.txt
├── macbert-base
│   ├── added_tokens.json
│   ├── config.json
│   ├── pytorch_model.bin
│   ├── special_tokens_map.json
│   ├── tokenizer.json
│   ├── tokenizer_config.json
│   └── vocab.txt
├── macbert-large
│   ├── added_tokens.json
│   ├── config.json
│   ├── pytorch_model.bin
│   ├── special_tokens_map.json
│   ├── tokenizer.json
│   ├── tokenizer_config.json
│   └── vocab.txt
├── mixed_corpus_bert_base_model.bin
├── mixed_corpus_bert_large_model.bin
└── nezha-cn-base
├── bert_config.json
├── pytorch_model.bin
└── vocab.txt
```
* 预训练模型[md5](user_data/md5.txt)

## 环境准备
* torch==1.7.0
* transformers=4.3.0.rc1
* simpletransformers==0.51.15
* TensorRT-7.2.1.6

## 端到端训练脚本
```
cd code
bash ./run.sh
```
## 不同版本方案

* 方案一: 预训练(多个模型) + finetune-分类(多个模型) + 生成软标签 + 训练regression模型(软标签，单模型)
```
cd code
bash ./train.sh
```
初赛使用的该方案，初赛成绩为0.9220；

* 方案二: 预训练(多个模型) + 加载预训练参数，初始化一个大模型 + 训练分类模型(单模型)
```
pipeline/pipeline_b.py
```
训练一个144层模型(6 * 12 + 24 * 3);

该模型单模型在复赛A榜成绩0.9561；推理平均时间15ms；

* 方案三: 预训练(多个模型) + finetune-分类(多个模型) + 平均融合
```
pipeline/pipeline_d.py
```
融合6个bert-base + 3个bert-large模型；

该模型在复赛A榜没测试，B榜成绩0.9593；推理平均时间15ms；

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/nilboy/gaic_track3_pair_sim

Awesome Lists containing this project

README