https://github.com/lonePatient/bert-sentence-similarity-pytorch

This repo contains a PyTorch implementation of a pretrained BERT model for sentence similarity task.
https://github.com/lonePatient/bert-sentence-similarity-pytorch

bert nlp pytorch sentence-similarity text-classification

Last synced: 3 months ago
JSON representation

This repo contains a PyTorch implementation of a pretrained BERT model for sentence similarity task.

Host: GitHub
URL: https://github.com/lonePatient/bert-sentence-similarity-pytorch
Owner: lonePatient
Created: 2019-02-14T13:32:20.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2019-02-14T13:34:15.000Z (over 6 years ago)
Last Synced: 2025-03-24T08:21:22.317Z (4 months ago)
Topics: bert, nlp, pytorch, sentence-similarity, text-classification
Language: Python
Size: 26.4 KB
Stars: 48
Watchers: 1
Forks: 6
Open Issues: 3
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-bert - lonePatient/bert-sentence-similarity-pytorch

README

# Bert sentence similarity by PyTorch

This repo contains a PyTorch implementation of a pretrained BERT model for sentence similarity task.

## Structure of the code

At the root of the project, you will see:

- csv
- tqdm
- numpy
- pickle
- scikit-learn
- PyTorch 1.0
- matplotlib
- pandas
- pytorch_pretrained_bert (load bert model)

## How to use the code

you need download pretrained chinese bert model (`chinese_L-12_H-768_A-12.zip`)

1. Download the Bert pretrained model from [Google](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip) and place it into the `/pybert/model/pretrain` directory.
2. `pip install pytorch-pretrained-bert` from [github](https://github.com/huggingface/pytorch-pretrained-BERT).
3. Run `python convert_tf_checkpoint_to_pytorch.py` to transfer the pretrained model(tensorflow version) into pytorch form .
4. Prepare [ATEC NLP data](https://dc.cloud.alipay.com/index#/topic/data?id=8), you can modify the `io.data_transformer.py` to adapt your data.
5. Modify configuration information in `pybert/config/basic_config.py`(the path of data,...).
6. Run `python data_join.py`
7. Run `python train_bert_atec_nlp.py`.

## Tips

- When converting the tensorflow checkpoint into the pytorch, it's expected to choice the "bert_model.ckpt", instead of "bert_model.ckpt.index", as the input file. Otherwise, you will see that the model can learn nothing and give almost same random outputs for any inputs. This means, in fact, you have not loaded the true ckpt for your model
- When using multiple GPUs, the non-tensor calculations, such as accuracy and f1_score, are not supported by DataParallel instance
- As recommanded by Jocob in his paper https://arxiv.org/pdf/1810.04805.pdf, in fine-tuning tasks, the hyperparameters are expected to set as following: **Batch_size**: 16 or 32, **learning_rate**: 5e-5 or 2e-5 or 3e-5, **num_train_epoch**: 3 or 4
- The pretrained model has a limit for the sentence of input that its length should is not larger than 512, the max position embedding dim. The data flows into the model as: Raw_data -> WordPieces -> Model. Note that the length of wordPieces is generally larger than that of raw_data, so a safe max length of raw_data is at ~128 - 256
- Upon testing, we found that fine-tuning all layers could get much better results than those of only fine-tuning the last classfier layer. The latter is actually a feature-based way

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lonePatient/bert-sentence-similarity-pytorch

Awesome Lists containing this project

README