Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/xiongma/chinese-law-bert-similarity

bert chinese similarity
https://github.com/xiongma/chinese-law-bert-similarity

bert deep-learning nlp sentence-similarity tensorflow

Last synced: 30 days ago
JSON representation

bert chinese similarity

Host: GitHub
URL: https://github.com/xiongma/chinese-law-bert-similarity
Owner: xiongma
License: mit
Archived: true
Created: 2019-01-29T02:51:43.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2019-11-07T05:18:54.000Z (over 4 years ago)
Last Synced: 2024-05-31T20:19:02.641Z (about 1 month ago)
Topics: bert, deep-learning, nlp, sentence-similarity, tensorflow
Language: Python
Homepage:
Size: 48.8 KB
Stars: 137
Watchers: 6
Forks: 29
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Lists

awesome-bert - policeme/chinese_bert_similarity

README

        # How to use

## Prediction

This project, I improve model which was trained, so you can download it, and use it to prediction!

* this project just support every sentences with 45 char length

* [download](https://pan.baidu.com/s/1CbKiY8GBGaF2dnMioLDU5Q) model file, pwd: vv1k

* just use like this 

    * first

        ````python

        bs = BertSim(gpu_no=0, log_dir='log/', bert_sim_dir='bert_sim_model\\', verbose=True)

    * second

        > similarity sentences

        ````python

        text_a = '技术侦查措施只能在立案后采取'

        text_b = '未立案不可以进行技术侦查'

        bs.predict([[text_a, text_b]])

        ````

        > you will get result like this:

        [[0.00942544 0.99057454]]

        

        > not similarity sentence

        ```python

        text_a = '华为还准备起诉美国政府'

        text_b = '飞机出现后货舱火警信息'

        bs.predict([[text_a, text_b]])

        ```

        > you will get result like this:

        [[0.98687243 0.01312758]]

        

### Parameter

| name | type | detail |

|--------------------|------|-------------|

gpu_no | int | which gpu will be use to init bert ner graph

log_dir | str | log dir 

verbose | bool| whether show tensorflow log

bert_sim_model | str| bert sim model path

## Train

### Code

In this project, I just use bert pre model to fine tuning, so I just use their original code. I try to create new one, but 

the new one just same as the original code, so I given up.

### Dataset

Because of my domain work, my work is based on judicial examination education, so I didn't use common dataset, my dataset were 

labeled by manual work, it include 80000+, 50000+ are similar, 30000+ are dissimilar, because of the privacy, I can't open source of this dataset

### Suggest:

In original code, they just got the model pool output, I think there may be other ways to increase the accuracy, I tried some ways to increase the accuracy, but I found one,

just concat the [CLS] embedding of the fourth from bottom to tailender in encoder output list, if you want to use my way, just do like this。

* Delete the following code

````python

output_layer = model.get_pooled_output()

````

* Use the following code, it can **increase the accuracy 1%**.

````python

output_layer = tf.concat([tf.squeeze(model.all_encoder_layers[i][:, 0:1, :], axis=1) for i in range(-4, 0, 1)], axis=-1)

````