Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/xiongma/chinese-law-bert-similarity
bert chinese similarity
https://github.com/xiongma/chinese-law-bert-similarity
bert deep-learning nlp sentence-similarity tensorflow
Last synced: about 1 month ago
JSON representation
bert chinese similarity
- Host: GitHub
- URL: https://github.com/xiongma/chinese-law-bert-similarity
- Owner: xiongma
- License: mit
- Archived: true
- Created: 2019-01-29T02:51:43.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2019-11-07T05:18:54.000Z (about 5 years ago)
- Last Synced: 2024-08-11T16:09:20.802Z (4 months ago)
- Topics: bert, deep-learning, nlp, sentence-similarity, tensorflow
- Language: Python
- Homepage:
- Size: 48.8 KB
- Stars: 139
- Watchers: 6
- Forks: 29
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-bert - policeme/chinese_bert_similarity
README
# How to use
## Prediction
This project, I improve model which was trained, so you can download it, and use it to prediction!
* this project just support every sentences with 45 char length
* [download](https://pan.baidu.com/s/1CbKiY8GBGaF2dnMioLDU5Q) model file, pwd: vv1k
* just use like this
* first
````python
bs = BertSim(gpu_no=0, log_dir='log/', bert_sim_dir='bert_sim_model\\', verbose=True)
* second
> similarity sentences
````python
text_a = '技术侦查措施只能在立案后采取'
text_b = '未立案不可以进行技术侦查'
bs.predict([[text_a, text_b]])
````
> you will get result like this:
[[0.00942544 0.99057454]]
> not similarity sentence
```python
text_a = '华为还准备起诉美国政府'
text_b = '飞机出现后货舱火警信息'
bs.predict([[text_a, text_b]])
```
> you will get result like this:
[[0.98687243 0.01312758]]
### Parameter
| name | type | detail |
|--------------------|------|-------------|
gpu_no | int | which gpu will be use to init bert ner graph
log_dir | str | log dir
verbose | bool| whether show tensorflow log
bert_sim_model | str| bert sim model path## Train
### Code
In this project, I just use bert pre model to fine tuning, so I just use their original code. I try to create new one, but
the new one just same as the original code, so I given up.
### Dataset
Because of my domain work, my work is based on judicial examination education, so I didn't use common dataset, my dataset were
labeled by manual work, it include 80000+, 50000+ are similar, 30000+ are dissimilar, because of the privacy, I can't open source of this dataset
### Suggest:
In original code, they just got the model pool output, I think there may be other ways to increase the accuracy, I tried some ways to increase the accuracy, but I found one,
just concat the [CLS] embedding of the fourth from bottom to tailender in encoder output list, if you want to use my way, just do like this。
* Delete the following code````python
output_layer = model.get_pooled_output()
````* Use the following code, it can **increase the accuracy 1%**.
````python
output_layer = tf.concat([tf.squeeze(model.all_encoder_layers[i][:, 0:1, :], axis=1) for i in range(-4, 0, 1)], axis=-1)
````