Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jacksoncakes/chinese_keybert
A minimal chinese keywords extraction with BERT
https://github.com/jacksoncakes/chinese_keybert
bert chinese keyword-extraction nlp python pytorch
Last synced: about 2 months ago
JSON representation
A minimal chinese keywords extraction with BERT
- Host: GitHub
- URL: https://github.com/jacksoncakes/chinese_keybert
- Owner: JacksonCakes
- License: mit
- Created: 2022-11-01T10:23:04.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2022-11-01T15:46:59.000Z (about 2 years ago)
- Last Synced: 2024-10-01T17:22:57.925Z (about 2 months ago)
- Topics: bert, chinese, keyword-extraction, nlp, python, pytorch
- Language: Python
- Homepage:
- Size: 23.4 KB
- Stars: 8
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Chinese_keyBERT
Chinese_keyBERT is a minimal Chinese keywords extraction library that leverage the contextual embeddings generated from BERT models to extract relevant keywords from the given texts.## Installation
```
pip install chinese_keybert
```## Get started
```
from chinese_keybert import Chinese_Extractorkw_extractor = Chinese_Extractor()
text = ['''
深度學習是機器學習的分支, 透過模仿人類大腦及受其啟發的演算法改進、演化自己。 這些演算法的官方名稱是人工神經網絡。
程式員可以用文字、圖片及聲音等大量不同形式且複雜的無標籤數據來「訓練」這些神經網絡。然後這些模型便會不時「從經驗中學習」,
最終達至前所未有的準確度,甚至超越人類所及。雖然難以置信,但深度學習模式這個概念其實可追溯至1943年。
當時就已有兩名科學家利用數學及演算法,創建出模仿人類大腦的多層神經網絡。
''']
result = kw_extractor.generate_keywords(text,top_k=5,rank_methods="mmr",diversity=0.6)
>> [['深度', '演算法', '前所未有', '模型', '網絡']]
```## How it works
The core idea behind chinese_keyBERT is to utilize a word segmentation models to segments a piece of text into smaller n-grams and filter the n-grams according to the defined part-of-speech (as some pos are not suitable to be used as a keyword). Then, an embedding model (eg. BERT) is used to encode the text and filtered n_grams into embeddings and using some ranking methods (eg. maximun sum/maximun marginal relevance) to compute the cosine distances betweens the text and n-grams embeddings and rank the keywords according to the scores.## To-do
- [ ] Documentations
- [ ] Vectorization operations to speed-up processing of multiple documents
- [ ] Add support for other word segmentation, part-of-speech and embeddings model## Credit
Chinese_keyBERT was largely inspired by [KeyBERT](https://github.com/MaartenGr/KeyBERT), a minimal library for embedding based keywords extractions. Besides, Chinese_keyBERT is also heavily relies on Chinese word segmentation and POS library from [CKIP](https://github.com/ckiplab/ckip-transformers) as well as [sentence-transformer](https://github.com/UKPLab/sentence-transformers) for generating quality embeddings.