Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/Kyubyong/bert-token-embeddings
https://github.com/Kyubyong/bert-token-embeddings
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/Kyubyong/bert-token-embeddings
- Owner: Kyubyong
- License: apache-2.0
- Created: 2019-01-16T00:35:13.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2019-01-22T07:27:09.000Z (almost 6 years ago)
- Last Synced: 2024-08-02T08:10:07.024Z (4 months ago)
- Language: Jupyter Notebook
- Size: 101 KB
- Stars: 97
- Watchers: 8
- Forks: 18
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-bert - Kyubyong/bert-token-embeddings
README
# Bert Pretrained Token Embeddings
BERT([BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)) yields pretrained token (=subword) embeddings. Let's extract and save them in the [word2vec](https://code.google.com/archive/p/word2vec/) format so that they can be used for downstream tasks.
## Requirements
* pytorch_pretrained_bert
* NumPy
* tqdm## Extraction
* Check `extract.py`.## Bert (Pretrained) Token Embeddings in word2vec format
| Models | # Vocab | # Dim | Notes |
|--|--|--|--|
|[bert-base-uncased](https://dl.dropboxusercontent.com/s/hcj1oh8ltqhqrk4/bert-base-uncased.30522.768d.vec.tar.gz) | 30,522 | 768 ||
|[bert-large-uncased](https://dl.dropboxusercontent.com/s/kdf932v3taa22zp/bert-large-uncased.30522.1024d.vec.tar.gz) | 30,522 | 1024 ||
|[bert-base-cased](https://dl.dropboxusercontent.com/s/7wb92gvc40jxvz9/bert-base-cased.28996.768d.vec.tar.gz) | 28,996 | 768 ||
| [bert-large-cased](https://dl.dropboxusercontent.com/s/ke68kjw7yui39s7/bert-large-cased.28996.1024d.vec.tar.gz) | 28,996 | 1024 ||
| [bert-base-multilingual-cased](https://dl.dropboxusercontent.com/s/kmyh9cxmu6hb1jx/bert-base-multilingual-cased.119547.768d.vec.tar.gz)| 119,547 | 768 | Recommended |
|[bert-base-multilingual-uncased](https://dl.dropboxusercontent.com/s/j2vz6hue7pze0ae/bert-base-multilingual-uncased.105879.768d.vec.tar.gz)| 30,522 | 768| Not recommended|
|[bert-base-chinese](https://dl.dropboxusercontent.com/s/gzb6kfjipw591vz/bert-base-chinese.21128.768d.vec.tar.gz)|21,128 | 768 ||## Example
* Check `example.ipynb` to see how to load (sub-)word vectors with [gensim](https://github.com/RaRe-Technologies/gensim) and plot them in 2d space using [tSNE](https://lvdmaaten.github.io/tsne/).* Related tokens to look
* Related tokens to ##go