{"id":13535287,"url":"https://github.com/imgarylai/bert-embedding","last_synced_at":"2025-04-02T01:30:28.497Z","repository":{"id":57414689,"uuid":"169742086","full_name":"imgarylai/bert-embedding","owner":"imgarylai","description":"🔡 Token level embeddings from BERT model on mxnet and gluonnlp","archived":true,"fork":false,"pushed_at":"2019-11-13T04:30:57.000Z","size":123,"stargazers_count":452,"open_issues_count":31,"forks_count":67,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-03-03T22:41:34.255Z","etag":null,"topics":["bert","gluonnlp","mxnet","natural-language-processing","nlp","word-embeddings"],"latest_commit_sha":null,"homepage":"http://bert-embedding.readthedocs.io/","language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/imgarylai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-02-08T13:52:38.000Z","updated_at":"2024-11-06T03:13:30.000Z","dependencies_parsed_at":"2022-09-16T08:52:41.642Z","dependency_job_id":null,"html_url":"https://github.com/imgarylai/bert-embedding","commit_stats":null,"previous_names":["imgarylai/bert_embedding"],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/imgarylai%2Fbert-embedding","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/imgarylai%2Fbert-embedding/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/imgarylai%2Fbert-embedding/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/imgarylai%2Fbert-embedding/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/imgarylai","download_url":"https://codeload.github.com/imgarylai/bert-embedding/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246738366,"owners_count":20825772,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","gluonnlp","mxnet","natural-language-processing","nlp","word-embeddings"],"created_at":"2024-08-01T08:00:52.789Z","updated_at":"2025-04-02T01:30:28.263Z","avatar_url":"https://github.com/imgarylai.png","language":"Python","readme":"# Bert Embeddings\n\n### [Deprecated] Thank you for checking this project. Unfortunately, I don't have time to maintain this project anymore. If you are interested in maintaing this project. Please create an issue and let me know. \n\n[![Build Status](https://travis-ci.org/imgarylai/bert-embedding.svg?branch=master)](https://travis-ci.org/imgarylai/bert-embedding) [![codecov](https://codecov.io/gh/imgarylai/bert-embedding/branch/master/graph/badge.svg)](https://codecov.io/gh/imgarylai/bert-embedding) [![PyPI version](https://badge.fury.io/py/bert-embedding.svg)](https://pypi.org/project/bert-embedding/) [![Documentation Status](https://readthedocs.org/projects/bert-embedding/badge/?version=latest)](https://bert-embedding.readthedocs.io/en/latest/?badge=latest) \n\n\n[BERT](https://arxiv.org/abs/1810.04805), published by [Google](https://github.com/google-research/bert), is new way to obtain pre-trained language model word representation. Many NLP tasks are benefit from BERT to get the SOTA.\n\nThe goal of this project is to obtain the token embedding from BERT's pre-trained model. In this way, instead of building and do fine-tuning for an end-to-end NLP model, you can build your model by just utilizing or token embedding.\n\nThis project is implemented with [@MXNet](https://github.com/apache/incubator-mxnet). Special thanks to [@gluon-nlp](https://github.com/dmlc/gluon-nlp) team.\n\n## Install\n\n```\npip install bert-embedding\n# If you want to run on GPU machine, please install `mxnet-cu92`.\npip install mxnet-cu92\n```\n\n## Usage\n\n```python\nfrom bert_embedding import BertEmbedding\n\nbert_abstract = \"\"\"We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers.\n Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers.\n As a result, the pre-trained BERT representations can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. \nBERT is conceptually simple and empirically powerful. \nIt obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE benchmark to 80.4% (7.6% absolute improvement), MultiNLI accuracy to 86.7 (5.6% absolute improvement) and the SQuAD v1.1 question answering Test F1 to 93.2 (1.5% absolute improvement), outperforming human performance by 2.0%.\"\"\"\nsentences = bert_abstract.split('\\n')\nbert_embedding = BertEmbedding()\nresult = bert_embedding(sentences)\n```\nIf you want to use GPU, please import mxnet and set context\n\n```python\nimport mxnet as mx\nfrom bert_embedding import BertEmbedding\n\n...\n\nctx = mx.gpu(0)\nbert = BertEmbedding(ctx=ctx)\n```\n\nThis result is a list of a tuple containing (tokens, tokens embedding)\n\nFor example:\n\n```python\nfirst_sentence = result[0]\n\nfirst_sentence[0]\n# ['we', 'introduce', 'a', 'new', 'language', 'representation', 'model', 'called', 'bert', ',', 'which', 'stands', 'for', 'bidirectional', 'encoder', 'representations', 'from', 'transformers']\nlen(first_sentence[0])\n# 18\n\n\nlen(first_sentence[1])\n# 18\nfirst_token_in_first_sentence = first_sentence[1]\nfirst_token_in_first_sentence[1]\n# array([ 0.4805648 ,  0.18369392, -0.28554988, ..., -0.01961522,\n#        1.0207764 , -0.67167974], dtype=float32)\nfirst_token_in_first_sentence[1].shape\n# (768,)\n```\n\n## OOV\n\nThere are three ways to handle oov, avg (default), sum, and last. This can be specified in encoding. \n\n```python\n...\nbert_embedding = BertEmbedding()\nbert_embedding(sentences, 'sum')\n...\n```\n\n## Available pre-trained BERT models\n\n| |book_corpus_wiki_en_uncased|book_corpus_wiki_en_cased|wiki_multilingual|wiki_multilingual_cased|wiki_cn|\n|---|---|---|---|---|---|\n|bert_12_768_12|✓|✓|✓|✓|✓|\n|bert_24_1024_16|x|✓|x|x|x|\n\nExample of using the large pre-trained BERT model from Google \n\n```python\nfrom bert_embedding import BertEmbedding\n\nbert_embedding = BertEmbedding(model='bert_24_1024_16', dataset_name='book_corpus_wiki_en_cased')\n```\n\nSource: [gluonnlp](http://gluon-nlp.mxnet.io/model_zoo/bert/index.html) \n","funding_links":[],"categories":["BERT language model and embedding:","\u003ca name=\"NLP\"\u003e\u003c/a\u003e3. NLP","Python"],"sub_categories":["2.14 Misc"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fimgarylai%2Fbert-embedding","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fimgarylai%2Fbert-embedding","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fimgarylai%2Fbert-embedding/lists"}