{"id":13562681,"url":"https://github.com/VinAIResearch/PhoBERT","last_synced_at":"2025-04-03T19:31:20.103Z","repository":{"id":43082835,"uuid":"244607156","full_name":"VinAIResearch/PhoBERT","owner":"VinAIResearch","description":"PhoBERT: Pre-trained language models for Vietnamese (EMNLP-2020 Findings)","archived":false,"fork":false,"pushed_at":"2024-07-23T00:20:02.000Z","size":68,"stargazers_count":637,"open_issues_count":2,"forks_count":92,"subscribers_count":22,"default_branch":"master","last_synced_at":"2024-08-01T13:27:34.956Z","etag":null,"topics":["bert","bert-embeddings","deep-learning","fairseq","language-models","named-entity-recognition","natural-language-inference","ner","nli","part-of-speech-tagging","phobert","pos-tagging","python3","rdrsegmenter","roberta","transformers","transformers-library","vietnamese","vietnamese-nlp","vncorenlp"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/VinAIResearch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-03-03T10:29:48.000Z","updated_at":"2024-07-28T15:14:43.000Z","dependencies_parsed_at":"2024-08-01T13:18:34.113Z","dependency_job_id":null,"html_url":"https://github.com/VinAIResearch/PhoBERT","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VinAIResearch%2FPhoBERT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VinAIResearch%2FPhoBERT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VinAIResearch%2FPhoBERT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VinAIResearch%2FPhoBERT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/VinAIResearch","download_url":"https://codeload.github.com/VinAIResearch/PhoBERT/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223019752,"owners_count":17074673,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","bert-embeddings","deep-learning","fairseq","language-models","named-entity-recognition","natural-language-inference","ner","nli","part-of-speech-tagging","phobert","pos-tagging","python3","rdrsegmenter","roberta","transformers","transformers-library","vietnamese","vietnamese-nlp","vncorenlp"],"created_at":"2024-08-01T13:01:11.143Z","updated_at":"2024-11-04T15:30:28.501Z","avatar_url":"https://github.com/VinAIResearch.png","language":null,"funding_links":[],"categories":["Others","NLP per Language"],"sub_categories":["Models and Embeddings"],"readme":"\n#### Table of contents\n1. [Introduction](#introduction)\n2. [Using PhoBERT with `transformers`](#transformers)\n\t- [Installation](#install2)\n\t- [Pre-trained models](#models2)\n\t- [Example usage](#usage2)\n3. [Using PhoBERT with `fairseq`](#fairseq)\n4. [Notes](#vncorenlp)\n\n# \u003ca name=\"introduction\"\u003e\u003c/a\u003e PhoBERT: Pre-trained language models for Vietnamese \n\nPre-trained PhoBERT models are the state-of-the-art language models for Vietnamese ([Pho](https://en.wikipedia.org/wiki/Pho), i.e. \"Phở\", is a popular food in Vietnam): \n\n - Two PhoBERT versions of \"base\" and \"large\" are the first public large-scale monolingual language models pre-trained for Vietnamese. PhoBERT pre-training approach is based on [RoBERTa](https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.md)  which optimizes the [BERT](https://github.com/google-research/bert) pre-training procedure for more robust performance.\n - PhoBERT outperforms previous monolingual and multilingual approaches, obtaining new state-of-the-art performances on four downstream Vietnamese NLP tasks of Part-of-speech tagging, Dependency parsing, Named-entity recognition and Natural language inference.\n\nThe general architecture and experimental results of PhoBERT can be found in our [paper](https://www.aclweb.org/anthology/2020.findings-emnlp.92/):\n\n    @inproceedings{phobert,\n    title     = {{PhoBERT: Pre-trained language models for Vietnamese}},\n    author    = {Dat Quoc Nguyen and Anh Tuan Nguyen},\n    booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020},\n    year      = {2020},\n    pages     = {1037--1042}\n    }\n\n**Please CITE** our paper when PhoBERT is used to help produce published results or is incorporated into other software.\n\n## \u003ca name=\"transformers\"\u003e\u003c/a\u003e Using PhoBERT with `transformers` \n\n### Installation \u003ca name=\"install2\"\u003e\u003c/a\u003e\n- Install `transformers` with pip: `pip install transformers`, or [install `transformers` from source](https://huggingface.co/docs/transformers/installation#installing-from-source).  \u003cbr /\u003e \nNote that we merged a slow tokenizer for PhoBERT into the main `transformers` branch. The process of merging a fast tokenizer for PhoBERT is in the discussion, as mentioned in [this pull request](https://github.com/huggingface/transformers/pull/17254#issuecomment-1133932067). If users would like to utilize the fast tokenizer, the users might install `transformers` as follows:\n\n\n```\ngit clone --single-branch --branch fast_tokenizers_BARTpho_PhoBERT_BERTweet https://github.com/datquocnguyen/transformers.git\ncd transformers\npip3 install -e .\n```\n\n- Install `tokenizers` with pip: `pip3 install tokenizers`\n\n### Pre-trained models \u003ca name=\"models2\"\u003e\u003c/a\u003e\n\n\nModel | #params | Arch.\t | Max length | Pre-training data | License\n---|---|---|---|---|---\n[`vinai/phobert-base-v2`](https://huggingface.co/vinai/phobert-base-v2) | 135M | base | 256 | 20GB  of Wikipedia and News texts + 120GB of texts from OSCAR-2301 | [GNU Affero GPL v3](https://github.com/VinAIResearch/PhoBERT/blob/master/LICENSE_for_PhoBERT_v2)\n[`vinai/phobert-base`](https://huggingface.co/vinai/phobert-base) | 135M | base | 256 | 20GB  of Wikipedia and News texts | [MIT License](https://github.com/VinAIResearch/PhoBERT/blob/master/LICENSE)\n[`vinai/phobert-large`](https://huggingface.co/vinai/phobert-large) | 370M | large | 256 | 20GB  of Wikipedia and News texts | [MIT License](https://github.com/VinAIResearch/PhoBERT/blob/master/LICENSE)\n\n\n### Example usage \u003ca name=\"usage2\"\u003e\u003c/a\u003e\n\n```python\nimport torch\nfrom transformers import AutoModel, AutoTokenizer\n\nphobert = AutoModel.from_pretrained(\"vinai/phobert-base-v2\")\ntokenizer = AutoTokenizer.from_pretrained(\"vinai/phobert-base-v2\")\n\n# INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!\nsentence = 'Chúng_tôi là những nghiên_cứu_viên .'  \n\ninput_ids = torch.tensor([tokenizer.encode(sentence)])\n\nwith torch.no_grad():\n    features = phobert(input_ids)  # Models outputs are now tuples\n\n## With TensorFlow 2.0+:\n# from transformers import TFAutoModel\n# phobert = TFAutoModel.from_pretrained(\"vinai/phobert-base\")\n```\n\n\n## \u003ca name=\"fairseq\"\u003e\u003c/a\u003e Using PhoBERT with `fairseq`\n\nPlease see details at [HERE](https://github.com/VinAIResearch/PhoBERT/blob/master/README_fairseq.md)!\n\n## \u003ca name=\"vncorenlp\"\u003e\u003c/a\u003e Notes \n\nIn case the input texts are `raw`, i.e. without word segmentation, a word segmenter must be applied to produce word-segmented texts before feeding to PhoBERT. As PhoBERT employed the [RDRSegmenter](https://github.com/datquocnguyen/RDRsegmenter) from [VnCoreNLP](https://github.com/vncorenlp/VnCoreNLP) to pre-process the pre-training data (including [Vietnamese tone normalization](https://github.com/VinAIResearch/BARTpho/blob/main/VietnameseToneNormalization.md) and word and sentence segmentation), it is recommended to also use the same word segmenter for PhoBERT-based downstream applications w.r.t. the input raw texts.\n\n#### Installation\n\n    pip install py_vncorenlp\n\n#### Example usage \u003ca name=\"example\"\u003e\u003c/a\u003e\n\n```python\nimport py_vncorenlp\n\n# Automatically download VnCoreNLP components from the original repository\n# and save them in some local machine folder\npy_vncorenlp.download_model(save_dir='/absolute/path/to/vncorenlp')\n\n# Load the word and sentence segmentation component\nrdrsegmenter = py_vncorenlp.VnCoreNLP(annotators=[\"wseg\"], save_dir='/absolute/path/to/vncorenlp')\n\ntext = \"Ông Nguyễn Khắc Chúc  đang làm việc tại Đại học Quốc gia Hà Nội. Bà Lan, vợ ông Chúc, cũng làm việc tại đây.\"\n\noutput = rdrsegmenter.word_segment(text)\n\nprint(output)\n# ['Ông Nguyễn_Khắc_Chúc đang làm_việc tại Đại_học Quốc_gia Hà_Nội .', 'Bà Lan , vợ ông Chúc , cũng làm_việc tại đây .']\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FVinAIResearch%2FPhoBERT","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FVinAIResearch%2FPhoBERT","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FVinAIResearch%2FPhoBERT/lists"}