{"id":13856412,"url":"https://github.com/dccuchile/beto","last_synced_at":"2025-07-13T17:31:44.985Z","repository":{"id":46048501,"uuid":"178470457","full_name":"dccuchile/beto","owner":"dccuchile","description":"BETO - Spanish version of the BERT model","archived":false,"fork":false,"pushed_at":"2023-10-21T22:05:29.000Z","size":283,"stargazers_count":492,"open_issues_count":8,"forks_count":63,"subscribers_count":38,"default_branch":"master","last_synced_at":"2024-11-22T13:36:40.509Z","etag":null,"topics":["bert","bert-model","nlp","spanish","transformers","transformers-library"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc-by-4.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dccuchile.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-03-29T20:18:49.000Z","updated_at":"2024-11-04T06:46:20.000Z","dependencies_parsed_at":"2022-09-14T13:10:30.422Z","dependency_job_id":"ee438714-b2b4-4839-b40c-47fb55e64c0c","html_url":"https://github.com/dccuchile/beto","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/dccuchile/beto","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dccuchile%2Fbeto","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dccuchile%2Fbeto/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dccuchile%2Fbeto/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dccuchile%2Fbeto/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dccuchile","download_url":"https://codeload.github.com/dccuchile/beto/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dccuchile%2Fbeto/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265178801,"owners_count":23723336,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","bert-model","nlp","spanish","transformers","transformers-library"],"created_at":"2024-08-05T03:00:59.789Z","updated_at":"2025-07-13T17:31:44.660Z","avatar_url":"https://github.com/dccuchile.png","language":null,"funding_links":[],"categories":["Others","NLP per Language"],"sub_categories":["Models and Embeddings"],"readme":"# BETO: Spanish BERT\n\nBETO is a [BERT model](https://github.com/google-research/bert) trained on a [big Spanish corpus](https://github.com/josecannete/spanish-corpora). BETO is of size similar to a BERT-Base and was trained with the Whole Word Masking technique. Below you find Tensorflow and Pytorch checkpoints for the uncased and cased versions, as well as some results for Spanish benchmarks comparing BETO with [Multilingual BERT](https://github.com/google-research/bert/blob/master/multilingual.md) as well as other (not BERT-based) models.\n\n## Download\n\n|              |                  HuggingFace Model Repository                  |\n|:------------:|:--------------------------------------------------------------:|\n| BETO uncased | [dccuchile/bert-base-spanish-wwm-uncased](https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased) |\n|  BETO cased  |  [dccuchile/bert-base-spanish-wwm-cased](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased)  |\n\nAll models use a vocabulary of about 31k BPE subwords constructed using SentencePiece and were trained for 2M steps. \n\n## Benchmarks\n\nThe following table shows some BETO results in the Spanish version of every task. \nWe compare BETO (cased and uncased) with the Best Multilingual BERT results that \nwe found in the literature (as of October 2019). \nThe table also shows some alternative methods for the same tasks (not necessarily BERT-based methods).\nReferences for all methods can be found [here](#references).\n\n|Task   | BETO-cased    | BETO-uncased  | Best Multilingual BERT    | Other results                  |\n|-------|--------------:|--------------:|--------------------------:|-------------------------------:|\n|[POS](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1827)    | **98.97**     | 98.44     | 97.10 [2]                 | 98.91 [6], 96.71 [3]           |\n|[NER-C](https://www.kaggle.com/nltkdata/conll-corpora)  | [**88.43**](https://github.com/gchaperon/beto-benchmarks/blob/master/conll2002/dev_results_beto-cased_conll2002.txt)         | 82.67         | 87.38 [2]                 | 87.18 [3]                      |\n|[MLDoc](https://github.com/facebookresearch/MLDoc)  | [95.60](https://github.com/gchaperon/beto-benchmarks/blob/master/MLDoc/dev_results_beto-cased_mldoc.txt)        | [**96.12**](https://github.com/gchaperon/beto-benchmarks/blob/master/MLDoc/dev_results_beto-uncased_mldoc.txt)     | 95.70 [2]                 | 88.75 [4]                      |\n|[PAWS-X](https://github.com/google-research-datasets/paws/tree/master/pawsx) | 89.05         | 89.55         | 90.70 [8]                 |\n|[XNLI](https://github.com/facebookresearch/XNLI)   | **82.01**         | 80.15     | 78.50 [2]                 | 80.80 [5], 77.80 [1], 73.15 [4]|\n\n## Example of use\n\nFor further details on how to use BETO you can visit the [🤗Huggingface Transformers library](https://github.com/huggingface/transformers), starting by the [Quickstart section](https://huggingface.co/docs/transformers/tasks/sequence_classification). \nBETO models can be accessed simply as [`'dccuchile/bert-base-spanish-wwm-cased'`](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) and [`'dccuchile/bert-base-spanish-wwm-uncased'`](https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased) by using the Transformers library. \nAn example on how to use the models in this page can be found in [this colab notebook](https://colab.research.google.com/drive/1pYOYsCU59GBOwztkWCw5PTsqBiJbRy4S?usp=sharing).\n\n\n## Acknowledgments\n\nWe thank [Adereso](https://www.adere.so/) for kindly providing support for traininig BETO-uncased, and the [Millennium Institute for Foundational Research on Data](https://imfd.cl/en/)\nthat provided support for training BETO-cased. Also thanks to Google for helping us with the [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc) program.\n\n## Citation\n\n[Spanish Pre-Trained BERT Model and Evaluation Data](https://arxiv.org/abs/2308.02976)\n\nTo cite this resource in a publication please use the following:\n\n```\n@inproceedings{CaneteCFP2020,\n  title={Spanish Pre-Trained BERT Model and Evaluation Data},\n  author={Cañete, José and Chaperon, Gabriel and Fuentes, Rodrigo and Ho, Jou-Hui and Kang, Hojin and Pérez, Jorge},\n  booktitle={PML4DC at ICLR 2020},\n  year={2020}\n}\n```\n\n\n## License Disclaimer\nThe license CC BY 4.0 best describes our intentions for our work. However we are not sure that all the datasets used to train BETO have licenses compatible with CC BY 4.0 (specially for commercial use). Please use at your own discretion and verify that the licenses of the original text resources match your needs.\n\n\n## References\n\n* [1] [Original Multilingual BERT](https://github.com/google-research/bert/blob/master/multilingual.md)\n* [2] [Multilingual BERT on \"Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT\"](https://arxiv.org/pdf/1904.09077.pdf)\n* [3] [Multilingual BERT on \"How Multilingual is Multilingual BERT?\"](https://arxiv.org/pdf/1906.01502.pdf)\n* [4] [LASER](https://arxiv.org/abs/1812.10464)\n* [5] [XLM (MLM+TLM)](https://arxiv.org/pdf/1901.07291.pdf)\n* [6] [UDPipe on \"75 Languages, 1 Model: Parsing Universal Dependencies Universally\"](https://arxiv.org/pdf/1904.02099.pdf)\n* [7] [Multilingual BERT on \"Sequence Tagging with Contextual and Non-Contextual Subword Representations: A Multilingual Evaluation\"](https://arxiv.org/pdf/1906.01569.pdf)\n* [8] [Multilingual BERT on \"PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification\"](https://arxiv.org/abs/1908.11828)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdccuchile%2Fbeto","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdccuchile%2Fbeto","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdccuchile%2Fbeto/lists"}