{"id":50244822,"url":"https://github.com/CAMeL-Lab/CAMeLBERT","last_synced_at":"2026-06-29T20:00:50.786Z","repository":{"id":93673099,"uuid":"340757224","full_name":"CAMeL-Lab/CAMeLBERT","owner":"CAMeL-Lab","description":"Code and models for \"The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models\". EACL 2021, WANLP.","archived":false,"fork":false,"pushed_at":"2024-06-21T05:12:11.000Z","size":94,"stargazers_count":55,"open_issues_count":2,"forks_count":13,"subscribers_count":4,"default_branch":"master","last_synced_at":"2026-01-25T22:46:17.214Z","etag":null,"topics":["arabic-nlp","deep-learning","nlp"],"latest_commit_sha":null,"homepage":"https://aclanthology.org/2021.wanlp-1.10","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CAMeL-Lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2021-02-20T21:29:06.000Z","updated_at":"2025-12-31T12:13:42.000Z","dependencies_parsed_at":"2025-09-09T20:33:37.551Z","dependency_job_id":"0f9c3ccd-8555-4f61-b626-491d831ebd6a","html_url":"https://github.com/CAMeL-Lab/CAMeLBERT","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/CAMeL-Lab/CAMeLBERT","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2FCAMeLBERT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2FCAMeLBERT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2FCAMeLBERT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2FCAMeLBERT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CAMeL-Lab","download_url":"https://codeload.github.com/CAMeL-Lab/CAMeLBERT/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2FCAMeLBERT/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34941027,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-29T02:00:05.398Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["arabic-nlp","deep-learning","nlp"],"created_at":"2026-05-26T23:00:19.772Z","updated_at":"2026-06-29T20:00:50.761Z","avatar_url":"https://github.com/CAMeL-Lab.png","language":"Python","funding_links":[],"categories":["NLP per Language"],"sub_categories":["Models and Embeddings"],"readme":"# CAMeLBERT: A collection of pre-trained models for Arabic NLP tasks:\n\nThis repo contains code for the experiments presented in our paper: [The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models](https://arxiv.org/pdf/2103.06678.pdf).\n\n## Requirements:\n\nThis code was written for python\u003e=3.7, pytorch 1.5.1, and transformers 3.1.0. You will also need few additional packages. Here's how you can set up the environment using conda (assuming you have conda and cuda installed):\n\n```bash\ngit clone https://github.com/CAMeL-Lab/CAMeLBERT.git\ncd CAMeLBERT\n\nconda create -n CAMeLBERT python=3.7\nconda activate CAMeLBERT\n\npip install -r requirements.txt\n```\n\n## CAMeLBERT:\n\n### Pretrained Models\nOur eight CAMeLBERT models are available on Hugging Face's [model hub](https://huggingface.co/CAMeL-Lab) along with their detailed descriptions. Note: to download our models as described in the model hub, you would need transformers\u003e=3.5.0. Otherwise, you could download the models manually.\n\n### Arabic Frequency Lists\nWe also provide a frequency lists dataset derived from the pretraining datasets (17.3B tokens) used to pretrain the family of CAMeLBert models.\nThe frequency dataset is available at https://github.com/CAMeL-Lab/Camel_Arabic_Frequency_Lists.\n\n\n## Fine-tuning Experiments:\n\nAll fine-tuned models can be found [here](https://drive.google.com/drive/folders/15feD46cPcRBybdUUKKrzR9zTxj2QBJ5w?usp=sharing).\n\n## Text Classification:\n\n### Sentiment Analysis:\n\nFor the sentiment analysis experiments, we combined four datasets: 1) [ArSAS](http://lrec-conf.org/workshops/lrec2018/W30/pdf/22_W30.pdf); 2) [ASTD](https://www.aclweb.org/anthology/D15-1299.pdf); 3) [SemEval-2017 4A](https://www.aclweb.org/anthology/S17-2088.pdf); 4) [ArSenTD](https://arxiv.org/pdf/1906.01830.pdf).\u003c/br\u003e\nThe models were fine-tuned on ArSenTD and the train splits of ArSAS, ASTD, and SemEval-2017. We then evaluate all the checkpoints on \na single dev split from ArSAS, ASTD, and SemEval-2017 and pick the best checkpoint to report the results on the test splits of ArSAS, ASTD, and SemEval-2017 repsectively. To run the fine-tuning:\n\n```bash\nexport DATA_DIR=/path/to/data\nexport TASK_NAME=arabic_sentiment\n\npython run_text_classification.py \\\n  --model_type bert \\\n  --model_name_or_path /path/to/pretrained_model/ \\ # Or huggingface model id \n  --task_name $TASK_NAME \\\n  --do_train \\\n  --do_eval \\\n  --eval_all_checkpoints \\\n  --save_steps 500 \\\n  --data_dir $DATA_DIR \\\n  --max_seq_length 128 \\\n  --per_gpu_train_batch_size 32 \\\n  --per_gpu_eval_batch_size 32 \\\n  --learning_rate 3e-5 \\\n  --num_train_epochs 3.0 \\\n  --overwrite_output_dir \\\n  --overwrite_cache \\\n  --output_dir /path/to/output_dir \\\n  --seed 12345\n```\n\n### Dialect Identification:\n\nFor the dialect identification experiments, we fine-tuned the models on four different dialect identification datasets: 1) [MADAR Corpus 26](https://www.aclweb.org/anthology/C18-1113.pdf); 2) [MADAR Corpus 6](https://www.aclweb.org/anthology/C18-1113.pdf); 3) [MADAR Twitter-5](https://www.aclweb.org/anthology/W19-4622.pdf); 4) [NADI Country-level](https://www.aclweb.org/anthology/2020.wanlp-1.9.pdf). We fine-tuned the models across the four datasets and we pick the best checkpoints on the dev sets to report results on the test sets. To run the fine-tuning:\n\n\n```bash\nexport DATA_DIR=/path/to/data\nexport TASK_NAME=arabic_did_madar_26 # or arabic_did_madar_6, arabic_did_madar_twitter, arabic_did_nadi_country\n\npython run_text_classification.py \\\n  --model_type bert \\\n  --model_name_or_path /path/to/pretrained_model/ \\ # Or huggingface model id\n  --task_name $TASK_NAME \\\n  --do_train \\\n  --do_eval \\\n  --eval_all_checkpoints \\\n  --save_steps 500 \\\n  --data_dir $DATA_DIR \\\n  --max_seq_length 128 \\\n  --per_gpu_train_batch_size 32 \\\n  --per_gpu_eval_batch_size 32 \\\n  --learning_rate 3e-5 \\\n  --num_train_epochs 10.0 \\\n  --overwrite_output_dir \\\n  --overwrite_cache \\\n  --output_dir /path/to/output_dir \\\n  --seed 12345\n```\n\n### Poetry Classification:\n\nFor the poetry classification experiments, we fine-tuned the models on the [APCD](https://arxiv.org/pdf/1905.05700.pdf) dataset. For each model, we pick the best checkpoint based on the dev set to report results on the test set. To run the fine-tuning:\n\n```bash\nexport DATA_DIR=/path/to/data\nexport TASK_NAME=arabic_poetry\n\npython run_text_classification.py \\\n  --model_type bert \\\n  --model_name_or_path /path/to/pretrained_model/ \\ # Or huggingface model id\n  --task_name $TASK_NAME \\\n  --do_train \\\n  --do_eval \\\n  --eval_all_checkpoints \\\n  --save_steps 5000 \\\n  --data_dir $DATA_DIR \\\n  --max_seq_length 128 \\\n  --per_gpu_train_batch_size 32 \\\n  --per_gpu_eval_batch_size 32 \\\n  --learning_rate 3e-5 \\\n  --num_train_epochs 3.0 \\\n  --overwrite_output_dir \\\n  --overwrite_cache \\\n  --output_dir /path/to/output_dir \\\n  --seed 12345\n```\n\nBash scripts to run text-classification fine-tuning and evaluation can be found in `text-classification/scripts/`.\n\n\n## Token Classification:\n\n### NER:\n\nFor the NER experiments, we used the [ANERCorp](https://link.springer.com/chapter/10.1007/978-3-540-70939-8_13) dataset and followed the splits defined by [Obeid et al., 2020](https://camel.abudhabi.nyu.edu/anercorp/).\nThe dataset doesn't have a dev split, so we fine-tune the models on the train split and evaluate the last checkpoint on the test split.\nTo run the fine-tuning:\n\n\n```bash\nexport DATA_DIR=/path/to/data                 # Should contain train/dev/test/labels files\nexport MAX_LENGTH=512\nexport BERT_MODEL=/path/to/pretrained_model/  # Or huggingface model id\nexport OUTPUT_DIR=/path/to/output_dir\nexport BATCH_SIZE=32\nexport NUM_EPOCHS=3\nexport SAVE_STEPS=750\nexport SEED=12345\n\n python run_token_classification.py \\\n  --data_dir $DATA_DIR \\\n  --labels $DATA_DIR/labels.txt \\\n  --model_name_or_path $BERT_MODEL \\\n  --output_dir $OUTPUT_DIR \\\n  --max_seq_length  $MAX_LENGTH \\\n  --num_train_epochs $NUM_EPOCHS \\\n  --per_device_train_batch_size $BATCH_SIZE \\\n  --save_steps $SAVE_STEPS \\\n  --seed $SEED \\\n  --overwrite_output_dir \\\n  --overwrite_cache \\\n  --do_train \\\n  --do_predict\n```\n\n### POS Tagging:\n\nFor the POS tagging experiments, we fine-tuned the models on three different datasets:\u003cbr/\u003e\n\n1. Penn Arabic Treebank ([PATB](https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/nemlar2004-penn-arabic-treebank.pdf)): in MSA and has 32 POS tags\n2. Egyptian Arabic Treebank ([ARZATB](https://catalog.ldc.upenn.edu/LDC2018T23)): in EGY and has 33 POS tags\n3. [GUMAR](https://www.aclweb.org/anthology/L18-1607.pdf) corpus: in GLF and includes 35 POS tags\n\nWe used the same hyperparameters for the 3 datasets and report results on the test sets by using the best checkpoints on the dev sets. To run the fine-tuning:\n\n```bash\nexport DATA_DIR=/path/to/data                 # Should contain train/dev/test/labels files\nexport MAX_LENGTH=512\nexport BERT_MODEL=/path/to/pretrained_model/  # Or huggingface model id\nexport OUTPUT_DIR=/path/to/output_dir\nexport BATCH_SIZE=32\nexport NUM_EPOCHS=10\nexport SAVE_STEPS=500\nexport SEED=12345\n\npython run_token_classification.py \\\n  --data_dir $DATA_DIR \\\n  --labels $DATA_DIR/labels.txt \\\n  --model_name_or_path $BERT_MODEL \\\n  --output_dir $OUTPUT_DIR \\\n  --max_seq_length  $MAX_LENGTH \\\n  --num_train_epochs $NUM_EPOCHS \\\n  --per_device_train_batch_size $BATCH_SIZE \\\n  --save_steps $SAVE_STEPS \\\n  --seed $SEED \\\n  --overwrite_output_dir \\\n  --overwrite_cache \\\n  --do_train \\\n  --do_eval\n```\n\nBash scripts to run token-classification fine-tuning and evaluation can be found in `token-classification/scripts/`.\n\n## Citation:\n\nIf you find any of the CAMeLBERT or the fine-tuned models useful in your work, please cite [our paper](https://arxiv.org/pdf/2103.06678.pdf):\n```bibtex\n@inproceedings{inoue-etal-2021-interplay,\n    title = \"The Interplay of Variant, Size, and Task Type in {A}rabic Pre-trained Language Models\",\n    author = \"Inoue, Go  and\n      Alhafni, Bashar  and\n      Baimukan, Nurpeiis  and\n      Bouamor, Houda  and\n      Habash, Nizar\",\n    booktitle = \"Proceedings of the Sixth Arabic Natural Language Processing Workshop\",\n    month = apr,\n    year = \"2021\",\n    address = \"Kyiv, Ukraine (Online)\",\n    publisher = \"Association for Computational Linguistics\",\n    abstract = \"In this paper, we explore the effects of language variants, data sizes, and fine-tuning task types in Arabic pre-trained language models. To do so, we build three pre-trained language models across three variants of Arabic: Modern Standard Arabic (MSA), dialectal Arabic, and classical Arabic, in addition to a fourth language model which is pre-trained on a mix of the three. We also examine the importance of pre-training data size by building additional models that are pre-trained on a scaled-down set of the MSA variant. We compare our different models to each other, as well as to eight publicly available models by fine-tuning them on five NLP tasks spanning 12 datasets. Our results suggest that the variant proximity of pre-training data to fine-tuning data is more important than the pre-training data size. We exploit this insight in defining an optimized system selection model for the studied tasks.\",\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCAMeL-Lab%2FCAMeLBERT","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FCAMeL-Lab%2FCAMeLBERT","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCAMeL-Lab%2FCAMeLBERT/lists"}