{"id":41847071,"url":"https://github.com/camel-lab/barec_analyzer","last_synced_at":"2026-01-25T10:03:04.969Z","repository":{"id":303969443,"uuid":"996218390","full_name":"CAMeL-Lab/barec_analyzer","owner":"CAMeL-Lab","description":null,"archived":false,"fork":false,"pushed_at":"2025-07-10T16:20:18.000Z","size":25,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-09T22:06:16.051Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CAMeL-Lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-04T16:17:22.000Z","updated_at":"2025-07-19T04:38:36.000Z","dependencies_parsed_at":"2025-07-10T19:19:11.924Z","dependency_job_id":null,"html_url":"https://github.com/CAMeL-Lab/barec_analyzer","commit_stats":null,"previous_names":["camel-lab/barec_analyzer"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/CAMeL-Lab/barec_analyzer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2Fbarec_analyzer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2Fbarec_analyzer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2Fbarec_analyzer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2Fbarec_analyzer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CAMeL-Lab","download_url":"https://codeload.github.com/CAMeL-Lab/barec_analyzer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2Fbarec_analyzer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28751065,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-25T09:58:17.166Z","status":"ssl_error","status_checked_at":"2026-01-25T09:55:56.104Z","response_time":113,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-25T10:03:04.397Z","updated_at":"2026-01-25T10:03:04.960Z","avatar_url":"https://github.com/CAMeL-Lab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# BAREC Analyzer\n\nThis repository contains scripts for preprocessing, training, and evaluating the models in our paper [A Large and Balanced Corpus for Fine-grained Arabic Readability Assessment](https://arxiv.org/abs/2502.13520).\nThe BAREC corpus is available on [Hugging Face](https://huggingface.co/datasets/CAMeL-Lab/BAREC-Corpus-v1.0).\n\n## Repository Structure\n- `scripts/preprocess.py`: Processes raw texts into our tokenized input variants (`Word`, `D3Tok`, `Lex`, and `D3Lex`). You **DO NOT** need to run this script to process the BAREC corpus as we already provide these input variants for the full corpus.\n- `scripts/train.py`: Script for fine-tuning pre-trained models using the BAREC corpus. The script supports different loss functions and input variants. It also generates results and saves trained models.\n- `scripts/collect_results.py`: Aggregates evaluation results from multiple trained models and exports them as CSV files for further analysis.\n\n## Setup\n\n### Install Dependencies\n\nTo run `scripts/preprocess.py`, you need to install [CAMeL Tools](https://github.com/CAMeL-Lab/camel_tools) and get the CAMeLBERT MSA morphosyntactic tagger from `camel_data`:\n\n```sh\ngit clone https://github.com/CAMeL-Lab/camel_tools.git\ncd camel_tools\n\nconda create -n camel_tools python=3.9\nconda activate camel_tools\n\npip install -e .\ncamel_data -i disambig-bert-unfactored-msa\n```\n\nTo run `scripts/train.py` and `scripts/collect_results.py`:\n\n```sh\ngit clone https://github.com/CAMeL-Lab/barec_analyzer.git\ncd barec_analyzer\n\nconda create -n barec python=3.9\nconda activate barec\n\npip install -r requirements.txt\n```\n\n\n## Usage\n\n### Preprocessing\n\nPreprocess raw text to different input variants.\nYou **DO NOT** need this script if you want to train on the BAREC corpus as we already provide these input variants for the full corpus.\n\n```sh\npython scripts/preprocess.py \\\n  --input \u003cINPUT_TXT_PATH\u003e \\\n  --input_var \u003cINPUT_VARIANT\u003e \\\n  --db \u003cMORPHOLOGY_DATABASE\u003e \\\n  --output \u003cOUTPUT_TXT_PATH\u003e\n```\n\n- `--input`: Path to input text file containing raw text data\n- `--input_var`: Input variant (`Word`, `D3Tok`, `Lex`, or `D3Lex`)\n- `--db` (**optional**): Path to morphological database to use for processing\n- `--output`: Path to output file to save processed text data\n\n**Important Note**: The default morphological analyzer used in the preprocessing script is not the same as the one in the paper, which is licensed by LDC. To download the same morphological analyzer, you need to:\n\n1. Obtain the morphological analyzer from LDC ([LDC2010L01](https://catalog.ldc.upenn.edu/LDC2010L01)).\n2. Download the muddled version of the analyzer from [here](https://github.com/CAMeL-Lab/CAMeLBERT_morphosyntactic_tagger/releases/download/v0.0.1/analyzer-msa.muddle).\n3. Install [Muddler](https://github.com/CAMeL-Lab/muddler), a tool for sharing derived data, and use it to unmuddle the encrypted file.\n  ```sh\n  pip install muddler\n  muddler unmuddle -s /PATH/TO/LDC2010L01.tgz -m /PATH/TO/analyzer-msa.muddle /PATH/TO/almor-s31.db.utf8\n  ```\n\n4. To use this analyzer in `scripts/preprocess.py`, pass it as a parameter (`--db \"/PATH/TO/almor-s31.db.utf8\"`).\n\n\n### Training a Model\n\nRun the training script on the BAREC corpus with configurable parameters:\n\n```sh\npython scripts/train.py \\\n  --loss \u003cLOSS_TYPE\u003e \\\n  --model \u003cMODEL_CHECKPOINT\u003e \\\n  --input_var \u003cINPUT_TYPE\u003e \\\n  --save_dir \u003cMODEL_SAVE_BASE_DIR\u003e \\\n  --output_path \u003cOUTPUT_XLSX_DIR\u003e\n```\n\n- `--loss`: Loss function (e.g., `CE`, `EMD`, `OLL1`, etc.)\n- `--model`: Model checkpoint (e.g., HuggingFace model name or path)\n- `--input_var`: Input variant (`Word`, `D3Tok`, `Lex`, or `D3Lex`)\n- `--save_dir`: Base directory for saving trained model folders\n- `--output_path`: Directory to save output XLSX files\n\n### Collecting Results\n\nAfter training multiple models, aggregate their results:\n\n```sh\npython scripts/collect_results.py \\\n  --models_path \u003cMODELS_DIR\u003e \\\n  --output_path \u003cRESULTS_CSV_DIR\u003e\n```\n\n- `--models_path`: Directory containing all trained model folders\n- `--output_path`: Directory to save the aggregated CSV files\n\n## Citation\n```\n@inproceedings{elmadani-etal-2025-readability,\n    title = \"A Large and Balanced Corpus for Fine-grained Arabic Readability Assessment\",\n    author = \"Elmadani, Khalid N.  and\n      Habash, Nizar  and\n      Taha-Thomure, Hanada\",\n    booktitle = \"Findings of the Association for Computational Linguistics: ACL 2025\",\n    year = \"2025\",\n    address = \"Vienna, Austria\",\n    publisher = \"Association for Computational Linguistics\"\n}\n```\n\n## License\nSee the `LICENSE` file for license information.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcamel-lab%2Fbarec_analyzer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcamel-lab%2Fbarec_analyzer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcamel-lab%2Fbarec_analyzer/lists"}