{"id":18759556,"url":"https://github.com/sail-sg/sailcraft","last_synced_at":"2025-10-05T00:43:37.906Z","repository":{"id":237133183,"uuid":"781754593","full_name":"sail-sg/sailcraft","owner":"sail-sg","description":"🚢 Data Toolkit for Sailor Language Models","archived":false,"fork":false,"pushed_at":"2025-02-24T07:03:17.000Z","size":224,"stargazers_count":91,"open_issues_count":0,"forks_count":10,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-06-09T15:52:58.170Z","etag":null,"topics":["data-cleaning","data-deduplication"],"latest_commit_sha":null,"homepage":"https://sea-sailor.github.io/blog/sailcraft/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sail-sg.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-04-04T01:06:31.000Z","updated_at":"2025-05-21T01:40:23.000Z","dependencies_parsed_at":null,"dependency_job_id":"3dec7f2d-12d5-41c4-baaf-357b6d90ea12","html_url":"https://github.com/sail-sg/sailcraft","commit_stats":null,"previous_names":["sail-sg/sailcraft"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/sail-sg/sailcraft","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sail-sg%2Fsailcraft","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sail-sg%2Fsailcraft/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sail-sg%2Fsailcraft/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sail-sg%2Fsailcraft/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sail-sg","download_url":"https://codeload.github.com/sail-sg/sailcraft/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sail-sg%2Fsailcraft/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278395913,"owners_count":25979691,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-04T02:00:05.491Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-cleaning","data-deduplication"],"created_at":"2024-11-07T18:00:39.644Z","updated_at":"2025-10-05T00:43:37.876Z","avatar_url":"https://github.com/sail-sg.png","language":"Python","funding_links":[],"categories":["Projects \u0026 Blogs"],"sub_categories":[],"readme":"# SailCraft: Data Toolkit for Sailor Language Models\n\n[![Homepage](https://img.shields.io/badge/🏠-Homepage-3C47EB.svg)](https://sea-sailor.github.io/) \u0026nbsp;\u0026nbsp; [![HuggingFace](https://img.shields.io/badge/🤗-HuggingFace-E87948.svg)](https://huggingface.co/sail/Sailor-7B) \u0026nbsp;\u0026nbsp; [![Technical Report](https://img.shields.io/badge/arXiv-2404.03608-b31b1b.svg)](https://arxiv.org/pdf/2404.03608.pdf)\n\n\n\n\nThis repository provides a data processing pipeline for large language model training. \nIt consists of four stages: initial data cleaning, near deduplication, exact deduplication, and a second round of data cleaning.\nThe data cleaning part is especially optimized for south-east asian languages (e.g., Thai).\n\n## Requirements\n\nInstall the packages and download the models for data cleaning. Here we only download the models for English, Chinese, Thai, Vietnamese, Indonesian, Malay, and Lao. You can add more languages by modifying the `--used_language_ids` parameter. The full language list can be found [here](data_cleaning/languages_id.py).\n\n```\npip install -r requirements.txt\nmkdir lm_resource\nwget https://huggingface.co/datasets/sail/sailcraft_lm_resource/resolve/main/lid.176.bin -P lm_resource\npython code/data_cleaning/download_sentencepiece_kenlm_models.py --used_language_ids en zh th vi id ms lo --output_dir_path lm_resource\n```\n\nInstall Rust for exact deduplication, refer to [this guidance](https://github.com/google-research/deduplicate-text-datasets#installing) for more details.\n\n```\ncurl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh\n. \"$HOME/.cargo/env\"\n```\n\n## Quickstart\n\nWe sample 1,000 lines from the [cc100 Indonesian subset](https://data.statmt.org/cc-100/) for a preliminary analysis.\n\nExecute the script by running:\n```\nbash run_example.sh\n```\n\nUpon successful execution, you should observe the following logs indicating the processing stages:\n\n```\nCounting lines in cleaned data output: 987\nCounting lines in near deduplication output: 974\nCounting lines in exact deduplication output: 963\nCounting lines in final output: 949\n```\n\nThis output confirms the sequential filtering and deduplication stages of the dataset.\nThe final output can be accessed at `data/data_output/final_output/sample/data_clean.jsonl`.\n\n## Running with Your Own Dataset\n\nTo integrate your own dataset into the project, follow these steps:\n\n1. **Prepare Your Dataset**: Place your dataset file, named `ALIAS.jsonl`, in the `./data/data_input/` directory.\n2. **Configure Script Variables**: Adjust the `ALIAS` and `LANGUAGE` variables in the `./run_example.sh` script to correspond with your dataset details.\n\n### Parameter Settings\nEnsure proper configuration of the processes by setting the following parameters:\n\n1. **Data Cleaning**: Set the parameters for each filter. Detailed configuration can be found [here](https://github.com/sail-sg/sailcraft/blob/main/code/data_cleaning/parameters_filtering.py).\n2. **Near Deduplication**: Specify the number of permutations to use in MinHash by referring to the example [here](https://github.com/sail-sg/sailcraft/blob/c98a10458a92514d9922fa01a5f3ede631c546ac/code/near_dedup/run_example.sh#L22).\n3. **Exact Deduplication**: Define the identified substrings of the given length as shown in the example [here](https://github.com/sail-sg/sailcraft/blob/c98a10458a92514d9922fa01a5f3ede631c546ac/code/exact_dedup/run_example.sh#L18).\n\n\n## Case Studies\n\n1. For data cleaning, check the `code/data_cleaning/filtering_logs` for each filter.\n2. Run `code/exact_dedup/scripts/count_topk_occurrences.py` to obtain the top-k occurrences.\n\n```shell\npython code/exact_dedup/scripts/count_topk_occurrences.py \\\n--data_alias sample \\\n--split train \\\n--top_k_number 100 \\\n--threshold 2 \\\n--cache_dir cache/exact_dedup_cache\n```\n\nThis script displays the top 100 most frequent text spans that occur more than twice in the dataset.\n\n| Count | Span |\n|-------|---------------------------------------------------------------------------------------------------|\n| 4 | 'pernah disentuh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka) dan' |\n| 4 | 'k pernah disentuh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka) da' |\n| 4 | 'nah disentuh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka) dan tid' |\n| 4 | 'sentuh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka) dan tidak pul' |\n| 4 | 'uh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka) dan tidak pula ol' |\n| 4 | 'ah disentuh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka) dan tida' |\n| 4 | 'ak pernah disentuh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka) d' |\n| 4 | 'ernah disentuh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka) dan t' |\n| 3 | 'manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka) dan tidak pula oleh jin.' |\n| 3 | 'tidak pernah disentuh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka' |\n| 3 | 'tidak pernah disentuh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami merek' |\n\n## Acknowledgment\n\nThanks to the contributors of the following projects:\n\n- [text-dedup](https://github.com/ChenghaoMou/text-dedup)\n- [exact-dedup](https://github.com/google-research/deduplicate-text-datasets)\n- [bigscience-data-preparation](https://github.com/bigscience-workshop/data-preparation)\n- [bigscience-data-tooling](https://github.com/bigscience-workshop/data_tooling)\n\n## Citing this work\n\nIf you use this repository or sailor models, please cite\n\n```\n@article{sailor2report,\ntitle  = {Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLM},\nauthor = {Longxu Dou and Qian Liu and Fan Zhou and Changyu Chen and Zili Wang and Ziqi Jin and Zichen Liu and Tongyao Zhu and Cunxiao Du and Penghui Yang and Haonan Wang and Jiaheng Liu and Yongchi Zhao and Xiachong Feng and Xin Mao and Man Tsung Yeung and Kunat Pipatanakul and Fajri Koto and Min Si Thu and Hynek Kydl{\\'\\i}{\\v{c}}ek and Zeyi Liu and Qunshu Lin and Sittipong Sripaisarnmongkol and Kridtaphad Sae-Khow and Nirattisai Thongchim and Taechawat Konkaew and Narong Borijindargoon and Anh Dao and Matichon Maneegard and Phakphum Artkaew and Zheng-Xin Yong and Quan Nguyen and Wannaphong Phatthiyaphaibun and Hoang H. Tran and Mike Zhang and Shiqi Chen and Tianyu Pang and Chao Du and Xinyi Wan and Wei Lu and Min Lin},\njournal={arXiv preprint arXiv:2502.12982},\nyear   = {2025}\n}\n```\n\n```\n@inproceedings{sailor1report,\n    title = \"Sailor: Open Language Models for South-{E}ast {A}sia\",\n    author = \"Dou, Longxu and Liu, Qian and Zeng, Guangtao and Guo, Jia  and Zhou, Jiahui and Mao, Xin and Jin, Ziqi and Lu, Wei and Lin, Min\",\n    booktitle = \"Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations\",\n    year = \"2024\",\n}\n```\n\n## Contact\n\nIf you have any questions, please raise an issue on our GitHub repository or contact \u003ca href=\"mailto:doulx@sea.com\"\u003edoulx@sea.com\u003c/a\u003e.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsail-sg%2Fsailcraft","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsail-sg%2Fsailcraft","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsail-sg%2Fsailcraft/lists"}