{"id":16677653,"url":"https://github.com/shjwudp/c4-dataset-script","last_synced_at":"2025-07-27T08:32:24.825Z","repository":{"id":37962254,"uuid":"496950762","full_name":"shjwudp/c4-dataset-script","owner":"shjwudp","description":"Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.","archived":false,"fork":false,"pushed_at":"2023-06-07T14:13:15.000Z","size":600,"stargazers_count":119,"open_issues_count":0,"forks_count":14,"subscribers_count":5,"default_branch":"master","last_synced_at":"2024-11-19T20:23:05.791Z","etag":null,"topics":["commoncrawl","dataset","massivetext","nlp","python","spark"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shjwudp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-05-27T10:15:11.000Z","updated_at":"2024-09-22T10:43:52.000Z","dependencies_parsed_at":"2024-10-28T11:28:38.355Z","dependency_job_id":"0b977115-15bb-4304-83a0-795cb72185d9","html_url":"https://github.com/shjwudp/c4-dataset-script","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shjwudp%2Fc4-dataset-script","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shjwudp%2Fc4-dataset-script/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shjwudp%2Fc4-dataset-script/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shjwudp%2Fc4-dataset-script/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shjwudp","download_url":"https://codeload.github.com/shjwudp/c4-dataset-script/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227782487,"owners_count":17819276,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["commoncrawl","dataset","massivetext","nlp","python","spark"],"created_at":"2024-10-12T13:27:08.386Z","updated_at":"2024-12-02T18:45:52.681Z","avatar_url":"https://github.com/shjwudp.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# C4 Dataset Script\n\n[C4](https://www.tensorflow.org/datasets/catalog/c4) is a great way to get a colossal cleaned web corpus. Unfortunately, Google open-sourced c4 script highly depends on GCP and code mixed in a big repo. Therefore, it takes work to develop it freely. This repository extracts the processing logic and implements it to run on Spark. In addition, some helpful data process method in MassiveText is implemented in massivetext_utils.py.\n\n## Run c4 script on Spark\n\nSetup c4 work environment.\n\n```bash\n# 1. Create an independent Anaconda environment and install python dependencies\nconda create -y -n c4-env conda-pack \u0026\u0026 conda activate c4-env\npip install git+https://github.com/shjwudp/c4-dataset-script\n\n# 2. Download punkt tokenizer\npython -m nltk.downloader -d $(which python | xargs dirname)/../nltk_data punkt\n\n# 3. Run pyspark requires JAVA to be installed in your environment, you should\n#    make sure you have JDK installed and JAVA_HOME configured.\n```\n\nIf everything goes well, you can make the C4 dataset on localhost.\n\n```bash\npython -m c4_dataset_script.c4_script --wet-file-paths $PATH_TO_YOUR_CC_WET_FILE\n```\n\nOr submit to spark cluster.\n\n```bash\n# 1. Before submitting to the cluster, you need to package the environment conda env\nconda pack --name c4-env -o c4-env.tar.gz\n\n# 2. Submit to spark cluster\nPYSPARK_DRIVER_PYTHON=python \\\nPYSPARK_PYTHON=./environment/bin/python \\\npython c4_dataset_script/c4_script.py \\\n    --wet-file-paths $PATH_TO_YOUR_CC_WET_FILE \\\n    --c4-save-path $PATH_TO_YOUR_C4_OUTPUT \\\n    --spark-master $SPARK_MASTER_ADDR \\\n    --spark-archives c4-env.tar.gz#environment\n```\n\n## Make colossal cleaned Chinese web corpus\n\nReferring to the method of C4, there is a data processing pipeline building for a cleaned Chinese web corpus. It includes web page download, Chinese recognition, heuristics text filter method, toxic recognition and filter, and Repetition Removal used in Google/DeepMind MassiveText.\n\n## 1. Download the WET crawl archive index file\n\nCommon Crawl organized crawled data into some archives. You can browse the archives list from [here](https://commoncrawl.org/the-data/get-started/). In the next step, we will download text data (WET) as the input of processing. First, download the WET crawl archive index file.\n\n```bash\ncd c4_dataset_script\nwget -r --no-parent https://data.commoncrawl.org/crawl-data/${CRAWL_ARCHIVE_ID}/wet.paths.gz\n```\n\n*You can get CRAWL_ARCHIVE_ID [here](https://commoncrawl.org/the-data/get-started/). For instance: CC-MAIN-2022-49.*\n\n## 2. Run download and Chinese screening script on Spark\n\n```bash\nspark-submit --master ${SPARK_MASTER_ADDR} \\\n    Chinese/download_web_docs.py \\\n        --wet-paths ./data.commoncrawl.org/crawl-data/${CRAWL_ARCHIVE_ID}/wet.paths.gz \\\n        --output ./download-docs\n```\n\n## 3. Filter out non-sentence lines and toxic document\n\nRefer to the c4 heuristics method. I used the following strategies for cleaning up Common Crawl's web-extracted text:\n\n- Only retained lines that ended in a terminal punctuation mark or colon.\n- Discarded any page with fewer than five sentences and only retained lines that\ncontained at least five words.\n- Removed any page that contained any word on the \"List of Dirty, Naughty, Obscene\nor Otherwise Bad Words.\"\n- Many of the scraped pages contained Chinese garbled, so we removed any line with the garbled characters. For example: \"[-]|□|■|�\".\n\n```bash\ncat ./download-docs/*/part-* | \\\n    python Chinese/filter_out_bad_lines.py \\\n        --badwords_filepath ./badwords/zh \\\n         \u003e clean_docs.jsonl\n```\n\n*About 93.57% of documents are filtered out in this stage. You can see samples of filtered documents [here](data/Chinese_bad-lines_samples.jsonl).*\n\n## 4. Remove duplicated text\n\nTo eliminate duplicate text, I use the text deduplication strategy from C4. The algorithm divides the document into lines, hashes them, and removes any duplicate lines from the dataset. This effective approach is particularly useful for removing repeated header and footer content.\n\n```bash\nspark-submit --master ${SPARK_MASTER_ADDR} \\\n    Chinese/remove_duplicate_text.py \\\n        --input clean_docs.jsonl \\\n        --output ./deduplicated_text\n```\n\n*About 62.67% of documents are filtered out in this stage. You can see samples of filtered lines [here](data/Chinese_Remove-Duplicated-Text_samples.jsonl).*\n\n## 5. Remove documents that are over self-repeating - Repetition Removal in DeepMind MassiveText\n\nCheck the percentage of duplicate content in the web document, and the program will remove documents whose duplicate proportion exceeds the preset threshold. This function implements \"Repetition Removal\" as described in [Gopher](https://arxiv.org/abs/2112.11446).\n\n```bash\nspark-submit --master ${SPARK_MASTER_ADDR} \\\n    Chinese/repetition_removal.py \\\n        --input clean_docs.jsonl \\\n        --output ./repetition_removal_output\n```\n\n*About 21.21% of documents are filtered out in this stage. You can see samples of filtered documents [here](data/Chinese_Repetition-Removal_samples.jsonl).*\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshjwudp%2Fc4-dataset-script","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshjwudp%2Fc4-dataset-script","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshjwudp%2Fc4-dataset-script/lists"}