{"id":15688258,"url":"https://github.com/graykode/mlm-pipeline","last_synced_at":"2025-05-07T21:01:37.433Z","repository":{"id":110173104,"uuid":"219958726","full_name":"graykode/mlm-pipeline","owner":"graykode","description":"mlm-pipeline is a cloud architecture that preprocesses the masked language model (mlm)","archived":false,"fork":false,"pushed_at":"2019-11-07T15:17:00.000Z","size":953,"stargazers_count":10,"open_issues_count":1,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-31T14:06:11.476Z","etag":null,"topics":["ansible","aws","bert","cloud","mlm","natural-language-processing","nlp","terraform"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/graykode.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-11-06T09:25:11.000Z","updated_at":"2024-02-12T09:53:37.000Z","dependencies_parsed_at":"2023-03-16T14:31:03.257Z","dependency_job_id":null,"html_url":"https://github.com/graykode/mlm-pipeline","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/graykode%2Fmlm-pipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/graykode%2Fmlm-pipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/graykode%2Fmlm-pipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/graykode%2Fmlm-pipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/graykode","download_url":"https://codeload.github.com/graykode/mlm-pipeline/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252954376,"owners_count":21830902,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ansible","aws","bert","cloud","mlm","natural-language-processing","nlp","terraform"],"created_at":"2024-10-03T17:56:48.758Z","updated_at":"2025-05-07T21:01:37.377Z","avatar_url":"https://github.com/graykode.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# mlm-pipeline\n\n`mlm-pipeline` is a cloud architecture that preprocesses the masked language model (mlm).\n\n[\u003cimg width=\"400\"\nsrc=\"https://user-images.githubusercontent.com/32828768/49876264-ff2e4180-fdf0-11e8-9512-06ffe3ede9c5.png\"\u003e](https://jalammar.github.io/illustrated-bert/)\n\n\u003cimg src=\"image/readme.png\" height=\"400px;\" /\u003e\n\nIn NLP, a masked Languge Model (MLM) such as BERT, XLM, RoBERTa, and ALBERT, pretraining the sentence's input with `[MASK]` is a state-of-a-art.\n\n`Input Text: the man jumped up , put his basket on phil ##am ##mon ' s head`\n`Original Masked Input: [MASK] man [MASK] up , put his [MASK] on phil\n[MASK] ##mon ' s head`\n\nHowever, the preprocessing process of tokenizing and masking a few hundred GB of large text takes a lot of time with a single node. We use a multi-node architecture that distributes preprocessing through the cloud architecture's pipeline design with **pull-push pattern**.\n\n- `ventilator` : Read large text and deliver message to zmq's queue. ventilator is a single node.\n- `worker` : 1) BERT Tokenizer, 2) Create Masked on sentences, 3) push preprocessed tfrecord to S3\n- `worker controller` : using Terraform, Ansible, we can control all ec2s and dynamic provisioning ec2 on AWS.\n\n\n\n## Usage\n\n### prepare wiki dump data\n\n(If you don't use this wiki data, you can cancel this step). extract the text with\n[WikiExtractor.py](https://github.com/attardi/wikiextractor). It took about an hour using 96 core ec2.\n\n```shell\nwget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2\nbzip2 -dk enwiki-latest-pages-articles.xml.bz2\n\npython3 WikiExtractor.py -o \\\n  ../output --processes 80 \\\n  ../enwiki-latest-pages-articles.xml\n```\n\n\n\n# \n\n### dynamic provisioning EC2(worker) on AWS\n\n- [Installing Terraform](https://learn.hashicorp.com/terraform/getting-started/install.html)\n- [Installation Guide Ansible](https://docs.ansible.com/ansible/latest/installation_guide/intro_installation.html)\n\n```shell\n(run on your local pc)\nchmod +x init.sh\n./init.sh\ngit clone https://github.com/graykode/mlm-pipeline\n\nexport AWS_ACCESS_KEY_ID='xxxxxxx'\nexport AWS_SECRET_ACCESS_KEY='xxxxxx'\n\ncd worker_controller/terraform\nterraform init\n```\n\nchange some variable in [variables.tf](https://github.com/graykode/mlm-pipeline/blob/master/worker_controller/terraform/variables.tf)\n\n- region, zone, **number_of_worker** , client_instance_type, volume_size, **client_subnet**, **client_security_groups**, default_keypair_name\n- you must **open ventilation port** when create client_security_groups.\n\nThen run below:\n\n```shell\n(run on your local pc)\nterraform apply\n```\n\nor if you want to destroy all, type `terraform destroy`\n\n\n\n\n### command to all ec2 nodes in one time.\n\n```shell\n(run on your local pc)\ncd ../ansible\n\n# ping test\nansible-playbook -i ./inventory/ec2.py \\\n      --limit \"tag_type_worker\" \\\n      -u ubuntu \\\n      --private-key ~/.ssh/SoRT.pem ping.yaml\n      \n# install python packagement(ex tensorflow, boto, zmq, ..)\nansible-playbook -i ./inventory/ec2.py \\\n      --limit \"tag_type_worker\" \\\n      -u ubuntu \\\n      --private-key ~/.ssh/SoRT.pem init.yaml \\\n      --extra-vars \"aws_access_key_id=\u003ckey_id\u003e aws_secret_access_key=\u003caccess_key\u003e\" -vvvv\n```\n\n`aws_access_key_id` and `aws_secret_access_key` will be in environment variable (`/etc/environment`) to using boto s3. change as your `\u003ckey_id\u003e`, `\u003caccess_key\u003e`.\n\n\n\n\n### ventilation setting\n\n```shell\n(in ventilation ec2)\nwget https://raw.githubusercontent.com/graykode/mlm-pipeline/master/init.sh\n# init shell for ventilator\nsudo apt update \u0026\u0026 sudo apt install -y python3 \u0026\u0026 \\\n      sudo apt install -y python3-pip \u0026\u0026 \\\n      pip3 install zmq\n```\n\n\n\n### Start in workers and ventilation order\n\n```shell\n(run on your local pc)\nansible-playbook -i ./inventory/ec2.py \\\n      --limit \"tag_type_worker\" \\\n      -u ubuntu \\\n      --private-key ~/.ssh/SoRT.pem working.yaml \\\n      --extra-vars \"bucket_name=\u003cbucket_name\u003e vserver=\u003cventilator_ip\u003e\"\n      \n(in ventilation ec2)\npython3 ventilator.py \\\n  --data 'data folder path' \\\n  --vport 5557 \\\n  --time 0.88  \n```\n\n\n\n## License\n\nMIT\n\n\n\n## Author\n\n- Tae Hwan Jung(Jeff Jung) @graykode, Kyung Hee Univ CE(Undergraduate).\n- Author Email : [nlkey2022@gmail.com](mailto:nlkey2022@gmail.com)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgraykode%2Fmlm-pipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgraykode%2Fmlm-pipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgraykode%2Fmlm-pipeline/lists"}