https://github.com/graykode/mlm-pipeline
mlm-pipeline is a cloud architecture that preprocesses the masked language model (mlm)
https://github.com/graykode/mlm-pipeline
ansible aws bert cloud mlm natural-language-processing nlp terraform
Last synced: about 1 month ago
JSON representation
mlm-pipeline is a cloud architecture that preprocesses the masked language model (mlm)
- Host: GitHub
- URL: https://github.com/graykode/mlm-pipeline
- Owner: graykode
- License: mit
- Created: 2019-11-06T09:25:11.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2019-11-07T15:17:00.000Z (over 5 years ago)
- Last Synced: 2025-03-31T14:06:11.476Z (2 months ago)
- Topics: ansible, aws, bert, cloud, mlm, natural-language-processing, nlp, terraform
- Language: Python
- Homepage:
- Size: 931 KB
- Stars: 10
- Watchers: 2
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# mlm-pipeline
`mlm-pipeline` is a cloud architecture that preprocesses the masked language model (mlm).
[
](https://jalammar.github.io/illustrated-bert/)
In NLP, a masked Languge Model (MLM) such as BERT, XLM, RoBERTa, and ALBERT, pretraining the sentence's input with `[MASK]` is a state-of-a-art.
`Input Text: the man jumped up , put his basket on phil ##am ##mon ' s head`
`Original Masked Input: [MASK] man [MASK] up , put his [MASK] on phil
[MASK] ##mon ' s head`However, the preprocessing process of tokenizing and masking a few hundred GB of large text takes a lot of time with a single node. We use a multi-node architecture that distributes preprocessing through the cloud architecture's pipeline design with **pull-push pattern**.
- `ventilator` : Read large text and deliver message to zmq's queue. ventilator is a single node.
- `worker` : 1) BERT Tokenizer, 2) Create Masked on sentences, 3) push preprocessed tfrecord to S3
- `worker controller` : using Terraform, Ansible, we can control all ec2s and dynamic provisioning ec2 on AWS.## Usage
### prepare wiki dump data
(If you don't use this wiki data, you can cancel this step). extract the text with
[WikiExtractor.py](https://github.com/attardi/wikiextractor). It took about an hour using 96 core ec2.```shell
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
bzip2 -dk enwiki-latest-pages-articles.xml.bz2python3 WikiExtractor.py -o \
../output --processes 80 \
../enwiki-latest-pages-articles.xml
```#
### dynamic provisioning EC2(worker) on AWS
- [Installing Terraform](https://learn.hashicorp.com/terraform/getting-started/install.html)
- [Installation Guide Ansible](https://docs.ansible.com/ansible/latest/installation_guide/intro_installation.html)```shell
(run on your local pc)
chmod +x init.sh
./init.sh
git clone https://github.com/graykode/mlm-pipelineexport AWS_ACCESS_KEY_ID='xxxxxxx'
export AWS_SECRET_ACCESS_KEY='xxxxxx'cd worker_controller/terraform
terraform init
```change some variable in [variables.tf](https://github.com/graykode/mlm-pipeline/blob/master/worker_controller/terraform/variables.tf)
- region, zone, **number_of_worker** , client_instance_type, volume_size, **client_subnet**, **client_security_groups**, default_keypair_name
- you must **open ventilation port** when create client_security_groups.Then run below:
```shell
(run on your local pc)
terraform apply
```or if you want to destroy all, type `terraform destroy`
### command to all ec2 nodes in one time.
```shell
(run on your local pc)
cd ../ansible# ping test
ansible-playbook -i ./inventory/ec2.py \
--limit "tag_type_worker" \
-u ubuntu \
--private-key ~/.ssh/SoRT.pem ping.yaml
# install python packagement(ex tensorflow, boto, zmq, ..)
ansible-playbook -i ./inventory/ec2.py \
--limit "tag_type_worker" \
-u ubuntu \
--private-key ~/.ssh/SoRT.pem init.yaml \
--extra-vars "aws_access_key_id= aws_secret_access_key=" -vvvv
````aws_access_key_id` and `aws_secret_access_key` will be in environment variable (`/etc/environment`) to using boto s3. change as your ``, ``.
### ventilation setting
```shell
(in ventilation ec2)
wget https://raw.githubusercontent.com/graykode/mlm-pipeline/master/init.sh
# init shell for ventilator
sudo apt update && sudo apt install -y python3 && \
sudo apt install -y python3-pip && \
pip3 install zmq
```### Start in workers and ventilation order
```shell
(run on your local pc)
ansible-playbook -i ./inventory/ec2.py \
--limit "tag_type_worker" \
-u ubuntu \
--private-key ~/.ssh/SoRT.pem working.yaml \
--extra-vars "bucket_name= vserver="
(in ventilation ec2)
python3 ventilator.py \
--data 'data folder path' \
--vport 5557 \
--time 0.88
```## License
MIT
## Author
- Tae Hwan Jung(Jeff Jung) @graykode, Kyung Hee Univ CE(Undergraduate).
- Author Email : [[email protected]](mailto:[email protected])