{"id":19932142,"url":"https://github.com/amazon-science/embert","last_synced_at":"2025-05-03T11:31:27.824Z","repository":{"id":139011536,"uuid":"393853247","full_name":"amazon-science/embert","owner":"amazon-science","description":"Code for EmBERT, a transformer model for embodied, language-guided visual task completion.","archived":false,"fork":false,"pushed_at":"2024-01-30T07:03:51.000Z","size":602,"stargazers_count":56,"open_issues_count":3,"forks_count":11,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-01-30T08:25:15.313Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/amazon-science.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2021-08-08T04:08:00.000Z","updated_at":"2024-01-30T08:25:19.353Z","dependencies_parsed_at":"2024-01-30T08:25:17.557Z","dependency_job_id":"8357e017-454a-44f4-9639-dce8349d118d","html_url":"https://github.com/amazon-science/embert","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fembert","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fembert/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fembert/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fembert/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/amazon-science","download_url":"https://codeload.github.com/amazon-science/embert/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224360233,"owners_count":17298319,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-12T23:09:12.710Z","updated_at":"2024-11-12T23:09:13.354Z","avatar_url":"https://github.com/amazon-science.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# EmBERT: A Transformer Model for Embodied, Language-guided Visual Task Completion\n\nWe present Embodied BERT (EmBERT), a transformer-based model which can attend to high-dimensional, multi-modal inputs\nacross long temporal horizons for language-conditioned task completion. Additionally, we bridge the gap between\nsuccessful object-centric navigation models used for non-interactive agents and the language-guided visual task\ncompletion benchmark, ALFRED, by introducing object navigation targets for EmBERT training. We achieve competitive\nperformance on the ALFRED benchmark, and EmBERT marks the first transformer-based model to successfully handle the\nlong-horizon, dense, multi-modal histories of ALFRED, and the first ALFRED model to utilize object-centric navigation\ntargets.\n\nIn this repository, we provide the entire codebase which is used for training and evaluating EmBERT performance on the\nALFRED dataset. It's mostly based on [AllenNLP](https://allennlp.org/)\nand [PyTorch-Lightning](https://pytorch-lightning.readthedocs.io/en/latest/)\ntherefore it's inherently easily to extend.\n\n## Setup\n\nWe used Anaconda for our experiments. Please create an anaconda environment and then install the project dependencies\nwith the following command:\n\n```bash\npip install -r requirements.txt\n```\n\nAs next step, we will download the ALFRED data using the script `scripts/download_alfred_data.sh` as follows:\n\n```bash\nsh scripts/donwload_alfred_data.sh json_feat\n```\n\nBefore doing so, make sure that you have installed `p7zip` because is used to extract the trajectory files.\n\n## MaskRCNN fine-tuning\n\nWe provide the code to fine-tune a MaskRCNN model on the ALFRED dataset. To create the vision dataset, use the script\n`scripts/generate_vision_dataset.sh`. This will create the dataset splits required by the training process. After this,\nit's possible to run the model fine-tuning using:\n\n```bash\nPYTHONPATH=. python vision/finetune.py --batch_size 8 --gradient_clip_val 5 --lr 3e-4 --gpus 1 --accumulate_grad_batches 2 --num_workers 4 --save_dir storage/models/vision/maskrcnn_bs_16_lr_3e-4_epochs_46_7k_batches --max_epochs 46 --limit_train_batches 7000\n```\n\nWe provide this code for reference however in our experiments we used the MaskRCNN model from MOCA which applies more\nsophisticated data augmentation techniques to improve performance on the ALFRED dataset.\n\n## ALFRED Visual Features extraction\n\n### MaskRCNN\n\nThe visual feature extraction script is responsible for generating the MaskRCNN features as well as orientation\ninformation for every bounding box. For the MaskrCNN model, we use the pretrained model from MOCA. You can download it\nfrom their GitHub page. First, we create the directory structure and then download the model weights:\n\n```bash\nmkdir -p storage/models/vision/moca_maskrcnn;\nwget https://alfred-colorswap.s3.us-east-2.amazonaws.com/weight_maskrcnn.pt -O storage/models/vision/moca_maskrcnn/weight_maskrcnn.pt; \n```\n\nWe extract visual features for training trajectories using the following command:\n\n```bash\nsh scripts/generate_moca_maskrcnn.sh\n```\n\nYou can refer to the actual extraction script `scripts/generate_maskrcnn_horizon0.py` for additional parameters. We\nexecuted this command on an `p3.2xlarge` instance with NVIDIA V100. This command will populate the directory\n`storage/data/alfred/json_feat_2.1.0/` with the visual features for each trajectory step. In particular, the parameter\n`--features_folder` will specify the subdirectory (for each trajectory) that will contain the compressed NumPy files\nconstituting the features. Each NumPy file has the following structure:\n\n```python\ndict(\n    box_features=np.array,\n    roi_angles=np.array,\n    boxes=np.array,\n    masks=np.array,\n    class_probs=np.array,\n    class_labels=np.array,\n    num_objects=int,\n    pano_id=int\n)\n\n```\n\n## Data-augmentation procedure\n\nIn our paper, we describe a procedure to augment the ALFREd trajectories with object and corresponding receptacle\ninformation. In particular, we reply the trajectories and we make sure to track object and its receptacle during a\nsubgoal. The data augmentation script will create a new trajectory file called `ref_traj_data.json` that mimics the same\ndata structure of the original ALFRED dataset but adds to it a few fields for each action.\n\nTo start generating the refined data, use the following script:\n\n```bash\nPYTHONPATH=. python scripts/generate_landmarks.py \n```\n\n## EmBERT Training\n\n### Vocabulary creation\n\nWe use `AllenNLP` for training our models. Before starting the training we will generate the vocabulary for the model\nusing the following command:\n\n```bash\nallennlp build-vocab training_configs/embert/embert_oscar.jsonnet storage/models/embert/vocab.tar.gz --include-package grolp\n```\n\n### Training\n\nFirst, we need to download the OSCAR checkpoint before starting the training process. We used a version of OSCAR which\ndoesn't use object labels which can be freely downloaded following the instruction\non [GitHub](https://github.com/microsoft/Oscar/blob/master/DOWNLOAD.md). Make sure to download this file in the\nfolder `storage/models/pretrained` using the following commands:\n\n```bash\nmkdir -p storage/models/pretrained/;\nwget https://biglmdiag.blob.core.windows.net/oscar/pretrained_models/base-no-labels.zip -O storage/models/pretrained/oscar.zip;\nunzip storage/models/pretrained/oscar.zip -d storage/models/pretrained/;\nmv storage/models/pretrained/base-no-labels/ep_67_588997/pytorch_model.bin storage/models/pretrained/oscar-base-no-labels.bin;\nrm storage/models/pretrained/oscar.zip;\n```\n\nA new model can be trained using the following command:\n\n```bash\nallennlp train training_configs/embert/embert_widest.jsonnet -s storage/models/alfred/embert --include-package grolp\n```\n\nWhen training for the first time, make sure to add to the previous command the following parameters:\n`--preprocess --num_workers 4`. This will make sure that the dataset is preprocessed and cached in order to speedup\ntraining. We run training using AWS EC2 instances `p3.8xlarge` with `16` workers on a single GPU per configuration.\n\nThe configuration file `training_configs/embert/embert_widest.jsonnet` contains all the parameters that you might be\ninterested in if you want to change the way the model works or any reference to the actual features files. If you're\ninterested in how to change the model itself, please refer to [the model definition](grolp/models/alfred.py). The\nparameters in the constructor of the class will reflect the ones reported in the configuration file. In general, this\nproject has been developed by using AllenNLP has a reference framework. We refer the reader to the official\n[AllenNLP documentation](http://docs.allennlp.org/main/) for more details about how to structure a project.\n\n## EmBERT evaluation\n\nWe modified the original ALFRED evaluation script to make sure that the results are completely reproducible. Refer to\nthe original repository for more information.\n\nTo run the evaluation on the `valid_seen` and `valid_unseen` you can use the provided script `scripts/run_eval.sh` in\norder to evaluate your model. The EmBERT trainer has different ways of saving checkpoints. At the end of the training,\nit will automatically save the best model in an archive named `model.tar.gz` in the destination folder (the one\nspecified with `-s`). To evaluate it run the following command:\n\n```bash\nsh scripts/run_eval.sh \u003cyour_model_path\u003e/model.tar.gz \n```\n\nIt's also possible to run the evaluation of a specific checkpoint. This can be done by running the previous command as\nfollows:\n\n```bash\nsh scripts/run_eval.sh \u003cyour_model_path\u003e/model-epoch=6.ckpt\n```\n\nIn this way the evaluation script will load the checkpoint at epoch 6 in the path `\u003cyour_model_path\u003e`. When specifying a\ncheckpoint directly, make sure that the folder `\u003cyour_model_path\u003e` contains both `config.json` file and `vocabulary`\ndirectory because they are required by the script to load all the correct model parameters.\n\n## Citation\n\nIf you're using this codebase please cite our work:\n\n```bibtex\n@article{suglia:embert,\n  title={Embodied {BERT}: A Transformer Model for Embodied, Language-guided Visual Task Completion},\n  author={Alessandro Suglia and Qiaozi Gao and Jesse Thomason and Govind Thattai and Gaurav Sukhatme},\n  journal={arXiv},\n  year={2021},\n  url={https://arxiv.org/abs/2108.04927}\n}\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famazon-science%2Fembert","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Famazon-science%2Fembert","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famazon-science%2Fembert/lists"}