{"id":14991134,"url":"https://github.com/adl-x/llavidal","last_synced_at":"2025-04-12T03:25:51.464Z","repository":{"id":269318683,"uuid":"814569881","full_name":"ADL-X/LLAVIDAL","owner":"ADL-X","description":"This is the offical repository of LLAVIDAL","archived":false,"fork":false,"pushed_at":"2025-03-21T16:52:12.000Z","size":33570,"stargazers_count":11,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-25T23:04:56.758Z","etag":null,"topics":["action-recognition","activities-of-daily-living","large-vision-language-model","llvm"],"latest_commit_sha":null,"homepage":"https://adl-x.github.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc-by-4.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ADL-X.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-13T09:09:30.000Z","updated_at":"2025-03-21T16:52:15.000Z","dependencies_parsed_at":null,"dependency_job_id":"29fb9125-7cda-4717-b583-d0dcb3be16bf","html_url":"https://github.com/ADL-X/LLAVIDAL","commit_stats":null,"previous_names":["adl-x/llavidal"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ADL-X%2FLLAVIDAL","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ADL-X%2FLLAVIDAL/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ADL-X%2FLLAVIDAL/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ADL-X%2FLLAVIDAL/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ADL-X","download_url":"https://codeload.github.com/ADL-X/LLAVIDAL/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248511302,"owners_count":21116387,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["action-recognition","activities-of-daily-living","large-vision-language-model","llvm"],"created_at":"2024-09-24T14:21:32.309Z","updated_at":"2025-04-12T03:25:51.441Z","avatar_url":"https://github.com/ADL-X.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# \u003cimg src=\"./llavidal/static/llavidal.ico\"  style=\"vertical-align:middle;\"/\u003e LLAVIDAL: Benchmarking Large LAnguage VIsion Models for Daily Activities of Living 🏃👩‍🦯‍➡️🗨️\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./llavidal/static/web-teaser.jpg\" alt=\"LLAVIDAL Approach Overview\"\u003e\n\u003c/p\u003e   \n\n\n-----\n## Available resources\n| **Resource**               | **Link**                                                                                                                                                                                            |\n|----------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| **Paper**                  | [![Paper](https://img.shields.io/badge/Read-Paper-blue.svg)](https://arxiv.org/pdf/2406.09390)                                                                                                      |\n| **LLAVIDAL Weights**          | [![Weights](https://img.shields.io/badge/Download-Model_Weights-green.svg)](https://huggingface.co/datasets/dreilly/ADL-X/tree/main/model_weights)     |\n| **(ADL-X) Multi-modal Features**         | [![Multi-modal Features](https://img.shields.io/badge/Download-Multimodal_Features-orange.svg)](https://huggingface.co/datasets/dreilly/ADL-X/tree/main/multimodal_features) |\n| **(ADL-X) Instruction Dataset**    | [![Instruction Dataset](https://img.shields.io/badge/Download-Instruction_Dataset-yellowgreen.svg)](https://huggingface.co/datasets/dreilly/ADL-X/tree/main/instruction_data) |\n| **Data Curation**          | [![Data Curation](https://img.shields.io/badge/Read-Data_Curation-aquamarine.svg)](#data-curation-pipeline-) |\n| **Training**               | [![Training](https://img.shields.io/badge/Start-Training-crimson.svg)](#training-) |\n| **Offline Demo**           | [![Offline Demo](https://img.shields.io/badge/Run-Offline_Demo-teal.svg)](#running-demo-) |\n| **Quantitative Evaluation**| [![Quantitative Evaluation](https://img.shields.io/badge/View-Quantitative_Evaluation-lightgrey.svg)](#quantitative-evaluation-) |\n\n---\n\n## What is your goal?\n\u003ctable\u003e\n    \u003cthead\u003e\u003ctr\u003e\u003cth colspan=4\u003e\u003ca href=\"#installation-wrench\"\u003eStart with installation (click here)\u003c/a\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\n    \u003ctbody\u003e\u003ctr\u003e\n        \u003ctd\u003e\u003ca href=\"#running-demo-\"\u003eI want to try the demo\u003c/td\u003e\n        \u003ctd\u003e\u003ca href=\"#training-\"\u003eI want to train LLAVIDAL\u003c/td\u003e\n        \u003ctd\u003e\u003ca href=\"#quantitative-evaluation-\"\u003eI want to reproduce LLAVIDAL results\u003c/td\u003e\n        \u003ctd\u003e\u003ca href=\"#adl-x-data-curation-pipeline-\"\u003eI want to generate the ADL-X dataset\u003c/td\u003e\n    \u003c/tr\u003e\u003c/tbody\u003e\n\u003c/table\u003e\n\n---\n\n## Installation :wrench:\nOur python environement is identical to [Video-ChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT), we recommend following their installation instructions:\n\n```shell\nconda create --name=llavidal python=3.10\nconda activate llavidal\n\ngit clone https://github.com/ADL-X/LLAVIDAL.git\ncd LLAVIDAL\npip install -r requirements.txt\n\nexport PYTHONPATH=\"./:$PYTHONPATH\"\n```\n\nAdditionally, if you are using A100/H100 GPUs you can install [FlashAttention](https://github.com/HazyResearch/flash-attention),\n```shell\npip install ninja\n\ngit clone https://github.com/HazyResearch/flash-attention.git\ncd flash-attention\ngit checkout v1.0.7\npython setup.py install\n```\n---\n\n## Running Demo 🚗\nWe provide a Gradio demo to run LLAVIDAL on your local machine. For the best performance, use a CUDA-enabled machine with at least 18GB of VRAM.\n\n1. Activate the llavidal Conda environment `conda activate llavidal`\n2. Download the following LLaVA weights: [LLaVA-7B-Lightening-v1-1](https://huggingface.co/mmaaz60/LLaVA-7B-Lightening-v1-1)\n3. Download the LLAVIDAL weights from [Available Resources](#available-resources)\n\nFinally, run the demo by executing the following command:\n\n```shell\npython llavidal/demo/video_demo.py \\\n    --model-name \u003cpath to the LLaVA-7B-Lightening-v1-1 weights downloaded in step 2\u003e \\\n    --projection_path \u003cpath to the downloaded llavidal weights (llavidal_weights.bin\n) downloaded in step 3\u003e\n```\n\nAfter running the command a URL will be provided. Click this URL and follow the on-screen instructions to use the demo.\n\n---\n\n## Training 💪🦾\n\nLLAVIDAL is trained on ADL-X, an ADL dataset of over 100K video-instruction pairs. The weights of the model are initialized from LLaVA and it is trained for 3 epochs on 8 48GB NVIDIA RTX A6000 GPUs. To begin, download the LLaVA weights from this link: [LLaVA-7B-Lightening-v1-1](https://huggingface.co/mmaaz60/LLaVA-7B-Lightening-v1-1).\n\nWe provide two methods to prepare the ADL-X dataset for training:\n1. **Download the pre-extracted RGB/Object/Skeleton features (recommended)**:\n   - Download the multi-modal features (`video_features.zip`, `object_features.zip`, `pose_features.zip`) and Instruction Dataset (`NTU_QA-for-training.json`) from [Available Resources](#available-resources)\n   - This should result in separate directories for each modality, and a json for training\n2. **Curate the ADL-X RGB videos and generate RGB features, download Object/Skeleton**:\n   - Follow the steps in [Data Curation Pipeline](#data-curation-pipeline-) to obtain the ADL-X videos. Extract features using any feature extractor (e.g., CLIP/SigLIP) \n   - Download the multi-modal Object/Skeleton features (`object_features.zip`, `pose_features.zip`)\n\n### Standard training (this is not MMPro training proposed in the paper)\nThe command below will train the LLAVIDAL architecture for 3 epochs on all three modalities. This command is modular and will only train LLAVIDAL with the modalities whose folders are passed. For example, if only `--object_folder` and `--pose_folder` is passed, LLAVIDAL will drop the video modality and will only train with the object and pose modalities.\n```shell\ntorchrun --nproc_per_node=8 --master_port 29001 llavidal/train/train_mem.py \\\n          --version v1 \\\n          --tune_mm_mlp_adapter True \\\n          --mm_use_vid_start_end \\\n          --bf16 True \\\n          --num_train_epochs 3 \\\n          --per_device_train_batch_size 4 \\\n          --per_device_eval_batch_size 4 \\\n          --gradient_accumulation_steps 1 \\\n          --evaluation_strategy \"no\" \\\n          --save_strategy \"steps\" \\\n          --save_steps 3000 \\\n          --save_total_limit 3 \\\n          --learning_rate 2e-5 \\\n          --weight_decay 0. \\\n          --warmup_ratio 0.03 \\\n          --lr_scheduler_type \"cosine\" \\\n          --logging_steps 100 \\\n          --tf32 True \\\n          --model_max_length 2048 \\\n          --gradient_checkpointing True \\\n          --lazy_preprocess True \\\n          --output_dir ./work_dirs/LLAVIDAL_video-object-pose-text_3epochs \\\n          --model_name_or_path /path/to/LLaVA-7B-Lightening-v-1-1/ \\\n          --data_path /path/to/NTU_QA-for-training.json \\\n          --video_folder /path/to/video_features/ \\\n          --object_folder /path/to/object_features/ /\n          --pose_folder /path/to/pose_features/\n```\n\n### MMPro training\nThis is the suggested method to train LLAVIDAL, in which we use a curriculum learning approach to sequentially introduce modalities into LLAVIDAL. In the implementation this consists of training many models independently in stage 1, merging their weights, and propogating those weights to the next stage. The following command can be use to train LLAVIDAL with MMPro training (**UPDATE THE PATHS in mmpro_training.sh BEFORE RUNNING**):\n```shell\nbash scripts/mmpro_training.sh\n```\n\nThe final model will be available in the directory you ran the above command at `./work_dirs/mmpose_training/stage3_video-pose-object-text/`.\n\n---\n\n## ADL-X Data Curation Pipeline 📖\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./llavidal/static/adlx-curation-web.jpg\" alt=\"LLAVIDAL Architecture Overview\"\u003e\n\u003c/p\u003e   \n\n**NOTE: You only need to follow the steps below if you want to generate the ADL-X RGB videos. If you are only interested in training LLAVIDAL, you can skip this process entirely and directly download the RGB/Object/Skeleton features of the ADL-X dataset in the [Available Resources](#available-resources) section above**. \n\nFollow the steps below to recreate ADL-X using the data curation pipeline proposed in the paper. You'll need to obtain access to and download the [NTU-RGB+D dataset](https://rose1.ntu.edu.sg/dataset/actionRecognition/). \n\n**1.** Be sure your NTU data directory is structured in the following way\n```\nNTU120\n├── rgb\n│   ├── S001C001P001R001A001_rgb.avi\n│   ├── S001C001P001R001A002_rgb.avi\n│   ...\n├── skeletons\n│   ├── S001C001P001R001A001.skeleton.npy\n│   ├── S001C001P001R001A002.skeleton.npy\n│   ...\n```\n\n**2.** Modify the paths in the following script and run it:\n```shell\nbash adlx_curation/gen_adlx_videos_and_QA.sh\n```\n\n\n**3.** Prepare Spatio-Temporal features using CLIP\nFor training efficiency, we pre-compute the spatio-temporal video features used during training. The following command will save one pickle file per video in directory specified by the `--clip_feat_path` argument. Run the following command to generate spatio-temporal features with CLIP:\n ```shell\n python scripts/save_spatio_temporal_clip_features.py \\\n        --video_dir_path /directory/to/save/stitched/videos/ \\\n        --clip_feat_path \u003cThe output dir where features should be saved\u003e\n```\n\n**4.** Download the pre-computed pose (`pose_features.zip`) and object features (`object_features.zip`) from [Available Resources](#available-resources) and extract them\n\n---\n\n## Quantitative Evaluation 🧪\n\nWe introduce two new evaluation for ADL centric tasks -- [ADLMCQ-AR \u0026 ADLMCQ-AF](https://huggingface.co/datasets/dreilly/ADL-X/tree/main/evaluation) which are MCQs conttaining Action Recognition and Action Forecasting Tasks.\nWe also release [SmartHome Untrimmed Descriptions](https://huggingface.co/datasets/dreilly/ADL-X/blob/main/evaluation/Video_description_Smarthome_Untrimmed.json) for the first time.\n\nStep 1: Download all the datasets-- [Charades](https://prior.allenai.org/projects/charades) , [LEMMA](https://sites.google.com/view/lemma-activity)(We use the exo-view) ,[SMARTHOME UNTRIMMED and TRIMMED](https://project.inria.fr/toyotasmarthome/).\n\nStep 2: For Action Forecasting access the json files and slice the videos from the start frame and end frame.For action recognition nothing is needed.\n\n\nStep 3: Arrange the data like that in the json file provided and run the command ,\n```shell\ncd llavidal/eval/\n```\n```shell\npython run_inference_action_recognition_charades.py\n--video_dir /path/to/videos \\\n  --qa_file /path/to/qa_file.json \\\n  --output_dir /path/to/output \\\n  --output_name results \\\n  --model-name \u003cLLAVA model path\u003e \\\n  --conv-mode llavidal_v1 \\\n  --projection_path \u003cpath to LLAVIDAL WEIGHTS\u003e \n```\n\n\nStep 3: Evaulate using GPT3.5 Turbo api \n```shell\ncd quantitative_evaluation/\n```\n```shell\nevaluate_action_recognition_charades.py\n```\nand pass the above results in STEP 2.\n\nFor other methods the above steps are same \n\n-----------------\nFor video descriptions for Charades run command \n\n```shell\ncd llavidal/eval\n```\n```shell\npython run_inference_benchmark_general.py\n```\nPass the appropiate paths to get the results josn\n\nFor video descriptions for Smarthome Untrimmed ,slice the videos in 1 minutes each and make a dense description like that of data curation process.\n\nTo get individual descriptions \n\n```shell\ncd llavidal/eval\n```\n```shell\npython run_inference_descriptions_smarthome.py\n```\n\n---\n\n## ADL-X Dataset Details 📂\n\nWe are introducing ADLX the first ADL centric video instruction dataset, due to licensing restrictions we cannot share the original videos, but we share the [Video_features](https://huggingface.co/datasets/dreilly/ADL-X/blob/main/multimodal_features/object_features.zip) , [Pose Features](https://huggingface.co/datasets/dreilly/ADL-X/blob/main/multimodal_features/pose_features.zip) and [Object Features](https://huggingface.co/datasets/dreilly/ADL-X/blob/main/multimodal_features/object_features.zip)\n\nThe video features are structured as\n```\nVideo Features\n├── 001_video_001.pkl\n├── 001_video_002.pkl\n├── 001_video_003.pkl\n├── 001_video_004.pkl\n├── 001_video_005.pkl\n├── 001_video_006.pkl\n...\n\n```\neach video feature is of dimension 356 x 1024.\n\nThe pose features are structured as\n```\nPose Features\n├── 001_001_video_001_pose.pickle\n├── 001_001_video_002_pose.pickle\n├── 001_001_video_003_pose.pickle\n├── 001_001_video_004_pose.pickle\n├── 001_001_video_005_pose.pickle\n├── 001_001_video_006_pose.pickle\n├── 001_001_video_007_pose.pickle\n├── 001_001_video_008_pose.pickle\n...\n```\neach pose feature is of the dimension 216 x 256\n\nThe object features are structured as\n```\nObject Features\n├── 001_001_video_001_object.pkl\n├── 001_001_video_002_object.pkl\n├── 001_001_video_003_object.pkl\n├── 001_001_video_004_object.pkl\n├── 001_001_video_005_object.pkl\n├── 001_001_video_006_object.pkl\n├── 001_001_video_007_object.pkl\n├── 001_001_video_008_object.pkl\n...\n```\neach object feature is of the dimension n x 8x 512, where n is the number of objects present in the video.\n\n---\n\n## Qualitative Analysis 🎬\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./llavidal/static/QA_example.png\" alt=\"Qualitative Evaluation\"\u003e\n\u003c/p\u003e  \n\n---\n\n## Acknowledgements 🙏\n+ [Video-ChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT?tab=readme-ov-file)\n+ [CogVLM](https://github.com/THUDM/CogVLM)\n\nIf you're using LLAVIDAL in your research or application, please consider citing it using the following BibTeX:\n```bibtex\n@inproceedings{llavidal2024,\n  title={LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living}, \n  author={Dominick Reilly and Rajatsubhra Chakraborty and Arkaprava Sinha and Manish Kumar Govind and Pu Wang and Francois Bremond and Le Xue and Srijan Das},\n    booktitle={Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)}\n    year={2025}\n}\n```\n----------\n\n## Usage LICENSE :\n\nThe dataset is protected under the CC-BY license of Creative Commons, which allows users to distribute, remix, adapt, and build upon the material in any medium or format, as long as the creator is attributed. The license allows ADL-X for commercial use. As the authors of this manuscript and collectors of this dataset, we reserve the right to distribute the data.\n\n------\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadl-x%2Fllavidal","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fadl-x%2Fllavidal","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadl-x%2Fllavidal/lists"}