{"id":15634764,"url":"https://github.com/fcakyon/video-transformers","last_synced_at":"2026-03-02T16:36:00.334Z","repository":{"id":57748882,"uuid":"524105926","full_name":"fcakyon/video-transformers","owner":"fcakyon","description":"Easiest way of fine-tuning HuggingFace video classification models","archived":false,"fork":false,"pushed_at":"2023-03-20T20:43:24.000Z","size":74,"stargazers_count":141,"open_issues_count":1,"forks_count":13,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-04-09T22:12:57.437Z","etag":null,"topics":["accelerate","classification","deep-learning","evaluate","huggingface","layer","machine-learning","neptune","onnx","onnxruntime","python","pytorch","pytorch-video","tensorboard","transformers","video","video-classification","video-transformer","vision","wandb"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/fcakyon.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":"fcakyon"}},"created_at":"2022-08-12T13:52:28.000Z","updated_at":"2025-04-08T09:38:58.000Z","dependencies_parsed_at":"2024-10-23T03:23:35.748Z","dependency_job_id":null,"html_url":"https://github.com/fcakyon/video-transformers","commit_stats":{"total_commits":26,"total_committers":2,"mean_commits":13.0,"dds":"0.038461538461538436","last_synced_commit":"8ada60b5a01964d813f11cd491d1bc22653e7303"},"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fcakyon%2Fvideo-transformers","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fcakyon%2Fvideo-transformers/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fcakyon%2Fvideo-transformers/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fcakyon%2Fvideo-transformers/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/fcakyon","download_url":"https://codeload.github.com/fcakyon/video-transformers/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248119292,"owners_count":21050755,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["accelerate","classification","deep-learning","evaluate","huggingface","layer","machine-learning","neptune","onnx","onnxruntime","python","pytorch","pytorch-video","tensorboard","transformers","video","video-classification","video-transformer","vision","wandb"],"created_at":"2024-10-03T10:56:39.920Z","updated_at":"2026-03-02T16:35:55.283Z","avatar_url":"https://github.com/fcakyon.png","language":"Python","funding_links":["https://github.com/sponsors/fcakyon"],"categories":["Python"],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://user-images.githubusercontent.com/34196005/180642397-1f56d9c7-dee2-48d4-acbf-c3bc62f36150.png\" width=\"500\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n    Easiest way of fine-tuning HuggingFace video classification models.\n\u003c/p\u003e\n\n\u003cdiv align=\"center\"\u003e\n    \u003ca href=\"https://badge.fury.io/py/video-transformers\"\u003e\u003cimg src=\"https://badge.fury.io/py/video-transformers.svg\" alt=\"pypi version\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://pepy.tech/project/video-transformers\"\u003e\u003cimg src=\"https://pepy.tech/badge/video-transformers\" alt=\"total downloads\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://twitter.com/fcakyon\"\u003e\u003cimg src=\"https://img.shields.io/twitter/follow/fcakyon?color=blue\u0026logo=twitter\u0026style=flat\" alt=\"fcakyon twitter\"\u003e\u003c/a\u003e\n\u003c/div\u003e\n\n## 🚀 Features\n\n`video-transformers` uses:\n\n- 🤗 [accelerate](https://github.com/huggingface/accelerate) for distributed training,\n\n- 🤗 [evaluate](https://github.com/huggingface/evaluate) for evaluation,\n\n- [pytorchvideo](https://github.com/facebookresearch/pytorchvideo) for dataloading\n\nand supports:\n\n- creating and fine-tunining video models using [transformers](https://github.com/huggingface/transformers) and [timm](https://github.com/rwightman/pytorch-image-models) vision models\n\n- experiment tracking with [neptune](https://neptune.ai/), [tensorboard](https://www.tensorflow.org/tensorboard) and other trackers\n\n- exporting fine-tuned models in [ONNX](https://onnx.ai/) format\n\n- pushing fine-tuned models into [HuggingFace Hub](https://huggingface.co/models?pipeline_tag=image-classification\u0026sort=downloads)\n\n- loading pretrained models from [HuggingFace Hub](https://huggingface.co/models?pipeline_tag=image-classification\u0026sort=downloads)\n\n- Automated [Gradio app](https://gradio.app/), and [space](https://huggingface.co/spaces) creation \n\n## 🏁 Installation\n\n- Install `Pytorch`:\n\n```bash\nconda install pytorch=1.11.0 torchvision=0.12.0 cudatoolkit=11.3 -c pytorch\n```\n\n- Install pytorchvideo and transformers from main branch:\n\n```bash\npip install git+https://github.com/facebookresearch/pytorchvideo.git\npip install git+https://github.com/huggingface/transformers.git\n```\n\n- Install `video-transformers`:\n\n```bash\npip install video-transformers\n```\n\n## 🔥 Usage\n\n- Prepare video classification dataset in such folder structure (.avi and .mp4 extensions are supported):\n\n```bash\ntrain_root\n    label_1\n        video_1\n        video_2\n        ...\n    label_2\n        video_1\n        video_2\n        ...\n    ...\nval_root\n    label_1\n        video_1\n        video_2\n        ...\n    label_2\n        video_1\n        video_2\n        ...\n    ...\n```\n\n- Fine-tune Timesformer (from HuggingFace) video classifier:\n\n```python\nfrom torch.optim import AdamW\nfrom video_transformers import VideoModel\nfrom video_transformers.backbones.transformers import TransformersBackbone\nfrom video_transformers.data import VideoDataModule\nfrom video_transformers.heads import LinearHead\nfrom video_transformers.trainer import trainer_factory\nfrom video_transformers.utils.file import download_ucf6\n\nbackbone = TransformersBackbone(\"facebook/timesformer-base-finetuned-k400\", num_unfrozen_stages=1)\n\ndownload_ucf6(\"./\")\ndatamodule = VideoDataModule(\n    train_root=\"ucf6/train\",\n    val_root=\"ucf6/val\",\n    batch_size=4,\n    num_workers=4,\n    num_timesteps=8,\n    preprocess_input_size=224,\n    preprocess_clip_duration=1,\n    preprocess_means=backbone.mean,\n    preprocess_stds=backbone.std,\n    preprocess_min_short_side=256,\n    preprocess_max_short_side=320,\n    preprocess_horizontal_flip_p=0.5,\n)\n\nhead = LinearHead(hidden_size=backbone.num_features, num_classes=datamodule.num_classes)\nmodel = VideoModel(backbone, head)\n\noptimizer = AdamW(model.parameters(), lr=1e-4)\n\nTrainer = trainer_factory(\"single_label_classification\")\ntrainer = Trainer(datamodule, model, optimizer=optimizer, max_epochs=8)\n\ntrainer.fit()\n\n```\n\n- Fine-tune ConvNeXT (from HuggingFace) + Transformer based video classifier:\n\n```python\nfrom torch.optim import AdamW\nfrom video_transformers import TimeDistributed, VideoModel\nfrom video_transformers.backbones.transformers import TransformersBackbone\nfrom video_transformers.data import VideoDataModule\nfrom video_transformers.heads import LinearHead\nfrom video_transformers.necks import TransformerNeck\nfrom video_transformers.trainer import trainer_factory\nfrom video_transformers.utils.file import download_ucf6\n\nbackbone = TimeDistributed(TransformersBackbone(\"facebook/convnext-small-224\", num_unfrozen_stages=1))\nneck = TransformerNeck(\n    num_features=backbone.num_features,\n    num_timesteps=8,\n    transformer_enc_num_heads=4,\n    transformer_enc_num_layers=2,\n    dropout_p=0.1,\n)\n\ndownload_ucf6(\"./\")\ndatamodule = VideoDataModule(\n    train_root=\"ucf6/train\",\n    val_root=\"ucf6/val\",\n    batch_size=4,\n    num_workers=4,\n    num_timesteps=8,\n    preprocess_input_size=224,\n    preprocess_clip_duration=1,\n    preprocess_means=backbone.mean,\n    preprocess_stds=backbone.std,\n    preprocess_min_short_side=256,\n    preprocess_max_short_side=320,\n    preprocess_horizontal_flip_p=0.5,\n)\n\nhead = LinearHead(hidden_size=neck.num_features, num_classes=datamodule.num_classes)\nmodel = VideoModel(backbone, head, neck)\n\noptimizer = AdamW(model.parameters(), lr=1e-4)\n\nTrainer = trainer_factory(\"single_label_classification\")\ntrainer = Trainer(\n    datamodule,\n    model,\n    optimizer=optimizer,\n    max_epochs=8\n)\n\ntrainer.fit()\n\n```\n\n- Fine-tune Resnet18 (from HuggingFace) + GRU based video classifier:\n\n```python\nfrom video_transformers import TimeDistributed, VideoModel\nfrom video_transformers.backbones.transformers import TransformersBackbone\nfrom video_transformers.data import VideoDataModule\nfrom video_transformers.heads import LinearHead\nfrom video_transformers.necks import GRUNeck\nfrom video_transformers.trainer import trainer_factory\nfrom video_transformers.utils.file import download_ucf6\n\nbackbone = TimeDistributed(TransformersBackbone(\"microsoft/resnet-18\", num_unfrozen_stages=1))\nneck = GRUNeck(num_features=backbone.num_features, hidden_size=128, num_layers=2, return_last=True)\n\ndownload_ucf6(\"./\")\ndatamodule = VideoDataModule(\n    train_root=\"ucf6/train\",\n    val_root=\"ucf6/val\",\n    batch_size=4,\n    num_workers=4,\n    num_timesteps=8,\n    preprocess_input_size=224,\n    preprocess_clip_duration=1,\n    preprocess_means=backbone.mean,\n    preprocess_stds=backbone.std,\n    preprocess_min_short_side=256,\n    preprocess_max_short_side=320,\n    preprocess_horizontal_flip_p=0.5,\n)\n\nhead = LinearHead(hidden_size=neck.hidden_size, num_classes=datamodule.num_classes)\nmodel = VideoModel(backbone, head, neck)\n\nTrainer = trainer_factory(\"single_label_classification\")\ntrainer = Trainer(\n    datamodule,\n    model,\n    max_epochs=8\n)\n\ntrainer.fit()\n\n```\n\n- Perform prediction for a single file or folder of videos:\n\n```python\nfrom video_transformers import VideoModel\n\nmodel = VideoModel.from_pretrained(model_name_or_path)\n\nmodel.predict(video_or_folder_path=\"video.mp4\")\n\u003e\u003e [{'filename': \"video.mp4\", 'predictions': {'class1': 0.98, 'class2': 0.02}}]\n```\n\n\n## 🤗 Full HuggingFace Integration\n\n- Push your fine-tuned model to the hub:\n\n```python\nfrom video_transformers import VideoModel\n\nmodel = VideoModel.from_pretrained(\"runs/exp/checkpoint\")\n\nmodel.push_to_hub('model_name')\n```\n\n- Load any pretrained video-transformer model from the hub:\n\n```python\nfrom video_transformers import VideoModel\n\nmodel = VideoModel.from_pretrained(\"runs/exp/checkpoint\")\n\nmodel.from_pretrained('account_name/model_name')\n```\n\n- Push your model to HuggingFace hub with auto-generated model-cards:\n\n```python\nfrom video_transformers import VideoModel\n\nmodel = VideoModel.from_pretrained(\"runs/exp/checkpoint\")\nmodel.push_to_hub('account_name/app_name')\n```\n\n- (Incoming feature) Push your model as a Gradio app to HuggingFace Space:\n\n```python\nfrom video_transformers import VideoModel\n\nmodel = VideoModel.from_pretrained(\"runs/exp/checkpoint\")\nmodel.push_to_space('account_name/app_name')\n```\n\n## 📈 Multiple tracker support\n\n- Tensorboard tracker is enabled by default.\n\n- To add Neptune/Layer ... tracking:\n\n```python\nfrom video_transformers.tracking import NeptuneTracker\nfrom accelerate.tracking import WandBTracker\n\ntrackers = [\n    NeptuneTracker(EXPERIMENT_NAME, api_token=NEPTUNE_API_TOKEN, project=NEPTUNE_PROJECT),\n    WandBTracker(project_name=WANDB_PROJECT)\n]\n\ntrainer = Trainer(\n    datamodule,\n    model,\n    trackers=trackers\n)\n\n```\n\n## 🕸️ ONNX support\n\n- Convert your trained models into ONNX format for deployment:\n\n```python\nfrom video_transformers import VideoModel\n\nmodel = VideoModel.from_pretrained(\"runs/exp/checkpoint\")\nmodel.to_onnx(quantize=False, opset_version=12, export_dir=\"runs/exports/\", export_filename=\"model.onnx\")\n```\n\n## 🤗 Gradio support\n\n- Convert your trained models into Gradio App for deployment:\n\n```python\nfrom video_transformers import VideoModel\n\nmodel = VideoModel.from_pretrained(\"runs/exp/checkpoint\")\nmodel.to_gradio(examples=['video.mp4'], export_dir=\"runs/exports/\", export_filename=\"app.py\")\n```\n\n\n## Contributing\n\nBefore opening a PR:\n\n- Install required development packages:\n\n```bash\npip install -e .\"[dev]\"\n```\n\n- Reformat with black and isort:\n\n```bash\npython -m tests.run_code_style format\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffcakyon%2Fvideo-transformers","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffcakyon%2Fvideo-transformers","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffcakyon%2Fvideo-transformers/lists"}