{"id":15652041,"url":"https://github.com/lromul/ball-action-spotting","last_synced_at":"2025-04-16T03:45:45.878Z","repository":{"id":174953839,"uuid":"600746239","full_name":"lRomul/ball-action-spotting","owner":"lRomul","description":"SoccerNet@CVPR | 1st place solution for Ball Action Spotting Challenge 2023","archived":false,"fork":false,"pushed_at":"2023-07-02T19:58:15.000Z","size":634,"stargazers_count":109,"open_issues_count":1,"forks_count":13,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-12T04:23:37.120Z","etag":null,"topics":["action-recognition","action-spotting","deep-learning","football","pytorch","soccer","soccernet","video","video-processing"],"latest_commit_sha":null,"homepage":"https://www.soccer-net.org/tasks/ball-action-spotting","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lRomul.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-02-12T13:30:28.000Z","updated_at":"2025-04-04T12:17:02.000Z","dependencies_parsed_at":null,"dependency_job_id":"58c5db89-da58-4b82-83ac-a58ae5c3fef7","html_url":"https://github.com/lRomul/ball-action-spotting","commit_stats":null,"previous_names":["lromul/ball-action-spotting"],"tags_count":308,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lRomul%2Fball-action-spotting","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lRomul%2Fball-action-spotting/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lRomul%2Fball-action-spotting/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lRomul%2Fball-action-spotting/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lRomul","download_url":"https://codeload.github.com/lRomul/ball-action-spotting/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249192211,"owners_count":21227748,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["action-recognition","action-spotting","deep-learning","football","pytorch","soccer","soccernet","video","video-processing"],"created_at":"2024-10-03T12:41:06.185Z","updated_at":"2025-04-16T03:45:45.857Z","avatar_url":"https://github.com/lRomul.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Solution for SoccerNet Ball Action Spotting Challenge 2023\n\n![header](data/readme_images/header.png)\n\nThis repo contains the 1st place solution for the [SoccerNet Ball Action Spotting Challenge 2023](https://www.soccer-net.org/challenges/2023#h.vverf0niv2is).\nThe challenge goal is to develop an algorithm for spotting passes and drives occurring in videos of soccer matches.\nUnlike the [SoccerNet Action Spotting Challenge](https://www.soccer-net.org/challenges/2023#h.x9nb4u9m441u), the actions are much more densely allocated and should be predicted more accurately (with a 1-second precision).\n\n## Solution\n\nKey points:\n* Efficient model architecture for extracting information from video data\n* Multi-stage training (transfer learning, fine-tuning with long sequences)\n* Fast video loading for training (GPU based, no need for preprocessing with extracting images)\n\n### Model\n\nThe model architecture employed in this solution utilizes a slow fusion approach, incorporating 2D convolutions in the early stages and 3D convolutions in the later stages.\nThis architectural choice made one of the most significant contributions to the overall metric outcome.\nIt effectively improved the mAP@1 metric on both the test and challenge sets, resulting in an approximate increase of 15% (from 65% to 80%) compared to an approach utilizing 2D CNN early fusion.\n\n![model](data/readme_images/model.png)\n\nThe model consumes sequences of grayscale frames. Neighboring frames are stacking as channels for input to the 2D convolutional encoder.\nFor example, if fifteen frames are stacked in sets of three, the outcome would be five input tensors, each consisting of three channels.\nThe shared 2D encoder independently processes these input tensors, producing visual features.\nThe following 3D encoder processes visual features, producing temporal features.\nConcated temporal features pass through global pooling to compress the spatial dimensions.\nThen, a linear classifier predicts the presence of actions in the middle frame of the original input sequence.\n\nI choose the following model hyperparameters as a result of the experiments:\n* Stack threes from 15 grayscale 1280x736 frames skipping every second frame from the original 25 FPS video (equivalent to 15 neighboring frames at 12.5 FPS, about 1.16 seconds window)\n* EfficientNetV2 B0 as 2D encoder\n* 4 inverted residual 3D blocks as 3D encoder (ported from 2D EfficientNet version)\n* GeM as global pooling\n* Multilabel classification with positive labels applied in a 0.6 seconds window (15 frames at 25 FPS) around the action timestamp from annotation\n\nYou can find more details in the [model implementation](src/models/multidim_stacker.py) and [experiment configs](configs/ball_action).\n\n### Training\n\nI made several stages of training to obtain 86.47% mAP@1 on the challenge set (87.03% on the test):\n1. **Basic training ([config](configs/ball_action/sampling_weights_001.py)).** The 2D encoder is initialized with ImageNet-pre-trained weights; other parts start from scratch.\n2. **Training on Action Spotting Challenge dataset ([config](configs/action/action_sampling_weights_002.py)).** Same weights initialization as in the first stage.\n3. **Transfer learning ([config](configs/ball_action/ball_tuning_001.py)).** 2D and 3D encoders initialized with weights obtained in the second stage. Out-of-fold predictions from the first stage were used for data sampling (more details later).\n4. **Fine-tuning with long sequences ([config](configs/ball_action/ball_finetune_long_004.py)).** 2D and 3D encoders initialized with weights obtained in the third stage. 2D encoder weights are frozen.\n\n#### Basic training\n\nIn this challenge, I employed 7-fold cross-validation to tune the training pipeline more precisely.\nEach labeled game from the dataset is an individual fold.\n\nIn short, the resulting training pipeline hyperparameters are:\n* Learning rate warmup for the first 6 epochs from 0 to 3e-4, cosine annealing last 30 epochs to 3e-6\n* Batch size 4, one training epoch comprises 6000 samples\n* Optimizer AdamW with weight decay 0.01\n* Focal loss with gamma 1.2\n* Model EMA with decay 0.999\n* Initial weights for 2D encoder ImageNet pretrained\n* Architecture hyperparameters listed above in the model part\n\nWorth writing about sampling techniques during training, which significantly impacts its results.\nFor basic training, a simple but well-working sampling algorithm was used.\nFor each training sample, randomly take a video index from a uniform distribution.\nThen randomly choose a frame index with an equal chance to sample frames near to actions and remaining frames.\nFrame near to action if allocated in a 0.36 seconds window (9 frames at 25 FPS) around the action.\nI tried different ratios, but an equal chance to show empty and event frame worked best.\nI will introduce a more advanced sampling scheme in the section on transfer learning.\n\nI applied the usual augmentations like horizontal flip, rotation, random resized crop, brightness, motion blur, etc.\nOf the interesting, there are only two temporal augmentations:\n* Simulating camera movement (change translation, scale, and angle over time)\n* Randomly shake the frames in sequences (applied 40% chance to change the frame index on +- 1)\n\nThe models from this training have 79.06% on CV (cross-validation) and 84.26% mAP@1 on the test set (the metric on test split was calculated by the out-of-fold predictions for two folds which include test games).\nI did not evaluate these models for the challenge set.\n\n#### Training on Action Spotting Challenge dataset\n\nI built a similar pipeline for videos and classes from Action Spotting Challenge to get good initial weights for the next experiment.\n\nBriefly, here are the changes from the previous:\n* 377 games in training, 18 in validation split\n* 15 classes (all cards-related class was merged because the model consumes grayscale frames)\n* 4 warmup epochs and 20 training epochs, one epoch is 36000 samples\n* Weight the frame sampling by the effective number of class samples\n\n#### Transfer learning\n\nThis training uses the results of the previous two.\nThe second one gives excellent initial weights for 2D and 3D encoders. It provides a significant boost (~2% mAP@1 on test and CV).\nThat is understandable because the same models were trained on many games with similar input frames and some similar actions.\n\nBasic training gives out-of-fold predictions that I use for sampling in the following way:\n\n![sampling](data/readme_images/sampling.png)\n\nTake the element-wise maximum between the sampling distribution (introduced above) and predictions, then normalize again to equal probability sums between empty and action frames.\nThe intuition is that there are some hard negative examples in the dataset. Due to many negative samples, such hard examples are rarely sampled during training.\nWe can make something like hard negative mining/sampling with the technique.\n\nOther minor changes compared to basic training:\n* 7 warmup epochs and 35 training epochs\n* Focal loss with gamma 1.2 and alpha 0.4\n\nModels achieve 81.04% on CV, 86.51% on the test, and 86.35% mAP@1 on the challenge set.\n\n#### Fine-tuning with long sequences\n\nBefore, I trained all models on relatively short clips (15 frames in 12.5 FPS, 1.16 seconds) to fits the VRAM of a single GPU and to decrease training time. Models obtained good 2D and 3D encoder weights in the previous experiment. So I tried to freeze 2D weights and fine-tune 3D weights on long sequences to provide more context.\n\nChanges compared to transfer learning experiment:\n* 33 frames in 12.5 FPS (2.6 seconds)\n* LR warmup first 2 epochs from 0 to 1e-3, cosine annealing last 7 epochs to 5e-05\n* SGD with Nesterov momentum 0.9\n\nModels scored 80.49% on CV, 87.04% on the test, and 86.47% mAP@1 on the challenge set. The score is lower on cross-validation, but it's my best submission on the test and the challenge set `¯\\_(ツ)_/¯`.\n\n### Prediction and postprocessing\n\nModels predict each possible sequence of frames from the videos. Additionally, I make test time augmentation with the horizontal flip. On the challenge set, I used the arithmetic mean of predictions from all fold models.\n\nPostprocessing is very simple. I just used a combination of Gaussian filter and peak detection from `SciPy` with the following parameters: standard deviation for Gaussian kernel 3.0, peak detection minimal height 0.2, and minimal distance between neighboring peaks 15 frames.\n\n### Training and prediction accelerations\n\nI optimized the training pipeline to iterate experiments faster and to test more hypotheses.\n* Custom multiprocessing video loader with simultaneous use `VideoProcessingFramework` (GPU decoding) and `OpenCV` (CPU decoding) workers to optimize hardware utilization\n* FP16 with Automatic Mixed Precision\n* `torch.compile` using TorchDynamo backend\n* Augmentation on the GPU with `kornia`\n\nThese accelerations allow running epoch (train + val) of basic training in 7 minutes and 10 seconds on a single RTX 3090 Ti.\nIt is impressive because one epoch is 6000 training and approximately 2600 validation examples, each of which is 15 frames in 1280x736 resolution.\nAlso, using source videos without the preprocessing with extracting images allows using any video frame during training and saves disk space.\n\nI applied caching strategy to speed up inference time using the architecture structure.\nIf one saves the last visual features, it is enough to predict with the 2D encoder only one stack of frames when receiving a new one.\nThe 2D encoder is the most time expensive part of the model. Predicting 3D features takes a short time. So this strategy dramatically boosts prediction speed by several times.\n\n### Progress\n\nYou can see detailed progress of the solution development during the challenge in [spreadsheets](https://docs.google.com/spreadsheets/d/1mGnTdrVnhoQ8PJKNN539ZzhZxSowc4GpN9NdyDJlqYo/edit?usp=sharing) (the document consists of multiple sheets).\n\nMy solution is very inspired by the top solutions of the DFL - Bundesliga Data Shootout competition:\n* Team Hydrogen ([link](https://www.kaggle.com/competitions/dfl-bundesliga-data-shootout/discussion/359932))\n* K_mat ([link](https://www.kaggle.com/competitions/dfl-bundesliga-data-shootout/discussion/360097))\n* Camaro ([link](https://www.kaggle.com/competitions/dfl-bundesliga-data-shootout/discussion/360236))\n* ohkawa3 ([link](https://www.kaggle.com/competitions/dfl-bundesliga-data-shootout/discussion/360331))\n\nSo I found a good base approach quickly. Thanks for sharing well-written and detailed reports :)\nThanks to the SoccerNet organizers for the excellent datasets. Thanks to the participants for a good competition. Thanks to my family and friends who supported me during the challenge!\n\n## Quick setup and start\n\n### Requirements\n\n* Linux (tested on Ubuntu 20.04 and 22.04)\n* NVIDIA GPU (pipeline tuned for RTX 3090)\n* NVIDIA Drivers \u003e= 520, CUDA \u003e= 11.8\n* [Docker](https://docs.docker.com/engine/install/)\n* [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)\n\n### Run\n\nClone the repo and enter the folder.\n\n```bash\ngit clone git@github.com:lRomul/ball-action-spotting.git\ncd ball-action-spotting\n```\n\nBuild a Docker image and run a container.\n\n\u003cdetails\u003e\u003csummary\u003eHere is a small guide on how to use the provided Makefile\u003c/summary\u003e\n\n```bash\nmake  # stop, build, run\n\n# do the same\nmake stop\nmake build\nmake run\n\nmake  # by default all GPUs passed\nmake GPUS=all  # do the same\nmake GPUS=none  # without GPUs\n\nmake run GPUS=2  # pass the first two GPUs\nmake run GPUS='\\\"device=1,2\\\"'  # pass GPUs numbered 1 and 2\n\nmake logs\nmake exec  # run a new command in a running container\nmake exec COMMAND=\"bash\"  # do the same\nmake stop\n```\n\n\u003c/details\u003e\n\n```bash\nmake\n```\n\nFrom now on, you should run all commands inside the docker container.\n\nDownload the Ball Action Spotting dataset (9 GB).\nTo get the password, you must fill NDA ([link](https://www.soccer-net.org/data)).\n\n```bash\npython download_ball_data.py --password_videos \u003cpassword\u003e\n```\n\nDownload the Action Spotting dataset (791.5 GB). You can skip this step, but then you cannot train the model on the action dataset.\n\n```bash\npython download_action_data.py --only_train_valid --password_videos \u003cpassword\u003e\n```\n\nNow you can train models and use them to predict games.\nTo reproduce the final solution, you can use the following commands:\n\n```bash\n# Train and predict basic experiment on all folds\npython scripts/ball_action/train.py --experiment sampling_weights_001\npython scripts/ball_action/predict.py --experiment sampling_weights_001\n\n# Training on Action Spotting Challenge dataset\npython scripts/action/train.py --experiment action_sampling_weights_002\n\n# Transfer learning\npython scripts/ball_action/train.py --experiment ball_tuning_001\n\n# Fine-tune with long sequences, evaluate on CV, and predict challenge set\npython scripts/ball_action/train.py --experiment ball_finetune_long_004\npython scripts/ball_action/predict.py --experiment ball_finetune_long_004\npython scripts/ball_action/evaluate.py --experiment ball_finetune_long_004\npython scripts/ball_action/predict.py --experiment ball_finetune_long_004 --challenge\npython scripts/ball_action/ensemble.py --experiments ball_finetune_long_004 --challenge\n\n# Spotting results will be there\ncd data/ball_action/predictions/ball_finetune_long_004/challenge/ensemble/\nzip results_spotting.zip ./*/*/*/results_spotting.json\n```\n\n### Trained models\n\nYou can skip any step of training the final solution by downloading model weights and predictions from [Google Drive](https://drive.google.com/drive/folders/1mIu62cIdsRn3W4o1E5vRR8V5Q1B6HHoz?usp=sharing).\n\nCopy the files to the [data](data) directory so that the folder structure is as follows:\n\n```\ndata\n├── action\n│   ├── experiments\n│   │   └── action_sampling_weights_002\n│   └── predictions\n│       └── action_sampling_weights_002\n├── ball_action\n│   ├── experiments\n│   │   ├── ball_finetune_long_004\n│   │   ├── ball_tuning_001\n│   │   └── sampling_weights_001\n│   └── predictions\n│       ├── ball_finetune_long_004\n│       ├── ball_tuning_001\n│       └── sampling_weights_001\n├── readme_images\n└── soccernet\n    └── spotting-ball-2023\n        └── england_efl\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flromul%2Fball-action-spotting","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flromul%2Fball-action-spotting","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flromul%2Fball-action-spotting/lists"}