Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-cvpr-2024

🤩 An AWESOME Curated List of Papers, Workshops, Datasets, and Challenges from CVPR 2024
https://github.com/harpreetsahota204/awesome-cvpr-2024

Last synced: 4 days ago
JSON representation

👁️💬 Vision-Language
- Let’s Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation - sg/CLoT?style=social)](https://github.com/sail-sg/CLoT) [![arXiv](https://img.shields.io/badge/arXiv-2312.02439-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2312.02439) | |
- A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models - Rodríguez, Sina Hajimiri, Ismail Ben Ayed | [![GitHub](https://img.shields.io/github/stars/jusiro/CLAP?style=social)](https://github.com/jusiro/CLAP/) [![arXiv](https://img.shields.io/badge/arXiv-2312.12730-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2312.12730) | CLIP is a powerful vision-language model for visual recognition. However, fine-tuning it for small downstream tasks with limited labeled samples is challenging. Efficient transfer learning (ETL) methods adapt VLMs with few parameters, but require careful per-task hyperparameter tuning using large validation sets. To overcome this, the authors propose CLAP, a principled approach that adapts linear probing for few-shot learning. CLAP consistently outperforms ETL methods, providing an efficient and robust approach for few-shot adaptation of large vision-language models in realistic settings where hyperparameter tuning with large validation sets is not feasible. |
- ![GitHub - portal/Link-Context-Learning) [![arXiv](https://img.shields.io/badge/arXiv-2308.07891-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2308.07891) | The paper presents Link-Context Learning (LCL), a new approach that enables Multimodal Large Language Models (MLLMs) to learn new concepts from limited examples in a single conversation. The proposed training strategy fine-tunes MLLMs using contrast learning and balanced sampling from LCL and original tasks. The ISEKAI dataset is introduced to evaluate MLLMs' performance on LCL tasks. Experiments show that LCL-MLLM outperforms vanilla MLLMs on the ISEKAI dataset. The paper presents LCL as a promising paradigm for expanding MLLMs' abilities and paving the way for more human-like learning in multimodal models. |
- Vlogger: Make Your Dream A Vlog - 2401.09414-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2401.09414) | Vlogger is an AI system that generates minute-level video blogs from user descriptions. It uses a Large Language Model (LLM) to break down the task into four stages: Script, Actor, ShowMaker, and Voicer. The ShowMaker uses a Spatial-Temporal Enhanced Block (STEB) to enhance spatial-temporal coherence. Vlogger can generate 5+ minute vlogs surpassing previous long video generation methods. |
- Alpha-CLIP: A CLIP Model Focusing on Wherever You Want - 2312.03818-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2312.03818) | Alpha-CLIP is an improved version of the CLIP model that focuses on specific regions of interest in images through an auxiliary alpha channel. It can enhance CLIP in different image-related tasks, including 2D and 3D image generation, captioning, and detection. Alpha-CLIP preserves CLIP's visual recognition ability and boosts zero-shot classification accuracy by 4.1% when using foreground masks. |
- CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update - tool/CLOVA-tool?style=social)](https://github.com/clova-tool/CLOVA-tool) [![arXiv](https://img.shields.io/badge/arXiv-2312.10908-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2312.10908) | CLOVA is a system that leverages large language models (LLMs) to generate programs that can accomplish various visual tasks using off-the-shelf visual tools. To overcome the limitation of fixed tools, CLOVA has a closed-loop framework that includes an inference phase, reflection phase, and learning phase. It also uses a multimodal global-local reflection scheme and three flexible methods to collect real-time training data. CLOVA's learning capability enables it to adapt to new environments, resulting in a 5-20% better performance on VQA, multiple-image reasoning, knowledge tagging, and image editing tasks. |
- Convolutional Prompting meets Language Models for Continual Learning - 2403.20317v1-b31b1b.svg?style=for-the-badge)](https://arxiv.org/html/2403.20317v1) | The paper introduces ConvPrompt, a novel approach for continual learning in vision transformers. ConvPrompt leverages convolutional prompts and large language models to maintain layer-wise shared embeddings and improve knowledge sharing across tasks. The method improves state-of-the-art by around 3% with significantly fewer parameters. In summary, ConvPrompt is an efficient and effective prompt-based continual learning approach that adapts the model capacity based on task similarity. |
- Improved Visual Grounding through Self-Consistent Explanations - Bonilla, Ziyan Yang | [![GitHub](https://img.shields.io/github/stars/uvavision/SelfEQ?style=social)](https://github.com/uvavision/SelfEQ) [![arXiv](https://img.shields.io/badge/arXiv-2312.04554-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2312.04554) | This paper presents a strategy called SelfEQ. The aim of SelfEQ is to improve the ability of vision-and-language models to locate specific objects in an image. The proposed strategy involves adding paraphrases generated by a large language model to existing text-image datasets. The model is then fine-tuned to ensure that a phrase and its paraphrase map to the same region in the image. This promotes self-consistency in visual explanations, expands the model's vocabulary, and enhances the quality of object locations highlighted by gradient-based visual explanation methods like GradCAM. |
- Learning CNN on ViT: A Hybrid Model to Explicitly Class-specific Boundaries for Domain Adaptation - Tuong Do-Tran, Tuan-Ngoc Nguyen | [![GitHub](https://img.shields.io/github/stars/dotrannhattuong/ECB?style=social)](https://github.com/dotrannhattuong/ECB) [![arXiv](https://img.shields.io/badge/arXiv-2403.18360v2-b31b1b.svg?style=for-the-badge)](https://arxiv.org/html/2403.18360v2) | The paper introduces a new approach called Explicitly Class-specific Boundaries (ECB) for domain adaptation, which combines the strengths of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) by training CNN on ViT. ECB uses ViT to determine class-specific decision boundaries and CNN to group target features based on those boundaries. This improves the quality of pseudo labels and reduces knowledge disparities. The paper also provides visualizations to demonstrate the effectiveness of the proposed ECB method. |
- Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations - 2403.02090v2-b31b1b.svg?style=for-the-badge)](https://arxiv.org/html/2403.02090v2) | The paper "Modeling Multimodal Social Interactions" introduces three new tasks to model multi-party social interactions. The authors propose a novel multimodal baseline that leverages densely aligned language-visual representations to address these challenges. Experiments demonstrate the effectiveness of the proposed approach in modeling social interactions. |
- mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration - PLUG/mPLUG-Owl?style=social)](https://github.com/X-PLUG/mPLUG-Owl) [![arXiv](https://img.shields.io/badge/arXiv-2311.04257-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2311.04257) | mPLUG-Owl2 is a multi-modal language model that improves text and multi-modal task performance. It uses a modularized network design with a language decoder as a universal interface for managing different modalities. It incorporates shared functional modules and a modality-adaptive module. It uses a two-stage training paradigm consisting of vision-language pre-training and joint vision-language instruction tuning. Experiments show it achieves SOTA results on multiple vision-language and pure-text benchmarks. It introduces novel architecture designs and training methods to enable modality collaboration, leading to strong performance in text-only and multi-modal tasks. |
- OneLLM: One Framework to Align All Modalities with Language - 2312.03700-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2312.03700) | OneLLM aligns 8 modalities to language using a unified framework. It uses a frozen CLIP-ViT and a universal projection module (UPM) that mixes image projection experts. OneLLM progressively aligns modalities to the LLM, starting with image-text alignment and expanding to video, audio, point cloud, depth/normal map, IMU, and fMRI data. The authors curated a large multimodal instruction dataset to fine-tune OneLLM's multimodal understanding and reasoning capabilities. OneLLM performs excellently on 25 diverse multimodal benchmarks, including captioning, question answering, and reasoning tasks. OneLLM pioneers a unified and scalable MLLM framework that can align a wide range of modalities with language and achieve strong multimodal understanding through instruction finetuning. |
- ![GitHub - 2311.17911-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2311.17911) | OPERA is a novel solution to alleviate hallucination in MLLMs. It introduces a penalty term and rollback strategy during beam-search decoding, targeting the root cause of self-attention patterns. It doesn't require additional data or training and has been proven effective in experiments. |
- Describing Differences in Image Sets with Natural Language - Visual-Datasets/VisDiff?style=social)](https://github.com/Understanding-Visual-Datasets/VisDiff) [![arXiv](https://img.shields.io/badge/arXiv-2312.02974-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2312.02974) | |
- Osprey: Pixel Understanding with Visual Instruction Tuning - 2312.10032-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2312.10032) | Osprey is an approach that improves multimodal large language models' accuracy in understanding visual information. It uses fine-grained mask regions and a convolutional CLIP backbone to extract precise visual mask features from high-resolution inputs efficiently. The authors curated the Osprey-724K dataset with 724K samples to facilitate mask-based instruction tuning. Osprey outperforms previous state-of-the-art methods in various region understanding tasks, showcasing its new capability for pixel-level instruction tuning. |
👩🏾‍🏫 Tutorial
- From Multimodal LLM to Human-level AI: Modality, Instruction, Reasoning and Beyond - edge research in multimodal large language models (MLLMs). These models integrate various modalities to enable AI systems to understand, reason, and plan. The tutorial focuses on MLLM architecture design, instructional learning, and multimodal reasoning. The organizers have compiled an extensive reading list for LLMs, MLLMs, instruction tuning, and reasoning. The tutorial aims to summarize technical advancements, challenges, and future research directions in the evolving field of MLLMs. |
- Diffusion-based Video Generative Models - depth exploration of diffusion-based video generative models, a cutting-edge field that is transforming video creation. It aims to help students, researchers, practitioners, video creators and enthusiasts gain the necessary knowledge to enter and contribute to this domain. The tutorial will cover three main topics: (1) Fundamentals: Diffusion models, video foundation models, pre-training (2) Applications: Fine-tuning, editing, controls, personalization, motion customization(3) Evaluation & Safety: Benchmarks, metrics, attacks, watermarks, copyright protection |
- Generalist Agent AI - multimodality, robotics, gaming, and healthcare. The schedule includes lectures, Q&A sessions, and panel discussions on knowledge agents, agent robotics, and agent foundation models. |
- Robustness at Inference: Towards Explainability, Uncertainty, and Intervenability
- Unlearning in Computer Vision: Foundations and Applications - trained models. The tutorial aims to provide a comprehensive understanding of MU techniques, algorithmic foundations and applications in computer vision. It also emphasizes the importance of MU from an industry perspective and discusses metrics to verify the unlearning process. |
- Recent Advances in Vision Foundation Models - purpose vision systems, or vision foundation models, for various downstream tasks at different levels of granularity. It explores the synergy of tasks and the versatility of transformers for building models for multimodal understanding and generation. |
📊 Datasets/Benchmarks
- VBench: Comprehensive Benchmark Suite for Video Generative Models - 2311.17982-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2311.17982) | VBench is a tool that evaluates video generation models across 16 quality dimensions. These dimensions fall under Video Quality and Video-Condition Consistency. VBench provides valuable insights by evaluating models across multiple dimensions, content categories, and comparing video vs image generation. The tool's authors plan to expand VBench to more models and video generation tasks. Checkout the leaderboard on HF here: https://huggingface.co/spaces/Vchitect/VBench_Leaderboard |
- ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object - zhang/imagenet_d?style=social)](https://github.com/chenshuang-zhang/imagenet_d) [![arXiv](https://img.shields.io/badge/arXiv-2403.18775v1-b31b1b.svg?style=for-the-badge)](https://arxiv.org/html/2403.18775v1) | ImageNet-D is a new benchmark for evaluating neural network robustness in visual perception tasks. It generates synthetic images with diverse backgrounds, textures, and materials, making it more challenging than other synthetic datasets. Key features include diversified image generation, high visual fidelity, and significant accuracy reduction of various vision models. The benchmark is created by combining object categories and refining through human verification. ImageNet-D is effective in evaluating neural network robustness, as accuracy on it improves with accuracy on ImageNet. |
- SoccerNet Game State Reconstruction - gamestate?style=social)](https://github.com/SoccerNet/sn-gamestate) [![arXiv](https://img.shields.io/badge/arXiv-2404.11335-b31b1b.svg?style=for-the-badge)](https://arxiv.org/abs/2404.11335) | SoccerNet Game State Reconstruction (GSR) is a novel computer vision task involving the tracking and identification of sports players from a single moving camera to construct a video game-like minimap, without any specific hardware worn by the players. SoccerNet-GSR, the released dataset, includes 200 clips with 9.37M pitch localization annotations and 2.36M athlete positions on the pitch with their role, team & jersey number. Furthermore, a new performance metric 'GS-HOTA' is introduced to evaluate GSR methods. |
- Benchmarking and Evaluating Large Video Generation Models - 2310.11440-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2310.11440) | The paper proposes a comprehensive evaluation framework for large video generation models that have grown rapidly. Existing academic metrics are inadequate for evaluating these models trained on massive datasets. The proposed evaluation pipeline comprises prompt curation, objective evaluation, subjective studies, and opinion alignment. The models are evaluated based on 17 objective metrics covering visual quality, content quality, motion quality, and text-caption alignment. Additionally, it provides a comparison table of various video generation models across different metrics and capabilities. |
- LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs - 2312.04372-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2312.04372) | The LaMPilot dataset consists of 4,900 human-annotated traffic scenes, each with an instruction (I), an initial state (b), and a set of goal state criteria (G). The dataset is classified by maneuver and scenario types and is divided into training, validation, and testing sets. |
- MAPLM: A Real-World Large-Scale Vision-Language Dataset for Map and Traffic Scene Understanding - AD/MAPLM?style=social)](https://github.com/LLVM-AD/MAPLM) | The dataset contains 3D point cloud Bird's Eye View and high-resolution panoramic images of various traffic scenarios. It also includes detailed annotations at the feature, lane, and road levels. The dataset is designed for a Q&A task, where models will be evaluated based on their ability to answer questions about the traffic scenes such as the number of lanes, presence of intersections, and data quality. |
- Polos: Multimodal Metric Learning from Human Feedback for Image Captioning - smilab24/polos?style=social)](https://github.com/keio-smilab24/polos) [![arXiv](https://img.shields.io/badge/arXiv-2402.18091-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2402.18091) | The Polaris dataset, used to train the model, contains 131,020 human judgments from 550 evaluators on the appropriateness of image captions. The dataset is much larger than existing ones and is capable of training image captioning metrics. The captions in Polaris are more diverse, collected from humans and generated by 10 modern image captioning models. This demonstrates the effectiveness and robustness of Polos compared to previous metrics. |
🛠️ Workshop
- Representation Learning with Very Limited Images - modal models with limited data resources. It aims to bring together diverse communities that work on approaches such as self-supervised learning with a single image or synthetic pre-training with generated images. The workshop's organizers include researchers from various institutions. |
- Urban Scene Modeling: Where Vision Meets Photogrammetry and Graphics
- What is Next in Multimodal Foundation Models? - to-image/video/3D generation, zero-shot classification, and cross-modal retrieval. It brings together leaders to discuss different aspects of these models, including their design, efficiency, ethics, and open availability. |
- Dataset Distillation - day workshop on June 17. The workshop will explore the potential of Dataset Distillation (DD) in computer vision applications like face recognition, object detection, image segmentation, and video understanding. DD has the potential to reduce training costs, make AI eco-friendly, and enable research groups with limited resources to engage in state-of-the-art research. The workshop will also cover related topics such as active learning, few-shot learning, generative models, and learning from synthetic data. |
- Gaze Estimation and Prediction in the Wild - based interaction techniques, eye tracking technologies, applications of gaze interaction in various domains, and methodological considerations in gaze-based research. The main objective is to enhance the field of gaze interaction by providing a platform for researchers and practitioners to present their work, exchange ideas, and explore future directions. |
- Large Scale Holistic Video Understanding - video-understanding/HVU-Dataset?style=social)](https://github.com/holistic-video-understanding/HVU-Dataset) | The main objective of the workshop is to establish a video benchmark integrating joint recognition of all the semantic concepts, as a single class label per task is often not sufficient to describe the holistic content of a video. The planned panel discussion with world’s leading experts on this problem will be a fruitful input and source of ideas for all participants. The community is invited to help to extend the HVU dataset that will spur research in video understanding as a comprehensive, multi-faceted problem. |
- Responsible Data - driven dataset development, best practices for data collectors and annotators, responsible datasets for AI models, measuring dataset responsibility, transparency, data privacy and accountability, and engaging the open-source community. |
- Synthetic data fro Computer Vision
- Computer Vision for Mixed Reality - 3 allow for deeply immersive mixed reality experiences. We focus on capturing real environments with cameras and using AI to augment them with virtual objects. Our call for papers invites research on novel methods for Mixed Reality. Topics include real-time view synthesis, scene understanding, 3D capture, and more. |
- Multimodal Algorithmic Reasoning - world tasks, and will also encourage the vision community to build neural networks with human-like intelligence abilities |
- Vision Datasets Understanding - level analysis, representations of and similarities between vision datasets, improving vision dataset quality, and evaluating model accuracy under various test environments. |
🧨Diffusion
- SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors - Ying Lee | [![GitHub](https://img.shields.io/github/stars/daveredrum/SceneTex?style=social)](https://github.com/daveredrum/SceneTex) [![arXiv](https://img.shields.io/badge/arXiv-2311.17261-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2311.17261) | SceneTex generates high-quality indoor scene textures using depth-to-image diffusion priors. Key features include optimization in RGB space, multiresolution texture field, and cross-attention decoder for global style consistency. Experiments show it outperforms prior methods, but limitations include occasional artifacts and inability to handle complex geometry. |
- DemoFusion: Democratising High-Resolution Image Generation With No $$$ - Zhe Song, Zhanyu Ma | [![GitHub](https://img.shields.io/github/stars/PRIS-CV/DemoFusion?style=social)](https://github.com/PRIS-CV/DemoFusion) [![arXiv](https://img.shields.io/badge/arXiv-2311.16973-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2311.16973) | DemoFusion is an extension that enables the generation of high-res images through an accessible and efficient inference procedure. It uses global-local denoising paths and introduces three techniques for coherent high-res generation: progressive upscaling, skip residual, and dilated sampling. DemoFusion unlocks the potential in existing open-source text-to-image models without additional training or prohibitive costs, democratizing high-res image synthesis. |
- DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing - Shi/DragDiffusion?style=social)](https://github.com/Yujun-Shi/DragDiffusion) [![arXiv](https://img.shields.io/badge/arXiv-2306.14435-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2306.14435) | DragDiffusion is a novel method for interactive point-based image editing that enhances the applicability and versatility of the DragGAN framework by extending it to diffusion models. It optimizes the latent of a single diffusion step and introduces techniques to preserve the identity of the original image. The authors present DragBench, the first benchmark dataset for evaluating interactive point-based image editing methods. Experiments demonstrate the effectiveness of DragDiffusion compared to DragGAN, and an ablation study explores key factors. |
- FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models - aneja/FaceTalk?style=social)](https://github.com/shivangi-aneja/FaceTalk) [![arXiv](https://img.shields.io/badge/arXiv-2312.08459-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2312.08459) | FaceTalk is a novel method to generate 3D motion sequences of talking human heads from audio signals. It employs neural parametric head models with speech signals and a new latent diffusion model. The approach denoises Gaussian noise sequences iteratively and extracts mesh sequences using marching cubes from the frozen NPHM model. FaceTalk outperforms existing methods by 75% in perceptual user study evaluations and produces visually natural motion with diverse facial expressions and styles. |
- RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models - lab/RAVE?style=social)](https://github.com/rehg-lab/RAVE) [![arXiv](https://img.shields.io/badge/arXiv-2312.04524-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2312.04524) | RAVE is a fast and innovative method for zero-shot video editing that uses pre-trained text-to-image diffusion models. It preserves the original motion and structure of the input video while producing high-quality, temporally consistent edited videos. RAVE edits videos 25% faster than existing methods by efficiently leveraging spatio-temporal interactions between frames. It outperforms existing methods across diverse editing scenarios and requires no extra training or manual inputs. However, there are some limitations such as flickering issues for extreme shape edits in very long videos and fine detail flickering. Try the demo here: https://huggingface.co/spaces/ozgurkara/RAVE |
- Relightful Harmonization: Lighting-aware Portrait Background Replacement - 2312.06886-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2312.06886) | The paper presents Relightful Harmonization, a technique for harmonizing portrait lighting with a new background image. The method encodes lighting information from the target background image and aligns it with features from panoramic environment maps. Relightful Harmonization outperforms existing benchmarks in visual fidelity and lighting coherence. The technique only requires an arbitrary background image during inference and expands the training data using a novel data simulation pipeline. This approach enables realistic, lighting-aware portrait background replacement using just a single target background image, without requiring HDR environment maps. |
🏆 Challenges
- Agriculture-Vision Prize Challenge - Vision Prize Challenge 2024 encourages the development of algorithms for recognizing agricultural patterns from aerial images and to promote sustainable agriculture practices. Semi-supervised learning techniques will be used to merge two datasets and assess model performance. Prizes are $2,500 for 1st place, $1,500 for 2nd place, and $1,000 for 3rd place. |
- Building3D Challenge - 2307.11914-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2307.11914) | This challenge utilizes the Building3D dataset, an urban-scale publicly available dataset with over 160,000 buildings from 16 cities in Estonia. Participants must develop algorithms that take point clouds as input and generate wireframe models. |
- Chalearn Face Anti-spoofing Workshop - digital Attack dataset, called UniAttackData, with 1,800 participations, 2 physical and 12 digital attacks, and 29,706 videos. |
- Structured Semantic 3D Reconstruction (S23DR) Challenge
- Pixel-level Video Understanding in the Wild - level understanding of video scenes in complex environments and realistic scenarios. |
- DataCV Challenge
- Grocery Vision - world retail data collected in typical grocery store environments. Track 1 focuses on Video and Spatial Temporal Action Localization (TAL and STAL). Participants are provided with 73,683 image-annotation pairs for training, and their performance is evaluated based on frame-mAP for TAL and tube-mAP for STAL. Track 2 is the Multi-modal Product Retrieval (MPR) challenge. Participants must design methods to accurately retrieve product identity by measuring similarity between images and descriptions. |
- SyntaGen Competition - quality synthetic datasets using Stable Diffusion and the 20 class names from PASCAL VOC 2012 for semantic segmentation. The datasets will be evaluated by training a DeepLabv3 model and assessing its performance on a private test set, with submissions ranked based on the mIoU metric[1]. The top 2 teams will receive cash prizes and the opportunity to present their work at the workshop. |
- SMART-101 CVPR 2024 Challenge - domain conversational AI systems based on their ability to engage in helpful, harmless, and honest conversations with humans[1]. The challenge comprises a multi-turn dialogue between a human and an AI assistant, where the human can ask the AI to perform open-ended tasks or engage in open-ended conversation[1]. The AI systems are evaluated on various metrics, including helpfulness, harmlessness, honesty, groundedness, and role consistency. |
- Snapshot Spectral Imaging Face Anti-spoofing Challenge - the first snapshot spectral face anti-spoofing dataset with 6760 hyperspectral images, each containing 30 spectral channels. This competition aims to encourage research on new spectroscopic sensor face anti-spoofing algorithms suitable for SSI images. |
👁️ Vision Transformers
- ![GitHub - 2312.10035-b31b1b.svg?style=for-the-badge)](ar5iv.labs.arxiv.org/html/2312.10035) | The Point Transformer V3 (PTv3) is a 3D point cloud transformer architecture that prioritizes simplicity and efficiency to enable scalability, overcoming the traditional trade-off between accuracy and speed in point cloud processing. It uses point cloud serialization, serialized attention, enhanced conditional positional encoding (xCPE), and simplified designs to improve efficiency. PTv3 achieves state-of-the-art performance across over 20 downstream tasks spanning indoor and outdoor scenarios, while offering superior speed and memory efficiency compared to previous point transformers. |
- ![GitHub - MIG/RepViT) [![arXiv](https://img.shields.io/badge/arXiv-2307.09283-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2307.09283) | RepViT is a new lightweight convolutional neural network series for mobile devices. It combines efficient architectural choices from Vision Transformers with a standard CNN, MobileNetV3-L. Key steps include separating token mixer and channel mixer, reducing expansion ratio, using early convolutions as stem, employing a deeper downsampling layer, replacing the classifier with a simpler one, and using only 3x3 convolutions. RepViT outperforms existing CNNs and ViTs on vision tasks while maintaining favorable latency on mobile devices. |
📦 3D Vision
- ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering - 2312.05941-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2312.05941) | ASH generates real-time photorealistic renderings of animatable human avatars using Gaussian splats attached to a deformable mesh template. The skeletal motion is encoded using pose-dependent normal maps, and the dynamic Gaussian parameters are learned using 2D convolutional architectures. This approach surpasses existing real-time human avatar rendering methods and represents a significant step towards producing real-time, high-fidelity, controllable human avatars. |
- Doodle Your 3D: From Abstract Freehand Sketches to Precise 3D Shapes - 2312.04043-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2312.04043) | This paper introduces a new method for generating precise 3D shapes from abstract freehand sketches, without the need for paired sketch-3D data. The approach uses a part-level modeling and alignment framework, which enables sketch modeling and in-position editing. By operating in a low-dimensional implicit latent space and using diffusion models, the approach significantly reduces computational demands and processing time. Overall, the method offers a novel solution for enabling accurate 3D generation from abstract sketches. |
- GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians - 2312.02069-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2312.02069) | GaussianAvatars is a new technique for creating customizable photorealistic head avatars using a dynamic 3D representation based on 3D Gaussian splats. This approach allows for precise animation control while maintaining photorealistic rendering. The technique has shown impressive animation capabilities in challenging scenarios, such as reenactments from a driving video, where it outperforms existing techniques by a significant margin. |
- Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships - vocabulary 3D scene graphs from point clouds, combining vision-language and large language models. Key ideas include constructing a 3D graph with a GNN, aligning features with CLIP, and using an LLM. It allows querying arbitrary objects and relationships at inference time, and enables open-vocabulary prediction not limited to fixed labels. |
🧨 Diffusion
- Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features - 3D-Features?style=social)](https://github.com/niladridutt/Diffusion-3D-Features) [![arXiv](https://img.shields.io/badge/arXiv-2311.17024-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2311.17024) | Diff3F is a feature descriptor for untextured 3D shapes. It computes 3D semantic features using pre-trained 2D diffusion models, rendering depth and normal maps from multiple views, and lifting the 2D diffusion features back to the 3D surface. This produces semantic descriptors on the 3D shape without requiring additional training data or part segmentation. |
- One-step Diffusion with Distribution Matching Distillation - 2311.18828-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2311.18828) | Distribution Matching Distillation (DMD accelerates multi-step diffusion models into a one-step generator without compromising image quality. DMD matches the distribution of the original diffusion model by minimizing KL divergence and using two score functions - one for the actual data distribution and one for the generated distribution. A regression loss matches the large-scale structure of the multi-step diffusion outputs. |
🧩 Segmentation
- Amodal Ground Truth and Completion in the Wild - Completion-in-the-Wild?style=social)](https://github.com/Championchess/Amodal-Completion-in-the-Wild) [![arXiv](https://img.shields.io/badge/arXiv-2312.17247-b31b1b.svg?style=for-the-badge)](https://ar5iv.labs.arxiv.org/html/2312.17247) | The paper introduces amodal image segmentation which predicts masks for entire objects, including occluded parts. Previous methods used manual annotation, but the authors use 3D data to construct the MP3D-Amodal dataset with authentic amodal ground truth masks. Two architecture variants are explored: a two-stage OccAmodal model and a one-stage SDAmodal model. Their method achieves state-of-the-art performance on amodal segmentation datasets, including COCOA and the new MP3D-Amodal dataset. |

Programming Languages

Python 3 HTML 2 JavaScript 2

Ecosyste.ms: Awesome

awesome-cvpr-2024

👁️💬 Vision-Language

👩🏾‍🏫 Tutorial

📊 Datasets/Benchmarks

🛠️ Workshop

🧨Diffusion

🏆 Challenges

👁️ Vision Transformers

📦 3D Vision

🧨 Diffusion

🧩 Segmentation