Awesome-Transformer-based-SLAM

Paper Survey for Transformer-based SLAM
https://github.com/KwanWaiPang/Awesome-Transformer-based-SLAM

Last synced: 3 days ago
JSON representation

Transformer-based SLAM
- 3R-GS: Best Practice in Optimizing Camera Poses Along with 3DGS - --|[website](https://zsh523.github.io/3R-GS/)<br>MASt3R-SfM+3DGS|
- AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos
- SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos - VCL-3DV/SLAM3R.svg)](https://github.com/PKU-VCL-3DV/SLAM3R)|[test](https://kwanwaipang.github.io/SLAM3R/)|
- MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors - SLAM.svg)](https://github.com/rmurai0610/MASt3R-SLAM)|[Website](https://edexheim.github.io/mast3r-slam/) <br> [Test](https://kwanwaipang.github.io/MASt3R-SLAM/)
- Jperceiver: Joint perception network for depth, pose and layout estimation in driving scenes - --|
Transformer-based Mapping
- Multi-frame self-supervised depth with transformers - --|---|
- Lite-mono: A lightweight cnn and transformer architecture for self-supervised monocular depth estimation - --|---|
- CompletionFormer: Depth Completion with Convolutions and Vision Transformers - zym/CompletionFormer.svg)](https://github.com/youmi-zym/CompletionFormer)|[website](https://youmi-zym.github.io/projects/CompletionFormer/)|
- Lightweight monocular depth estimation via token-sharing transformer - --|---|
- ROIFormer: Semantic-Aware Region of Interest Transformer for Efficient Self-Supervised Monocular Depth Estimation - --|---|
- TODE-Trans: Transparent Object Depth Estimation with Transformer - --|
- Deep digging into the generalization of self-supervised monocular depth estimation - --|
- PanoFormer: Panorama Transformer for Indoor 360 Depth Estimation - bjtu/PanoFormer.svg)](https://github.com/zhijieshen-bjtu/PanoFormerr)|---|
- Egformer: Equirectangular geometry-biased transformer for 360 depth estimation - Mono.svg)](https://github.com/noahzn/Lite-Mono)|---|
- Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation - --|---|
- Improving 360 monocular depth estimation via non-local dense prediction transformer and joint supervised and self-supervised learning - --|---|
- MVSFormer: Multi-view stereo by learning robust image features and temperature-based depth - --|---|
- M3: 3D-Spatial Multimodal Memory - spatial.svg)](https://github.com/MaureenZOU/m3-spatial)|[website](https://m3-spatial-memory.github.io/)<br>compression & Gaussian Memory Attention|
- Continuous 3D Perception Model with Persistent State
- SPARS3R: Semantic Prior Alignment and Regularization for Sparse 3D Reconstruction
- SplatVoxel: History-Aware Novel View Streaming without Temporal Training - --|---|
- GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding
- DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers
- Sonata: Self-Supervised Learning of Reliable Point Representations
- Point transformer v3: Simpler faster stronger - --|
- Point transformer v2: Grouped vector attention and partition-based pooling - --|
- Point transformer - --|[unofficial implementation](https://github.com/POSTECH-CVLab/point-transformer)|
- Dynamic Point Maps: A Versatile Representation for Dynamic 3D Reconstruction - --|[website](https://www.robots.ox.ac.uk/%CB%9Cvgg/research/dynamic-pointmaps/) <br> Dynamic DUSt3R, DPM|
- MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion - project.github.io/)<br>[Test](https://kwanwaipang.github.io/MonST3R/)|
- Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos - code.svg)](https://github.com/Stereo4d/stereo4d-code)|[website](https://stereo4d.github.io/)|
- Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass - 3d.github.io/) <br> [Test](https://kwanwaipang.github.io/Fast3R/)
- Depth anything: Unleashing the power of large-scale unlabeled data - Anything.svg)](https://github.com/LiheYoung/Depth-Anything)|[Website](https://depth-anything.github.io/)
- DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions - --|---|
- MUSt3R: Multi-view Network for Stereo 3D Reconstruction
- Learning to adapt clip for few-shot monocular depth estimation - --|---|
- 3d reconstruction with spatial memory
- Towards zero-shot scale-aware monocular depth estimation - ml/vidar.svg)](https://github.com/tri-ml/vidar)|[website](https://sites.google.com/view/tri-zerodepth)|
- DUSt3R: Geometric 3D Vision Made Easy - geometric-3d-vision-made-easy/) <br> [Test](https://kwanwaipang.github.io/File/Blogs/Poster/MASt3R-SLAM.html)
- Gs-lrm: Large reconstruction model for 3d gaussian splatting - --|[website](https://sai-bi.github.io/project/gs-lrm/)<br>3DGS+Transformer|
- BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation - Depth-Estimation-Toolbox.svg)](https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox)|---|
- GLPanoDepth: Global-to-Local Panoramic Depth Estimation - --|---|
- Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics - --|---|
- Objcavit: improving monocular depth estimation using natural language models and image-object cross-attention - --|
- Depthformer: Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion - --|
- Sidert: A real-time pure transformer architecture for single image depth estimation - --|---|
- Hybrid transformer based feature fusion for self-supervised monocular depth estimation - --|---|
- Spike transformer: Monocular depth estimation for spiking camera - SpikingCamera.svg)](https://github.com/Leozhangjiyuan/MDE-SpikingCamera)|---|
- Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers - --|STTR<br>stereo matching|
- Transformer-based Monocular Depth Estimation with Attention Supervision - Chang-42/ASTransformer.svg)](https://github.com/WJ-Chang-42/ASTransformer)|---|
- Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction - --|
- Vision transformers for dense prediction - org/DPT.svg)](https://github.com/isl-org/DPT)|DPT|
- MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer - --|
- DEST: "Depth Estimation with Simplified Transformer - --|---|
- SparseFormer: Attention-based Depth Completion Network - --|---|
- GuideFormer: Transformers for Image Guided Depth Completion - --|---|
Transformer-based Pose Tracking
- Transformer-based model for monocular visual odometry: a video understanding approach - VO.svg)](https://github.com/aofrancani/TSformer-VO)|---|
- Light3R-SfM: Towards Feed-forward Structure-from-Motion - --|---|
- MASt3R-SfM: a Fully-Integrated Solution for Unconstrained Structure-from-Motion
- VGGSfM: Visual Geometry Grounded Deep Structure From Motion
- XIRVIO: Critic-guided Iterative Refinement for Visual-Inertial Odometry with Explainable Adaptive Weighting - --|---|
- End-to-End Learned Visual Odometry Based on Vision Transformer - --|---|
- Causal Transformer for Fusion and Pose Estimation in Deep Visual Inertial Odometry - --|
- DDETR-SLAM: A Transformer-Based Approach to Pose Optimization in Dynamic Environments - --|---|
- ViT VO-A Visual Odometry technique Using CNN-Transformer Hybrid Architecture - --|---|
- TransFusionOdom: interpretable transformer-based LiDAR-inertial fusion odometry estimation - modal-dataset-for-odometry-estimation.svg)](https://github.com/RakugenSon/Multi-modal-dataset-for-odometry-estimation)|---|
- Modality-invariant Visual Odometry for Embodied Vision - Transformer.svg)](https://github.com/memmelma/VO-Transformer)|[Website](https://memmelma.github.io/vot/)|
- ViTVO: Vision Transformer based Visual Odometry with Attention Supervision - --|---|
- Ema-vio: Deep visual–inertial odometry with external memory attention - --|---|
- AFT-VO: Asynchronous fusion transformers for multi-view visual odometry estimation - --|---|
- Dense prediction transformer for scale estimation in monocular visual odometry - --|---|
- Transformer guided geometry model for flow-based unsupervised visual odometry - --|---|
Transformer-based Optical Flow
- Win-win: Training high-resolution vision transformers from two windows - --|---|
- Flowformer: A transformer architecture and its masked cost volume autoencoding for optical flow - --|---|
- Cotracker: It is better to track together - tracker.svg)](https://github.com/facebookresearch/co-tracker)|---|
- FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation - --|
- Transflow: Transformer as flow learner - --|---|
- Croco v2: Improved cross-view completion pre-training for stereo matching and optical flow
- Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion
- Unifying flow, stereo and depth estimation - --|
- Gmflow: Learning optical flow via global matching - --|
- Craft: Cross-attentional flow transformer for robust optical flow - --|
- Learning optical flow with kernel patch attention - research/KPAFlow.svg)](https://github.com/megvii-research/KPAFlow)|---|
- Global Matching with Overlapping Attention for Optical Flow Estimation - --|
- Flowformer: A transformer architecture for optical flow - Official.svg)](https://github.com/drinkingcoder/FlowFormer-Official)|---|
- ![Github stars - ->
Transformer-based View Matching
- Loop Closure from Two Views: Revisiting PGO for Scalable Trajectory Estimation through Monocular Priors - --| MASt3R for Loop Closure|
- Speedy MASt3R - --|---|
- VGGT: Visual Geometry Grounded Transformer - t.github.io/)<br>[Test](https://kwanwaipang.github.io/VGGT/)|
- Grounding Image Matching in 3D with MASt3R - matching-and-stereo-3d-reconstruction/) <br> [Test](https://kwanwaipang.github.io/File/Blogs/Poster/MASt3R-SLAM.html)
- RoMa: Robust dense feature matching - --|
- Tlcd: A transformer based loop closure detection for robotic visual slam - --|---|
- Cotr: Correspondence transformer for matching across images - vision/COTR.svg)](https://github.com/ubc-vision/COTR)|---|
- LoFTR: Detector-free local feature matching with transformers - --|
- Superglue: Learning feature matching with graph neural networks - attention|
Other Resources
- Awesome-Transformer-Attention
- Dense-Prediction-Transformer-Based-Visual-Odometry
- Visual SLAM with Vision Transformers(ViT)
- Awesome-Learning-based-VO-VIO
- Dinov2: Learning robust visual features without supervision
- Is space-time attention all you need for video understanding?
- Taming transformers for high-resolution image synthesis - transformers.svg)](https://github.com/CompVis/taming-transformers)|High resolution CNN+Transformer|
- Emerging properties in self-supervised vision transformers
- Vivit: A video vision transformer - pytorch.svg)](https://github.com/lucidrains/vit-pytorch)|---|
- An image is worth 16x16 words: Transformers for image recognition at scale - research/vision_transformer.svg)](https://github.com/google-research/vision_transformer)|ViT|

Programming Languages

Categories

Transformer-based Mapping 50 Transformer-based Pose Tracking 16 Transformer-based Optical Flow 14 Other Resources 10 Transformer-based View Matching 9 Transformer-based SLAM 5

Sub Categories

Keywords

computer-vision 2 slam 2 deep-learning 2 transformer 2 attention-mechanism 1 attention-mechanisms 1 awesome-list 1 detr 1 papers 1 self-attention 1 transformer-architecture 1 transformer-awesome 1 transformer-cv 1 transformer-models 1 transformer-with-cv 1 transformers 1 vision-transformer 1 visual-transformer 1 vit 1 optimal-control 1 visual-slam-learning 1 awesome 1 vio 1 visual-odometry 1