Awesome-Transformer-based-SLAM
Paper Survey for Transformer-based SLAM
https://github.com/KwanWaiPang/Awesome-Transformer-based-SLAM
Last synced: 1 day ago
JSON representation
-
Transformer-based SLAM
- Building temporally coherent 3D maps with VGGT for memory-efficient Semantic SLAM - --|---|
- SING3R-SLAM: Submap-based Indoor Monocular Gaussian SLAM with 3D Reconstruction Priors - --|---|
- Multi-Agent Monocular Dense SLAM With 3D Reconstruction Priors - --|---|
- 3R-GS: Best Practice in Optimizing Camera Poses Along with 3DGS - --|[website](https://zsh523.github.io/3R-GS/)<br>MASt3R-SfM+3DGS|
- AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos
- PROFusion: Robust and Accurate Dense Reconstruction via Camera Pose Regression and Optimization - --|
- SLAM-Former: Putting SLAM into One Transformer - MARS-Lab/SLAM-Former.svg)](https://github.com/Tsinghua-MARS-Lab/SLAM-Former)|[website](https://tsinghua-mars-lab.github.io/SLAM-Former/)|
- AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend
- LiDAR-VGGT: Cross-Modal Coarse-to-Fine Fusion for Globally Consistent and Metric-Scale Dense Mapping - Map-Evaluation.svg)](https://github.com/NorwegianSmokedSalmon/Color-Map-Evaluation)|---|
- SAIL-Recon: Large SfM by Augmenting Scene Regression with Localization - SAIL/sail-recon.svg)](https://github.com/HKUST-SAIL/sail-recon)|[website](https://hkust-sail.github.io/sail-recon/)|
- DINO-SLAM: DINO-informed RGB-D SLAM for Neural Implicit and Explicit Representations - --|---|
- SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos - VCL-3DV/SLAM3R.svg)](https://github.com/PKU-VCL-3DV/SLAM3R)|[test](https://kwanwaipang.github.io/SLAM3R/)|
- MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors - SLAM.svg)](https://github.com/rmurai0610/MASt3R-SLAM)|[Website](https://edexheim.github.io/mast3r-slam/) <br> [Test](https://kwanwaipang.github.io/MASt3R-SLAM/)
- Jperceiver: Joint perception network for depth, pose and layout estimation in driving scenes - --|
- VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold - --|---|
- ViSTA-SLAM: Visual SLAM with Symmetric Two-view Association - slam.svg)](https://github.com/zhangganlin/vista-slam)|[website](https://ganlinzhang.xyz/vista-slam/)|
- VGGT-Long: Chunk it, Loop it, Align it – Pushing VGGT’s Limits on Kilometer-scale Long RGB Sequences - Long.svg)](https://github.com/DengKaiCQ/VGGT-Long)|---|
- EC3R-SLAM: Efficient and Consistent Monocular Dense SLAM with Feed-Forward 3D Reconstruction - SLAM.svg)](https://github.com/hulxgit/EC3R-SLAM)|[Website](https://h0xg.github.io/ec3r/)|
- GRS-SLAM3R: Real-Time Dense SLAM with Gated Recurrent State - --|---|
-
Transformer-based Mapping
- Human3R: Everyone Everywhere All at Once
- Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation - --|---|
- Human3R: Everyone Everywhere All at Once
- Multi-frame self-supervised depth with transformers - --|---|
- Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation - --|---|
- Lite-mono: A lightweight cnn and transformer architecture for self-supervised monocular depth estimation - --|---|
- CompletionFormer: Depth Completion with Convolutions and Vision Transformers - zym/CompletionFormer.svg)](https://github.com/youmi-zym/CompletionFormer)|[website](https://youmi-zym.github.io/projects/CompletionFormer/)|
- Lightweight monocular depth estimation via token-sharing transformer - --|---|
- ROIFormer: Semantic-Aware Region of Interest Transformer for Efficient Self-Supervised Monocular Depth Estimation - --|---|
- TODE-Trans: Transparent Object Depth Estimation with Transformer - --|
- Deep digging into the generalization of self-supervised monocular depth estimation - --|
- Egformer: Equirectangular geometry-biased transformer for 360 depth estimation - Mono.svg)](https://github.com/noahzn/Lite-Mono)|---|
- PanoFormer: Panorama Transformer for Indoor 360 Depth Estimation - bjtu/PanoFormer.svg)](https://github.com/zhijieshen-bjtu/PanoFormerr)|---|
- Improving 360 monocular depth estimation via non-local dense prediction transformer and joint supervised and self-supervised learning - --|---|
- MVSFormer: Multi-view stereo by learning robust image features and temperature-based depth - --|---|
- M3: 3D-Spatial Multimodal Memory - spatial.svg)](https://github.com/MaureenZOU/m3-spatial)|[website](https://m3-spatial-memory.github.io/)<br>compression & Gaussian Memory Attention|
- Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos - code.svg)](https://github.com/Stereo4d/stereo4d-code)|[website](https://stereo4d.github.io/)|
- Continuous 3D Perception Model with Persistent State
- SPARS3R: Semantic Prior Alignment and Regularization for Sparse 3D Reconstruction
- SplatVoxel: History-Aware Novel View Streaming without Temporal Training - --|---|
- GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding
- DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers
- Sonata: Self-Supervised Learning of Reliable Point Representations
- Point transformer v3: Simpler faster stronger - --|
- Point transformer v2: Grouped vector attention and partition-based pooling - --|
- Point transformer - --|[unofficial implementation](https://github.com/POSTECH-CVLab/point-transformer)|
- Dynamic Point Maps: A Versatile Representation for Dynamic 3D Reconstruction - --|[website](https://www.robots.ox.ac.uk/%CB%9Cvgg/research/dynamic-pointmaps/) <br> Dynamic DUSt3R, DPM|
- MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion - project.github.io/)<br>[Test](https://kwanwaipang.github.io/MonST3R/)|
- MUSt3R: Multi-view Network for Stereo 3D Reconstruction
- Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass - 3d.github.io/) <br> [Test](https://kwanwaipang.github.io/Fast3R/)
- Depth anything: Unleashing the power of large-scale unlabeled data - Anything.svg)](https://github.com/LiheYoung/Depth-Anything)|[Website](https://depth-anything.github.io/)
- DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions - --|---|
- Learning to adapt clip for few-shot monocular depth estimation - --|---|
- 3d reconstruction with spatial memory
- Towards zero-shot scale-aware monocular depth estimation - ml/vidar.svg)](https://github.com/tri-ml/vidar)|[website](https://sites.google.com/view/tri-zerodepth)|
- DUSt3R: Geometric 3D Vision Made Easy - geometric-3d-vision-made-easy/) <br> [Test](https://kwanwaipang.github.io/File/Blogs/Poster/MASt3R-SLAM.html)
- Gs-lrm: Large reconstruction model for 3d gaussian splatting - --|[website](https://sai-bi.github.io/project/gs-lrm/)<br>3DGS+Transformer|
- BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation - Depth-Estimation-Toolbox.svg)](https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox)|---|
- GLPanoDepth: Global-to-Local Panoramic Depth Estimation - --|---|
- Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics - --|---|
- Objcavit: improving monocular depth estimation using natural language models and image-object cross-attention - --|
- Depthformer: Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion - --|
- Sidert: A real-time pure transformer architecture for single image depth estimation - --|---|
- Hybrid transformer based feature fusion for self-supervised monocular depth estimation - --|---|
- Spike transformer: Monocular depth estimation for spiking camera - SpikingCamera.svg)](https://github.com/Leozhangjiyuan/MDE-SpikingCamera)|---|
- Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers - --|STTR<br>stereo matching|
- Transformer-based Monocular Depth Estimation with Attention Supervision - Chang-42/ASTransformer.svg)](https://github.com/WJ-Chang-42/ASTransformer)|---|
- Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction - --|
- Vision transformers for dense prediction - org/DPT.svg)](https://github.com/isl-org/DPT)|DPT|
- MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer - --|
- DEST: "Depth Estimation with Simplified Transformer - --|---|
- SparseFormer: Attention-based Depth Completion Network - --|---|
- GuideFormer: Transformers for Image Guided Depth Completion - --|---|
- MapAnything: Universal Feed-Forward Metric 3D Reconstruction - anything.svg)](https://github.com/facebookresearch/map-anything)|[website](https://map-anything.github.io/)|
- PanoSplatt3R: Leveraging Perspective Pretraining for Generalized Unposed Wide-Baseline Panorama Reconstruction
- Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos - --|---|
- LONG3R: Long Sequence Streaming 3D Reconstruction
- Dens3R: A Foundation Model for 3D Geometry Prediction - --|---|
- StreamVGGT: Streaming 4D Visual Geometry Transformer
- Test3R: Learning to Reconstruct 3D at Test Time - nop.github.io/)|
- Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory
- 4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos - --|[website](https://4dgt.github.io/)|
- Rig3R: Rig-Aware Conditioning for Learned 3D Reconstruction - --|---|
- SAB3R: Semantic-Augmented Backbone in 3D Reconstruction - --|[website](https://uva-computer-vision-lab.github.io/sab3r/)|
- Styl3R: Instant 3D Stylized Reconstruction for Arbitrary Scenes and Styles - CVGL/Styl3R.svg)](https://github.com/WU-CVGL/Styl3R)|[website](https://nickisdope.github.io/Styl3R/)|
- MonoSplat: Generalizable 3D Gaussian Splatting from Monocular Depth Foundation Models - AIM-Group/MonoSplat.svg)](https://github.com/CUHK-AIM-Group/MonoSplat)|---|
- Recollection from Pensieve: Novel View Synthesis via Learning from Uncalibrated Videos - --|
- STORM: Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes - yang.github.io/STORM/)|
- DELTA: Dense Depth from Events and LiDAR using Transformer's Attention
- MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds - dust3rp.github.io/)|
- MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision
- Regist3R: Incremental Registration with Stereo Foundation Model - --|---|
- St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World - --|[website](https://st4rtrack.github.io/)|
- AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis - megadepth.svg)](https://github.com/kvuong2711/aerial-megadepth)|[website](https://aerial-megadepth.github.io/)|
- Mono3R: Exploiting Monocular Cues for Geometric 3D Reconstruction - --|---|
- MonSter: Marry Monodepth to Stereo Unleashes Power - --|
- D2USt3R: Enhancing 3D Reconstruction with 4D Pointmaps for Dynamic Scenes - --|[website](https://cvlab-kaist.github.io/DDUSt3R/)|
- FlowR: Flowing from Sparse to Dense 3D Reconstructions - --|[website](https://tobiasfshr.github.io/pub/flowr/)|
- Easi3R: Estimating Disentangled Motion from DUSt3R Without Training
- SparseGS-W: Sparse-View 3D Gaussian Splatting in the Wild with Generative Priors - --|DUSt3R+Diffusion+3DGS|
- MVSAnywhere: Zero-Shot Multi-View Stereo
- CoMapGS: Covisibility Map-based Gaussian Splatting for Sparse Novel View Synthesis - --|[website](https://youngkyoonjang.github.io/projects/comapgs/)|
- Pow3R: empowering unconstrained 3D reconstruction with camera and scene priors - --|[website](https://europe.naverlabs.com/pow3r)<br>DUSt3R+multi information input|
- Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding
- UniK3D: Universal Camera Monocular 3D Estimation - eth/unik3d.svg)](https://github.com/lpiccinelli-eth/unik3d)|[website](https://lpiccinelli-eth.github.io/pub/unik3d/)|
- Depth anything v2 - Anything-V2.svg)](https://github.com/DepthAnything/Depth-Anything-V2)|[website](https://depth-anything-v2.github.io/)|
- Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation - --|---|
- Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation - --|---|
-
Transformer-based Pose Tracking
- A lightweight sensor fusion for neural visual inertial odometry - --|---|
- A lightweight sensor fusion for neural visual inertial odometry - --|---|
- A lightweight sensor fusion for neural visual inertial odometry - --|---|
- A lightweight sensor fusion for neural visual inertial odometry - --|---|
- A lightweight sensor fusion for neural visual inertial odometry - --|---|
- MASt3R-Fusion: Integrating Feed-Forward Visual Model with IMU, GNSS for High-Functionality SLAM - WHU/MASt3R-Fusion.svg)](https://github.com/GREAT-WHU/MASt3R-Fusion)|---|
- A lightweight sensor fusion for neural visual inertial odometry - --|---|
- A lightweight sensor fusion for neural visual inertial odometry - --|---|
- Dense-depth map guided deep Lidar-Visual Odometry with Sparse Point Clouds and Images - --|---|
- DINO-VO: A Feature-Based Visual Odometry Leveraging a Visual Foundation Model - --|---|
- A lightweight sensor fusion for neural visual inertial odometry - --|---|
- XIRVIO: Critic-guided Iterative Refinement for Visual-Inertial Odometry with Explainable Adaptive Weighting - --|---|
- Transformer-based model for monocular visual odometry: a video understanding approach - VO.svg)](https://github.com/aofrancani/TSformer-VO)|---|
- Light3R-SfM: Towards Feed-forward Structure-from-Motion - --|---|
- MASt3R-SfM: a Fully-Integrated Solution for Unconstrained Structure-from-Motion
- VGGSfM: Visual Geometry Grounded Deep Structure From Motion
- End-to-End Learned Visual Odometry Based on Vision Transformer - --|---|
- Causal Transformer for Fusion and Pose Estimation in Deep Visual Inertial Odometry - --|
- DDETR-SLAM: A Transformer-Based Approach to Pose Optimization in Dynamic Environments - --|---|
- ViT VO-A Visual Odometry technique Using CNN-Transformer Hybrid Architecture - --|---|
- Ema-vio: Deep visual–inertial odometry with external memory attention - --|---|
- TransFusionOdom: interpretable transformer-based LiDAR-inertial fusion odometry estimation - modal-dataset-for-odometry-estimation.svg)](https://github.com/RakugenSon/Multi-modal-dataset-for-odometry-estimation)|---|
- Modality-invariant Visual Odometry for Embodied Vision - Transformer.svg)](https://github.com/memmelma/VO-Transformer)|[Website](https://memmelma.github.io/vot/)|
- ViTVO: Vision Transformer based Visual Odometry with Attention Supervision - --|---|
- AFT-VO: Asynchronous fusion transformers for multi-view visual odometry estimation - --|---|
- Dense prediction transformer for scale estimation in monocular visual odometry - --|---|
- Transformer guided geometry model for flow-based unsupervised visual odometry - --|---|
- ZeroVO: Visual Odometry with Minimal Assumptions - --|[website](https://zvocvpr.github.io/)|
- BotVIO: A Lightweight Transformer-Based Visual-Inertial Odometry for Robotics - ustc/BotVIO.svg)](https://github.com/wenhuiwei-ustc/BotVIO)|---|
- Self-supervised multi-frame depth estimation with visual-inertial pose transformer and monocular guidance - --|---|
- A lightweight sensor fusion for neural visual inertial odometry - --|---|
- Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization - --|
- A lightweight sensor fusion for neural visual inertial odometry - --|---|
- A lightweight sensor fusion for neural visual inertial odometry - --|---|
- A lightweight sensor fusion for neural visual inertial odometry - --|---|
- A lightweight sensor fusion for neural visual inertial odometry - --|---|
- A lightweight sensor fusion for neural visual inertial odometry - --|---|
- A lightweight sensor fusion for neural visual inertial odometry - --|---|
- A lightweight sensor fusion for neural visual inertial odometry - --|---|
- A lightweight sensor fusion for neural visual inertial odometry - --|---|
- A lightweight sensor fusion for neural visual inertial odometry - --|---|
- A lightweight sensor fusion for neural visual inertial odometry - --|---|
- A lightweight sensor fusion for neural visual inertial odometry - --|---|
- A lightweight sensor fusion for neural visual inertial odometry - --|---|
-
Transformer-based Optical Flow
- Cotracker: It is better to track together - tracker.svg)](https://github.com/facebookresearch/co-tracker)|---|
- Win-win: Training high-resolution vision transformers from two windows - --|---|
- Flowformer: A transformer architecture and its masked cost volume autoencoding for optical flow - --|---|
- FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation - --|
- Transflow: Transformer as flow learner - --|---|
- Croco v2: Improved cross-view completion pre-training for stereo matching and optical flow
- Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion
- Unifying flow, stereo and depth estimation - --|
- Flowformer: A transformer architecture for optical flow - Official.svg)](https://github.com/drinkingcoder/FlowFormer-Official)|---|
- Gmflow: Learning optical flow via global matching - --|
- Craft: Cross-attentional flow transformer for robust optical flow - --|
- Learning optical flow with kernel patch attention - research/KPAFlow.svg)](https://github.com/megvii-research/KPAFlow)|---|
- Global Matching with Overlapping Attention for Optical Flow Estimation - --|
- |
- Grounding Image Matching in 3D with MASt3R - matching-and-stereo-3d-reconstruction/) <br> [Test](https://kwanwaipang.github.io/File/Blogs/Poster/MASt3R-SLAM.html)
- RoMa: Robust dense feature matching - --|
- Tlcd: A transformer based loop closure detection for robotic visual slam - --|---|
- Cotr: Correspondence transformer for matching across images - vision/COTR.svg)](https://github.com/ubc-vision/COTR)|---|
- LoFTR: Detector-free local feature matching with transformers - --|
- Superglue: Learning feature matching with graph neural networks - attention|
- TAPIP3D: Tracking Any Point in Persistent 3D Geometry
- DEFOM-Stereo: Depth Foundation Model Based Stereo Matching - Research-Team/DEFOM-Stereo.svg)](https://github.com/Insta360-Research-Team/DEFOM-Stereo)|[website](https://insta360-research-team.github.io/DEFOM-Stereo/)<br>depth anything v2 + RAFT-Stereo|
- MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training
- POMATO: Marrying Pointmap Matching with Temporal Motions for Dynamic 3D Reconstruction - --|
- CoMatcher: Multi-View Collaborative Feature Matching - --|---|
- CoMatch: Dynamic Covisibility-Aware Transformer for Bilateral Subpixel-Level Semi-Dense Image Matching - --|---|
- Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better
- FG2:Fine-Grained Cross-View Localization by Fine-Grained Feature Matching - epfl/FG2.svg)](https://github.com/vita-epfl/FG2)|---|
- ColabSfM: Collaborative Structure-from-Motion by Point Cloud Registration - --|
- Efficient LoFTR: Semi-dense local feature matching with sparse-like speed
- Rotation-invariant transformer for point cloud matching
- Aspanformer: Detector-free image matching with adaptive span transformer - --|---|
- MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training
-
Other Resources
- Awesome-Transformer-Attention
- Dense-Prediction-Transformer-Based-Visual-Odometry
- Visual SLAM with Vision Transformers(ViT)
- Awesome-Learning-based-VO-VIO
- Dinov2: Learning robust visual features without supervision
- Is space-time attention all you need for video understanding?
- Taming transformers for high-resolution image synthesis - transformers.svg)](https://github.com/CompVis/taming-transformers)|High resolution CNN+Transformer|
- Emerging properties in self-supervised vision transformers
- Vivit: A video vision transformer - pytorch.svg)](https://github.com/lucidrains/vit-pytorch)|---|
- An image is worth 16x16 words: Transformers for image recognition at scale - research/vision_transformer.svg)](https://github.com/google-research/vision_transformer)|ViT|
- DINOV3
Programming Languages
Categories
Sub Categories
Keywords
computer-vision
2
slam
2
deep-learning
2
transformer
2
attention-mechanism
1
attention-mechanisms
1
awesome-list
1
detr
1
papers
1
self-attention
1
transformer-architecture
1
transformer-awesome
1
transformer-cv
1
transformer-models
1
transformer-with-cv
1
transformers
1
vision-transformer
1
visual-transformer
1
vit
1
awesome
1
vio
1
visual-odometry
1
optimal-control
1
visual-slam-learning
1