https://github.com/liruilong940607/prope
Cameras as Relative Positional Encoding
https://github.com/liruilong940607/prope
multi-view positional-encoding transformer
Last synced: 10 months ago
JSON representation
Cameras as Relative Positional Encoding
- Host: GitHub
- URL: https://github.com/liruilong940607/prope
- Owner: liruilong940607
- Created: 2025-05-08T17:35:07.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-07-22T23:55:24.000Z (11 months ago)
- Last Synced: 2025-07-23T00:23:18.203Z (11 months ago)
- Topics: multi-view, positional-encoding, transformer
- Language: Python
- Homepage: http://www.liruilong.cn/prope/
- Size: 31.7 MB
- Stars: 376
- Watchers: 7
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# PRoPE
https://www.liruilong.cn/prope/
This is the official repo for the paper "Cameras as Relative Positional Encoding"

**TL;DR**: Language models and multi-view transformers must both bind “positional” information to input tokens, in terms of sequence position for LLMs and camera parameters for multi-view transformers. We present a study on camera conditioning that includes absolute positional encodings (e.g, raymaps), relative pose encodings (e.g., GTA), and a new method (PRoPE) uses *relative projective* transformation to capture 3D relationship between image tokens.
## Implementations
The implementation of PRoPE is extremely simple and efficient. We provide standalone, single-file implementations for both JAX and PyTorch in [`prope/jax.py`](prope/jax.py) and [`prope/torch.py`](prope/torch.py).
## Example of Usages
Here we demo with PyTorch version:
```python
# Say we have C images, each carries with camera infomation, which would be used for cross-view understanding.
viewmats: Tensor # (B, C, 4, 4) camera world-to-camera matrix
Ks: Tensor # (B, C, 3, 3) camera intrinsic matrix
# In transformer we typically patchify the images into tokens. Say
# the image size is (256, 384) and patch size is 16.
image_width, image_height = 256, 384
patches_x, patches_y = image_width / 16, image_height / 16
# And our attention layer has mapped the images from pixels (B, C, 384, 256) to Q/K/V tokens with shape (B, num_heads, seqlen, head_dim), where `seqlen = C * patches_x * patches_y`
Q, K, V: Tensor = ... # (B, num_heads, seqlen, head_dim)
# Injecting the camera information is simply replacing the native torch attention with our impl:
output = torch.nn.functional.scaled_dot_product_attention(Q, K, V)
# -->
output = prope_dot_product_attention(
Q, K, V,viewmats=viewmats, Ks=Ks, patches_x=patches_x, patches_y=patches_y, image_width=image_width, image_height=image_height
)
```
## Experiments
- Improve LVSM on the task of Novel View Syntheis: [Checkout `nvs` branch](https://github.com/liruilong940607/prope/tree/nvs)
- Improve UniMatch on the task of Stereo Depth Estimation: To be released