https://github.com/liruilong940607/prope

Cameras as Relative Positional Encoding
https://github.com/liruilong940607/prope

multi-view positional-encoding transformer

Last synced: 12 months ago
JSON representation

Cameras as Relative Positional Encoding

Host: GitHub
URL: https://github.com/liruilong940607/prope
Owner: liruilong940607
Created: 2025-05-08T17:35:07.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-07-22T23:55:24.000Z (about 1 year ago)
Last Synced: 2025-07-23T00:23:18.203Z (about 1 year ago)
Topics: multi-view, positional-encoding, transformer
Language: Python
Homepage: http://www.liruilong.cn/prope/
Size: 31.7 MB
Stars: 376
Watchers: 7
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# PRoPE
https://www.liruilong.cn/prope/

This is the official repo for the paper "Cameras as Relative Positional Encoding"

**TL;DR**: Language models and multi-view transformers must both bind “positional” information to input tokens, in terms of sequence position for LLMs and camera parameters for multi-view transformers. We present a study on camera conditioning that includes absolute positional encodings (e.g, raymaps), relative pose encodings (e.g., GTA), and a new method (PRoPE) uses *relative projective* transformation to capture 3D relationship between image tokens.

## Implementations

The implementation of PRoPE is extremely simple and efficient. We provide standalone, single-file implementations for both JAX and PyTorch in [`prope/jax.py`](prope/jax.py) and [`prope/torch.py`](prope/torch.py).

## Example of Usages

Here we demo with PyTorch version:

```python
# Say we have C images, each carries with camera infomation, which would be used for cross-view understanding.
viewmats: Tensor # (B, C, 4, 4) camera world-to-camera matrix
Ks: Tensor # (B, C, 3, 3) camera intrinsic matrix

# In transformer we typically patchify the images into tokens. Say
# the image size is (256, 384) and patch size is 16.
image_width, image_height = 256, 384
patches_x, patches_y = image_width / 16, image_height / 16

# And our attention layer has mapped the images from pixels (B, C, 384, 256) to Q/K/V tokens with shape (B, num_heads, seqlen, head_dim), where `seqlen = C * patches_x * patches_y`
Q, K, V: Tensor = ... # (B, num_heads, seqlen, head_dim)

# Injecting the camera information is simply replacing the native torch attention with our impl:
output = torch.nn.functional.scaled_dot_product_attention(Q, K, V)
# -->
output = prope_dot_product_attention(
Q, K, V,viewmats=viewmats, Ks=Ks, patches_x=patches_x, patches_y=patches_y, image_width=image_width, image_height=image_height
)
```

## Experiments

- Improve LVSM on the task of Novel View Syntheis: [Checkout `nvs` branch](https://github.com/liruilong940607/prope/tree/nvs)
- Improve UniMatch on the task of Stereo Depth Estimation: To be released

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/liruilong940607/prope

Awesome Lists containing this project

README