An open API service indexing awesome lists of open source software.

https://github.com/commaai/commavq

commaVQ is a dataset of compressed driving video
https://github.com/commaai/commavq

Last synced: 8 days ago
JSON representation

commaVQ is a dataset of compressed driving video

Awesome Lists containing this project

README

          


commaVQ challenge


Leaderboard
·
comma.ai/jobs
·
Discord
·
X

| Source Video | Compressed Video | Future Prediction |
| --------------- | ---------------- |------------------ |
| | | |

A world model is a model that can predict the next state of the world given the observed previous states and actions.

World models are essential to training all kinds of intelligent agents, especially self-driving models.

commaVQ contains:
- encoder/decoder models used to heavily compress driving scenes
- a world model trained on 3,000,000 minutes of driving videos
- a dataset of 100,000 minutes of compressed driving videos

# Task

## Lossless compression challenge: make me smaller! $500 challenge
Losslessly compress 5,000 minutes of driving video "tokens". Go to [./compression/](./compression/) to start

**Prize: highest compression rate on 5,000 minutes of driving video (~915MB) - Challenge ended July, 1st 2024 11:59pm AOE**

Submit a single zip file containing the compressed data and a python script to decompress it into its original form using [this form](https://forms.gle/US88Hg7UR6bBuW3BA). Top solutions are listed on [comma's official leaderboard](https://comma.ai/leaderboard).






score


name


method








3.4



szabolcs-cs



self-compressing neural network






2.9



BradyWynn



arithmetic coding with GPT






2.6



pkourouklidis

👑


arithmetic coding with GPT






2.3


anonymous


zpaq






2.3



rostislav



zpaq






2.2


anonymous


zpaq






2.2


anonymous


zpaq






2.2



0x41head



zpaq






2.2



tillinf



zpaq






2.2



nuniesmith



zpaq






1.6


baseline


lzma


## Overview
A VQ-VAE [1,2] was used to heavily compress each video frame into 128 "tokens" of 10 bits each. Each entry of the dataset is a "segment" of compressed driving video, i.e. 1min of frames at 20 FPS. Each file is of shape 1200x8x16 and saved as int16.

A world model [3] was trained to predict the next token given a context of past tokens. This world model is a Generative Pre-trained Transformer (GPT) [4] trained on 3,000,000 minutes of driving videos following a similar recipe to [5].

## Examples
[./notebooks/encode.ipynb](./notebooks/encode.ipynb) and [./notebooks/decode.ipynb](./notebooks/decode.ipynb) for an example of how to visualize the dataset using a segment of driving video from [comma's drive to Taco Bell](https://blog.comma.ai/taco-bell/)

[./notebooks/gpt.ipynb](./notebooks/gpt.ipynb) for an example of how to use the world model to imagine future frames.

[./compression/compress.py](./compression/compress.py) for an example of how to compress the tokens using lzma

## Download the dataset
- Using huggingface datasets
```python
import numpy as np
from datasets import load_dataset
# load the first shard
data_files = {'train': ['data-0000.tar.gz']}
ds = load_dataset('commaai/commavq', data_files=data_files)
tokens = np.array(ds['train'][0]['token.npy'])
poses = np.array(ds['train'][0]['pose.npy'])
```
- Manually download from huggingface datasets repository: https://huggingface.co/datasets/commaai/commavq

## References
[1] Van Den Oord, Aaron, and Oriol Vinyals. "Neural discrete representation learning." Advances in neural information processing systems 30 (2017).

[2] Esser, Patrick, Robin Rombach, and Bjorn Ommer. "Taming transformers for high-resolution image synthesis." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.

[3] https://worldmodels.github.io/

[4] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).

[5] Micheli, Vincent, Eloi Alonso, and François Fleuret. "Transformers are Sample-Efficient World Models." The Eleventh International Conference on Learning Representations. 2022.