https://github.com/voxel51/papers-with-data
A curated list of papers that released datasets along with their work
https://github.com/voxel51/papers-with-data
ai artificial-intelligence computer-vision data-science datasets deep-learning machine-learning papers
Last synced: 3 months ago
JSON representation
A curated list of papers that released datasets along with their work
- Host: GitHub
- URL: https://github.com/voxel51/papers-with-data
- Owner: voxel51
- License: apache-2.0
- Created: 2023-06-28T02:29:32.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-10-22T13:40:41.000Z (12 months ago)
- Last Synced: 2025-07-03T19:47:59.348Z (3 months ago)
- Topics: ai, artificial-intelligence, computer-vision, data-science, datasets, deep-learning, machine-learning, papers
- Language: Python
- Homepage:
- Size: 63.5 KB
- Stars: 125
- Watchers: 17
- Forks: 9
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Papers with Data
Data reigns supreme 🥇
Every day it becomes more evident that *data* is the limiting factor for
state-of-the-art 📈 machine learning. Your model architecture may be
revolutionary, but without high-quality data 📊 to train on, it will be doomed
to mediocrity.Pair idea with execution and use top-notch data in your next project!
## NeurIPS 2023
We've combed through the **2384** papers accepted to NeurIPS in 2023 and compiled
a short-list of papers introducing exciting new datasets.| **Title** | **Tags** | **Paper** | **Dataset** | **Code** |
|:---------:|:---------:|:---------:|:-----------:|:--------:|
| DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data | `perceptual similarity`, `image`, `synthetic`, `diffusion`, `JND`, `2AFC` | [](https://arxiv.org/abs/2306.09344)| [](https://try.fiftyone.ai/datasets/NIGHTS/samples) | [](https://github.com/ssundaram21/dreamsim) |
| Visual Instruction Tuning | `vision-language`, `llm`, `instruction-tuning`, `image`, `multimodal` | [](https://arxiv.org/abs/2304.08485)| [](https://try.fiftyone.ai/datasets/LLaVA-Instruct/samples) | [](https://github.com/haotian-liu/LLaVA) |
| ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation | `reward-model`, `image`, `text-to-image`, `synthetic`, `human-preference`, `alignment` | [](https://arxiv.org/abs/2304.05977)| [](https://try.fiftyone.ai/datasets/ImageRewardDB-clean/samples) | [](https://github.com/THUDM/ImageReward) |
| MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing | `image-editing`, `synthetic`, `image`, `instruction` | [](https://arxiv.org/abs/2306.10012)| [](https://try.fiftyone.ai/datasets/MagicBrish/samples) | [](https://github.com/OSU-NLP-Group/MagicBrush) |
| REAL3D-AD | `3D`, `point-cloud`, `anomaly-detection` | [](https://arxiv.org/abs/2309.13226)| [](https://try.fiftyone.ai/datasets/REAL3D-AD/samples) | [](https://github.com/M-3LAB/Real3D-AD) |## WACV 2024
| **Title** | **Tags** | **Paper** | **Dataset** | **Code** |
|:---------:|:---------:|:---------:|:-----------:|:--------:|
| dacl10k: Benchmark for Semantic Bridge Damage Segmentation | `image`, `semantic segmentation`, `classification`, `construction`, `defect` | [](https://arxiv.org/abs/2309.00460)| [](https://try.fiftyone.ai/datasets/dacl10k/samples) | [](https://github.com/phiyodr/dacl10k-toolkit) |## ICCV 2023
| **Title** | **Tags** | **Paper** | **Dataset** | **Code** |
|:---------:|:---------:|:---------:|:-----------:|:--------:|
| Satlas: A Large-Scale, Multi-Task Dataset for Remote Sensing Image Understanding | `image`, `SAR`, `satellite`, `detection`, `climate` | [](https://arxiv.org/abs/2211.15660)| [](https://try.fiftyone.ai/datasets/SATLAS%20Marine%20Infrastructure/samples) | [](https://github.com/allenai/satlas) |
| Building3D: An Urban-Scale Dataset and Benchmarks for Learning Roof Structures from Point Clouds | `3D`, `point cloud` | [](https://arxiv.org/abs/2307.11914)| [](https://try.fiftyone.ai/datasets/Building3D/samples) | |
| EgoObjects: A Large-Scale Egocentric Dataset for Fine-Grained Object Understanding | `image`, `object`, `ego` | [](https://arxiv.org/abs/2309.08816)| [](https://try.fiftyone.ai/datasets/egoobjects-val/samples) | [](https://github.com/facebookresearch/EgoObjects) |
| Equivariant Similarity for Vision-Language Foundation Models | `image`, `similarity`, `caption` | [](https://arxiv.org/abs/2303.14465)| [](https://try.fiftyone.ai/datasets/eqben-test/samples) | [](https://github.com/Wangt-CN/EqBen) |
| MOSE: A New Dataset for Video Object Segmentation in Complex Scenes | `video`, `segmentation`, `tracking` | [](https://arxiv.org/abs/2302.01872)| [](https://try.fiftyone.ai/datasets/mose/samples) | |
| SportsMOT: A Large Multi-Object Tracking Dataset in Multiple Sports Scenes | `multi-object tracking`, `sports` | [](https://arxiv.org/abs/2304.05170)| [](https://try.fiftyone.ai/datasets/sportsmot-validation/samples) | [](https://github.com/MCG-NJU/SportsMOT) |## CVPR 2023

We've combed through the **2359** papers accepted to CVPR in 2023 and compiled
a short-list of papers introducing exciting new datasets.| **Title** | **Tags** | **Paper** | **Dataset** | **Code** |
|:---------:|:---------:|:---------:|:-----------:|:--------:|
| MVImgNet: A Large-scale Dataset of Multi-view Images | `multi-view`, `image` | [](https://arxiv.org/abs/2303.06042)| [](https://try.fiftyone.ai/datasets/MVImgNet/samples) | [](https://github.com/GAP-LAB-CUHK-SZ/MVImgNet) |
| GeoNet: Benchmarking Unsupervised Adaptation across Geographies | `geolocation`, `image` | [](https://arxiv.org/abs/2303.15443)| [](https://try.fiftyone.ai/datasets/GeoNet/samples) | |
| Joint HDR Denoising and Fusion: A Real-World Mobile HDR Image Dataset | `denoising`, `image` | | [](https://try.fiftyone.ai/datasets/Mobile-HDR/samples) | [](https://github.com/shuaizhengliu/joint-hdrdn) |
| Spring: A High-Resolution High-Detail Dataset and Benchmark for Scene Flow, Optical Flow and Stereo | `optical flow`, `stereo`, `image` | [](https://arxiv.org/abs/2303.01943)| [](https://try.fiftyone.ai/datasets/Spring/samples) | |
| ImageNet-E: Benchmarking Neural Network Robustness via Attribute Editing | `image`, `editing` | [](https://arxiv.org/abs/2303.17096)| [](https://try.fiftyone.ai/datasets/ImageNet-E/samples) | [](https://github.com/alibaba/easyrobust) |
| ARKitTrack: A New Diverse Dataset for Tracking Using Mobile RGB-D Data | `RGB-D`, `segmentation`, `video` | [](https://arxiv.org/abs/2303.13885)| [](https://try.fiftyone.ai/datasets/ARKitTrack/samples) | [](https://github.com/lawrence-cj/ARKitTrack) |
| Diverse Embedding Expansion Network and Low-Light Cross-Modality Benchmark for Visible-Infrared Person Re-identification | `low-light`, `cross-modal`, `IR` | [](https://arxiv.org/abs/2303.14481)| [](https://try.fiftyone.ai/datasets/LLCM/samples) | [](https://github.com/ZYK100/LLCM) |
| JRDB-Pose: A Large-scale Dataset for Multi-Person Pose Estimation and Tracking | `pose estimation`, `image`, `keypoint`, `tracking` | [](https://arxiv.org/abs/2210.11940v2)| [](https://try.fiftyone.ai/datasets/JRDB-Pose/samples) | |
| A New Benchmark: On the Utility of Synthetic Data with Blender for Bare Supervised Learning and Downstream Domain Adaptation | `synthetic`, `domain adaptation`, `supervised` | [](https://arxiv.org/abs/2303.09165)| [](https://try.fiftyone.ai/datasets/SynSL-120K/samples) | [](https://github.com/huitangtang/On_the_Utility_of_Synthetic_Data) |## Papers from 2022
| **Title** | **Tags** | **Paper** | **Dataset** | **Code** |
|:---------:|:---------:|:---------:|:-----------:|:--------:|
| Calving fronts and where to find them: a benchmark dataset and methodology for automatic glacier calving front extraction from synthetic aperture radar imagery | `glacier`, `climate`, `SAR`, `satellite`, `image`, `semantic segmentation` | [](https://essd.copernicus.org/articles/14/4287/2022/)| [](https://try.fiftyone.ai/datasets/CaFFe/samples) | [](https://doi.pangaea.de/10.1594/PANGAEA.940950) |
| The Caltech Fish Counting Dataset: A Benchmark for Multiple-Object Tracking and Counting | `conservation`, `detection`, `SONAR`, `video`, `tracking`, `counting` | [](https://arxiv.org/abs/2207.09295)| [](https://try.fiftyone.ai/datasets/CFC/samples) | [](https://github.com/visipedia/caltech-fish-counting) |## Classics
| **Title** | **Tags** | **Paper** | **Dataset** | **Code** |
|:---------:|:---------:|:---------:|:-----------:|:--------:|
| ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases | `x-ray`, `image`, `healthcare`, `detection` | [](https://arxiv.org/abs/1705.02315v5)| [](https://try.fiftyone.ai/datasets/ChestX-ray14/samples) | |## Contributing 👋
We would love your help in making this repository even better! If we missed a
paper that introduced a new dataset, or if you can think of any ways to improve
the repository, feel free to open an issue or a pull request.## Note
This repository is inspired by [paperswithcode](https://paperswithcode.com),
and the template was adapted from
[top-cvpr-2023-papers](https://github.com/SkalskiP/top-cvpr-2023-papers).