{"id":13563767,"url":"https://github.com/google-research/scenic","last_synced_at":"2025-05-13T20:15:28.380Z","repository":{"id":37079132,"uuid":"385275507","full_name":"google-research/scenic","owner":"google-research","description":"Scenic: A Jax Library for Computer Vision Research and Beyond","archived":false,"fork":false,"pushed_at":"2025-05-05T15:11:45.000Z","size":66799,"stargazers_count":3529,"open_issues_count":286,"forks_count":454,"subscribers_count":36,"default_branch":"main","last_synced_at":"2025-05-05T16:29:32.870Z","etag":null,"topics":["attention","computer-vision","deep-learning","jax","research","transformers","vision-transformer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/google-research.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2021-07-12T14:27:08.000Z","updated_at":"2025-05-05T15:48:36.000Z","dependencies_parsed_at":"2023-02-16T06:00:40.672Z","dependency_job_id":"895c83bb-4cef-4cc0-b9f6-c6a05309f12d","html_url":"https://github.com/google-research/scenic","commit_stats":{"total_commits":640,"total_committers":78,"mean_commits":8.205128205128204,"dds":0.746875,"last_synced_commit":"4181ac18e2fb48e1e013ddef9b742e5f590eba4e"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Fscenic","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Fscenic/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Fscenic/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Fscenic/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/google-research","download_url":"https://codeload.github.com/google-research/scenic/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254020638,"owners_count":22000755,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["attention","computer-vision","deep-learning","jax","research","transformers","vision-transformer"],"created_at":"2024-08-01T13:01:23.103Z","updated_at":"2025-05-13T20:15:23.338Z","avatar_url":"https://github.com/google-research.png","language":"Python","readme":"# Scenic\n\u003cdiv style=\"text-align: left\"\u003e\n\u003cimg align=\"right\" src=\"https://raw.githubusercontent.com/google-research/scenic/main/images/scenic_logo.png\" width=\"200\" alt=\"scenic logo\"\u003e\u003c/img\u003e\n\u003c/div\u003e\n\n*Scenic* is a codebase with a focus on research around attention-based models\nfor computer vision. Scenic has been successfully used to develop\nclassification, segmentation, and detection models for multiple modalities\nincluding images, video, audio, and multimodal combinations of them.\n\nMore precisely, *Scenic* is a (i) set of shared light-weight libraries solving\ntasks commonly encountered tasks when training large-scale (i.e. multi-device,\nmulti-host) vision models; and (ii) several *projects* containing fully\nfleshed out problem-specific training and evaluation loops using these\nlibraries.\n\nScenic is developed in [JAX](https://github.com/jax-ml/jax) and uses\n[Flax](https://github.com/google/flax).\n\n### Contents\n* [What we offer](#what-we-offer)\n* [SOTA models and baselines in Scenic](#sota-models-and-baselines-in-scenic)\n* [Philosophy](#philosophy)\n* [Getting started](#getting-started)\n* [Scenic component design](#scenic-component-design)\n* [Citing Scenic](#citing-scenic)\n\n## What we offer\nAmong others *Scenic* provides\n\n* Boilerplate code for launching experiments, summary writing, logging,\n  profiling, etc;\n* Optimized training and evaluation loops, losses, metrics, bi-partite matchers,\n  etc;\n* Input-pipelines for popular vision datasets;\n* [Baseline models](https://github.com/google-research/scenic/tree/main/scenic/projects/baselines#scenic-baseline-models),\nincluding strong non-attentional baselines.\n\n\n## SOTA models and baselines in *Scenic*\nThere are some SOTA models and baselines in Scenic which were either developed\nusing Scenic, or have been reimplemented in Scenic:\n\nProjects that were developed in Scenic or used it for their experiments:\n\n* [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691)\n* [OmniNet: Omnidirectional Representations from Transformers](https://arxiv.org/abs/2103.01075)\n* [Attention Bottlenecks for Multimodal Fusion](https://arxiv.org/abs/2107.00135)\n* [TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?](https://arxiv.org/abs/2106.11297)\n* [Exploring the Limits of Large Scale Pre-training](https://arxiv.org/abs/2110.02095)\n* [The Efficiency Misnomer](https://arxiv.org/abs/2110.12894)\n* [Discrete Representations Strengthen Vision Transformer Robustness](https://arxiv.org/abs/2111.10493)\n* [Pyramid Adversarial Training Improves ViT Performance](https://arxiv.org/abs/2111.15121)\n* [VUT: Versatile UI Transformer for Multi-Modal Multi-Task User Interface Modeling](https://arxiv.org/abs/2112.05692)\n* [CLAY: Learning to Denoise Raw Mobile UI Layouts for Improving Datasets at Scale](https://arxiv.org/abs/2201.04100)\n* [Zero-Shot Text-Guided Object Generation with Dream Fields](https://arxiv.org/abs/2112.01455)\n* [Multiview Transformers for Video Recognition](https://arxiv.org/abs/2201.04288)\n* [PolyViT: Co-training Vision Transformers on Images, Videos and Audio](https://arxiv.org/abs/2111.12993)\n* [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230)\n* [Learning with Neighbor Consistency for Noisy Labels](https://arxiv.org/abs/2202.02200)\n* [Token Turing Machines](https://arxiv.org/pdf/2211.09119.pdf)\n* [Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning](https://arxiv.org/pdf/2302.14115.pdf)\n* [AVATAR: Unconstrained Audiovisual Speech Recognition](https://arxiv.org/abs/2206.07684)\n* [Adaptive Computation with Elastic Input Sequence](https://arxiv.org/abs/2301.13195)\n* [Location-Aware Self-Supervised Transformers for Semantic Segmentation](https://arxiv.org/abs/2212.02400)\n* [How can objects help action recognition?](https://openaccess.thecvf.com/content/CVPR2023/html/Zhou_How_Can_Objects_Help_Action_Recognition_CVPR_2023_paper.html)\n* [Verbs in Action: Improving verb understanding in video-language models](https://arxiv.org/abs/2304.06708)\n* [Unified Visual Relationship Detection with Vision and Language Models](https://arxiv.org/abs/2303.08998)\n* [UnLoc: A Unified Framework for Video Localization Tasks](https://arxiv.org/abs/2308.11062)\n* [REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory](https://arxiv.org/abs/2212.05221)\n* [Audiovisual Masked Autoencoders](https://arxiv.org/abs/2212.05922)\n* [MatFormer: Nested Transformer for Elastic Inference](https://arxiv.org/abs/2310.07707)\n* [Pixel Aligned Language Models](https://arxiv.org/abs/2312.09237)\n* [A Generative Approach for Wikipedia-Scale Visual Entity Recognition](https://arxiv.org/abs/2403.02041)\n* [Streaming Dense Video Captioning](https://arxiv.org/abs/2404.01297)\n* [Dense Video Object Captioning from Disjoint Supervision](https://arxiv.org/abs/2306.11729)\n\nMore information can be found in [projects](https://github.com/google-research/scenic/tree/main/scenic/projects#list-of-projects-hosted-in-scenic).\n\nBaselines that were reproduced in Scenic:\n\n* [(ViT) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929)\n* [(DETR) End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872)\n* [Deformable DETR: Deformable Transformers for End-to-End Object Detection](https://arxiv.org/abs/2010.04159)\n* [(CLIP) Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020)\n* [MLP-Mixer: An all-MLP Architecture for Vision](https://arxiv.org/abs/2105.01601)\n* [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)\n* [How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers](https://arxiv.org/abs/2106.10270)\n* [Big Transfer (BiT): General Visual Representation Learning](https://arxiv.org/abs/1912.11370)\n* [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385)\n* [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597)\n* [PCT: Point Cloud Transformer](https://arxiv.org/abs/2012.09688)\n* [Universal Transformers](https://arxiv.org/abs/1807.03819)\n* [PonderNet](https://arxiv.org/abs/2107.05407)\n* [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377)\n* [Rethinking Attention with Performers](https://arxiv.org/abs/2009.14794)\n* [(CenterNet) Objects as Points](https://arxiv.org/abs/1904.07850)\n* [(SAM) Segment Anything](https://arxiv.org/abs/2304.02643)\n\n\nMore information can be found in [baseline models](https://github.com/google-research/scenic/tree/main/scenic/projects/baselines#scenic-baseline-models).\n\n\u003ca name=\"philosophy\"\u003e\u003c/a\u003e\n## Philosophy\n*Scenic* aims to facilitate rapid prototyping of large-scale vision models. To\nkeep the code simple to understand and extend we prefer *forking and\ncopy-pasting over adding complexity or increasing abstraction*. Only when\nfunctionality proves to be widely useful across many models and tasks it may be\nupstreamed to Scenic's shared libraries.\n\n\n\u003ca name=\"getting_start\"\u003e\u003c/a\u003e\n## Getting started\n* See `projects/baselines/README.md` for a walk-through baseline models and\n  instructions on how to run the code.\n* If you would like to contribute to *Scenic*, please check out the\n  [Philisophy](#philosophy), [Code structure](#code_structure) and\n  [Contributing](CONTRIBUTING.md) sections.\n  Should your contribution be a part of the shared libraries, please send us a\n  pull request!\n\n\n### Quickstart\nYou will need Python 3.9 or later. Download the code from GitHub\n\n```shell\n$ git clone https://github.com/google-research/scenic.git\n$ cd scenic\n$ pip install .\n```\n\nand run training for ViT on ImageNet:\n\n```shell\n$ python scenic/main.py -- \\\n  --config=scenic/projects/baselines/configs/imagenet/imagenet_vit_config.py \\\n  --workdir=./\n```\n\nNote that for specific projects and baselines, you might need to install extra\npackages that are mentioned in their `README.md` or `requirements.txt` files.\n\n[Here](https://colab.research.google.com/github/google-research/scenic/blob/main/scenic/common_lib/colabs/scenic_playground.ipynb)\nis also a minimal colab to train a simple feed-forward model using Scenic.\n\n\u003ca name=\"code_structure\"\u003e\u003c/a\u003e\n## Scenic component design\nScenic is designed to propose different levels of abstraction, to support\nhosting projects that only require changing hyper-parameters by defining config\nfiles, to those that need customization on the input pipeline, model\narchitecture, losses and metrics, and the training loop. To make this happen,\nthe code in Scenic is organized as either _project-level_ code,\nwhich refers to customized code for specific projects or baselines or\n_library-level_ code, which refers to common functionalities and general\npatterns that are adapted by the majority of projects. The project-level\ncode lives in the `projects` directory.\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/google-research/scenic/main/images/scenic_design.jpg\" width=\"900\" alt=\"scenic design\"\u003e\u003c/img\u003e\n\u003c/div\u003e\n\n### Library-level code\nThe goal is to keep the library-level code minimal and well-tested and to avoid\nintroducing extra abstractions to support minor use-cases. Shared libraries\nprovided by *Scenic* are split into:\n\n*   `dataset_lib`: Implements IO pipelines for loading and pre-processing data\n    for common Computer Vision tasks and benchmarks (see \"Tasks and Datasets\"\n    section). All pipelines are designed to be scalable and support multi-host\n    and multi-device setups, taking care dividing data among multiple hosts,\n    incomplete batches, caching, pre-fetching, etc.\n*   `model_lib` : Provides\n    *   several abstract model interfaces (e.g. `ClassificationModel` or\n        `SegmentationModel` in `model_lib.base_models`) with task-specific\n        losses and metrics;\n    *   neural network layers in `model_lib.layers`, focusing on efficient\n        implementation of attention and transformer layers;\n    *   accelerator-friendly implementations of bipartite matching\n        algorithms in `model_lib.matchers`.\n*   `train_lib`: Provides tools for constructing training loops and implements\n    several optimized trainers (classification trainer and segmentation trainer)\n    that can be forked for customization.\n*   `common_lib`: General utilities, like logging and debugging modules,\n    functionalities for processing raw data, etc.\n\n### Project-level code\nScenic supports the development of customized solutions for customized tasks and\ndata via the concept of \"project\". There is no one-fits-all recipe for how much\ncode should be re-used by a project. Projects can consist of only configs and\nuse the common models, trainers, task/data that live in library-level code, or\nthey can simply fork any of the mentioned functionalities and redefine, layers,\nlosses, metrics, logging methods, tasks, architectures, as well as training and\nevaluation loops. The modularity of library-level code makes it flexible for\nprojects to fall placed on any spot in the \"run-as-is\" to \"fully customized\"\nspectrum.\n\nCommon baselines such as a ResNet and Vision Transformer (ViT) are implemented\nin the [`projects/baselines`](https://github.com/google-research/scenic/tree/main/scenic/projects/baselines)\nproject. Forking models in this directory is a good starting point for new\nprojects.\n\n\n## Citing Scenic\nIf you use Scenic, you can cite our [white paper](https://openaccess.thecvf.com/content/CVPR2022/html/Dehghani_Scenic_A_JAX_Library_for_Computer_Vision_Research_and_Beyond_CVPR_2022_paper.html).\nHere is an example BibTeX entry:\n\n```bibtex\n@InProceedings{dehghani2021scenic,\n    author    = {Dehghani, Mostafa and Gritsenko, Alexey and Arnab, Anurag and Minderer, Matthias and Tay, Yi},\n    title     = {Scenic: A JAX Library for Computer Vision Research and Beyond},\n    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},\n    year      = {2022},\n    pages     = {21393-21398}\n}\n```\n\n_Disclaimer: This is not an official Google product._\n","funding_links":[],"categories":["Python","Computer Vision","其他_机器视觉","Libraries"],"sub_categories":["General Purpose CV","网络服务_其他"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle-research%2Fscenic","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgoogle-research%2Fscenic","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle-research%2Fscenic/lists"}