https://github.com/onnx/models

A collection of pre-trained, state-of-the-art models in the ONNX format
https://github.com/onnx/models
deep-learning download models onnx pretrained
Last synced: about 1 year ago
JSON representation
A collection of pre-trained, state-of-the-art models in the ONNX format
Host: GitHub
URL: https://github.com/onnx/models
Owner: onnx
License: apache-2.0
Created: 2017-10-06T00:03:03.000Z (over 8 years ago)
Default Branch: main
Last Pushed: 2024-04-30T20:44:50.000Z (about 2 years ago)
Last Synced: 2025-04-24T03:49:42.359Z (about 1 year ago)
Topics: deep-learning, download, models, onnx, pretrained
Language: Jupyter Notebook
Homepage: http://onnx.ai/models/
Size: 297 MB
Stars: 8,495
Watchers: 196
Forks: 1,456
Open Issues: 208
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

awesome-computer-vision-resources - [github
awesome-mobile-ai - ONNX Model Zoo - ONNX format models (🛠️ Tools & Utilities / 🎨 Pre-trained Model Hubs)
awesome-oneapi - models - The ONNX Model Zoo is a collection of pre-trained, state-of-the-art machine learning models in the ONNX format. These models are contributed by community members and accompanied by Jupyter notebooks for model training and running inference with the trained model. (Table of Contents / AI - Frameworks and Toolkits)
awesome-approximate-dnn - ONNX Model Zoo - Collection of pre-trained onnx models (Others / Model ZOO)
awesome-ai-models - onnx/models
awesome-edge-ai-models - ONNX Model Zoo
awesome-opensource-ai - ONNX Model Zoo - Collection of pre-trained, state-of-the-art models in the ONNX format. 80+ models spanning vision, NLP, and audio with validation data and reference implementations. Apache 2.0 licensed. ![GitHub stars](https://img.shields.io/github/stars/onnx/models?style=social) (8. MLOps / LLMOps & Production)
awesome-github-projects - models - A collection of pre-trained, state-of-the-art models in the ONNX format ⭐9,674 `Jupyter Notebook` ⚡ (🤖 AI & Machine Learning)
README

          

# ONNX Model Zoo

## Introduction

Welcome to the ONNX Model Zoo! The Open Neural Network Exchange (ONNX) is an open standard format created to represent machine learning models. Supported by a robust community of partners, ONNX defines a common set of operators and a common file format to enable AI developers to use models with a variety of frameworks, tools, runtimes, and compilers.

This repository is a curated collection of pre-trained, state-of-the-art models in the ONNX format. These models are sourced from prominent open-source repositories and have been contributed by a diverse group of community members. Our aim is to facilitate the spread and usage of machine learning models among a wider audience of developers, researchers, and enthusiasts.

To handle ONNX model files, which can be large, we use Git LFS (Large File Storage). 

## Models

Currently, we are expanding the ONNX Model Zoo by incorporating additional models from the following categories.

As we are rigorously validating the new models for accuracy, refer to the [validated models](#validated-models) below that have been successfully validated for accuracy:

- Computer Vision

- Natural Language Processing (NLP)

- Generative AI

- Graph Machine Learning

These models are sourced from prominent open-source repositories such as [timm](https://github.com/huggingface/pytorch-image-models), [torchvision](https://github.com/pytorch/vision), [torch_hub](https://pytorch.org/hub/), and [transformers](https://github.com/huggingface/transformers), and exported into the ONNX format using the open-source [TurnkeyML toolchain](https://github.com/onnx/turnkeyml).

## Validated Models

#### Vision

* [Image Classification](#image_classification)

* [Object Detection & Image Segmentation](#object_detection)

* [Body, Face & Gesture Analysis](#body_analysis)

* [Image Manipulation](#image_manipulation)

#### Language

* [Machine Comprehension](#machine_comprehension)

* [Machine Translation](#machine_translation)

* [Language Modelling](#language_modelling)

#### Other

* [Visual Question Answering & Dialog](#visual_qna)

* [Speech & Audio Processing](#speech)

* [Other interesting models](#others)

Read the [Usage](#usage-) section below for more details on the file formats in the ONNX Model Zoo (.onnx, .pb, .npz), downloading multiple ONNX models through [Git LFS command line](#gitlfs-), and starter Python code for validating your ONNX model using test data.

INT8 models are generated by [Intel® Neural Compressor](https://github.com/intel/neural-compressor). [Intel® Neural Compressor](https://github.com/intel/neural-compressor) is an open-source Python library which supports automatic accuracy-driven tuning strategies to help user quickly find out the best quantized model. It implements dynamic and static quantization for ONNX models and can represent quantized ONNX models with operator oriented as well as tensor oriented (QDQ) ways. Users can use web-based UI service or python code to do quantization. Read the [Introduction](https://github.com/intel/neural-compressor/blob/master/README.md) for more details.

### Image Classification 

This collection of models take images as input, then classifies the major objects in the images into 1000 object categories such as keyboard, mouse, pencil, and many animals.

|Model Class |Reference |Description |Huggingface Spaces|

|-|-|-|-|

|[MobileNet](validated/vision/classification/mobilenet)|[Sandler et al.](https://arxiv.org/abs/1801.04381)|Light-weight deep neural network best suited for mobile and embedded vision applications. 
Top-5 error from paper - ~10%|

|[ResNet](validated/vision/classification/resnet)|[He et al.](https://arxiv.org/abs/1512.03385)|A CNN model (up to 152 layers). Uses shortcut connections to achieve higher accuracy when classifying images. 
 Top-5 error from paper - ~3.6%| [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/onnx/ResNet) |

|[SqueezeNet](validated/vision/classification/squeezenet)|[Iandola et al.](https://arxiv.org/abs/1602.07360)|A light-weight CNN model providing AlexNet level accuracy with 50x fewer parameters. 
Top-5 error from paper - ~20%| [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/onnx/SqueezeNet) |

|[VGG](validated/vision/classification/vgg)|[Simonyan et al.](https://arxiv.org/abs/1409.1556)|Deep CNN model(up to 19 layers). Similar to AlexNet but uses multiple smaller kernel-sized filters that provides more accuracy when classifying images. 
Top-5 error from paper - ~8%| [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/onnx/VGG) |

|[AlexNet](validated/vision/classification/alexnet)|[Krizhevsky et al.](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf)|A Deep CNN model (up to 8 layers) where the input is an image and the output is a vector of 1000 numbers. 
 Top-5 error from paper - ~15%| [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/onnx/AlexNet) |

|[GoogleNet](validated/vision/classification/inception_and_googlenet/googlenet)|[Szegedy et al.](https://arxiv.org/pdf/1409.4842.pdf)|Deep CNN model(up to 22 layers). Comparatively smaller and faster than VGG and more accurate in detailing than AlexNet. 
 Top-5 error from paper - ~6.7%| [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/onnx/GoogleNet) |

|[CaffeNet](validated/vision/classification/caffenet)|[Krizhevsky et al.]( https://ucb-icsi-vision-group.github.io/caffe-paper/caffe.pdf)|Deep CNN variation of AlexNet for Image Classification in Caffe where the max pooling precedes the local response normalization (LRN) so that the LRN takes less compute and memory.| [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/onnx/CaffeNet) |

|[RCNN_ILSVRC13](validated/vision/classification/rcnn_ilsvrc13)|[Girshick et al.](https://arxiv.org/abs/1311.2524)|Pure Caffe implementation of R-CNN for image classification. This model uses localization of regions to classify and extract features from images.|

|[DenseNet-121](validated/vision/classification/densenet-121)|[Huang et al.](https://arxiv.org/abs/1608.06993)|Model that has every layer connected to every other layer and passes on its own feature providing strong gradient flow and more diversified features.| [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/onnx/DenseNet-121) |

|[Inception_V1](validated/vision/classification/inception_and_googlenet/inception_v1)|[Szegedy et al.](https://arxiv.org/abs/1409.4842)|This model is same as GoogLeNet, implemented through Caffe2 that has improved utilization of the computing resources inside the network and helps with the vanishing gradient problem. 
 Top-5 error from paper - ~6.7%| [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/onnx/Inception_v1) |

|[Inception_V2](validated/vision/classification/inception_and_googlenet/inception_v2)|[Szegedy et al.](https://arxiv.org/abs/1512.00567)|Deep CNN model for Image Classification as an adaptation to Inception v1 with batch normalization. This model has reduced computational cost and improved image resolution compared to Inception v1. 
 Top-5 error from paper ~4.82%|

|[ShuffleNet_V1](validated/vision/classification/shufflenet)|[Zhang et al.](https://arxiv.org/abs/1707.01083)|Extremely computation efficient CNN model that is designed specifically for mobile devices. This model greatly reduces the computational cost and provides a ~13x speedup over AlexNet on ARM-based mobile devices. Compared to MobileNet, ShuffleNet achieves superior performance by a significant margin due to it's efficient structure. 
 Top-1 error from paper - ~32.6%|

|[ShuffleNet_V2](validated/vision/classification/shufflenet)|[Zhang et al.](https://arxiv.org/abs/1807.11164)|Extremely computation efficient CNN model that is designed specifically for mobile devices. This network architecture design considers direct metric such as speed, instead of indirect metric like FLOP. 
 Top-1 error from paper - ~30.6%|

|[ZFNet-512](validated/vision/classification/zfnet-512)|[Zeiler et al.](https://arxiv.org/abs/1311.2901)|Deep CNN model (up to 8 layers) that increased the number of features that the network is capable of detecting that helps to pick image features at a finer level of resolution. 
 Top-5 error from paper - ~14.3%| [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/onnx/ZFNet-512) |

|[EfficientNet-Lite4](validated/vision/classification/efficientnet-lite4)|[Tan et al.](https://arxiv.org/abs/1905.11946)|CNN model with an order of magnitude of few computations and parameters, while still acheiving state-of-the-art accuracy and better efficiency than previous ConvNets. 
 Top-5 error from paper - ~2.9%| [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/onnx/EfficientNet-Lite4) |



#### Domain-based Image Classification 

This subset of models classify images for specific domains and datasets.

|Model Class |Reference |Description |

|-|-|-|

|[MNIST-Handwritten Digit Recognition](validated/vision/classification/mnist)|[Convolutional Neural Network with MNIST](https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_103D_MNIST_ConvolutionalNeuralNetwork.ipynb)	|Deep CNN model for handwritten digit identification|



### Object Detection & Image Segmentation 

Object detection models detect the presence of multiple objects in an image and segment out areas of the image where the objects are detected. Semantic segmentation models partition an input image by labeling each pixel into a set of pre-defined categories.

|Model Class |Reference |Description |Hugging Face Spaces |

|-|-|-|-|

|[Tiny YOLOv2](validated/vision/object_detection_segmentation/tiny-yolov2)|[Redmon et al.](https://arxiv.org/pdf/1612.08242.pdf)|A real-time CNN for object detection that detects 20 different classes. A smaller version of the more complex full YOLOv2 network.|

|[SSD](validated/vision/object_detection_segmentation/ssd)|[Liu et al.](https://arxiv.org/abs/1512.02325)|Single Stage Detector: real-time CNN for object detection that detects 80 different classes.|

|[SSD-MobileNetV1](validated/vision/object_detection_segmentation/ssd-mobilenetv1)|[Howard et al.](https://arxiv.org/abs/1704.04861)|A variant of MobileNet that uses the Single Shot Detector (SSD) model framework. The model detects 80 different object classes and locates up to 10 objects in an image.|

|[Faster-RCNN](validated/vision/object_detection_segmentation/faster-rcnn)|[Ren et al.](https://arxiv.org/abs/1506.01497)|Increases efficiency from R-CNN by connecting a RPN with a CNN to create a single, unified network for object detection that detects 80 different classes.| [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/onnx/faster-rcnn) |

|[Mask-RCNN](validated/vision/object_detection_segmentation/mask-rcnn)|[He et al.](https://arxiv.org/abs/1703.06870)|A real-time neural network for object instance segmentation that detects 80 different classes. Extends Faster R-CNN as each of the 300 elected ROIs go through 3 parallel branches of the network: label prediction, bounding box prediction and mask prediction.| [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/onnx/mask-rcnn) |

|[RetinaNet](validated/vision/object_detection_segmentation/retinanet)|[Lin et al.](https://arxiv.org/abs/1708.02002)|A real-time dense detector network for object detection that addresses class imbalance through Focal Loss. RetinaNet is able to match the speed of previous one-stage detectors and defines the state-of-the-art in two-stage detectors (surpassing R-CNN).|

|[YOLO v2-coco](validated/vision/object_detection_segmentation/yolov2-coco)|[Redmon et al.](https://arxiv.org/abs/1612.08242)|A CNN model for real-time object detection system that can detect over 9000 object categories. It uses a single network evaluation, enabling it to be more than 1000x faster than R-CNN and 100x faster than Faster R-CNN. This model is trained with COCO dataset and contains 80 classes.

|[YOLO v3](validated/vision/object_detection_segmentation/yolov3)|[Redmon et al.](https://arxiv.org/pdf/1804.02767.pdf)|A deep CNN model for real-time object detection that detects 80 different classes. A little bigger than YOLOv2 but still very fast. As accurate as SSD but 3 times faster.|

|[Tiny YOLOv3](validated/vision/object_detection_segmentation/tiny-yolov3)|[Redmon et al.](https://arxiv.org/pdf/1804.02767.pdf)| A smaller version of YOLOv3 model. |

|[YOLOv4](validated/vision/object_detection_segmentation/yolov4)|[Bochkovskiy et al.](https://arxiv.org/abs/2004.10934)|Optimizes the speed and accuracy of object detection. Two times faster than EfficientDet. It improves YOLOv3's AP and FPS by 10% and 12%, respectively, with mAP50 of 52.32 on the COCO 2017 dataset and FPS of 41.7 on a Tesla V100.| [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/onnx/yolov4) |

|[DUC](validated/vision/object_detection_segmentation/duc)|[Wang et al.](https://arxiv.org/abs/1702.08502)|Deep CNN based pixel-wise semantic segmentation model with >80% [mIOU](/models/semantic_segmentation/DUC/README.md/#metric) (mean Intersection Over Union). Trained on cityscapes dataset, which can be effectively implemented in self driving vehicle systems.| [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/onnx/DUC) |

|[FCN](validated/vision/object_detection_segmentation/fcn)|[Long et al.](https://people.eecs.berkeley.edu/~jonlong/long_shelhamer_fcn.pdf)|Deep CNN based segmentation model trained end-to-end, pixel-to-pixel that produces efficient inference and learning. Built off of AlexNet, VGG net, GoogLeNet classification methods. 
[contribute](contribute.md)| [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/onnx/FCN) |



### Body, Face & Gesture Analysis 

Face detection models identify and/or recognize human faces and emotions in given images. Body and Gesture Analysis models identify gender and age in given image.

|Model Class |Reference |Description |Hugging Face Spaces |

|-|-|-|-|

|[ArcFace](validated/vision/body_analysis/arcface)|[Deng et al.](https://arxiv.org/abs/1801.07698)|A CNN based model for face recognition which learns discriminative features of faces and produces embeddings for input face images.| [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/onnx/ArcFace) |

|[UltraFace](validated/vision/body_analysis/ultraface)|[Ultra-lightweight face detection model](https://github.com/Linzaer/Ultra-Light-Fast-Generic-Face-Detector-1MB)|This model is a lightweight facedetection model designed for edge computing devices.| [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/onnx/ultraface) |

|[Emotion FerPlus](validated/vision/body_analysis/emotion_ferplus) |[Barsoum et al.](https://arxiv.org/abs/1608.01041)	| Deep CNN for emotion recognition trained on images of faces.|

|[Age and Gender Classification using Convolutional Neural Networks](validated/vision/body_analysis/age_gender)| [Rothe et al.](https://data.vision.ee.ethz.ch/cvl/publications/papers/proceedings/eth_biwi_01229.pdf)	|This model accurately classifies gender and age even the amount of learning data is limited.|



### Image Manipulation 

Image manipulation models use neural networks to transform input images to modified output images. Some popular models in this category involve style transfer or enhancing images by increasing resolution.

|Model Class |Reference |Description |Hugging Face Spaces |

|-|-|-|-|

|Unpaired Image to Image Translation using Cycle consistent Adversarial Network|[Zhu et al.](https://arxiv.org/abs/1703.10593)|The model uses learning to translate an image from a source domain X to a target domain Y in the absence of paired examples. 
[contribute](contribute.md)|

|[Super Resolution with sub-pixel CNN](validated/vision/super_resolution/sub_pixel_cnn_2016) |	[Shi et al.](https://arxiv.org/abs/1609.05158)	|A deep CNN that uses sub-pixel convolution layers to upscale the input image. | [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/onnx/sub_pixel_cnn_2016) |

|[Fast Neural Style Transfer](validated/vision/style_transfer/fast_neural_style) |	[Johnson et al.](https://arxiv.org/abs/1603.08155)	|This method uses a loss network pretrained for image classification to define perceptual loss functions that measure perceptual differences in content and style between images. The loss network remains fixed during the training process.|



### Speech & Audio Processing 

This class of models uses audio data to train models that can identify voice, generate music, or even read text out loud.

|Model Class |Reference |Description |

|-|-|-|

|Speech recognition with deep recurrent neural networks|	[Graves et al.](https://www.cs.toronto.edu/~fritz/absps/RNN13.pdf)|A RNN model for sequential data for speech recognition. Labels problems where the input-output alignment is unknown
[contribute](contribute.md)|

|Deep voice: Real time neural text to speech |	[Arik et al.](https://arxiv.org/abs/1702.07825)	|A DNN model that performs end-to-end neural speech synthesis. Requires fewer parameters and it is faster than other systems. 
[contribute](contribute.md)|

|Sound Generative models|	[WaveNet: A Generative Model for Raw Audio ](https://arxiv.org/abs/1609.03499)|A CNN model that generates raw audio waveforms. Has predictive distribution for each audio sample. Generates realistic music fragments. 
[contribute](contribute.md)|



### Machine Comprehension 

This subset of natural language processing models that answer questions about a given context paragraph.

|Model Class |Reference |Description |Hugging Face Spaces|

|-|-|-|-|

|[Bidirectional Attention Flow](validated/text/machine_comprehension/bidirectional_attention_flow)|[Seo et al.](https://arxiv.org/pdf/1611.01603)|A model that answers a query about a given context paragraph.| [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/onnx/BiDAF) |

|[BERT-Squad](validated/text/machine_comprehension/bert-squad)|[Devlin et al.](https://arxiv.org/pdf/1810.04805.pdf)|This model answers questions based on the context of the given input paragraph. | [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/onnx/BERT-Squad) |

|[RoBERTa](validated/text/machine_comprehension/roberta)|[Liu et al.](https://arxiv.org/pdf/1907.11692.pdf)|A large transformer-based model that predicts sentiment based on given input text.| [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/onnx/RoBERTa) |

|[GPT-2](validated/text/machine_comprehension/gpt-2)|[Radford et al.](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)|A large transformer-based language model that given a sequence of words within some text, predicts the next word. | [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/onnx/GPT-2) |

|[T5](validated/text/machine_comprehension/t5)|[Raffel et al.](https://arxiv.org/abs/1910.10683)|A large transformer-based language model trained on multiple tasks at once to achieve better semantic understanding of the prompt, capable of sentiment-analysis, question-answering, similarity-detection, translation, summarization, etc. |[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/onnx/T5) |



### Machine Translation 

This class of natural language processing models learns how to translate input text to another language.

|Model Class |Reference |Description |

|-|-|-|

|Neural Machine Translation by jointly learning to align and translate|	[Bahdanau et al.](https://arxiv.org/abs/1409.0473)|Aims to build a single neural network that can be jointly tuned to maximize the translation performance. 
[contribute](contribute.md)|

|Google's Neural Machine Translation System|	[Wu et al.](https://arxiv.org/abs/1609.08144)|This model helps to improve issues faced by the Neural Machine Translation (NMT) systems like parallelism that helps accelerate the final translation speed.
[contribute](contribute.md)|



### Language Modelling 

This subset of natural language processing models learns representations of language from large corpuses of text.

|Model Class |Reference |Description |

|-|-|-|

|Deep Neural Network Language Models | [Arisoy et al.](https://pdfs.semanticscholar.org/a177/45f1d7045636577bcd5d513620df5860e9e5.pdf)|A DNN acoustic model. Used in many natural language technologies. Represents a probability distribution over all possible word strings in a language. 
 [contribute](contribute.md)|



### Visual Question Answering & Dialog 

This subset of natural language processing models uses input images to answer questions about those images.

|Model Class |Reference |Description |

|-|-|-|

|VQA: Visual Question Answering |[Agrawal et al.](https://arxiv.org/pdf/1505.00468v6.pdf)|A model that takes an image and a free-form, open-ended natural language question about the image and outputs a natural-language answer. 
[contribute](contribute.md)|

|Yin and Yang: Balancing and Answering Binary Visual Questions |[Zhang et al.](https://arxiv.org/pdf/1511.05099.pdf)|Addresses VQA by converting the question to a tuple that concisely summarizes the visual concept to be detected in the image. Next, if the concept can be found in the image, it provides a “yes” or “no” answer. Its performance matches the traditional VQA approach on unbalanced dataset, and outperforms it on the balanced dataset. 
[contribute](contribute.md)|

|Making the V in VQA Matter|[Goyal et al.](https://arxiv.org/pdf/1612.00837.pdf)|Balances the VQA dataset by collecting complementary images such that every question is associated with a pair of similar images that result in two different answers to the question, providing a unique interpretable model that provides a counter-example based explanation.  
[contribute](contribute.md)|

|Visual Dialog|	[Das et al.](https://arxiv.org/abs/1611.08669)|An AI agent that holds a meaningful dialog with humans in natural, conversational language about visual content. Curates a large-scale Visual Dialog dataset (VisDial). 
[contribute](contribute.md)|



### Other interesting models 

There are many interesting deep learning models that do not fit into the categories described above. The ONNX team would like to highly encourage users and researchers to [contribute](contribute.md) their models to the growing model zoo.

|Model Class |Reference |Description |

|-|-|-|

|Text to Image|	[Generative Adversarial Text to image Synthesis ](https://arxiv.org/abs/1605.05396)|Effectively bridges the advances in text and image modeling, translating visual concepts from characters to pixels. Generates plausible images of birds and flowers from detailed text descriptions. 
[contribute](contribute.md)|

|Time Series Forecasting|	[Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks ](https://arxiv.org/pdf/1703.07015.pdf)|The model extracts short-term local dependency patterns among variables and to discover long-term patterns for time series trends. It helps to predict solar plant energy output, electricity consumption, and traffic jam situations. 
[contribute](contribute.md)|

|Recommender systems|[DropoutNet: Addressing Cold Start in Recommender Systems](http://www.cs.toronto.edu/~mvolkovs/nips2017_deepcf.pdf)|A collaborative filtering method that makes predictions about an individual’s preference based on preference information from other users.
[contribute](contribute.md)|

|Collaborative filtering|[Neural Collaborative Filtering](https://arxiv.org/pdf/1708.05031.pdf)|A DNN model based on the interaction between user and item features using matrix factorization. 
[contribute](contribute.md)|

|Autoencoders|[A Hierarchical Neural Autoencoder for Paragraphs and Documents](https://arxiv.org/abs/1506.01057)|An LSTM (long-short term memory) auto-encoder to preserve and reconstruct multi-sentence paragraphs.
[contribute](contribute.md)|



## Usage 

Every ONNX backend should support running the models out of the box. After downloading and extracting the tarball of each model, you will find:

- A protobuf file `model.onnx` that represents the serialized ONNX model.

- Test data (in the form of serialized protobuf TensorProto files or serialized NumPy archives).

### Usage - Test data starter code

The test data files can be used to validate ONNX models from the Model Zoo. We have provided the following interface examples for you to get started. Please replace `onnx_backend` in your code with the appropriate framework of your choice that provides ONNX inferencing support, and likewise replace `backend.run_model` with the framework's model evaluation logic.

There are two different formats for the test data files:

- Serialized protobuf TensorProtos (.pb), stored in folders with the naming convention `test_data_set_*`.

```python

import numpy as np

import onnx

import os

import glob

import onnx_backend as backend

from onnx import numpy_helper

model = onnx.load('model.onnx')

test_data_dir = 'test_data_set_0'

# Load inputs

inputs = []

inputs_num = len(glob.glob(os.path.join(test_data_dir, 'input_*.pb')))

for i in range(inputs_num):

    input_file = os.path.join(test_data_dir, 'input_{}.pb'.format(i))

    tensor = onnx.TensorProto()

    with open(input_file, 'rb') as f:

        tensor.ParseFromString(f.read())

    inputs.append(numpy_helper.to_array(tensor))

# Load reference outputs

ref_outputs = []

ref_outputs_num = len(glob.glob(os.path.join(test_data_dir, 'output_*.pb')))

for i in range(ref_outputs_num):

    output_file = os.path.join(test_data_dir, 'output_{}.pb'.format(i))

    tensor = onnx.TensorProto()

    with open(output_file, 'rb') as f:

        tensor.ParseFromString(f.read())

    ref_outputs.append(numpy_helper.to_array(tensor))

# Run the model on the backend

outputs = list(backend.run_model(model, inputs))

# Compare the results with reference outputs.

for ref_o, o in zip(ref_outputs, outputs):

    np.testing.assert_almost_equal(ref_o, o)

```

- Serialized Numpy archives, stored in files with the naming convention `test_data_*.npz`. Each file contains one set of test inputs and outputs.

```python

import numpy as np

import onnx

import onnx_backend as backend

# Load the model and sample inputs and outputs

model = onnx.load(model_pb_path)

sample = np.load(npz_path, encoding='bytes')

inputs = list(sample['inputs'])

outputs = list(sample['outputs'])

# Run the model with an onnx backend and verify the results

np.testing.assert_almost_equal(outputs, backend.run_model(model, inputs))

```

### Usage - Model quantization

You can get quantized ONNX models by using [Intel® Neural Compressor](https://github.com/intel/neural-compressor). It provides web-based UI service to make quantization easier and supports code-based usage for more abundant quantization settings. Refer to [bench document](https://github.com/intel/neural-compressor/blob/master/docs/bench.md) for how to use web-based UI service and [example document](./resource/docs/INC_code.md) for a simple code-based demo.

![image](./resource/images/INC_GUI.gif)

## Usage

There are multiple ways to access the ONNX Model Zoo:

### Git Clone (Not Recommended)

Cloning the repository using git won't automatically download the ONNX models due to their size. To manage these files, first, install Git LFS by running:

```bash

pip install git-lfs

```

To download a specific model:

```bash

git lfs pull --include="[path to model].onnx" --exclude=""

```

To download all models:

```bash

git lfs pull --include="*" --exclude=""

```

### GitHub UI

Alternatively, you can download models directly from GitHub. Navigate to the model's page and click the "Download" button on the top right corner.

## Model Visualization

For a graphical representation of each model's architecture, we recommend using [Netron](https://github.com/lutzroeder/netron).

## Contributions

Contributions to the ONNX Model Zoo are welcome! Please check our [contribution guidelines](contribute.md) for more information on how you can contribute to the growth and improvement of this resource.

Thank you for your interest in the ONNX Model Zoo, and we look forward to your participation in our community!

# License

[Apache License v2.0](LICENSE)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/onnx/models

Awesome Lists containing this project

README