< All Topics
Print

ONNX Runtime

ONNX Runtime (ORT) is a comprehensive open-source ecosystem that accelerates the performance of Machine Learning (ML) models by providing a standardized file format and a set of operators, which allows interoperability among various dedicated ML frameworks, such as TFLite, PyTorch, Tensorflow/Keras, scikit-learn and others. By supporting a broad range of programming languages and platforms, ONNX Runtime enables efficient and optimized deployment of ML models on diverse devices, such as CPUs, GPUs, and mobile devices.

Using ONNX Runtime, developers can easily switch between different frameworks for specific stages of the development process, including rapid training, flexible network architecture, and inferencing. This flexibility allows developers to choose the most appropriate framework for their specific use case and integrate it with the rest of their ML pipeline. The standardization provided by ONNX Runtime helps to reduce the complexity of the ML development process and ensures that ML models can be easily ported across different environments.

Fundamentals

ORT for Inference

At the most general level and not from a statistician’s point of view, inference refers to the process of applying a trained machine learning model to new input data to make predictions or decisions. The input data is fed into the model, and the model uses its learned parameters and algorithms to generate an output that represents the predicted outcome for that input. This output can take many forms, depending on the type of model and the nature of the problem being solved.

Inference is a critical step in the machine learning workflow, as it enables models to be used in real-world applications to make decisions or provide insights. ONNX Runtime offers a performance boost to the inference process by leveraging Execution Providers (EPs), which are hardware-specific optimization modules designed for various types of hardware, such as CPUs, GPUs, and FPGAs. Some use cases include:

  • Enhance inference in ML models.
  • Run on different hardware and OS configurations.
  • Train models using Python and deploy them into a C#, C++, or Java App.
  • Train and execute models created in different frameworks.

Hardware Acceleration

ONNX Runtime supports hardware acceleration through Execution Providers, which enable developers to optimize the execution of their ONNX models by leveraging the specific compute capabilities of the hardware platform. It offers a flexible and extensible architecture, allowing developers to deploy their ONNX models in different environments, including on-premises, cloud, and edge devices, and take advantage of the available hardware resources to accelerate the model inference.

ONNX Runtime Execution Providers. Source ONNX Runtime.

Developers can take advantage of a wide selection of EPs optimized for various hardware architectures by utilizing the “GetCapability()” interface in ONNX Runtime. Additionally, this interface enables developers to easily customize and configure their machine learning models to execute on the most suitable hardware platform for their specific use case, optimizing the performance, minimizing the latency and reducing the energy consumption.

Execution Providers List

Execution ProvidersDescription
Intel DNNLUses Intel’s Deep Neural Network Library (DNNL) to accelerate inference on Intel CPUs. It contains vectorized and threaded building blocks that you can use to implement deep neural networks (DNN) with C and C++ interfaces.
TVMEnables ONNX Runtime users to leverage Apache TVM model optimizations. It’s been tested to work on a handful of models on Linux and Windows, but not on MacOS.
Intel OpenVINOUses Intel’s OpenVINO toolkit to optimize inference performance on Intel CPUs, GPUs, and VPUs.
XNNPACKEnables users to accelerate ONNX models on Android/iOS devices and WebAssembly. It is a highly optimized library of floating-point neural network inference operators for ARM, and x86 platforms.
NVIDIA CUDAUses NVIDIA’s CUDA toolkit to execute models on GPUs.
NVIDIA TensorRTLeverages NVIDIA’s TensorRT Deep Learning inferencing engine to accelerate the inference on NVIDIA GPUs.
DirectMLUses Microsoft’s DirectML API to accelerate inference on Windows devices that support DirectX 12. It provides GPU acceleration for ML tasks across a broad range of supported hardware and drivers.
AMD MIGraphXUses AMD’s Deep Learning graph optimization engine to accelerate the inference on AMD GPUs.
AMD ROCmEnables hardware-accelerated computation on AMD ROCm-enabled GPUs.
ARM Compute Library Accelerates the performance of ONNX model workloads across Armv8 cores.
Android Neural Networks APIUses Android’s Neural Networks API to execute models on Android devices.
ARM-NNAccelerates the performance of ONNX model workloads across Armv8 cores.
CoreMLUses CPU, GPU, and Neural Engine to improve performance.
Rockchip NPUEnables deep learning inference on Rockchip NPU via RKNPU DDK.
Xilinx Vitis-AIVitis-AI is Xilinx’s development stack for hardware-accelerated AI inference on Xilinx platforms, including edge devices and Alveo cards. 
Huawei CANNHelps users quickly build AI applications and services based on the Ascend platform.
Source: ONNX Runtime Execution Providers.

Quick Installation Guide in Python

In this quick guide, we will describe the process required in Python; please refer to the ORT’s documentation for other programming languages. With the model trained in ONNX format, you can perform inferencing with ONNX. Use the command below to install the GPU package:

pip install onnxruntime-gpu

Or use the package below if your code runs on Arm CPUs and/or macOS.

pip install onnxruntime

Now, create an inference session to work with your model.

import onnxruntime as ort
ORT_session = ort.InferenceSession("your_model.onnx")

Run the created inference session with the desired outputs and inputs to get the prediction.

Prediction = ORT_session.run(None, {"input1": value})

For more information, you can explore the ORT tutorials GitHub repo. These tutorials offer detailed explanations and practical examples of using ORT with ML frameworks and cloud services.

Highlights

Project Background

  • Project: ONNX Runtime
  • Author: Microsoft and Facebook
  • Initial Release: September 2017
  • Type: Machine Learning Accelerator
  • License: MIT
  • Contains: Set of libraries, executable installers, and a NuGet package manager.
  • Language: C#, C++, C, Python, Assembly, Cuda, Java
  • GitHub: microsoft/onnxruntime
  • Runs On: Linux, Windows, and MAC
  • Twitter: ONNXRuntime
  • Stackflow: ONNXruntime
  • Youtube: ONNX Runtime

Main Features

  • It optimizes inference tasks of ML models for CPU and GPU.
  • It supports various hardware and software platforms, including Windows, Linux, and macOS.
  • It integrates various frameworks, such as PyTorch, TensorFlow, scikit-learn, and more.
  • It is available in several programming languages.
  • It supports a wide range of custom execution providers.

Prior Knowledge Requirements

  • Knowledge of machine learning principles, neural networks, and deep learning architectures.
  • Familiarity with the Open Neural Network Exchange (ONNX) format.
  • Basic knowledge of programming languages, such as C#, C++, C, Python, Assembly, Cuda, and Java.
  • Familiarity with data structures, such as arrays, tensors, and DataFames.
  • Familiarity with platforms on which you intend to deploy your models.

Projects and Organizations Using ORT

  • Adobe: “With ONNX Runtime, Adobe Target got flexibility and standardization in one package: flexibility for our customers to train ML models in the frameworks of their choice, and standardization to robustly deploy those models at scale for fast inference, to deliver true, real-time personalized experiences.” –Georgiana Copil, Senior Computer Scientist, Adobe.
  • AMD: “The ONNX Runtime integration with AMD’s ROCm open software ecosystem helps our customers leverage the power of AMD Instinct GPUs to accelerate and scale their large machine learning models with flexibility across multiple frameworks.” –Andrew Dieckmann, Corporate Vice President and General Manager, AMD Data Center GPU & Accelerated Processing.
  • Ant Group: “Using ONNX Runtime, we have improved the inference performance of many computer vision (CV) and natural language processing (NLP) models trained by multiple deep learning frameworks. These are part of the Alipay production system. We plan to use ONNX Runtime as the high-performance inference backend for more deep learning models in broad applications, such as click-through rate prediction and cross-modal prediction.” –Xiaoming Zhang, Head of Inference Team, Ant Group.
  • Intel: “We are excited to support ONNX Runtime on the Intel® Distribution of OpenVINO™. This accelerates machine learning inference across Intel hardware and gives developers the flexibility to choose the combination of Intel hardware that best meets their needs from CPU to VPU or FPGA.” –Jonathan Ballon, Vice President and General Manager, Intel Internet of Things Group.
  • Oracle: “The ONNX Runtime API for Java enables Java developers and Oracle customers to seamlessly consume and execute ONNX machine-learning models, taking advantage of the expressive power, high performance, and scalability of Java.” –Stephen Green, Director of Machine Learning Research Group, Oracle.
  • Source: ORT community

Community Benchmarks

  • 8,300 Stars
  • 1,900 Forks
  • 460+ Code contributors
  • Source: GitHub

Releases

  • ONNX Runtime v1.14.0 (2-2023); Update: Improves memory efficiency by enabling GPU memory reuse across different streams. 
  • ONNX Runtime v1.13.1 (10-24-2022); Update: Expose all arena configs in Python API in an extensible way.
  • ONNX Runtime v1.12.0 (7-21-2022); Update: Support for invoking individual ops without the need to create a separate grap.
  • ONNX Runtime v1.11.0 (3-26-2022):;Update: Memory utilization related performance improvements.
  • Source: releases.

References

[1] GitHub.

[2] onnxruntime.ai

Was this article helpful?
0 out of 5 stars
5 Stars 0%
4 Stars 0%
3 Stars 0%
2 Stars 0%
1 Stars 0%
5
Please Share Your Feedback
How Can We Improve This Article?
Table of Contents
Scroll to Top