Home
Blog
Compare the differences between the NXP i.MX8 Graphic Accelerators

Blog

Compare the differences between the NXP i.MX8 Graphic Accelerators

Thursday, 25 Aug 2022 | Diego Dorta

1. iMX8 overview

Variscite’s i.MX8 family of System on Modules (SoMs) provides pin-to-pin scalability between the iMX8, iMX 8X, and i.MX 8M families of NXP SoCs. Each SoM features a GPU that enables hardware acceleration of both graphical and computational applications. This article aims to introduce Variscite customers to the GPU and provide practical steps on how to select the correct SoM, as follows:

A comparison of the GPUs of each SoM in Variscite’s i.MX8 family
An introduction to the software APIs available to accelerate graphical and computational tasks
OpenGL ES benchmark results using GPU acceleration
Benchmark results of various calculations using CPU and GPU
Helpful tips and guidelines for getting the best results
Next steps for selecting the correct SoM for your application

1.1 CPU and GPU

Before getting started, it is essential to mention the difference between the Central Processing Unit (CPU) and the Graphics Processing Unit (GPU) and why the GPU is so crucial for developing graphical applications.
The CPU is designed to handle a wide range of tasks quickly, but it is limited by the number of tasks that can be executed simultaneously. On the other hand, the GPU is designed with multitasking in mind and generally supports the type of workload related to graphics processing, such as rendering high-resolution images and videos.
CPUs and GPUs differ in their overall architecture; while the CPU is “narrow and deep”, the GPUs are “shallow and wide”. Generally, the CPU has a massive memory cache, makes branch prediction, and has a much higher clock speed than the GPU. The GPU contains several compute units, which are much simpler than a regular CPU, but can execute a single instruction in parallel on multiple data. The GPU does not make branch predictions.

For example, a code that has a lot of branching and conditions like “if-else” statements will be executed more efficiently on a CPU, because of its branch prediction capabilities, and a code that has a lot of Single Instruction Multiple Data (SIMD) operations will be executed more efficiently on a GPU, because of its parallel processing capabilities.

i.MX8 System on Module with 3D GPU - GC7000 XSVX (x2) — i.MX8 System on Module with 3D GPU – GC7000 XSVX (x2)

2. GPU on i.MX 8 Application Processors

The i.MX 8 applications processors family from NXP incorporates GPUs from Vivante (VeriSilicon). These GPUs accelerate different user-level Application Programming Interfaces (APIs). See the following table with a general overview of the processors and their current support:

iMX 8 Series Comparison Table

NXP SoC	Variscite System on Module	3D GPU	2D GPU	3D API	2D API	Compute API	Other APIs
i.MX 8	VAR-SOM-MX8, SPEAR-MX8	GC7000 XSVX (x2)	High Perf 2D Blit Engine	GLES, Vulkan	OpenVG, G2D	OpenCL	OpenVX
i.MX 8X	VAR-SOM-MX8X	GC7000 Lite	High Perf 2D Blit Engine	GLES, Vulkan	OpenVG, G2D	OpenCL	N/A
i.MX 8M	DART-MX8M	GC7000 Lite	N/A	GLES, Vulkan	OpenVG	OpenCL	N/A
i.MX 8M Mini	DART-MX8M-MINI, VAR-SOM-MX8M-MINI	GCNano Ultra	GC320	GLES	OpenVG, G2D	N/A	N/A
i.MX 8M Nano	VAR-SOM-MX8M-NANO	GC7000 Ultra Lite	N/A	GLES, Vulkan	OpenVG, G2D	OpenCL	N/A
i.MX 8M Plus	DART-MX8M-PLUS, VAR-SOM-MX8M-PLUS	GC7000 Ultra Lite	GC520L	GLES, Vulkan	OpenVG, G2D	OpenCL	OpenVX

The listed APIs are defined by the Khronos Group, which is an open, non-profit, member-driven consortium of over 160 industry-leading companies, creating advanced, royalty-free interoperability standards for 3D graphics, augmented and virtual reality, parallel programming, vision acceleration and Machine Learning:

OpenCL: used for cross-platform, parallel programming of modern processors.
- https://www.khronos.org/opencl
OpenGL ES (OpenGL for Embedded Systems): used for rendering advanced 2D and 3D graphics on embedded systems.
- https://www.khronos.org/opengles
OpenVG: used for advanced user interfaces and vector graphics libraries such as SVG.
- https://www.khronos.org/openvg
OpenVX: enables performance and power-optimized computer vision processing, which is important in embedded and real-time use cases such as the face, body, and gesture tracking, intelligent video surveillance, etc.
- https://www.khronos.org/openvx
Vulkan: enables developers to target a wide range of devices with the same graphics API.
- https://www.vulkan.org

For more information about the clock speeds, number of shaders, GFLOPS, etc. please refer to:

NXP i.MX Graphics User’s Guide

iMX8 based evaluation kit with 2D/3D graphics acceleration

3. 2D/3D Graphics Acceleration

Variscite’s Board Support Package (BSP) comes with several examples of the most common APIs, such as OpenCL, OpenVG, OpenGL and more. These examples can be found in the /opt folder of recent Yocto Releases. For example, to run the Bloom OpenGL ES demo, run:

# cd /opt/imx-gpu-sdk/GLES3/Bloom

# ./GLES3.Bloom_Wayland

It is important to mention that if the application requires a Graphical User Interface (GUI), the user should use OpenGL ES (a.k.a. GLES), a subset of the OpenGL API for 2D and 3D rendering of computer graphics. GLES does graphics acceleration either directly or using an abstracted high-level framework. To create an application using hardware-accelerated graphics, there are basically two options:

OpenGL ES: writing the application using the native framework, which gives the user more control. However, this is the most challenging way to achieve the expected results.
Library/Framework: using a framework that already supports OpenGL ES graphics acceleration, for example, Qt.

3.1 Tests with the glmark2 Benchmark

To get an estimated performance of the GPU, the following tests were done with the glmark2 benchmark tool, with the CPU scaling governor set to performance, and with a heatsink on the SOC.

Running the Tests with glmark2

The tests were executed with the following command:

# glmark2-es2-wayland -s 640x480

The taskset command can be used to run the benchmark on specific CPU cores. For example, on the System on Modules powered by the i.MX 8 family, which has 4x A53 and 2x A72 cores, you can select a set of cores of the same type (A53 or A72), by specifying the core numbers, with the following commands:

# taskset -c 0-3 glmark2-es2-wayland -s 640x480

# taskset -c 4,5 glmark2-es2-wayland -s 640x480

NOTE: Even though the examples above use sets of the same core types (A53 or A72), you may use any combination of cores.

Test Scores Results

Variscite System on Module	BSP Version	Module Version	DRAM	GL Renderer	GL Version	Score
VAR-SOM-MX8 4x A53	5.4.85	V1.3	4GB	GC7000XSVX	OpenGL ES 3.2	2151
VAR-SOM-MX8 2x A72	5.4.85	V1.3	4GB	GC7000XSVX	OpenGL ES 3.2	2315
VAR-SOM-MX8 6x cores	5.4.85	V1.3	4GB	GC7000XSVX	OpenGL ES 3.2	2121
SPEAR-MX8 4x A53	5.4.85	V1.2	4GB	GC7000XSVX	OpenGL ES 3.2	2070
SPEAR-MX8 2x A72	5.4.85	V1.2	4GB	GC7000XSVX	OpenGL ES 3.2	2256
SPEAR-MX8 6x cores	5.4.85	V1.2	4GB	GC7000XSVX	OpenGL ES 3.2	2060
VAR-SOM-MX8X	4.14	V1.2	2GB	GC7000L	OpenGL ES 3.1	952
DART-MX8M	5.4.142	V1.3	4GB	GC7000L	OpenGL ES 3.1	946
DART-MX8M-MINI	5.4.142	V1.2	2GB	GC7000 NanoUltra	OpenGL ES 2.0	389
VAR-SOM-MX8M-MINI	5.4.142	V1.3	2GB	GC7000 NanoUltra	OpenGL ES 2.0	381
VAR-SOM-MX8M-NANO	5.4.142	V1.3	1GB	GC7000UL	OpenGL ES 3.1	361
DART-MX8M-PLUS	5.4.70	V1.1	4GB	GC7000UL	OpenGL ES 3.1	992
VAR-SOM-MX8M-PLUS	5.4.70	V1.1	4GB	GC7000UL	OpenGL ES 3.1	987

Single Core vs Multicore Results

An interesting observation from the table above is that the SPEAR-MX8 and VAR-SOM-MX8, 2x A72 cores performed better than 6x (2x A72 + 4x A53) cores. The following table shows the results when using the SPEAR-MX8 and running glmark2-es2-wayland on one or more specific CPU cores:

Variscite System on Module	Used CPU Cores	CPU Max Usage	Score
SPEAR-MX8	1x A53	95%	1462
SPEAR-MX8	2x A53	130%	2039
SPEAR-MX8	3x A53	126%	2042
SPEAR-MX8	4x A53	124%	2070
SPEAR-MX8	1x A72	85%	1843
SPEAR-MX8	2x A72	112%	2256

Although the test is benchmarking the GPU, it also depends on the CPU. For this specific application, using 2x A72 cores produces the best results. This is because the A72 cores are faster than the A53 cores. Since the task does not take advantage of multiple CPUs, assigning the task to the fastest cores produces the best results.

3.2 OpenCL Acceleration

OpenCL (Open Computing Language) is an open, royalty-free standard for cross-platform, parallel programming of diverse accelerators found in supercomputers, cloud servers, personal computers, mobile devices and embedded platforms. Below are benchmark results and links to the source code of two examples written both for the CPU and GPU cores using OpenCL:

Matrix Multiplication (512×512)

Variscite System on Module	Used CPU Cores	CPU Time (s)	GPU Time (s)
VAR-SOM-MX8	1x A72@600Mhz	3.64	0.67
DART-MX8M-PLUS	1x A53@1200Mhz	15.67	1.66

The source code example for this test can be found at Matrix Multiplication Example

Binary Search

Variscite System on Module	Used CPU Cores	CPU Time (ms)	GPU Time (ms)
VAR-SOM-MX8	1x A72@600Mhz	0.008	0.109
DART-MX8M-PLUS	1x A53@1200Mhz	0.007	0.125

The source code example for this test can be found at Binary Search Example.

The matrix multiplication example runs faster on the GPU while the binary search example runs faster on the CPU. This makes sense, because:

a. The binary search example uses a lot of branches (if-else statements), while the matrix multiplication example doesn’t, making it more suitable for the CPU.
b. The matrix multiplication example uses a lot of the same instructions on multiple data, and this is more suitable for the GPU.

4. Closing Thoughts

By now, you’ve probably recognized that there are many variables that can affect the performance you can expect from GPU-accelerated graphical and computational tasks. Each SoC has a unique CPU, GPU, and memory configuration with varying performance and features. Offloading CPU work to a coprocessor is beneficial not only for faster calculation but also for reducing the load on the CPU, freeing it to do something else.
For a single-threaded task, you can get better results by assigning it to a higher-performance CPU core, like the A72.

Each application has unique requirements. While specifications and numerical benchmarks can help point you in the right direction, it’s best to benchmark your application on the target hardware. Variscite recommends using an EVK and pin-to-pin compatible SoMs to evaluate your application and use real-world results to help you identify the correct SoM for your application.

For more information and examples, please visit the OpenCL Examples directory in our var-demos repository on GitHub.

Back