1. iMX8 overview

Variscite’s i.MX8 family of System on Modules (SoMs) provides pin-to-pin scalability between the i.MX 8, iMX 8X, and i.MX 8M families of NXP SoCs. Each SoM features a GPU that enables hardware acceleration of both graphical and computational applications. This article aims to introduce Variscite customers to the GPU and provide practical steps on how to select the correct SoM, as follows:

  • A comparison of the GPUs of each SoM in Variscite’s i.MX8 family
  • An introduction to the software APIs available to accelerate graphical and computational tasks
  • OpenGL ES benchmark results using GPU acceleration
  • Benchmark results of various calculations using CPU and GPU
  • Helpful tips and guidelines for getting the best results
  • Next steps for selecting the correct SoM for your application
1.1 CPU and GPU

Before getting started, it is essential to mention the difference between the Central Processing Unit (CPU) and the Graphics Processing Unit (GPU) and why the GPU is so crucial for developing graphical applications.
The CPU is designed to handle a wide range of tasks quickly, but it is limited by the number of tasks that can be executed simultaneously. On the other hand, the GPU is designed with multitasking in mind and generally supports the type of workload related to graphics processing, such as rendering high-resolution images and videos.
CPUs and GPUs differ in their overall architecture; while the CPU is “narrow and deep”, the GPUs are “shallow and wide”. Generally, the CPU has a massive memory cache, makes branch prediction, and has a much higher clock speed than the GPU. The GPU contains several compute units, which are much simpler than a regular CPU, but can execute a single instruction in parallel on multiple data. The GPU does not make branch predictions.

For example, a code that has a lot of branching and conditions like “if-else” statements will be executed more efficiently on a CPU, because of its branch prediction capabilities, and a code that has a lot of Single Instruction Multiple Data (SIMD) operations will be executed more efficiently on a GPU, because of its parallel processing capabilities.

 

i.MX8 System on Module with 3D GPU - GC7000 XSVX (x2)

i.MX8 System on Module with 3D GPU – GC7000 XSVX (x2)

 

2. GPU on i.MX 8 Application Processors

The i.MX 8 applications processors family from NXP incorporates GPUs from Vivante (VeriSilicon). These GPUs accelerate different user-level Application Programming Interfaces (APIs). See the following table with a general overview of the processors and their current support:

NXP SoC Variscite System on Module 3D GPU 2D GPU 3D API 2D API Compute API Other APIs
i.MX 8 VAR-SOM-MX8,

SPEAR-MX8

GC7000 XSVX (x2) High Perf 2D Blit Engine GLES, Vulkan OpenVG, G2D OpenCL OpenVX
i.MX 8X VAR-SOM-MX8X GC7000 Lite High Perf 2D Blit Engine GLES, Vulkan OpenVG, G2D OpenCL N/A
i.MX 8M DART-MX8M GC7000 Lite N/A GLES, Vulkan OpenVG OpenCL N/A
i.MX 8M Mini DART-MX8M-MINI,

VAR-SOM-MX8M-MINI

GCNano Ultra GC320 GLES OpenVG, G2D N/A N/A
i.MX 8M Nano VAR-SOM-MX8M-NANO GC7000 Ultra Lite N/A GLES, Vulkan OpenVG, G2D OpenCL N/A
i.MX 8M Plus DART-MX8M-PLUS,

VAR-SOM-MX8M-PLUS

GC7000 Ultra Lite GC520L GLES, Vulkan OpenVG, G2D OpenCL  OpenVX

The listed APIs are defined by the Khronos Group, which is an open, non-profit, member-driven consortium of over 160 industry-leading companies, creating advanced, royalty-free interoperability standards for 3D graphics, augmented and virtual reality, parallel programming, vision acceleration and Machine Learning:

For more information about the clock speeds, number of shaders, GFLOPS, etc. please refer to:

 

iMX8 based evaluation kit with 2D/3D graphics acceleration

iMX8 based evaluation kit with 2D/3D graphics acceleration

 

3. 2D/3D Graphics Acceleration

Variscite’s Board Support Package (BSP) comes with several examples of the most common APIs, such as OpenCL, OpenVG, OpenGL and more. These examples can be found in the /opt folder of recent Yocto Releases. For example, to run the Bloom OpenGL ES demo, run:

# cd /opt/imx-gpu-sdk/GLES3/Bloom
# ./GLES3.Bloom_Wayland

It is important to mention that if the application requires a Graphical User Interface (GUI), the user should use OpenGL ES (a.k.a. GLES), a subset of the OpenGL API for 2D and 3D rendering of computer graphics. GLES does graphics acceleration either directly or using an abstracted high-level framework. To create an application using hardware-accelerated graphics, there are basically two options:

  • OpenGL ES: writing the application using the native framework, which gives the user more control. However, this is the most challenging way to achieve the expected results.
  • Library/Framework: using a framework that already supports OpenGL ES graphics acceleration, for example, Qt.
3.1 Tests with the glmark2 Benchmark

To get an estimated performance of the GPU, the following tests were done with the glmark2 benchmark tool, with the CPU scaling governor set to performance, and with a heatsink on the SOC.

Running the Tests with glmark2

The tests were executed with the following command:

# glmark2-es2-wayland -s 640x480

The taskset command can be used to run the benchmark on specific CPU cores. For example, on the System on Modules powered by the i.MX 8 family, which has 4x A53 and 2x A72 cores, you can select a set of cores of the same type (A53 or A72), by specifying the core numbers, with the following commands:

# taskset -c 0-3 glmark2-es2-wayland -s 640x480
# taskset -c 4,5 glmark2-es2-wayland -s 640x480

NOTE: Even though the examples above use sets of the same core types (A53 or A72), you may use any combination of cores.

Test Scores Results  

Variscite System on Module BSP Version Module Version DRAM GL Renderer GL Version Score
VAR-SOM-MX8 4x A53 5.4.85

 

V1.3 4GB

 

GC7000XSVX OpenGL ES 3.2 2151
VAR-SOM-MX8 2x A72 5.4.85

 

V1.3 4GB

 

GC7000XSVX OpenGL ES 3.2 2315
VAR-SOM-MX8 6x cores 5.4.85 V1.3 4GB GC7000XSVX OpenGL ES 3.2 2121
SPEAR-MX8 4x A53 5.4.85 V1.2 4GB

 

GC7000XSVX OpenGL ES 3.2 2070
SPEAR-MX8 2x A72 5.4.85

 

V1.2 4GB

 

GC7000XSVX OpenGL ES 3.2 2256
SPEAR-MX8 6x cores 5.4.85 V1.2 4GB GC7000XSVX OpenGL ES 3.2 2060
VAR-SOM-MX8X 4.14 V1.2 2GB

 

GC7000L OpenGL ES 3.1 952
DART-MX8M 5.4.142 V1.3 4GB GC7000L OpenGL ES 3.1 946
DART-MX8M-MINI 5.4.142 V1.2 2GB GC7000

NanoUltra

OpenGL ES 2.0 389
VAR-SOM-MX8M-MINI 5.4.142 V1.3 2GB

 

GC7000

NanoUltra

OpenGL ES 2.0 381
VAR-SOM-MX8M-NANO 5.4.142 V1.3 1GB GC7000UL OpenGL ES 3.1 361
DART-MX8M-PLUS 5.4.70 V1.1 4GB GC7000UL OpenGL ES 3.1 992
VAR-SOM-MX8M-PLUS 5.4.70 V1.1 4GB GC7000UL OpenGL ES 3.1 987


Single Core vs Multicore Results

An interesting observation from the table above is that the SPEAR-MX8 and VAR-SOM-MX8, 2x A72 cores performed better than 6x (2x A72 + 4x A53) cores. The following table shows the results when using the SPEAR-MX8 and running glmark2-es2-wayland on one or more specific CPU cores:

Variscite System on Module Used CPU Cores CPU Max Usage Score
SPEAR-MX8 1x A53 95% 1462
SPEAR-MX8 2x A53 130% 2039
SPEAR-MX8 3x A53 126% 2042
SPEAR-MX8 4x A53 124% 2070
SPEAR-MX8 1x A72 85% 1843
SPEAR-MX8 2x A72 112% 2256

Although the test is benchmarking the GPU, it also depends on the CPU. For this specific application, using 2x A72 cores produces the best results. This is because the A72 cores are faster than the A53 cores. Since the task does not take advantage of multiple CPUs, assigning the task to the fastest cores produces the best results.

3.2 OpenCL Acceleration

OpenCL (Open Computing Language) is an open, royalty-free standard for cross-platform, parallel programming of diverse accelerators found in supercomputers, cloud servers, personal computers, mobile devices and embedded platforms. Below are benchmark results and links to the source code of two examples written both for the CPU and GPU cores using OpenCL:

Matrix Multiplication (512×512)

Variscite System on Module Used CPU Cores CPU Time (s) GPU Time (s)
VAR-SOM-MX8 1x A72@600Mhz 3.64 0.67
DART-MX8M-PLUS 1x A53@1200Mhz 15.67 1.66

Binary Search

Variscite System on Module Used CPU Cores CPU Time (ms) GPU Time (ms)
VAR-SOM-MX8 1x A72@600Mhz 0.008 0.109
DART-MX8M-PLUS 1x A53@1200Mhz 0.007 0.125

The matrix multiplication example runs faster on the GPU while the binary search example runs faster on the CPU. This makes sense, because:

a. The binary search example uses a lot of branches (if-else statements), while the matrix multiplication example doesn’t, making it more suitable for the CPU.
b. The matrix multiplication example uses a lot of the same instructions on multiple data, and this is more suitable for the GPU.

 

4. Closing Thoughts

By now, you’ve probably recognized that there are many variables that can affect the performance you can expect from GPU-accelerated graphical and computational tasks. Each SoC has a unique CPU, GPU, and memory configuration with varying performance and features. Offloading CPU work to a coprocessor is beneficial not only for faster calculation but also for reducing the load on the CPU, freeing it to do something else.
For a single-threaded task, you can get better results by assigning it to a higher-performance CPU core, like the A72.

Each application has unique requirements. While specifications and numerical benchmarks can help point you in the right direction, it’s best to benchmark your application on the target hardware. Variscite recommends using an EVK and pin-to-pin compatible SoMs to evaluate your application and use real-world results to help you identify the correct SoM for your application.

For more information and examples, please visit the OpenCL Examples directory in our var-demos repository on GitHub.