1. iMX8 overview
Variscite’s i.MX8 family of System on Modules (SoMs) provides pin-to-pin scalability between the i.MX 8, iMX 8X, and i.MX 8M families of NXP SoCs. Each SoM features a GPU that enables hardware acceleration of both graphical and computational applications. This article aims to introduce Variscite customers to the GPU and provide practical steps on how to select the correct SoM, as follows:
- A comparison of the GPUs of each SoM in Variscite’s i.MX8 family
- An introduction to the software APIs available to accelerate graphical and computational tasks
- OpenGL ES benchmark results using GPU acceleration
- Benchmark results of various calculations using CPU and GPU
- Helpful tips and guidelines for getting the best results
- Next steps for selecting the correct SoM for your application
1.1 CPU and GPU
Before getting started, it is essential to mention the difference between the Central Processing Unit (CPU) and the Graphics Processing Unit (GPU) and why the GPU is so crucial for developing graphical applications.
The CPU is designed to handle a wide range of tasks quickly, but it is limited by the number of tasks that can be executed simultaneously. On the other hand, the GPU is designed with multitasking in mind and generally supports the type of workload related to graphics processing, such as rendering high-resolution images and videos.
CPUs and GPUs differ in their overall architecture; while the CPU is “narrow and deep”, the GPUs are “shallow and wide”. Generally, the CPU has a massive memory cache, makes branch prediction, and has a much higher clock speed than the GPU. The GPU contains several compute units, which are much simpler than a regular CPU, but can execute a single instruction in parallel on multiple data. The GPU does not make branch predictions.
For example, a code that has a lot of branching and conditions like “if-else” statements will be executed more efficiently on a CPU, because of its branch prediction capabilities, and a code that has a lot of Single Instruction Multiple Data (SIMD) operations will be executed more efficiently on a GPU, because of its parallel processing capabilities.
2. GPU on i.MX 8 Application Processors
The i.MX 8 applications processors family from NXP incorporates GPUs from Vivante (VeriSilicon). These GPUs accelerate different user-level Application Programming Interfaces (APIs). See the following table with a general overview of the processors and their current support:
|NXP SoC||Variscite System on Module||3D GPU||2D GPU||3D API||2D API||Compute API||Other APIs|
|GC7000 XSVX (x2)||High Perf 2D Blit Engine||GLES, Vulkan||OpenVG, G2D||OpenCL||OpenVX|
|i.MX 8X||VAR-SOM-MX8X||GC7000 Lite||High Perf 2D Blit Engine||GLES, Vulkan||OpenVG, G2D||OpenCL||N/A|
|i.MX 8M||DART-MX8M||GC7000 Lite||N/A||GLES, Vulkan||OpenVG||OpenCL||N/A|
|i.MX 8M Mini||DART-MX8M-MINI,
|GCNano Ultra||GC320||GLES||OpenVG, G2D||N/A||N/A|
|i.MX 8M Nano||VAR-SOM-MX8M-NANO||GC7000 Ultra Lite||N/A||GLES, Vulkan||OpenVG, G2D||OpenCL||N/A|
|i.MX 8M Plus||DART-MX8M-PLUS,
|GC7000 Ultra Lite||GC520L||GLES, Vulkan||OpenVG, G2D||OpenCL||OpenVX|
The listed APIs are defined by the Khronos Group, which is an open, non-profit, member-driven consortium of over 160 industry-leading companies, creating advanced, royalty-free interoperability standards for 3D graphics, augmented and virtual reality, parallel programming, vision acceleration and Machine Learning:
- OpenCL: used for cross-platform, parallel programming of modern processors.
- OpenGL ES (OpenGL for Embedded Systems): used for rendering advanced 2D and 3D graphics on embedded systems.
- OpenVG: used for advanced user interfaces and vector graphics libraries such as SVG.
- OpenVX: enables performance and power-optimized computer vision processing, which is important in embedded and real-time use cases such as the face, body, and gesture tracking, intelligent video surveillance, etc.
- Vulkan: enables developers to target a wide range of devices with the same graphics API.
For more information about the clock speeds, number of shaders, GFLOPS, etc. please refer to:
3. 2D/3D Graphics Acceleration
Variscite’s Board Support Package (BSP) comes with several examples of the most common APIs, such as OpenCL, OpenVG, OpenGL and more. These examples can be found in the /opt folder of recent Yocto Releases. For example, to run the Bloom OpenGL ES demo, run:
# cd /opt/imx-gpu-sdk/GLES3/Bloom
It is important to mention that if the application requires a Graphical User Interface (GUI), the user should use OpenGL ES (a.k.a. GLES), a subset of the OpenGL API for 2D and 3D rendering of computer graphics. GLES does graphics acceleration either directly or using an abstracted high-level framework. To create an application using hardware-accelerated graphics, there are basically two options:
- OpenGL ES: writing the application using the native framework, which gives the user more control. However, this is the most challenging way to achieve the expected results.
- Library/Framework: using a framework that already supports OpenGL ES graphics acceleration, for example, Qt.
3.1 Tests with the glmark2 Benchmark
To get an estimated performance of the GPU, the following tests were done with the glmark2 benchmark tool, with the CPU scaling governor set to performance, and with a heatsink on the SOC.
Running the Tests with glmark2
The tests were executed with the following command:
# glmark2-es2-wayland -s 640x480
The taskset command can be used to run the benchmark on specific CPU cores. For example, on the System on Modules powered by the i.MX 8 family, which has 4x A53 and 2x A72 cores, you can select a set of cores of the same type (A53 or A72), by specifying the core numbers, with the following commands:
# taskset -c 0-3 glmark2-es2-wayland -s 640x480
# taskset -c 4,5 glmark2-es2-wayland -s 640x480
NOTE: Even though the examples above use sets of the same core types (A53 or A72), you may use any combination of cores.
Test Scores Results
|Variscite System on Module||BSP Version||Module Version||DRAM||GL Renderer||GL Version||Score|
|VAR-SOM-MX8 4x A53||5.4.85
|GC7000XSVX||OpenGL ES 3.2||2151|
|VAR-SOM-MX8 2x A72||5.4.85
|GC7000XSVX||OpenGL ES 3.2||2315|
|VAR-SOM-MX8 6x cores||5.4.85||V1.3||4GB||GC7000XSVX||OpenGL ES 3.2||2121|
|SPEAR-MX8 4x A53||5.4.85||V1.2||4GB
|GC7000XSVX||OpenGL ES 3.2||2070|
|SPEAR-MX8 2x A72||5.4.85
|GC7000XSVX||OpenGL ES 3.2||2256|
|SPEAR-MX8 6x cores||5.4.85||V1.2||4GB||GC7000XSVX||OpenGL ES 3.2||2060|
|GC7000L||OpenGL ES 3.1||952|
|DART-MX8M||5.4.142||V1.3||4GB||GC7000L||OpenGL ES 3.1||946|
|OpenGL ES 2.0||389|
|OpenGL ES 2.0||381|
|VAR-SOM-MX8M-NANO||5.4.142||V1.3||1GB||GC7000UL||OpenGL ES 3.1||361|
|DART-MX8M-PLUS||5.4.70||V1.1||4GB||GC7000UL||OpenGL ES 3.1||992|
|VAR-SOM-MX8M-PLUS||5.4.70||V1.1||4GB||GC7000UL||OpenGL ES 3.1||987|
Single Core vs Multicore Results
An interesting observation from the table above is that the SPEAR-MX8 and VAR-SOM-MX8, 2x A72 cores performed better than 6x (2x A72 + 4x A53) cores. The following table shows the results when using the SPEAR-MX8 and running glmark2-es2-wayland on one or more specific CPU cores:
|Variscite System on Module||Used CPU Cores||CPU Max Usage||Score|
Although the test is benchmarking the GPU, it also depends on the CPU. For this specific application, using 2x A72 cores produces the best results. This is because the A72 cores are faster than the A53 cores. Since the task does not take advantage of multiple CPUs, assigning the task to the fastest cores produces the best results.
3.2 OpenCL Acceleration
OpenCL (Open Computing Language) is an open, royalty-free standard for cross-platform, parallel programming of diverse accelerators found in supercomputers, cloud servers, personal computers, mobile devices and embedded platforms. Below are benchmark results and links to the source code of two examples written both for the CPU and GPU cores using OpenCL:
Matrix Multiplication (512×512)
|Variscite System on Module||Used CPU Cores||CPU Time (s)||GPU Time (s)|
- The source code example for this test can be found at Matrix Multiplication Example
|Variscite System on Module||Used CPU Cores||CPU Time (ms)||GPU Time (ms)|
- The source code example for this test can be found at Binary Search Example.
The matrix multiplication example runs faster on the GPU while the binary search example runs faster on the CPU. This makes sense, because:
a. The binary search example uses a lot of branches (if-else statements), while the matrix multiplication example doesn’t, making it more suitable for the CPU.
b. The matrix multiplication example uses a lot of the same instructions on multiple data, and this is more suitable for the GPU.
4. Closing Thoughts
By now, you’ve probably recognized that there are many variables that can affect the performance you can expect from GPU-accelerated graphical and computational tasks. Each SoC has a unique CPU, GPU, and memory configuration with varying performance and features. Offloading CPU work to a coprocessor is beneficial not only for faster calculation but also for reducing the load on the CPU, freeing it to do something else.
For a single-threaded task, you can get better results by assigning it to a higher-performance CPU core, like the A72.
Each application has unique requirements. While specifications and numerical benchmarks can help point you in the right direction, it’s best to benchmark your application on the target hardware. Variscite recommends using an EVK and pin-to-pin compatible SoMs to evaluate your application and use real-world results to help you identify the correct SoM for your application.
For more information and examples, please visit the OpenCL Examples directory in our var-demos repository on GitHub.