STM32MP1 Browser Performance: Difference between revisions

From Wiki-DB
Jump to navigationJump to search
Fgerstandl (talk | contribs)
Fgerstandl (talk | contribs)
 
(39 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Tasks of the GPU ==
== Performance Tests with hardware acceleration (etnaviv) ==
* For „simple“ webpages without 3D-features, the GPU is only used for „blitting“during a process step called „Raster(ization) and Compositing“
* „blitting“ = fast copy and move of memory objects
* By this, a strong relief of the CPU can be achieved
* this should be well possible with the STM32MP1
 
== Performance Tests ==
=== "Infragistics Ignite UI" Demo Application ===
=== "Infragistics Ignite UI" Demo Application ===
[[File:Infragistics.png|frameless]]<br>
[[File:Infragistics.png|frameless]]<br>
Line 16: Line 10:


To get some insights the chromium profiling was used.
To get some insights the chromium profiling was used.
[[File:ChromeProfilingIgnite.png|frameless]]
[[File:ChromeProfilingIgnite.png|frameless]]
The Chrome Profiler shows following things:
 
With the Chrome Profiler one can make following conclusions:
* the CPU does most of the work
* the CPU does most of the work
* there is no obvious point where CPU cycles would be spent excessively (perf doesn't indicate anything).
* there is no obvious point where CPU cycles would be spent excessively (perf doesn't indicate anything).
Line 23: Line 20:
* But that also means the slowness is caused by the CPU / website
* But that also means the slowness is caused by the CPU / website
* not much the GPU can do about this.
* not much the GPU can do about this.
=== QOpenGLWidget Example ===


=== Line-Chart Demo Application ===
=== Line-Chart Demo Application ===
==== with GPU ====
[[File:ChartLineWithGPU.png|frameless]]
[[File:ChartLineWithGPU.png|frameless]]


https://www.dropbox.com/s/323nv90lhp9wh02/mit_GPU.MP4?dl=0  
https://www.dropbox.com/s/323nv90lhp9wh02/mit_GPU.MP4?dl=0  


==== no GPU ====
* with GPU support
[[File:ChartLineWithoutGPU.png|frameless]]
* 20 - 30 fps
 
* works very well
https://www.dropbox.com/s/kprvgowf8kzod6a/ohne_GPU.MP4?dl=0


=== DH demo from tradeshow ===
=== DH demo from tradeshow ===
Line 43: Line 36:


=== Webgl example (aquarium) ===
=== Webgl example (aquarium) ===
http://webglsamples.org/aquarium/aquarium.html?numFish=1&canvasWidth=800&canvasHeight=480
[[File:Webgl aquarium.png|frameless]]
Running a webgl example page (fishtank) on the STM32MP1 (etnaviv) with Chromium under a resolution of 800x600 results in '''3fps'''.
=== 3D-animation and video as texture ===
[[File:3danimation.png|frameless]]
https://www.dropbox.com/s/323nv90lhp9wh02/mit_GPU.MP4?dl=0
The cube runs very smooth with at least 40fps.
=== Conclusion ===
Webapplications must be optimized for particular embedded system resp. SOC. Heavy WebGL applications are not suited for this platform.<br>
Nevertheless appealing web pages with responsive requirements on the STM32MP1 are possible.
== Performance Tests with software renderning (without GPU) ==
=== Line-Chart Demo Application ===
[[File:ChartLineWithoutGPU.png|frameless]]
https://www.dropbox.com/s/kprvgowf8kzod6a/ohne_GPU.MP4?dl=0
* without GPU (Software Rendering)
* < 1 fps
* too slow to be usable
=== 3D-animation and video as texture ===
[[File:3danimation.png|frameless]]
https://www.dropbox.com/s/323nv90lhp9wh02/mit_GPU.MP4?dl=0
The cube does not even show up on the screen.
=== Conclusion ===
Software Rendering (with no GPU) is no option on the STM32MP1 for graphical webinterfaces were responsiveness is required.
== drawElements Quality Program (deqp) ==
The drawElements Quality Program (deqp) is a benchmarking system for measuring the quality of GPUs and their drivers.<br>
It can be started from the commandline with the following script (no display needed):
<syntaxhighlight lang="sh" line>
#!/bin/sh
export GPU_TESTS='dEQP-GLES2.performance*'
export ETNA_MESA_DEBUG=nir
export EGL_PLATFORM=surfaceless
export XDG_RUNTIME_DIR=/var/run/user/`id -u`/
# Observe that GPU is rendering:
# dstat -i -I `grep gpu /proc/interrupts | cut -d : -f 1`
cd /usr/share/deqp/gles2/
unset LIBGL_ALWAYS_SOFTWARE
unset GBM_ALWAYS_SOFTWARE
./deqp-gles2 \
          --deqp-surface-width=256 \
          --deqp-surface-height=256 \
          --deqp-surface-type=pbuffer \
          --deqp-gl-config-name=rgba8888d24s8ms0 \
          --deqp-visibility=hidden \
          --deqp-log-images=enable \
          --deqp-crashhandler=enable \
          --deqp-log-filename=hardware.qpa \
          -n ${GPU_TESTS} 2>&1 | tee hardware.log
../tools/testlog-to-xml hardware.qpa hardware.xml
export LIBGL_ALWAYS_SOFTWARE=true
# Observe results using a browser, run the following # $ ln -s ../tools/testlog.* .
# $ python3 -m http.server \
#      --bind 0.0.0.0 \
#      --directory /usr/share/deqp/gles2/ 8080
# Navigate to http://IP-of-the-board:8080
</syntaxhighlight>
The deqp run on the STM32MP1 (etnaviv driver) gives the following results:
Test run totals: 
        Passed: 1729/1783 97.0% 
        Failed:   54/1783      33.0% 
        Not supported:   0/1783        0.0% 
        Warnings:   0/1783        0.0%


The full logs can be found here:<br>
[https://www.dropbox.com/s/iyro4ulsb0x96uo/STM32MP1_deqp_logs.zip?dl=0 STM32MP1_deqp_logs.zip]


== Functional GPU testing ==
== Glmark2 Benchmark ==
A Glmark2 Benchmark was run on the STM32MP1 with an 800x600 resolution with the following command:
{|
|
glmark2-es2-drm --off-screen -s 800x600 --annotate
|}
 
With the following results:
{|
|
=======================================================
    glmark2 2017.07
=======================================================
    OpenGL Information
    GL_VENDOR:    etnaviv
    GL_RENDERER:  Vivante GC400 rev 4652
    GL_VERSION:    OpenGL ES 2.0 Mesa 20.2.2 (git-df2977f871)
=======================================================
[build] use-vbo=false: FPS: 182 FrameTime: 5.495 ms
[build] use-vbo=true: FPS: 221 FrameTime: 4.525 ms
[texture] texture-filter=nearest: FPS: 176 FrameTime: 5.682 ms
[texture] texture-filter=linear: FPS: 174 FrameTime: 5.747 ms
[texture] texture-filter=mipmap: FPS: 177 FrameTime: 5.650 ms
[shading] shading=gouraud: FPS: 149 FrameTime: 6.711 ms
[shading] shading=blinn-phong-inf: FPS: 89 FrameTime: 11.236 ms
[shading] shading=phong: FPS: 51 FrameTime: 19.608 ms
[shading] shading=cel: FPS: 30 FrameTime: 33.333 ms
[bump] bump-render=high-poly: FPS: 60 FrameTime: 16.667 ms
[bump] bump-render=normals: FPS: 141 FrameTime: 7.092 ms
[bump] bump-render=height: FPS: 90 FrameTime: 11.111 ms
[effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 25 FrameTime: 40.000 ms
[effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 4 FrameTime: 250.000 ms
[pulsar] light=false:quads=5:texture=false: FPS: 106 FrameTime: 9.434 ms
[desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 11 FrameTime: 90.909 ms
[desktop] effect=shadow:windows=4: FPS: 54 FrameTime: 18.519 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 31 FrameTime: 32.258 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 30 FrameTime: 33.333 ms
[buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 39 FrameTime: 25.641 ms
[ideas] speed=duration: FPS: 41 FrameTime: 24.390 ms
[jellyfish] <default>: FPS: 21 FrameTime: 47.619 ms
Error: SceneTerrain requires Vertex Texture Fetch support, but GL_MAX_VERTEX_TEXTURE_IMAGE_UNITS is 0
[terrain] <default>: Unsupported
[shadow] <default>: FPS: 42 FrameTime: 23.810 ms
[refract] <default>: FPS: 9 FrameTime: 111.111 ms
[conditionals] fragment-steps=0:vertex-steps=0: FPS: 103 FrameTime: 9.709 ms
[conditionals] fragment-steps=5:vertex-steps=0: FPS: 33 FrameTime: 30.303 ms
[conditionals] fragment-steps=0:vertex-steps=5: FPS: 97 FrameTime: 10.309 ms
[function] fragment-complexity=low:fragment-steps=5: FPS: 60 FrameTime: 16.667 ms
[function] fragment-complexity=medium:fragment-steps=5: FPS: 27 FrameTime: 37.037 ms
[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 57 FrameTime: 17.544 ms
[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 57 FrameTime: 17.544 ms
[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 29 FrameTime: 34.483 ms
=======================================================
                                  glmark2 Score: 75
=======================================================
|}
 
== Tasks of the GPU ==
* For „simple“ webpages without 3D-features, the GPU is only used for „blitting“during a process step called „Raster(ization) and Compositing“
* „blitting“ = fast copy and move of memory objects
* By this, a strong relief of the CPU can be achieved
* this should be well possible with the STM32MP1


== Some Toughts about Javascript performance ==
== Some Toughts about Javascript performance ==
====Is it correct, that since js is interpreted and is not thread safe, js of a single webpage does not make use of multithreading/multiprocessing? ====
JS is interpreted, yes. However, both the chromium v8 engine and firefox tracemonkey are JIT compilers and turn that JS into code native to the architecture on which they are running. This leads to a huge performance boost compared to simple interpreting. The JS JIT compilation can be parallelized.<br>
JS is synchronous and has no concept of threads. However, there are these Promise() things and events (like timer callbacks). The JS engines generally have separate thread(s) to deal with such events, so the JS code can set up some "async" operation by calling a function (e.g. load something from somewhere) that triggers a JS function on completion. What happens is that the JS itself runs on CPU0, while a thread doing that loading runs on CPU1, and suddenly your javascript  uses two CPU cores.<br>
So if you add those two points above together, you get that contemporary JS engines and the way contemporary JS is written end up using multiple CPUs. It is just not done in the traditional manner and it does look like a lot of crutches indeed.


== Profiling with Chrome DevTools ==
== Profiling with Chrome DevTools ==
Line 92: Line 240:
* Performance Profile:
* Performance Profile:
:[[File:ChromedevtoolsPerformanceProfiling.png|frameless]]
:[[File:ChromedevtoolsPerformanceProfiling.png|frameless]]
:[https://www.dropbox.com/s/fdpj4fp4oecxfkk/devtools-start-performance-profiling.mp4?dl=0 devtools-start-performance-profiling.mp4]
:The performance profile gives you an overview over all important stats at once. You can record it while browsing through an website on the target.<br>
:The performance profile gives you an overview over all important stats at once. You can record it while browsing through an website on the target.<br>
:You can save such profiles and import it with any other chrome browser to analyse the profile offline.
:You can save such profiles and import it with any other chrome browser to analyse the profile offline.
== Brief summary ==
* Webapplications must be optimized for particular embedded system resp. SOC
* Comprehensive analysis and profiling tools are available
* Thus, appealing web pages on embedded systems should be possible
* … where also the “responsiveness" is given

Latest revision as of 08:56, 23 August 2021

Performance Tests with hardware acceleration (etnaviv)

"Infragistics Ignite UI" Demo Application


The "Infragistics Ignite UI" is a commercial WebUI framework based on Angular.
To test the browser perfomance a WebUI based on that framework was used.



To open the screen in the above screenshot it takes approx. 3 seconds from menu click until content is fully visible

To get some insights the chromium profiling was used.


With the Chrome Profiler one can make following conclusions:

  • the CPU does most of the work
  • there is no obvious point where CPU cycles would be spent excessively (perf doesn't indicate anything).
  • approx. 70% of the time is spent interpreting Javascript. This is also to be expected, since this angular is full of complex Javascript.
  • But that also means the slowness is caused by the CPU / website
  • not much the GPU can do about this.

Line-Chart Demo Application

https://www.dropbox.com/s/323nv90lhp9wh02/mit_GPU.MP4?dl=0

  • with GPU support
  • 20 - 30 fps
  • works very well

DH demo from tradeshow

https://www.dropbox.com/s/rzeu2qk95oxy4lw/sensorless_demo_filtered_idastroem_DH%20electronics.mp4?dl=0

Webgl example (aquarium)

http://webglsamples.org/aquarium/aquarium.html?numFish=1&canvasWidth=800&canvasHeight=480

Running a webgl example page (fishtank) on the STM32MP1 (etnaviv) with Chromium under a resolution of 800x600 results in 3fps.

3D-animation and video as texture

https://www.dropbox.com/s/323nv90lhp9wh02/mit_GPU.MP4?dl=0

The cube runs very smooth with at least 40fps.

Conclusion

Webapplications must be optimized for particular embedded system resp. SOC. Heavy WebGL applications are not suited for this platform.
Nevertheless appealing web pages with responsive requirements on the STM32MP1 are possible.

Performance Tests with software renderning (without GPU)

Line-Chart Demo Application

https://www.dropbox.com/s/kprvgowf8kzod6a/ohne_GPU.MP4?dl=0

  • without GPU (Software Rendering)
  • < 1 fps
  • too slow to be usable

3D-animation and video as texture

https://www.dropbox.com/s/323nv90lhp9wh02/mit_GPU.MP4?dl=0

The cube does not even show up on the screen.

Conclusion

Software Rendering (with no GPU) is no option on the STM32MP1 for graphical webinterfaces were responsiveness is required.

drawElements Quality Program (deqp)

The drawElements Quality Program (deqp) is a benchmarking system for measuring the quality of GPUs and their drivers.

It can be started from the commandline with the following script (no display needed):

 #!/bin/sh
 
 export GPU_TESTS='dEQP-GLES2.performance*'
 
 export ETNA_MESA_DEBUG=nir
 export EGL_PLATFORM=surfaceless
 export XDG_RUNTIME_DIR=/var/run/user/`id -u`/
 
 # Observe that GPU is rendering:
 # dstat -i -I `grep gpu /proc/interrupts | cut -d : -f 1`
 
 cd /usr/share/deqp/gles2/
 
 unset LIBGL_ALWAYS_SOFTWARE
 unset GBM_ALWAYS_SOFTWARE
 
 ./deqp-gles2 \
          --deqp-surface-width=256 \
          --deqp-surface-height=256 \
          --deqp-surface-type=pbuffer \
          --deqp-gl-config-name=rgba8888d24s8ms0 \
          --deqp-visibility=hidden \
          --deqp-log-images=enable \
          --deqp-crashhandler=enable \
          --deqp-log-filename=hardware.qpa \
          -n ${GPU_TESTS} 2>&1 | tee hardware.log 
 
 ../tools/testlog-to-xml hardware.qpa hardware.xml
 
 export LIBGL_ALWAYS_SOFTWARE=true 
 
 # Observe results using a browser, run the following # $ ln -s ../tools/testlog.* .
 # $ python3 -m http.server \
 #       --bind 0.0.0.0 \
 #       --directory /usr/share/deqp/gles2/ 8080
 # Navigate to http://IP-of-the-board:8080

The deqp run on the STM32MP1 (etnaviv driver) gives the following results:

Test run totals:  
       Passed:		1729/1783	97.0%  
       Failed:		  54/1783       33.0%  
       Not supported:	   0/1783        0.0%  
       Warnings:	   0/1783        0.0%

The full logs can be found here:
STM32MP1_deqp_logs.zip

Glmark2 Benchmark

A Glmark2 Benchmark was run on the STM32MP1 with an 800x600 resolution with the following command:

glmark2-es2-drm --off-screen -s 800x600 --annotate

With the following results:

=======================================================
    glmark2 2017.07
=======================================================
    OpenGL Information
    GL_VENDOR:     etnaviv
    GL_RENDERER:   Vivante GC400 rev 4652
    GL_VERSION:    OpenGL ES 2.0 Mesa 20.2.2 (git-df2977f871)
=======================================================
[build] use-vbo=false: FPS: 182 FrameTime: 5.495 ms
[build] use-vbo=true: FPS: 221 FrameTime: 4.525 ms
[texture] texture-filter=nearest: FPS: 176 FrameTime: 5.682 ms
[texture] texture-filter=linear: FPS: 174 FrameTime: 5.747 ms
[texture] texture-filter=mipmap: FPS: 177 FrameTime: 5.650 ms
[shading] shading=gouraud: FPS: 149 FrameTime: 6.711 ms
[shading] shading=blinn-phong-inf: FPS: 89 FrameTime: 11.236 ms
[shading] shading=phong: FPS: 51 FrameTime: 19.608 ms
[shading] shading=cel: FPS: 30 FrameTime: 33.333 ms
[bump] bump-render=high-poly: FPS: 60 FrameTime: 16.667 ms
[bump] bump-render=normals: FPS: 141 FrameTime: 7.092 ms
[bump] bump-render=height: FPS: 90 FrameTime: 11.111 ms
[effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 25 FrameTime: 40.000 ms
[effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 4 FrameTime: 250.000 ms
[pulsar] light=false:quads=5:texture=false: FPS: 106 FrameTime: 9.434 ms
[desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 11 FrameTime: 90.909 ms
[desktop] effect=shadow:windows=4: FPS: 54 FrameTime: 18.519 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 31 FrameTime: 32.258 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 30 FrameTime: 33.333 ms
[buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 39 FrameTime: 25.641 ms
[ideas] speed=duration: FPS: 41 FrameTime: 24.390 ms
[jellyfish] <default>: FPS: 21 FrameTime: 47.619 ms
Error: SceneTerrain requires Vertex Texture Fetch support, but GL_MAX_VERTEX_TEXTURE_IMAGE_UNITS is 0
[terrain] <default>: Unsupported
[shadow] <default>: FPS: 42 FrameTime: 23.810 ms
[refract] <default>: FPS: 9 FrameTime: 111.111 ms
[conditionals] fragment-steps=0:vertex-steps=0: FPS: 103 FrameTime: 9.709 ms
[conditionals] fragment-steps=5:vertex-steps=0: FPS: 33 FrameTime: 30.303 ms
[conditionals] fragment-steps=0:vertex-steps=5: FPS: 97 FrameTime: 10.309 ms
[function] fragment-complexity=low:fragment-steps=5: FPS: 60 FrameTime: 16.667 ms
[function] fragment-complexity=medium:fragment-steps=5: FPS: 27 FrameTime: 37.037 ms
[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 57 FrameTime: 17.544 ms
[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 57 FrameTime: 17.544 ms
[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 29 FrameTime: 34.483 ms
=======================================================
                                  glmark2 Score: 75
=======================================================

Tasks of the GPU

  • For „simple“ webpages without 3D-features, the GPU is only used for „blitting“during a process step called „Raster(ization) and Compositing“
  • „blitting“ = fast copy and move of memory objects
  • By this, a strong relief of the CPU can be achieved
  • this should be well possible with the STM32MP1

Some Toughts about Javascript performance

Is it correct, that since js is interpreted and is not thread safe, js of a single webpage does not make use of multithreading/multiprocessing?

JS is interpreted, yes. However, both the chromium v8 engine and firefox tracemonkey are JIT compilers and turn that JS into code native to the architecture on which they are running. This leads to a huge performance boost compared to simple interpreting. The JS JIT compilation can be parallelized.

JS is synchronous and has no concept of threads. However, there are these Promise() things and events (like timer callbacks). The JS engines generally have separate thread(s) to deal with such events, so the JS code can set up some "async" operation by calling a function (e.g. load something from somewhere) that triggers a JS function on completion. What happens is that the JS itself runs on CPU0, while a thread doing that loading runs on CPU1, and suddenly your javascript uses two CPU cores.

So if you add those two points above together, you get that contemporary JS engines and the way contemporary JS is written end up using multiple CPUs. It is just not done in the traditional manner and it does look like a lot of crutches indeed.

Profiling with Chrome DevTools

Chrome DevTools is a set of web developer tools built directly into the Google Chrome browser. It is possible to analyse the runtime performance of a website directly on the target or remote with an PC connected over ethernet. With an remote connection you can measure the performance without interfering the measurment itself. The Chrome DevTools can also be used with an qtWebengine.

How to use Chrome DevTools with an remote connection

To be able to remote debug chromium on the target you need to start the chromium or qWebengine with the following parameters:
-remote-debugging-port=9222 --user-data-dir=remote-profile


To access it from a different computer you need to forward the port with ssh:
ssh -L 0.0.0.0:9223:localhost:9222 localhost -N
This must be done before the webbrowser gets started.


With an Chrome Browser on an computer connected with the target over ethernet you can now open the devtools with the IP of the target and the port "9223".
This clip shows you how to open den chrome devtools within chrome: devtools-open.mp4

DevTools analysis

The following tools can be used for performance anlaysis:

  • CPU Usage:
devtools-performance-monitor-cpu.mp4
With this stat you can easily see when the CPU is under full load.


  • GPU Usage:
The graph shows when the GPU is under use. But it does not show the GPU load in percent.
It is a useful tool to see if the GPU is fully ocupied.
If the GPU is disabled (software rendering) this graph is absent.


  • Frame Rendering Stats:
Newer chrome versions don't show live FPS anymore. But you can us "dropped" or "delayed" frames as indicator for good or bad performance. The higher the percentage value (frames rendered in time) the better.
A good explanation can be found here:
https://groups.google.com/a/chromium.org/g/blink-dev/c/iHULoSyUxOQ


  • Performance Profile:
devtools-start-performance-profiling.mp4
The performance profile gives you an overview over all important stats at once. You can record it while browsing through an website on the target.
You can save such profiles and import it with any other chrome browser to analyse the profile offline.