Analysis and optimization of the problem of delays and frame drops in loading animation

On the new device, testers noticed a bug: when loading, the system freezes a lot, and the animation is jerky and skips frames, which is why it looks jerky and unsmooth. Analysis showed that this is caused by a new feature in Android, which is activated only on more recent versions of the system. This problem is different from those we have seen before.

Preliminary analysis

When the loading screen lags, the first thing to check is whether the animation files have been changed. Perhaps the animation has become more resource-intensive, its resolution has increased, or too many logs are generated during the kernel startup process, and driver initialization adds unnecessary delays. You can also compare the old and new versions, the default AOSP boot animation, and other options to get a rough idea of ​​what changes might have caused the delay.

Clarification of the problem: FPS drop occurs only when zygote is restarted

Booting Android goes through three main phases: kernel startup, zygote startup, and system_server startup. These stages can have different effects on the smoothness of the animation. Three experiments were carried out:

  1. Running BootAnimation manually

When the device is turned on and in the Launcher state, BootAnimation is launched, which displays a boot animation on top of the current interface. This can be done using the command setprop service.bootanim.exit 0; setprop ctl.start bootanim. This launch method is fully compliant with AOSP standards and the native loading method with animation.

If during the experiment the animation skips frames, then the problem is in BootAnimation itself. If the frames are not reset, this indicates that the animation is affected by other modules that slow down its operation.

Experiment result: The animation is smooth and there are no frame drops, indicating that the problem is caused by other components and not the BootAnimation itself.

  1. Restart system_server

Restart only system_server using the command am restart or killall system_server. During the restart, watch the boot animation. If at this time there is a drop in FPS, then, taking into account the results of the first experiment, this indicates that the delay and the launch process system_server connected.

Experiment result: The animation is smooth, without a drop in FPS, this means that the problem of freezes is not related to the system_server startup process.

  1. Restarting zygote

Restart zygote by first running the command stopto terminate it, then the command startto start it again. If there is a drop in FPS during this time, this indicates that the lag is strongly related to the zygote startup process.

Experiment result: There are lags and a drop in FPS, which confirms that the problem is related to the zygote launch process. For now, the problem has been narrowed down to the zygote startup process, but since zygote performs many tasks, more experimentation is needed to further narrow down the search.

Eliminating hardware bottlenecks: the problem is not related to CPU and IO limitations

Before we continue to analyze the impact of the zygote startup process on BootAnimation's frame rate, let's configure the hardware resource allocation policy for BootAnimation. This is necessary to get a clear picture of how latency is related to hardware performance. The main customization objects are BootAnimation and zygote. Tune their load and resources such as CPU, RAM and IO to check how frame rates compare to hardware performance.

  1. Reduce the load on BootAnimation and increase its priority during operation.

  2. Reduce BootAnimation frame rate to 24 fps.

  3. Set BootAnimation process priority in configuration init (bootanim.rc), increasing its priority.

  4. Set up cgroup resource priority configuration for BootAnimation in init-configurations bootanim.rcplacing the BootAnimation process under the cgroup node with the highest performance. Enlarge task_profiles V bootanim.rc to achieve maximum performance (ProcessCapacityMax).

  5. Adjust I/O priority configuration for BootAnimation in init-configurations bootanim.rcsetting it to the highest I/O priority (add iopriority 0 V bootanim.rc).

  6. Reduce resource consumption generated by the Zygote process. The process of restarting Zygote with reduced priority will be carried out via rc-file, which will avoid restarting the media, camera, network and other services.

  7. Remove restart announcements for these services to prevent them from starting accidentally and to reduce the use of hardware resources.

  8. Adjust zygote's init rc file, remove its high priority task profile, etc. Consider reducing the number of zygote preloaded classes, as the process of loading them is resource intensive, consuming a significant amount of CPU and IO, which can lead to DDR locking at maximum frequency .

  9. Record real-time CPU and IO usage during process latency.

This series of modifications can reduce some of the performance overhead during Zygote startup and reallocate resources in favor of BootAnimation. After applying these modifications and restarting zygote, we were still experiencing frame drops and lag. However, CPU and I/O utilization remained low, indicating that latency was not due to a common hardware performance bottleneck.

BootAnimation Animation Stream Decomposition: Detecting Time-Intensive Operations and Strong GPU Correlation

If we've narrowed the problem down to a “cut” Zygote run stream, can we isolate it at the feature level, like in a flame diagram? To analyze the BootAnimation display flow, we can use a simple approach – add logs to measure the execution time of functions. The basic process of BootAnimation is quite simple: it involves receiving and parsing a boot animation file, after which the animation provides a sequence of images for decoding and subsequent display using OpenGLES. This process is relatively simple and the amount of code is small.

Therefore, we add logs before and after key code calls to record the execution time of functions. We mainly keep logs in the following key processes:

  • Receiving screen update events (processDisplayEvents)

  • Completing buffer allocation and decoding the loading animation frame (initTexture And glClear)

  • Loading onto the GPU and starting rendering (calling draw, done draw)

After rendering the frame, we display the total time spent on it and the number of milliseconds it took to render it without delays (past and delay). The log results are shown below.

As can be seen from the analysis, the delay when frames drop is 100-400 ms compared to the normal value. The main results are as follows:

  • initTexture: Execution time is stable at around 50ms.

  • Drawing: This is a very time-consuming process with unstable execution time (100-300 ms).

Since more than 99% of the time-consuming operations are related to OpenGL, it can be concluded that they have a strong correlation with performance and GPU utilization. This also highlights that performance data in this dimension were not included in the previous step of the analysis.

Narrowing down the problem: delay in the SurfaceFlinger reboot process

The delay occurs when SurfaceFlinger is rebooted, which is a subset of the Zygote reboot process. In the previous steps, we identified two key problems: the first is related to the GPU, and the second is related to Zygote. In Android, GPU control is handled by the SurfaceFlinger, RenderEngine, and HWC components, which interact with the GPU at boot time. Additionally, SurfaceFlinger is started when Zygote boots (the rc init file sets up a dependency: when Zygote is restarted, SurfaceFlinger is also restarted).

Given this, it can be assumed that SurfaceFlinger can have a large impact on latency. We did not block the SurfaceFlinger from rebooting during testing of the Zygote “lite” because it would cause the screen to go blank and we would not be able to test for lagging frames in the boot animation.

We suggest experimenting with the following command to launch SurfaceFlinger only, without restarting Zygote, and see if there is any delay in the loading animation.

stop # Остановить Zygote — эта команда завершает процесс Zygote.
start surfaceflinger # Запустить SurfaceFlinger — эта команда запускает процесс SurfaceFlinger.

The results of the experiment show that when loading, the animation freezes and frame drops occur. Now it can be argued that the problem is limited to the SurfaceFlinger startup process.

Determining the root cause of the problem: SkiaRenderEngine is competing for GPU access with BootAnimation

SurfaceFlinger interacts with the GPU through RenderEngine. In newer versions of Android, SkiaRenderEngine has replaced the GLESRenderEngine used in previous versions. Therefore, most operations with the GPU are associated with RenderEngine.
Analysis of the code, review of logs and experience with the startup processes of SurfaceFlinger and RenderEngine allowed us to quickly determine that the shader caching operations performed by RenderEngine during startup took almost 5 seconds.

After shader caching was completed, BootAnimation's frame rate increased, indicating a reduction in GPU load.

What is shader caching? A short explanation will be provided at the end of the article. First, let's look at how it affects GPU resource usage.

During SkiaRenderEngine startup, shaders are cached using multiple layers of nested for loops, as shown in the following image. These operations require significant GPU resources.

By adding “check” logs to the shader caching process, you can verify that it really takes a lot of time.

Since these caching operations are intended solely for “performance optimization”, disabling them will not cause functional problems. We conducted an experiment: by removing all caching calls, we noticed that the loading animation no longer lost frames and ran smoothly.

Now we can confidently say that the cause of freezes is a lack of GPU performance, and the culprit is the code responsible for caching shaders.

How to optimize: disabling Prime Shader Cache

This shader cache is created during the SurfaceFlinger boot process. It simulates typical application drawing operations by writing them to an empty buffer, allowing the GPU to build a cache of OpenGLES resources. These caches reduce the time it takes to generate resources the first time you draw, which improves performance to a certain extent.

In the code this process is called Prime Shader Cache — filling/initializing the shader cache. Since it is only for optimization purposes, we can simply remove this code.

In fact, the code provides a parameter service.sf.prime_shader_cachewhich allows you to enable or disable this feature. If this parameter is set to 0, then SurfaceFlinger will not call SkiaRenderEngine to perform caching operations.

If the cache is not created, then during application launch, if possible, it will still be generated. Cache information can be retrieved using the dump command in SurfaceFlinger.

Thus, the final solution to the freezing problem turned out to be quite simple: you just had to set the property service.sf.prime_shader_cache to a value that disables shader caching.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *