226841 – ANGLE Metal perf regression: sdf-sapp.html

Andre Weissflog

Reported 2021-06-09 12:47:41 PDT

One of my WASM + WebGL samples with a somewhat complex pixel shader (shadertoy-like signed-distance-field rendering) is seeing a fairly serious regression in the new Metal backend. My config: - mid-2014 13"MBP - macOS version 11.5 Beta (20G5033c) - Safari Technology Preview Release 125 (Safari 14.2, WebKit 16612.1.15.1.12) To reproduce: - first save and close any open files and applications, a forced reboot may be required! - in Safari Technology Preview enable the following experimental features: "WebGL 2.0" and "WebGL via Metal" (I suspect that only "WebGL via Metal" is needed though) - navigate to https://floooh.github.io/sokol-html5/sdf-sapp.html Expected behaviour: - the demo should run smoothly at 60fps Observed behaviour: - the demo only runs at an extremely low frame rate (less then 1 fps) and causes the entire macOS UI to become unresponsive - in some cases the entire macOS desktop is replaced with a random pixel pattern (this is where a forced reboot is needed)

Kenneth Russell

Comment 1 2021-06-09 12:55:42 PDT

Kyle, would it be possible for you to triage this, since it's apparently a problem specific to the Metal backend?

Kyle Piddington

Comment 2 2021-06-09 13:49:58 PDT

This appears to be an issue with Metal on the 11.5 beta, rather than with Safari Tech Preview 125. I'll do my best to track down what's happening here, but I had no performance issues when running against 11.4 on a similar system.

Kyle Piddington

Comment 3 2021-06-09 16:59:39 PDT

Perhaps I was mistaken, or lucky earlier. I apologize, I'm still seeing the system slowdown on a 11.4 release.

Kyle Piddington

Comment 4 2021-06-09 18:08:34 PDT

A proposed workaround: It seems the current issue is due to us queueing up an excessive number of frames. It seems where the OpenGL backend may have a cap, the metal backend does not. @Andrew Weissflog, While we're working to address the underlying issue, can you try adding a glFinish() to the end of your render loop, instead of a glFlush()?

Kimmo Kinnunen

Comment 5 2021-06-10 00:46:35 PDT

So there are two problems: 1) When UI is running slow, compositor might skip the frame and we will reuse the back buffer but not cancel the work. The work on that back buffer might still be executing, in case the UI slowness is due to slow WebGL content. Drawing a new slow frame on top of this is only making it slower. 2) Regardless or in conjunction to to 1), there is the problem of not waiting on a fence for completing certain amount of frames (2-3?). Discussed in passing in bug 222731 but I didn't file a specific bug about the work yet (we can use this bug). I'd imagine conceptually we could fix 1) by implementing glDiscardFramebuffer() and call that first thing for the new frame in case of preserveDrawingBuffer=false, but I don't know how effective would that be. Other would be to make sure sync points work and add a sync point per frame and wait for the i-2 sync point?

Kyle Piddington

Comment 6 2021-06-10 13:42:53 PDT

Created attachment 431119 [details] Patch

EWS Watchlist

Comment 7 2021-06-10 13:43:43 PDT

Note that there are important steps to take when updating ANGLE. See https://trac.webkit.org/wiki/UpdatingANGLE

Kenneth Russell

Comment 8 2021-06-10 16:59:30 PDT

Comment on attachment 431119 [details] Patch Chromium implemented this kind of backpressure at a higher level - allow no more than a certain number of outstanding frames before throttling the application. See these BackpressureFenceCreate and BackpressureFenceWait methods: https://source.chromium.org/chromium/chromium/src/+/main:ui/gl/gl_context.h;l=248?q=BackpressureFenceCreate How can we know how many command buffers an application will use per frame? Is it possible to implement this throttling at a higher level, using APIs that ANGLE already exposes (like glFenceSync / glGetSync / glWaitSync)?

Kyle Piddington

Comment 9 2021-06-10 17:26:45 PDT

(In reply to Kenneth Russell from comment #8) > Comment on attachment 431119 [details] > Patch > How can we know how many command buffers an application will use per frame? > Is it possible to implement this throttling at a higher level, using APIs > that ANGLE already exposes (like glFenceSync / glGetSync / glWaitSync)? This is the exact problem I'm facing right now, we don't know how many CB's will be running per frame. '6 enqueued at any time' is a poor metric, I'll see about the higher level API's, but we do need a solution as this can, as Andre mentions, lock up the rest of the system exhausting metal resources.

Kenneth Russell

Comment 10 2021-06-11 11:31:22 PDT

The boundaries between rendered frames are best understood at the WebKit level rather than inside ANGLE. Could we try maintaining a queue of FenceSync objects in GraphicsContextGLOpenGL::prepareTextureImpl (Source/WebCore/platform/graphics/angle/GraphicsContextGLANGLE.cpp) for WebGL 2.0 contexts only, and wait on them if they haven't completed within 3-4 frames? https://source.chromium.org/chromium/chromium/src/+/main:ui/gl/gl_context.cc;l=229?q=BackpressureFenceCreate

Dean Jackson

Comment 11 2021-06-11 18:59:01 PDT

Comment on attachment 431119 [details] Patch I'd like other people to weigh in on this. This would effectively be blocking the Web Process while waiting for the GPU - that's probably OK, and maybe better than simply pushing too much at it.

Radar WebKit Bug Importer

Comment 12 2021-06-11 19:39:08 PDT

<rdar://problem/79228701>

Myles C. Maxfield

Comment 13 2021-06-12 23:18:48 PDT

Comment on attachment 431119 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=431119&action=review > Source/ThirdParty/ANGLE/ChangeLog:9 > + What is the performance result of this change? Why does submitting a lot of work cause a performance drop?

Myles C. Maxfield

Comment 14 2021-06-12 23:18:49 PDT

Comment on attachment 431119 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=431119&action=review > Source/ThirdParty/ANGLE/ChangeLog:9 > + What is the performance result of this change? Why does submitting a lot of work cause a performance drop?

Myles C. Maxfield

Comment 15 2021-06-12 23:22:41 PDT

Comment on attachment 431119 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=431119&action=review > Source/ThirdParty/ANGLE/ChangeLog:6 > + WebGL content was submitting an unbounded amount of GPU work and frames to ANGLE. In OpenGL, this is handled by internal logic. However, for Metal, we need to directly manage how much in-flight work we have at once. What about non-Apple ports that still use an OpenGL backend? (Do those exist?) Was the behavior of the old OpenGL codepath to block the main thread until there is capacity available? > Source/ThirdParty/ANGLE/src/libANGLE/renderer/metal/mtl_command_buffer.mm:542 > + dispatch_semaphore_wait(mMetalEnqueueSemaphore, DISPATCH_TIME_FOREVER); Forever sounds a little extreme.

Myles C. Maxfield

Comment 16 2021-06-12 23:29:19 PDT

I think this is reasonable, given the semantics of WebGL - if the application drew something, we can’t just ignore them if there’s a lot of work in flight. I guess we could like defer/record their work, return early, and submit it again later on their behalf, but that sounds a little crazy to do. I can’t think of another solution other than blocking. Curious what the old OpenGL codepath did. Given Kimmo’s comment above, I think he should be the final one to review this, so I won’t r+ just yet.

Myles C. Maxfield

Comment 17 2021-06-12 23:31:13 PDT

(In reply to Myles C. Maxfield from comment #16) > submit it again later on their behalf This could also get you into the situation where the application submits 10 minutes’ worth of work during the first frame and now the content is no longer interactive.

Kimmo Kinnunen

Comment 18 2021-06-14 05:34:32 PDT

The semaphore leaks as per my reading? (Or do we have some sort of objc arc/releasepool enabled here?) While something like this is probably needed, I think this context is maybe the wrong context to push this particular in. However, we've done "ends justify the means" type work before, so from that perspective this could be landed as is (with the semaphore leak fixed?) Slowdowns due to single shader complexity are due to single draw call slowness. The change here implies that the measure of badness is the amount of stuff sent to GPU via command buffers. E.g. the statement is: 7 command buffers == bad. However, in this case it can be argued that that even 1, 2 or 3 command buffers == bad. Said differently: We see major amount of command buffers because we just keep generating simple command buffers that take huge amount of time. It's not because we would be generating overly much command buffers. So I think for _this particular case_, the correct fix is to create back pressure to the frame generation by waiting that we don't get ahead of the finished work more than 2 frames. I can implement this if we have time to wait for few days. There's a separate concern where single frame would send, say, 1000 command buffers worth of data. In these cases, a fix similar to this would probably be needed. My concern is that since we're using wrong solution for the current problem, we don't end up analysing the right problem at all. In other words, it'd be good if we had the test case of 100 command buffers per frame which we could associate with the fix. The overt caution is due to a scenario where 2 years down the line I'd be debugging some sort of perf problem and failing to understand that there's a implicit fuzz factor that limits the amount of work done. Explicit configuration knobs are easier to find in cases like that, but I don't see this being very configurable through the abstraction. If I understand correctly the objects that we can induce to be allocated during frame rendering are fairly lightweight. So in general Metal rendering, if I understand correctly, having 7 command buffers in flight is not in general taxing the system? It's just that if the any of the buffers contain shaders that complete only in 1 hour, that's when the system has problems?

Kyle Piddington

Comment 19 2021-06-14 09:27:26 PDT

(In reply to Kimmo Kinnunen from comment #18) > The semaphore leaks as per my reading? (Or do we have some sort of objc > arc/releasepool enabled here?) I'll double check, I was reading the man pages and didn't. spot dispatch-release > While something like this is probably needed, I think this context is maybe > the wrong context to push this particular in. However, we've done "ends > justify the means" type work before, so from that perspective this could be > landed as is (with the semaphore leak fixed?) > > Slowdowns due to single shader complexity are due to single draw call > slowness. > The change here implies that the measure of badness is the amount of stuff > sent to GPU via command buffers. E.g. the statement is: 7 command buffers == > bad. However, in this case it can be argued that that even 1, 2 or 3 command > buffers == bad. You are correct in this case: the measure here is how many in-flight command buffers we have at any time. This could hinder apps with 6/7 simple draw calls to save apps with 1, 2, or 3 very large command offers. > > Said differently: We see major amount of command buffers because we just > keep generating simple command buffers that take huge amount of time. It's > not because we would be generating overly much command buffers. > > So I think for _this particular case_, the correct fix is to create back > pressure to the frame generation by waiting that we don't get ahead of the > finished work more than 2 frames. > > I can implement this if we have time to wait for few days. Let's chat offline. > There's a separate concern where single frame would send, say, 1000 command > buffers worth of data. In these cases, a fix similar to this would probably > be needed. My concern is that since we're using wrong solution for the > current problem, we don't end up analysing the right problem at all. In > other words, it'd be good if we had the test case of 100 command buffers per > frame which we could associate with the fix. Metal does have a system cap of a max number of in-flight command buffers. I believe most of the issue here isn't actually us using all of the command buffers up, but also exhausting GPU resources by creating frames. > The overt caution is due to a scenario where 2 years down the line I'd be > debugging some sort of perf problem and failing to understand that there's a > implicit fuzz factor that limits the amount of work done. Explicit > configuration knobs are easier to find in cases like that, but I don't see > this being very configurable through the abstraction. Agreed. > If I understand correctly the objects that we can induce to be allocated > during frame rendering are fairly lightweight. So in general Metal > rendering, if I understand correctly, having 7 command buffers in flight is > not in general taxing the system? It's just that if the any of the buffers > contain shaders that complete only in 1 hour, that's when the system has > problems? From my sampling, the app is spending most of the time waiting for Metal to hand it a command buffer. The system slowdown might be due to other issues, such as us exhausting GPU memory by creating excessive amounts of IOSurfaces. Either way, back pressure is the correct solution.

Kimmo Kinnunen

Comment 20 2021-06-15 07:14:50 PDT

Created attachment 431434 [details] Patch

Kimmo Kinnunen

Comment 21 2021-06-15 07:41:46 PDT

Created attachment 431437 [details] Patch

Kenneth Russell

Comment 22 2021-06-15 16:42:06 PDT

Thanks for trying this direction Kimmo. Assuming the conformance test failures are due to Bug 227024 - have linked to that.

Kimmo Kinnunen

Comment 23 2021-06-16 02:06:21 PDT

Comment on attachment 431437 [details] Patch Moving the "back pressure" patch to the blocking bug 227059. This bug should be continuing to investigate why sdf-sapp.html has a performance problem with Metal backend. The perf is OK on OpenGL, but poor on Metal.

Marcin Ignac

Comment 24 2021-06-22 02:25:37 PDT

I would like to provide additional test case which although unaffected by WebGL on Metal setting it has identical symptoms: Here is 250k instanced quad particles with somewhat complicated GPGPU curl noise fragment shader pass for animation. http://projects.variable.io/webgl/shader-performance/ MacBook Pro 16" macOS Big Sur 11.4 Chrome 91.0.4472.106 + Intel UHD Graphics 630 = 30fps Chrome 91.0.4472.106 + AMD Radeon Pro 5600M = 60fps Safar 14.1.1 + Apple GPU = 30fps, feels like 5fps Safari TP Release 126 + Apple GPU = 28fps, feels like 1-2fps if i'm unlucky it completely DDOSs my mac and have to wait e.g. 30s to close tab button to work after click

Kyle Piddington

Comment 25 2021-07-07 14:10:18 PDT

*** This bug has been marked as a duplicate of bug 227226 ***

Kyle Piddington

Comment 26 2021-07-07 14:12:32 PDT

Enabling fastmath on integrated graphics fixes up a majority of performance. Remaining frame-pacing work to be tracked here: https://bugs.webkit.org/show_bug.cgi?id=227059

Attachments
Patch (4.46 KB, patch) 2021-06-10 13:42 PDT, Kyle Piddington	no flags	Details Formatted Diff Diff
Patch (14.51 KB, patch) 2021-06-15 07:14 PDT, Kimmo Kinnunen	ews-feeder: commit-queue-	Details Formatted Diff Diff
Patch (15.49 KB, patch) 2021-06-15 07:41 PDT, Kimmo Kinnunen	no flags	Details Formatted Diff Diff
Show Obsolete (3) View All Add attachment proposed patch, testcase, etc.