| Summary: | [GTK][WPE] Expensive atomic operations, overabundant semaphore signaling in GPUProcess streaming IPC | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | WebKit | Reporter: | Zan Dobersek <zan> | ||||||
| Component: | New Bugs | Assignee: | Nobody <webkit-unassigned> | ||||||
| Status: | NEW --- | ||||||||
| Severity: | Normal | CC: | kdwkleung, kkinnunen | ||||||
| Priority: | P2 | ||||||||
| Version: | WebKit Nightly Build | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Bug Depends on: | 239895 | ||||||||
| Bug Blocks: | 238593 | ||||||||
| Attachments: |
|
||||||||
|
Description
Zan Dobersek
2022-05-03 06:26:19 PDT
CPU load on the same WebGL load:
Disabled GPUProcess:
Performance counter stats for process id '3599069' (WebProcess):
5,035.64 msec task-clock # 0.629 CPUs utilized
7,635 context-switches # 1.516 K/sec
35 cpu-migrations # 6.950 /sec
180 page-faults # 35.745 /sec
6,841,684,581 cycles # 1.359 GHz
8,039,991,760 instructions # 1.18 insn per cycle
1,787,173,079 branches # 354.905 M/sec
52,361,168 branch-misses # 2.93% of all branches
8.003704734 seconds time elapsed
Enabled GPUProcess:
Performance counter stats for process id '3601229' (WebProcess):
4,166.83 msec task-clock # 0.521 CPUs utilized
4,042 context-switches # 970.042 /sec
18 cpu-migrations # 4.320 /sec
209 page-faults # 50.158 /sec
5,013,351,640 cycles # 1.203 GHz
6,129,645,128 instructions # 1.22 insn per cycle
1,387,168,345 branches # 332.907 M/sec
50,932,209 branch-misses # 3.67% of all branches
8.004450885 seconds time elapsed
Performance counter stats for process id '3601322' (GPUProcess):
2,795.15 msec task-clock # 0.349 CPUs utilized
149,970 context-switches # 53.654 K/sec
17 cpu-migrations # 6.082 /sec
105 page-faults # 37.565 /sec
5,308,481,612 cycles # 1.899 GHz
7,038,762,193 instructions # 1.33 insn per cycle
1,542,811,483 branches # 551.959 M/sec
16,204,399 branch-misses # 1.05% of all branches
8.003696078 seconds time elapsed
Created attachment 458738 [details]
Flattened WebProcess perf report
Created attachment 458739 [details]
Flattened GPUProcess perf report
(In reply to Zan Dobersek from comment #2) > Created attachment 458738 [details] > Flattened WebProcess perf report (In reply to Zan Dobersek from comment #3) > Created attachment 458739 [details] > Flattened GPUProcess perf report These show, for each process, where time is spent when GPUProcess mode is active. StreamClientConnection and StreamServerConnection methods operating on the buffer offset atomics are never-inlined to isolate those atomic ops as much as possible. In WebProcess: 3.44% 3.42% WPEWebProcess libWPEWebKit-1.0.so.3.17.0 [.] IPC::StreamClientConnection::release 2.34% 2.32% WPEWebProcess libWPEWebKit-1.0.so.3.17.0 [.] IPC::StreamClientConnection::tryAcquire in GPUProcess: 8.27% 8.13% xtGL work queue libWPEWebKit-1.0.so.3.17.0 [.] IPC::StreamServerConnection::release 2.62% 2.59% xtGL work queue libWPEWebKit-1.0.so.3.17.0 [.] IPC::StreamServerConnection::tryAcquire Then, for semaphore signalling, in the WebProcess: 17.96% 0.33% WPEWebProcess libWPEWebKit-1.0.so.3.17.0 [.] IPC::Semaphore::signal I suspect semaphore signalling could be improved more easily than the atomics, but both could use improvements. On Linux there's futexes which kind of fit into this use case, but not completely and not without a large amount of changes around this code. You can see if the bug 239895 takes care of some of the semaphore signal overhead. I believe perf can show also the slow paths inside the acquire, release, etc. related functions, so those would be interesting to see which parts it thinks are slow |