Bug 237988

Summary: Stop returning NodeVector from functions
Product: WebKit Reporter: Chris Dumez <cdumez>
Component: WebCore Misc.Assignee: Chris Dumez <cdumez>
Status: RESOLVED FIXED    
Severity: Normal CC: cmarcelo, darin, esprehn+autocc, ews-watchlist, ggaren, kangil.han, mifenton, webkit-bug-importer
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
See Also: https://bugs.webkit.org/show_bug.cgi?id=238096
Attachments:
Description Flags
Patch
none
Patch ews-feeder: commit-queue-

Description Chris Dumez 2022-03-16 16:19:55 PDT
Stop returning NodeVector from functions and use a out-parameter instead. While this doesn't look as modern, this is actually more efficient. This is because NodeVector has a fairly large inline buffer and is thus not that cheap to "move".
This was causing functions like `ContainerNode::parserAppendChild(Node&)` to spend unnecessary time under `VectorBuffer<Ref<Node, RawPtrTraits<Node>>, 11, FastMalloc>::swap(VectorBuffer<Ref<Node, RawPtrTraits<Node>>, 11, FastMalloc>&, unsigned long, unsigned long)`
Comment 1 Chris Dumez 2022-03-16 16:21:36 PDT
Created attachment 454908 [details]
Patch
Comment 2 Darin Adler 2022-03-16 16:27:13 PDT
Comment on attachment 454908 [details]
Patch

This is definitely a measurable speedup? The return-value optimization doesn’t kick in?
Comment 3 Chris Dumez 2022-03-16 16:40:40 PDT
(In reply to Darin Adler from comment #2)
> Comment on attachment 454908 [details]
> Patch
> 
> This is definitely a measurable speedup? The return-value optimization
> doesn’t kick in?

I am still waiting on A/B testing but I can clearly see the cost in traces when running Speedometer.
Comment 4 Chris Dumez 2022-03-16 16:41:37 PDT
(In reply to Chris Dumez from comment #3)
> (In reply to Darin Adler from comment #2)
> > Comment on attachment 454908 [details]
> > Patch
> > 
> > This is definitely a measurable speedup? The return-value optimization
> > doesn’t kick in?
> 
> I am still waiting on A/B testing but I can clearly see the cost in traces
> when running Speedometer.

Sample Count, Samples %, Normalized CPU %, Symbol
41, 0.0%, 0.0%, WTF::VectorBuffer<WTF::Ref<WebCore::Node, WTF::RawPtrTraits<WebCore::Node> >, 11ul, WTF::FastMalloc>::swap(WTF::VectorBuffer<WTF::Ref<WebCore::Node, WTF::RawPtrTraits<WebCore::Node> >, 11ul, WTF::FastMalloc>&, unsigned long, unsigned long) (in WebCore)
33, 0.0%, 0.0%,     WebCore::ContainerNode::parserAppendChild(WebCore::Node&) (in WebCore)
4, 0.0%, 0.0%,     WebCore::ContainerNode::appendChildWithoutPreInsertionValidityCheck(WebCore::Node&) (in WebCore)
1, 0.0%, 0.0%,     WebCore::executeTask(WebCore::HTMLConstructionSiteTask&) (in WebCore)
1, 0.0%, 0.0%,     WebCore::Node::insertBefore(WebCore::Node&, WebCore::Node*) (in WebCore)
1, 0.0%, 0.0%,     WebCore::ContainerNode::appendChild(WebCore::Node&) (in WebCore)
1, 0.0%, 0.0%,     WebCore::ContainerNode::replaceAll(WebCore::Node*) (in WebCore)
Comment 5 Darin Adler 2022-03-16 16:42:39 PDT
I’m really surprised that this involves a call to swap. Maybe we can fix that some other way. The return-value optimization should not rely on move semantics; it predates move semantics and maybe the another solution is to carefully write these functions so they get return-value optimization.
Comment 6 Chris Dumez 2022-03-16 16:44:50 PDT
(In reply to Darin Adler from comment #5)
> I’m really surprised that this involves a call to swap. Maybe we can fix
> that some other way. The return-value optimization should not rely on move
> semantics; it predates move semantics and maybe the another solution is to
> carefully write these functions so they get return-value optimization.

I believe it involves swap() because that's have the Vector move constructor is implemented:
```
template<typename T, size_t inlineCapacity, typename OverflowHandler, size_t minCapacity, typename Malloc>
inline Vector<T, inlineCapacity, OverflowHandler, minCapacity, Malloc>::Vector(Vector<T, inlineCapacity, OverflowHandler, minCapacity, Malloc>&& other)
{
    swap(other);
}
```

I can find dig deeper though. It is true that one would hope to get RVO.
Comment 7 Darin Adler 2022-03-16 16:44:55 PDT
Comment on attachment 454908 [details]
Patch

You’re welcome to fix it this way. I won’t stand in the way. But return-value optimization should have prevented this swap code from running; I want to understand why it didn’t kick in.
Comment 8 Darin Adler 2022-03-16 16:46:59 PDT
Separately, we should check and see if the move constructor can be written more efficiently than swap; seems likely that it could be.
Comment 9 Chris Dumez 2022-03-16 16:48:14 PDT
Comment on attachment 454908 [details]
Patch

Hmm, in some cases at least, it seems we call swap() explicitly:
e.g. `nodesForInsertion.swap(removedChildNodes);` in `ContainerNode::removeSelfOrChildNodesForInsertion()`.

Maybe I am wrong about the cause for the swap(), I'll double check the other instances.
Comment 10 Darin Adler 2022-03-16 16:50:29 PDT
Please also keep in mind that those functions likely use swap because the code predates move semantics; they could move instead if that helped.
Comment 11 Chris Dumez 2022-03-16 17:51:49 PDT
Created attachment 454922 [details]
Patch
Comment 12 Chris Dumez 2022-03-16 18:26:11 PDT
With this patch (using an out-parameter for NodeVector), I have verified in traces that the CPU time spent in Vector::swap() is completely gone under this Node functions.

However, using returning a NodeVector (avoid avoiding the explicit swap in ContainerNode::removeSelfOrChildNodesForInsertion(), I still see the CPU time under Vector::swap:
Sample Count, Samples %, Normalized CPU %, Symbol
16, 0.0%, 0.0%, WTF::VectorBuffer<WTF::Ref<WebCore::Node, WTF::RawPtrTraits<WebCore::Node> >, 11ul, WTF::FastMalloc>::swap(WTF::VectorBuffer<WTF::Ref<WebCore::Node, WTF::RawPtrTraits<WebCore::Node> >, 11ul, WTF::FastMalloc>&, unsigned long, unsigned long) (in WebCore)
10, 0.0%, 0.0%,     WebCore::ContainerNode::parserAppendChild(WebCore::Node&) (in WebCore)
5, 0.0%, 0.0%,     WebCore::ContainerNode::appendChildWithoutPreInsertionValidityCheck(WebCore::Node&) (in WebCore)
1, 0.0%, 0.0%,     WebCore::ContainerNode::removeSelfOrChildNodesForInsertion(WebCore::Node&, std::__1::optional<WebCore::Exception>&) (in WebCore)

Seems to confirm my suspicion that returning a NodeVector does call the Vector's move constructor (which is slow in the NodeVector case due to the large inline buffer).
I haven't been able to get rid of that CPU time while returning NodeVector in functions yet.

Note that many of the functions were already using an out-parameter for NodeVector, only the couple that I fixed were using a return value.
Comment 13 Chris Dumez 2022-03-16 19:49:30 PDT
From some quick googling it seems that copy elision is only guaranteed in C++17 for prvalues. Therefore, there is no guarantee for named variables.

In those functions, we are definitely returning named variables (and sometimes from multiple code paths, not a single return statement). In such cases, NRVO is apparently not guaranteed, even in C++17. There might be a way to refactor the code to get NRVO but since there is no guarantee, it sounds like it would be fragile.

Given this, I think using an out-parameter is probably the way to go?
It definitely gets rid of the move constructor / Vector::swap() according to the profiler (still waiting on A/B testing to see if the win is actually measurable on Speedometer).
Comment 14 Darin Adler 2022-03-17 07:28:03 PDT
OK
Comment 15 Darin Adler 2022-03-17 07:28:40 PDT
Still think Vector’s move constructor can do better than swap.
Comment 16 Chris Dumez 2022-03-17 07:36:18 PDT
(In reply to Darin Adler from comment #15)
> Still think Vector’s move constructor can do better than swap.

Yes, I believe so as well. However, I think this can be addressed separately?
Even if the Vector's move constructor could be better than a swap, it would still be costly for vectors with an inline buffer, compared to vectors that use an out-of-line buffer (that we can simply adopt).

The A/B bots have not even started processing this patch :/ From local testing I see a 0.5-0.6% progression on Speedometer 2 (iMac Pro / Intel). I also see from the profiler that the CPU time under NodeVector::swap() is completely gone.
Comment 17 Darin Adler 2022-03-17 08:01:27 PDT
(In reply to Chris Dumez from comment #16)
> (In reply to Darin Adler from comment #15)
> > Still think Vector’s move constructor can do better than swap.
> 
> Yes, I believe so as well. However, I think this can be addressed separately?

Yes.

> Even if the Vector's move constructor could be better than a swap, it would
> still be costly for vectors with an inline buffer, compared to vectors that
> use an out-of-line buffer (that we can simply adopt).

Agreed.
Comment 18 Darin Adler 2022-03-17 08:02:13 PDT
Sorry, thought I did review+ yesterday
Comment 19 EWS 2022-03-17 09:55:25 PDT
Committed r291416 (248543@main): <https://commits.webkit.org/248543@main>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 454922 [details].
Comment 20 Radar WebKit Bug Importer 2022-03-17 09:56:19 PDT
<rdar://problem/90437212>