Bug 209804 - [WPE][GTK] Avoid copying JSON source in webkit_user_content_filter_store_save() and friends
Summary: [WPE][GTK] Avoid copying JSON source in webkit_user_content_filter_store_save...
Status: ASSIGNED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebKit API (show other bugs)
Version: WebKit Local Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Adrian Perez
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-03-31 05:56 PDT by Adrian Perez
Modified: 2020-04-10 09:45 PDT (History)
10 users (show)

See Also:


Attachments
Patch (2.84 KB, patch)
2020-03-31 07:46 PDT, Adrian Perez
no flags Details | Formatted Diff | Diff
Patch (3.07 KB, patch)
2020-03-31 14:31 PDT, Adrian Perez
aperez: review?
aperez: commit-queue?
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Adrian Perez 2020-03-31 05:56:13 PDT
API functions webkit_user_content_filter_store_save() and
webkit_user_content_filter_store_save_from_file() result in
passing a GBytes to webkitUserContentFilterStoreSaveBytes(),
the the later instantiates a WTF::String with the contents
of the GBytes. This last step creates a copy of the data,
which can be potentially big (in the megabytes magnitude).

Moreover, webkit_user_content_filter_store_save_from_file()
will try to mmap() local files (good ☺) but to the data copy
to create the WTF::String will make the UI thread block on
page faults to load the data from disk, defeating the purpose
of the optimization (bad ☹).

It should be possible to use a WTF::ExternalStringImpl to
create a WTF::String pointing to the data from the GBytes
without copying, keeping a ref so the GBytes stays alive
and dropping the ref in ExternalStringImplFreeFunction
callback.
Comment 1 Adrian Perez 2020-03-31 07:46:39 PDT
Created attachment 395037 [details]
Patch
Comment 2 EWS Watchlist 2020-03-31 07:47:21 PDT
Thanks for the patch. If this patch contains new public API please make sure it follows the guidelines for new WebKit2 GTK+ API. See http://trac.webkit.org/wiki/WebKitGTK/AddingNewWebKit2API
Comment 3 Charlie Turner 2020-03-31 09:07:42 PDT
Comment on attachment 395037 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=395037&action=review

> Source/WebKit/UIProcess/API/glib/WebKitUserContentFilterStore.cpp:193
> +    RefPtr<StringImpl> stringImpl = ExternalStringImpl::create(reinterpret_cast<const LChar*>(sourceData), sourceSize, [sourceBytes = WTFMove(source)](ExternalStringImpl*, void*, unsigned) { });

It seems like the rule list compiler will copy one way or the other though. Before this patch, we're getting a copy in the call to the compiler, after this patch, we get a copy when the rule list is dispatched to the compile queue?
Comment 4 Adrian Perez 2020-03-31 09:25:35 PDT
(In reply to Charlie Turner from comment #3)
> Comment on attachment 395037 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=395037&action=review
> 
> > Source/WebKit/UIProcess/API/glib/WebKitUserContentFilterStore.cpp:193
> > +    RefPtr<StringImpl> stringImpl = ExternalStringImpl::create(reinterpret_cast<const LChar*>(sourceData), sourceSize, [sourceBytes = WTFMove(source)](ExternalStringImpl*, void*, unsigned) { });
> 
> It seems like the rule list compiler will copy one way or the other though.
> Before this patch, we're getting a copy in the call to the compiler, after
> this patch, we get a copy when the rule list is dispatched to the compile
> queue?

Well, you do have a point: there is a call to String::isolatedCopy() in
ContentRuleListStore::compileContentRuleList(). But this patch still
saves one memory copy, which is a good thing.

Note that it String::isolatedCopy() calls ::isSafeToSendToAnotherThread()
and if that returns ”true” then it just moves the object instead of making
a copy… In this case the String should have only one ref, because we are
creating it and moving it into the call to ::compileContentRuleList() so
I *think* that the copy in this case is being avoided.
Comment 5 Charlie Turner 2020-03-31 09:41:19 PDT
(In reply to Adrian Perez from comment #4)
> (In reply to Charlie Turner from comment #3)
> > Comment on attachment 395037 [details]
> > Patch
> > 
> > View in context:
> > https://bugs.webkit.org/attachment.cgi?id=395037&action=review
> > 
> > > Source/WebKit/UIProcess/API/glib/WebKitUserContentFilterStore.cpp:193
> > > +    RefPtr<StringImpl> stringImpl = ExternalStringImpl::create(reinterpret_cast<const LChar*>(sourceData), sourceSize, [sourceBytes = WTFMove(source)](ExternalStringImpl*, void*, unsigned) { });
> > 
> > It seems like the rule list compiler will copy one way or the other though.
> > Before this patch, we're getting a copy in the call to the compiler, after
> > this patch, we get a copy when the rule list is dispatched to the compile
> > queue?
> 
> Well, you do have a point: there is a call to String::isolatedCopy() in
> ContentRuleListStore::compileContentRuleList(). But this patch still
> saves one memory copy, which is a good thing.
> 
> Note that it String::isolatedCopy() calls ::isSafeToSendToAnotherThread()
> and if that returns ”true” then it just moves the object instead of making
> a copy… In this case the String should have only one ref, because we are
> creating it and moving it into the call to ::compileContentRuleList() so
> I *think* that the copy in this case is being avoided.

I started reading the const& overload of isolatedCopy, not the &&-version...

An ASSERT() at the callsite about the refCount could be useful to easily catch a performance regression.

I wonder if the compiler could be adjusted to do its work from a immutable StringView, don't see why it'd need to modify the source.  Anyway, drive-by over. Informal r+ from me :)
Comment 6 Adrian Perez 2020-03-31 14:30:16 PDT
(In reply to Charlie Turner from comment #5)
> (In reply to Adrian Perez from comment #4)
> > (In reply to Charlie Turner from comment #3)
> > > Comment on attachment 395037 [details]
> > > Patch
> > > 
> > > View in context:
> > > https://bugs.webkit.org/attachment.cgi?id=395037&action=review
> > > 
> > > > Source/WebKit/UIProcess/API/glib/WebKitUserContentFilterStore.cpp:193
> > > > +    RefPtr<StringImpl> stringImpl = ExternalStringImpl::create(reinterpret_cast<const LChar*>(sourceData), sourceSize, [sourceBytes = WTFMove(source)](ExternalStringImpl*, void*, unsigned) { });
> > > 
> > > It seems like the rule list compiler will copy one way or the other though.
> > > Before this patch, we're getting a copy in the call to the compiler, after
> > > this patch, we get a copy when the rule list is dispatched to the compile
> > > queue?
> > 
> > Well, you do have a point: there is a call to String::isolatedCopy() in
> > ContentRuleListStore::compileContentRuleList(). But this patch still
> > saves one memory copy, which is a good thing.
> > 
> > Note that it String::isolatedCopy() calls ::isSafeToSendToAnotherThread()
> > and if that returns ”true” then it just moves the object instead of making
> > a copy… In this case the String should have only one ref, because we are
> > creating it and moving it into the call to ::compileContentRuleList() so
> > I *think* that the copy in this case is being avoided.
> 
> I started reading the const& overload of isolatedCopy, not the &&-version...
>
> An ASSERT() at the callsite about the refCount could be useful to easily
> catch a performance regression.

Even better, we can construct the String first, then write the assertion
as “ASSERT(jsonString.isSafeToSendToAnotherThread())”, and then move the
string. Thanks for the idea!

> I wonder if the compiler could be adjusted to do its work from a immutable
> StringView, don't see why it'd need to modify the source.  Anyway, drive-by
> over. Informal r+ from me :)

Maybe it would make sense to change this, but that would a separate bug :)
Note in case that gets done: Instead of capturing the GRefPtr<GBytes> object
we would need to keep it around in the SaveTaskData struct.
Comment 7 Adrian Perez 2020-03-31 14:31:45 PDT
Created attachment 395091 [details]
Patch
Comment 8 Carlos Garcia Campos 2020-04-02 01:48:10 PDT
Comment on attachment 395091 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=395091&action=review

> Source/WebKit/UIProcess/API/glib/WebKitUserContentFilterStore.cpp:192
> +    RefPtr<StringImpl> stringImpl = ExternalStringImpl::create(reinterpret_cast<const LChar*>(sourceData), sourceSize, [sourceBytes = WTFMove(source)](ExternalStringImpl*, void*, unsigned) { });

Bytes data is UTF-8 and String expects UTF-16. Aren't we losing the conversion here?
Comment 9 Adrian Perez 2020-04-02 05:51:21 PDT
(In reply to Carlos Garcia Campos from comment #8)
> Comment on attachment 395091 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=395091&action=review
> 
> > Source/WebKit/UIProcess/API/glib/WebKitUserContentFilterStore.cpp:192
> > +    RefPtr<StringImpl> stringImpl = ExternalStringImpl::create(reinterpret_cast<const LChar*>(sourceData), sourceSize, [sourceBytes = WTFMove(source)](ExternalStringImpl*, void*, unsigned) { });
> 
> Bytes data is UTF-8 and String expects UTF-16. Aren't we losing the
> conversion here?

String can be *both* 8-bit (ASCII, UTF-8) or 16-bit (UTF-16). It depends
on how the StringImpl backing it is constructed. In the case of this patch,
note that the following factory function is used:

  ExternalStringImpl::create(const LChar* characters, unsigned length,
                             ExternalStringImplFreeFunction&&)

This takes an 8-bit string (LChar*, I think the “L” there is for “Latin1
Character” but don't take my word for it), and ends up calling:

  StringImpl::createWithoutCopying(const LChar*, unsigned length)

which constructs with:

  StringImpl::StringImpl(const LChar* characters, unsigned length, 
                         ConstructWithoutCopyingTag)

and ends up in with:

  inline StringImplShape::StringImplShape(unsigned refCount,
         unsigned length, const LChar* data8, unsigned hashAndFlags)
      : m_refCount(refCount)
      , m_length(length)
      , m_data8(data8)
      , m_hashAndFlags(hashAndFlags)
  {
  }

Note how this sets the m_data8 member, which means the string has
8-bit character data.

So… yes, the GBytes is always 8-bit data, but we are constructing a
WTF::String which *knows* that the data is 8-bit. Side note: all this
is fine and dandy, and no UTF-16 conversions are done, ever, because
the JSON spec mandates UTF-8 and therefore the content extensions
compiler expects UTF-8 as input as well. Life is always better on
the UTF-8 side of life 😎️
Comment 10 Adrian Perez 2020-04-02 05:52:32 PDT
…not to mention that the content extensions layout tests are passing
with this patch applied, too!
Comment 11 Darin Adler 2020-04-06 18:44:29 PDT
Comment on attachment 395091 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=395091&action=review

>>> Source/WebKit/UIProcess/API/glib/WebKitUserContentFilterStore.cpp:192
>>> +    RefPtr<StringImpl> stringImpl = ExternalStringImpl::create(reinterpret_cast<const LChar*>(sourceData), sourceSize, [sourceBytes = WTFMove(source)](ExternalStringImpl*, void*, unsigned) { });
>> 
>> Bytes data is UTF-8 and String expects UTF-16. Aren't we losing the conversion here?
> 
> String can be *both* 8-bit (ASCII, UTF-8) or 16-bit (UTF-16). It depends
> on how the StringImpl backing it is constructed. In the case of this patch,
> note that the following factory function is used:
> 
>   ExternalStringImpl::create(const LChar* characters, unsigned length,
>                              ExternalStringImplFreeFunction&&)
> 
> This takes an 8-bit string (LChar*, I think the “L” there is for “Latin1
> Character” but don't take my word for it), and ends up calling:
> 
>   StringImpl::createWithoutCopying(const LChar*, unsigned length)
> 
> which constructs with:
> 
>   StringImpl::StringImpl(const LChar* characters, unsigned length, 
>                          ConstructWithoutCopyingTag)
> 
> and ends up in with:
> 
>   inline StringImplShape::StringImplShape(unsigned refCount,
>          unsigned length, const LChar* data8, unsigned hashAndFlags)
>       : m_refCount(refCount)
>       , m_length(length)
>       , m_data8(data8)
>       , m_hashAndFlags(hashAndFlags)
>   {
>   }
> 
> Note how this sets the m_data8 member, which means the string has
> 8-bit character data.
> 
> So… yes, the GBytes is always 8-bit data, but we are constructing a
> WTF::String which *knows* that the data is 8-bit. Side note: all this
> is fine and dandy, and no UTF-16 conversions are done, ever, because
> the JSON spec mandates UTF-8 and therefore the content extensions
> compiler expects UTF-8 as input as well. Life is always better on
> the UTF-8 side of life 😎️

That’s close, but incorrect. WTF::String can be 8-bit (ASCII, Latin-1) or 16-bit (UTF-16). It cannot be UTF-8. That’s unfortunate because we all love UTF-8.
Comment 12 Darin Adler 2020-04-06 18:44:55 PDT
And yes, LChar means "Latin-1 character".
Comment 13 Darin Adler 2020-04-06 18:48:27 PDT
Comment on attachment 395091 [details]
Patch

This patch creates a bug where we no longer correctly handle non-ASCII characters. We can do this optimization of the string is all ASCII, but if there are any non-ASCII characters, then we need to use String::fromUTF8 to translate them to UTF-16.

You should construct a test case to demonstrate this in case I am wrong.
Comment 14 Adrian Perez 2020-04-10 05:55:23 PDT
(In reply to Darin Adler from comment #13)
> Comment on attachment 395091 [details]
> Patch
> 
> This patch creates a bug where we no longer correctly handle non-ASCII
> characters. We can do this optimization of the string is all ASCII, but if
> there are any non-ASCII characters, then we need to use String::fromUTF8 to
> translate them to UTF-16.
> 
> You should construct a test case to demonstrate this in case I am wrong.

You are indeed right: so far I had been testing only with JSON content
extension rulesets which are all ASCII. As you suggested to me by chat,
an option would be to check whether the input JSON source string is all
ASCII and avoid the copy in that case and otherwise use String::fromUTF8().

Another option would be to change ContentRuleListStore to use data buffers,
likely WebCore::SharedBuffer, which can wrap GBytes/NSData/CFData/etc.—but
this would be a bigger rework, so let's repurpose this bug for the smaller
optimization for now.
Comment 15 Darin Adler 2020-04-10 08:51:53 PDT
(In reply to Adrian Perez from comment #14)
> Another option would be to change ContentRuleListStore to use data buffers,
> likely WebCore::SharedBuffer, which can wrap GBytes/NSData/CFData/etc.—but
> this would be a bigger rework, so let's repurpose this bug for the smaller
> optimization for now.

Please note that the character set issue gets even more challenging if we use data buffers, unless we are deciding they are UTF-8 on all platforms.
Comment 16 Adrian Perez 2020-04-10 09:38:26 PDT
(In reply to Darin Adler from comment #15)
> (In reply to Adrian Perez from comment #14)
> > Another option would be to change ContentRuleListStore to use data buffers,
> > likely WebCore::SharedBuffer, which can wrap GBytes/NSData/CFData/etc.—but
> > this would be a bigger rework, so let's repurpose this bug for the smaller
> > optimization for now.
> 
> Please note that the character set issue gets even more challenging if we
> use data buffers, unless we are deciding they are UTF-8 on all platforms.

I think we should not assume any encoding for data buffers: they are
just a pile of bytes.

In this particular case, my thinking is that it would be fine to assume
that JSON must be UTF-8 because RFC-8295 mandates it for exchange—and
loading a JSON rule set which may have been generated by some other
software or even by hand qualifies as “JSON text exchanged between
systems that are not part of a closed ecosystem” (quoting the RFC,
see https://tools.ietf.org/html/rfc8259#section-8.1).

;-)
Comment 17 Darin Adler 2020-04-10 09:45:33 PDT
(In reply to Adrian Perez from comment #16)
> I think we should not assume any encoding for data buffers: they are
> just a pile of bytes.

My point exactly.

We can’t replace strings with data buffers because strings come with information about character encoding and buffers do not, we have to replace strings with data buffers along with information about character encoding.

> In this particular case, my thinking is that it would be fine to assume
> that JSON must be UTF-8

OK.