Bug 212393

Summary: UTF-8 encode strings of invalid URLs when converting WTF::URL to NSURL instead of truncating the UTF-16 encoding
Product: WebKit Reporter: Alex Christensen <achristensen>
Component: New BugsAssignee: Alex Christensen <achristensen>
Status: RESOLVED FIXED    
Severity: Normal CC: benjamin, cdumez, cmarcelo, darin, ews-watchlist, thorton, webkit-bug-importer
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
Patch
none
add more tests darin: review+, achristensen: commit-queue+

Description Alex Christensen 2020-05-26 17:34:33 PDT
UTF-8 encode strings of invalid URLs when converting WTF::URL to NSURL instead of truncating the UTF-16 encoding
Comment 1 Alex Christensen 2020-05-26 17:41:50 PDT
Created attachment 400289 [details]
Patch
Comment 2 EWS 2020-05-26 18:49:00 PDT
Committed r262171: <https://trac.webkit.org/changeset/262171>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 400289 [details].
Comment 3 Radar WebKit Bug Importer 2020-05-26 18:49:14 PDT
<rdar://problem/63655013>
Comment 4 Darin Adler 2020-05-27 11:49:10 PDT
Comment on attachment 400289 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=400289&action=review

> Source/WTF/wtf/cf/URLCF.cpp:59
> +    RetainPtr<CFURLRef> cfURL;
> +    if (LIKELY(m_string.is8Bit()))
> +        cfURL = WTF::createCFURLFromBuffer(reinterpret_cast<const char*>(m_string.characters8()), m_string.length());
> +    else {
> +        CString utf8 = m_string.utf8();
> +        cfURL = WTF::createCFURLFromBuffer(utf8.data(), utf8.length());
> +    }

On reflection, this code is really not quite right.

If we have non-ASCII Latin-1 characters, they will great treated as Latin-1, not UTF-8. But only if the string happens to be stored 8-bit, which is an implementation detail. Handling of those characters should *not* be different based on whether the string is 8-bit or not. Sadly, I think we need the more expensive "isAllASCII" check.

Test string would be "http://\xb6", both as an 8-bit string and a 16-bit string. Needs to give the same results in both cases.

> Source/WTF/wtf/cocoa/URLCocoa.mm:77
>      if (LIKELY(m_string.is8Bit()))
>          cfURL = WTF::createCFURLFromBuffer(reinterpret_cast<const char*>(m_string.characters8()), m_string.length());
>      else {
> -        // Slower path.
> -        WTF::URLCharBuffer buffer;
> -        copyToBuffer(buffer);
> -        cfURL = WTF::createCFURLFromBuffer(buffer.data(), buffer.size());
> +        CString utf8 = m_string.utf8();
> +        cfURL = WTF::createCFURLFromBuffer(utf8.data(), utf8.length());
>      }

Same problem.
Comment 5 Alex Christensen 2020-05-27 15:51:54 PDT
"http://\xb6" would be canonicalized to "http://xn--tba/" then it would be ASCII as all valid URLs are.  This change only applied to invalid URLs containing non-ASCII characters.  If you'd like, we can add the test I'm about to upload to verify correct behavior, but I think they behave correctly.
Comment 6 Alex Christensen 2020-05-27 15:52:36 PDT
Created attachment 400398 [details]
add more tests
Comment 7 Darin Adler 2020-05-27 18:24:21 PDT
Comment on attachment 400398 [details]
add more tests

I don’t understand why this works, since the code does not do the same thing with 8-bit and 16-bit strings.
Comment 8 Darin Adler 2020-05-27 18:25:55 PDT
In one case we pass a string with one byte, 0xB6, in it. In the other we pass a string with two bytes, 0xC2 0xB6, in it. Does WTF::createCFURLFromBuffer make the same URL in both cases? If so, how does it do that?
Comment 9 Alex Christensen 2020-05-27 21:55:22 PDT
Comment on attachment 400398 [details]
add more tests

We do pass one code point as input, U+00B6.  The output of NSURL parsing that is the percent-encoded UTF-8 encoding of that code point, %C2%B6, which is the same in both cases.  bugs.webkit.org's review tool is showing the UTF-8 encoding of my test case on line 216, but that's one character.
Comment 10 Darin Adler 2020-05-28 08:21:04 PDT
Sorry. I still don’t understand. WTF::createCFURLFromBuffer creates the same URL if you pass it:

    { 0xB6 }

And if you pass it:

    { 0xC2, 0xB6 }

Really?
Comment 11 Alex Christensen 2020-05-28 10:54:33 PDT
If you pass createCFURLFromBuffer { 0xB6 }, then the CFURLRef constructor will UTF-8 encode then percent encode that, resulting in %C2%B6.
This happens if you call createCFURL() on a URL that has an 8-bit or a 16-bit String containing only the character U+00B6. This is what you see in the patch named "add more tests".  The reason you think you see 2 characters in my code is because the code review tool UTF-8 encodes 0xB6 also.
If you pass createCFURLFromBuffer { 0xC2, 0xB6 }, then it will UTF-8 encode then percent encode that, resulting in %C3%82%C2%B6
Comment 12 Darin Adler 2020-05-28 10:57:04 PDT
(In reply to Alex Christensen from comment #11)
> If you pass createCFURLFromBuffer { 0xC2, 0xB6 }, then it will UTF-8 encode
> then percent encode that, resulting in %C3%82%C2%B6

OK, so then if I create a WTF::String containing the character U+00B6, but it’s a 16-bit string, then this code will run:

    CString utf8 = m_string.utf8();
    cfURL = WTF::createCFURLFromBuffer(utf8.data(), utf8.length());

The value of 'utf8' will be { 0xC2, 0xB6 }, and it will call WTF::createCFURLFromBuffer and result in %C3%82%C2%B6.

But that should not be handled any differently from an 8-bit string containing the character U+00B6.

What’s wrong in the story above? Why doesn’t a test case show this problem?
Comment 13 Alex Christensen 2020-05-28 10:59:58 PDT
Ah, this is because createCFURLFromBuffer first tries to decode the string as UTF-8 if it can then as Latin1.  We should pass createCFURLFromBuffer an encoding to use instead of having it try one then the other.  I think I would have to write a fuzzer to come up with a test case that produced different results if there is one.
Comment 14 Darin Adler 2020-05-28 12:35:38 PDT
Got it! So to show the bug, what we need is a Latin-1 sequence that is also a valid UTF-8 sequence. The test case would be U+00C2, U+00B6. That would give different CFURL results if passed as an 8-bit string and a 16-bit string.
Comment 15 Alex Christensen 2020-05-28 17:46:21 PDT
Fixing in https://bugs.webkit.org/show_bug.cgi?id=212486