Bug 208998

Summary: JavaScript identifier grammar supports unescaped astral symbols, but JSC doesn’t
Product: WebKit Reporter: Mathias Bynens <mathias>
Component: JavaScriptCoreAssignee: Keith Miller <keith_miller>
Status: RESOLVED FIXED    
Severity: Normal CC: benjamin, cdumez, claude.pache, cmarcelo, commit-queue, ews-watchlist, keith_miller, mark.lam, mathias, msaboff, ross.kirsling, saam, tzagallo, webkit-bug-importer, ysuzuki
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
WIP
none
Patch
none
Patch
none
Patch for landing none

Description Mathias Bynens 2020-03-12 08:58:42 PDT
Since ES2015, ID_Start or ID_Continue symbols _outside of the BMP_ can be used in identifiers. Example using U+102A7 CARIAN LETTER A2 (taken from https://mathiasbynens.be/notes/javascript-identifiers-es6):

    var 𐊧; // 1
    var \u{102A7}; // 2

Expected result:

The above program evaluates without throwing an exception.

Actual result:

Although JavaScriptCore handles line 2 correctly, it throws on line 1:

    SyntaxError: Invalid character '\ud800'

Note that JSC is the only engine with this bug:

    $ eshost -sx 'var 𐊧'
    #### Chakra, SpiderMonkey, V8, V8 --harmony, XS
    
    
    #### JavaScriptCore
    
    SyntaxError: Invalid character '\ud800'
Comment 1 Radar WebKit Bug Importer 2020-03-12 12:22:25 PDT
<rdar://problem/60381814>
Comment 2 Keith Miller 2020-03-13 11:35:31 PDT
Interesting our implementation seems semantically the same as v8's:

static inline bool isIdentStart(UChar32 c)
{
    return isLatin1(c) ? isIdentStart(static_cast<LChar>(c)) : isNonLatin1IdentStart(c);
}

static NEVER_INLINE bool isNonLatin1IdentStart(UChar c)
{
    return u_hasBinaryProperty(c, UCHAR_ID_START);
}

and similarly for non-start: 

static ALWAYS_INLINE bool isIdentPart(UChar32 c)
{
    return isLatin1(c) ? isIdentPart(static_cast<LChar>(c)) : isNonLatin1IdentPart(c);
}

static NEVER_INLINE bool isNonLatin1IdentPart(UChar32 c)
{
    return u_hasBinaryProperty(c, UCHAR_ID_CONTINUE) || c == 0x200C || c == 0x200D;
}

My guess is this is a bug in the system ICU. Does that analysis seem correct to you? My Unicode/ICU knowledge is very limited...
Comment 3 Yusuke Suzuki 2020-03-13 13:04:58 PDT
I think we should have a fallback path for surrogate pairs. Something like this.

if (!isIdentStart(code)) {
    // Check whether the code is leading surrogate.
    if (!isLeadSurrogate(code))
        goto fail;
    codeNext = getOneCharacter();
    if (!isTrailSurrogate(codeNext))
        goto fail;
    codePoint = codePointFromSurrogatePair(code, codeNext);
    if (!isIdentStart(codePoint))
        goto fail;
}
Comment 4 Mathias Bynens 2020-03-13 16:28:32 PDT
The error message mentions U+D800 which is the high surrogate half of the character in my example ('\u{102A7}' === '\uD800\uDEA7'), so I think the code paths Keith mentions are hit for each surrogate half (and surrogate code points are never valid identifier characters, hence the exception) as opposed to considering surrogate pairs where applicable.
Comment 5 Keith Miller 2020-03-14 00:27:00 PDT
(In reply to Mathias Bynens from comment #4)
> The error message mentions U+D800 which is the high surrogate half of the
> character in my example ('\u{102A7}' === '\uD800\uDEA7'), so I think the
> code paths Keith mentions are hit for each surrogate half (and surrogate
> code points are never valid identifier characters, hence the exception) as
> opposed to considering surrogate pairs where applicable.

Yeah, I think you're right. I just didn't know what was going on... >.>
Comment 6 Keith Miller 2020-03-15 15:57:01 PDT
Created attachment 393627 [details]
WIP
Comment 7 Keith Miller 2020-03-16 15:11:00 PDT
Created attachment 393691 [details]
Patch
Comment 8 Keith Miller 2020-03-16 15:42:09 PDT
Created attachment 393695 [details]
Patch
Comment 9 Michael Saboff 2020-03-16 16:17:36 PDT
Comment on attachment 393695 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=393695&action=review

r=me

> Source/JavaScriptCore/parser/Lexer.cpp:2625
> +    // FIXME: This should probably not be a lex error but dealing with surrogate pairs here is annoying and it's going to be an error anyway...

Add a bug for this FIXME.
Comment 10 Keith Miller 2020-03-16 16:19:07 PDT
Comment on attachment 393695 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=393695&action=review

>> Source/JavaScriptCore/parser/Lexer.cpp:2625
>> +    // FIXME: This should probably not be a lex error but dealing with surrogate pairs here is annoying and it's going to be an error anyway...
> 
> Add a bug for this FIXME.

Ehh, This should probably not be a FIXME. I don't think we're actually going to fix it.
Comment 11 Keith Miller 2020-03-16 16:28:14 PDT
Created attachment 393704 [details]
Patch for landing
Comment 12 WebKit Commit Bot 2020-03-16 17:12:22 PDT
Comment on attachment 393704 [details]
Patch for landing

Clearing flags on attachment: 393704

Committed r258531: <https://trac.webkit.org/changeset/258531>
Comment 13 WebKit Commit Bot 2020-03-16 17:12:24 PDT
All reviewed patches have been landed.  Closing bug.
Comment 14 Alexey Shvayka 2020-03-23 10:19:48 PDT
*** Bug 167940 has been marked as a duplicate of this bug. ***