WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
178174
Add support in named capture group identifiers for direct surrogate pairs
https://bugs.webkit.org/show_bug.cgi?id=178174
Summary
Add support in named capture group identifiers for direct surrogate pairs
Michael Saboff
Reported
2017-10-11 11:24:28 PDT
Currently, a named capture group can have a backslash escaped RegExpUnicodeEscapeSequence (e.g. \{12345}). It also looks like it should take approriate unicode characters directly, including those made up of surrogate pairs. Failing Test262 - JSTests/test262/test/built-ins/RegExp/named-groups/unicode-property-names.js Exception: SyntaxError: Invalid regular expression: invalid group specifier name at JSTests/test262/test/built-ins/RegExp/named-groups/unicode-property-names.js:16  This test is a UTF-8 source file with embedded non-BMP Unicode characters as both named group identifiers and property names. I made a JS source file with a couple of the Unicode properties from the original test file, and the jsc command throws SyntaxError: Invalid character '\ud801' See attached test source file. I think we need the jsc command to handle UTF-8 source files before we can fix this bug.
Attachments
Patch
(8.48 KB, patch)
2020-03-23 09:33 PDT
,
Alexey Shvayka
no flags
Details
Formatted Diff
Diff
Patch
(24.44 KB, patch)
2020-03-26 20:28 PDT
,
Alexey Shvayka
no flags
Details
Formatted Diff
Diff
Patch
(24.52 KB, patch)
2020-03-27 16:52 PDT
,
Alexey Shvayka
no flags
Details
Formatted Diff
Diff
Patch
(24.80 KB, patch)
2020-03-30 17:31 PDT
,
Alexey Shvayka
no flags
Details
Formatted Diff
Diff
Show Obsolete
(3)
View All
Add attachment
proposed patch, testcase, etc.
Michael Saboff
Comment 1
2017-10-11 11:25:09 PDT
<
rdar://problem/34843961
>
Alexey Shvayka
Comment 2
2020-03-23 09:33:09 PDT
Created
attachment 394267
[details]
Patch
Alexey Shvayka
Comment 3
2020-03-26 20:28:24 PDT
Created
attachment 394700
[details]
Patch
Alexey Shvayka
Comment 4
2020-03-26 20:41:35 PDT
While this patch is functionally the same, I've: a) Moved isIdentityEscapeAnError() out of tryConsumeUnicodeEscape() as RegExpUnicodeEscapeSequence [1] does not contain IdentityEscape [2] production. b) Tweaked error messages to report invalid \u escapes instead of identity escapes (which are Annex B), matching other error messages in other engines. c) Added a few error messages tests. I am kindly asking for re-review. Thank you. [1]:
https://tc39.es/ecma262/#prod-RegExpUnicodeEscapeSequence
[2]:
https://tc39.es/ecma262/#prod-IdentityEscape
Darin Adler
Comment 5
2020-03-27 09:59:59 PDT
Comment on
attachment 394700
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=394700&action=review
This looks fine to me. Michael may also want to review. The key is really test coverage including performance test results, not necessarily exactly how we write the code
> Source/JavaScriptCore/yarr/YarrParser.h:493 > + int escapeChar = tryConsumeUnicodeEscape();
Normally we don’t use abbreviations like "char" in WebKit code, preferring words like "character". So this local variable would be either named "escape" or "escapeCharacter" or "escaped" or "escapedCharacter" or "codePoint".
> Source/JavaScriptCore/yarr/YarrParser.h:-1044 > - if (u == -1) > - return -1;
What’s the rationale for removing this? I don’t think it’s great to rely on what U16_IS_LEAD will return when passed something that is not a 16-bit code unit. So I would have left it in.
> Source/JavaScriptCore/yarr/YarrParser.h:1031 > + int escapeChar = tryConsumeUnicodeEscape();
Same comment about variable naming here. Names like "ch" and "u" are definitely frowned upon in our general coding style, but in a way "escapeChar" is half-way there. Our underlying concept is that people are better at reading words than abbreviations.
Darin Adler
Comment 6
2020-03-27 10:02:30 PDT
Comment on
attachment 394700
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=394700&action=review
>> Source/JavaScriptCore/yarr/YarrParser.h:-1044 >> - return -1; > > What’s the rationale for removing this? I don’t think it’s great to rely on what U16_IS_LEAD will return when passed something that is not a 16-bit code unit. So I would have left it in.
Or we could remove the code, but add this: // <... some amazing comment explaining why this assertion is relevant ...> static_assert(!U16_IS_LEAD(-1));
Michael Saboff
Comment 7
2020-03-27 13:14:24 PDT
Comment on
attachment 394700
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=394700&action=review
>> Source/JavaScriptCore/yarr/YarrParser.h:493 >> + int escapeChar = tryConsumeUnicodeEscape(); > > Normally we don’t use abbreviations like "char" in WebKit code, preferring words like "character". So this local variable would be either named "escape" or "escapeCharacter" or "escaped" or "escapedCharacter" or "codePoint".
I would prefer codePoint as that is what a Unicode escape represents.
> Source/JavaScriptCore/yarr/YarrParser.h:999 > + ASSERT(!hasError(m_errorCode));
Isn't this a redundant ASSERT() given that we have the same assert at the beginning of the function and everywhere we set m_errorCode here we return?
> Source/JavaScriptCore/yarr/YarrParser.h:1032 > + if (escapeChar == -1 && !hasError(m_errorCode))
Seems a little odd that we are using different combinations of a -1 return value from tryConsumeUnicodeEscape() and m_errorCode to signal an error depending on the call site. Maybe you could make tryConsumeUnicodeEscape() a templatized function with the template parameter changing in what cases tryConsumeUnicodeEscape() sets m_errorCode.
Alexey Shvayka
Comment 8
2020-03-27 16:52:52 PDT
Created
attachment 394770
[details]
Patch Set reviewers, rename vars, bring back -1 check, drop redundant ASSERT, make tryConsumeUnicodeEscape() a template<>.
Alexey Shvayka
Comment 9
2020-03-27 17:16:54 PDT
(In reply to Michael Saboff from
comment #7
)
> Seems a little odd that we are using different combinations of a -1 return > value from tryConsumeUnicodeEscape() and m_errorCode to signal an error > depending on the call site. Maybe you could make tryConsumeUnicodeEscape() > a templatized function with the template parameter changing in what cases > tryConsumeUnicodeEscape() sets m_errorCode.
This is great improvement, thank you. If it is not too much to ask, I would appreciate your feedback on
https://bugs.webkit.org/show_bug.cgi?id=178175
.
Michael Saboff
Comment 10
2020-03-30 10:02:37 PDT
Comment on
attachment 394770
[details]
Patch r-me This looks great. The only change I recommend is changing the template parameter for tryConsumeUnicodeEscape() from a boolean to an enum. That way the template parameter at the call site describes the behavior. Just a suggestion, but use enum values something like AsCodePoint and AsIdentifierCharacter.
Michael Saboff
Comment 11
2020-03-30 10:16:11 PDT
Comment on
attachment 394770
[details]
Patch That is "r=me".
Michael Saboff
Comment 12
2020-03-30 10:30:09 PDT
(In reply to Alexey Shvayka from
comment #9
)
> (In reply to Michael Saboff from
comment #7
)
> ... If it is not too much to ask, I would > appreciate your feedback on
https://bugs.webkit.org/show_bug.cgi?id=178175
.
I looked at it earlier and didn't have any comments to add.
Alexey Shvayka
Comment 13
2020-03-30 17:31:19 PDT
Created
attachment 394992
[details]
Patch Introduce UnicodeEscapeContext.
Alexey Shvayka
Comment 14
2020-03-30 17:41:31 PDT
(In reply to Michael Saboff from
comment #10
)
> This looks great. The only change I recommend is changing the template > parameter for tryConsumeUnicodeEscape() from a boolean to an enum. That way > the template parameter at the call site describes the behavior. Just a > suggestion, but use enum values something like AsCodePoint and > AsIdentifierCharacter.
I went with UnicodeEscapeContext::{CharacterEscape,IdentifierName} because: a) As* usually points to a return type, while what we are passing to template<> is merely a parsing parameter that is used only in non-Unicode patterns; b) it was quite hard to pick a name that matched with As*, while UnicodeEscapeContext was a good fit.
Michael Saboff
Comment 15
2020-03-30 18:12:16 PDT
Comment on
attachment 394992
[details]
Patch r=me. I like it.
EWS
Comment 16
2020-03-30 18:27:17 PDT
Committed
r259262
: <
https://trac.webkit.org/changeset/259262
> All reviewed patches have been landed. Closing bug and clearing flags on
attachment 394992
[details]
.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug