RESOLVED FIXED 178174
Add support in named capture group identifiers for direct surrogate pairs
https://bugs.webkit.org/show_bug.cgi?id=178174
Summary Add support in named capture group identifiers for direct surrogate pairs
Michael Saboff
Reported 2017-10-11 11:24:28 PDT
Currently, a named capture group can have a backslash escaped RegExpUnicodeEscapeSequence (e.g. \{12345}). It also looks like it should take approriate unicode characters directly, including those made up of surrogate pairs. Failing Test262 - JSTests/test262/test/built-ins/RegExp/named-groups/unicode-property-names.js Exception: SyntaxError: Invalid regular expression: invalid group specifier name at JSTests/test262/test/built-ins/RegExp/named-groups/unicode-property-names.js:16  This test is a UTF-8 source file with embedded non-BMP Unicode characters as both named group identifiers and property names. I made a JS source file with a couple of the Unicode properties from the original test file, and the jsc command throws SyntaxError: Invalid character '\ud801' See attached test source file. I think we need the jsc command to handle UTF-8 source files before we can fix this bug.
Attachments
Patch (8.48 KB, patch)
2020-03-23 09:33 PDT, Alexey Shvayka
no flags
Patch (24.44 KB, patch)
2020-03-26 20:28 PDT, Alexey Shvayka
no flags
Patch (24.52 KB, patch)
2020-03-27 16:52 PDT, Alexey Shvayka
no flags
Patch (24.80 KB, patch)
2020-03-30 17:31 PDT, Alexey Shvayka
no flags
Michael Saboff
Comment 1 2017-10-11 11:25:09 PDT
Alexey Shvayka
Comment 2 2020-03-23 09:33:09 PDT
Alexey Shvayka
Comment 3 2020-03-26 20:28:24 PDT
Alexey Shvayka
Comment 4 2020-03-26 20:41:35 PDT
While this patch is functionally the same, I've: a) Moved isIdentityEscapeAnError() out of tryConsumeUnicodeEscape() as RegExpUnicodeEscapeSequence [1] does not contain IdentityEscape [2] production. b) Tweaked error messages to report invalid \u escapes instead of identity escapes (which are Annex B), matching other error messages in other engines. c) Added a few error messages tests. I am kindly asking for re-review. Thank you. [1]: https://tc39.es/ecma262/#prod-RegExpUnicodeEscapeSequence [2]: https://tc39.es/ecma262/#prod-IdentityEscape
Darin Adler
Comment 5 2020-03-27 09:59:59 PDT
Comment on attachment 394700 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=394700&action=review This looks fine to me. Michael may also want to review. The key is really test coverage including performance test results, not necessarily exactly how we write the code > Source/JavaScriptCore/yarr/YarrParser.h:493 > + int escapeChar = tryConsumeUnicodeEscape(); Normally we don’t use abbreviations like "char" in WebKit code, preferring words like "character". So this local variable would be either named "escape" or "escapeCharacter" or "escaped" or "escapedCharacter" or "codePoint". > Source/JavaScriptCore/yarr/YarrParser.h:-1044 > - if (u == -1) > - return -1; What’s the rationale for removing this? I don’t think it’s great to rely on what U16_IS_LEAD will return when passed something that is not a 16-bit code unit. So I would have left it in. > Source/JavaScriptCore/yarr/YarrParser.h:1031 > + int escapeChar = tryConsumeUnicodeEscape(); Same comment about variable naming here. Names like "ch" and "u" are definitely frowned upon in our general coding style, but in a way "escapeChar" is half-way there. Our underlying concept is that people are better at reading words than abbreviations.
Darin Adler
Comment 6 2020-03-27 10:02:30 PDT
Comment on attachment 394700 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=394700&action=review >> Source/JavaScriptCore/yarr/YarrParser.h:-1044 >> - return -1; > > What’s the rationale for removing this? I don’t think it’s great to rely on what U16_IS_LEAD will return when passed something that is not a 16-bit code unit. So I would have left it in. Or we could remove the code, but add this: // <... some amazing comment explaining why this assertion is relevant ...> static_assert(!U16_IS_LEAD(-1));
Michael Saboff
Comment 7 2020-03-27 13:14:24 PDT
Comment on attachment 394700 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=394700&action=review >> Source/JavaScriptCore/yarr/YarrParser.h:493 >> + int escapeChar = tryConsumeUnicodeEscape(); > > Normally we don’t use abbreviations like "char" in WebKit code, preferring words like "character". So this local variable would be either named "escape" or "escapeCharacter" or "escaped" or "escapedCharacter" or "codePoint". I would prefer codePoint as that is what a Unicode escape represents. > Source/JavaScriptCore/yarr/YarrParser.h:999 > + ASSERT(!hasError(m_errorCode)); Isn't this a redundant ASSERT() given that we have the same assert at the beginning of the function and everywhere we set m_errorCode here we return? > Source/JavaScriptCore/yarr/YarrParser.h:1032 > + if (escapeChar == -1 && !hasError(m_errorCode)) Seems a little odd that we are using different combinations of a -1 return value from tryConsumeUnicodeEscape() and m_errorCode to signal an error depending on the call site. Maybe you could make tryConsumeUnicodeEscape() a templatized function with the template parameter changing in what cases tryConsumeUnicodeEscape() sets m_errorCode.
Alexey Shvayka
Comment 8 2020-03-27 16:52:52 PDT
Created attachment 394770 [details] Patch Set reviewers, rename vars, bring back -1 check, drop redundant ASSERT, make tryConsumeUnicodeEscape() a template<>.
Alexey Shvayka
Comment 9 2020-03-27 17:16:54 PDT
(In reply to Michael Saboff from comment #7) > Seems a little odd that we are using different combinations of a -1 return > value from tryConsumeUnicodeEscape() and m_errorCode to signal an error > depending on the call site. Maybe you could make tryConsumeUnicodeEscape() > a templatized function with the template parameter changing in what cases > tryConsumeUnicodeEscape() sets m_errorCode. This is great improvement, thank you. If it is not too much to ask, I would appreciate your feedback on https://bugs.webkit.org/show_bug.cgi?id=178175.
Michael Saboff
Comment 10 2020-03-30 10:02:37 PDT
Comment on attachment 394770 [details] Patch r-me This looks great. The only change I recommend is changing the template parameter for tryConsumeUnicodeEscape() from a boolean to an enum. That way the template parameter at the call site describes the behavior. Just a suggestion, but use enum values something like AsCodePoint and AsIdentifierCharacter.
Michael Saboff
Comment 11 2020-03-30 10:16:11 PDT
Comment on attachment 394770 [details] Patch That is "r=me".
Michael Saboff
Comment 12 2020-03-30 10:30:09 PDT
(In reply to Alexey Shvayka from comment #9) > (In reply to Michael Saboff from comment #7) > ... If it is not too much to ask, I would > appreciate your feedback on https://bugs.webkit.org/show_bug.cgi?id=178175. I looked at it earlier and didn't have any comments to add.
Alexey Shvayka
Comment 13 2020-03-30 17:31:19 PDT
Created attachment 394992 [details] Patch Introduce UnicodeEscapeContext.
Alexey Shvayka
Comment 14 2020-03-30 17:41:31 PDT
(In reply to Michael Saboff from comment #10) > This looks great. The only change I recommend is changing the template > parameter for tryConsumeUnicodeEscape() from a boolean to an enum. That way > the template parameter at the call site describes the behavior. Just a > suggestion, but use enum values something like AsCodePoint and > AsIdentifierCharacter. I went with UnicodeEscapeContext::{CharacterEscape,IdentifierName} because: a) As* usually points to a return type, while what we are passing to template<> is merely a parsing parameter that is used only in non-Unicode patterns; b) it was quite hard to pick a name that matched with As*, while UnicodeEscapeContext was a good fit.
Michael Saboff
Comment 15 2020-03-30 18:12:16 PDT
Comment on attachment 394992 [details] Patch r=me. I like it.
EWS
Comment 16 2020-03-30 18:27:17 PDT
Committed r259262: <https://trac.webkit.org/changeset/259262> All reviewed patches have been landed. Closing bug and clearing flags on attachment 394992 [details].
Note You need to log in before you can comment on or make changes to this bug.