| Summary: | [YARR] Allow for Unicode named capture group identifiers in non-Unicode regular expressions | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | WebKit | Reporter: | Michael Saboff <msaboff> | ||||||||
| Component: | JavaScriptCore | Assignee: | Michael Saboff <msaboff> | ||||||||
| Status: | RESOLVED FIXED | ||||||||||
| Severity: | Normal | CC: | ashvayka, ews-watchlist, keith_miller, mark.lam, ross.kirsling, saam, tzagallo, webkit-bug-importer | ||||||||
| Priority: | P2 | Keywords: | InRadar | ||||||||
| Version: | WebKit Nightly Build | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Attachments: |
|
||||||||||
Created attachment 396024 [details]
Patch
Seems like a good informal review opportunity for Alexey. ;) (In reply to Ross Kirsling from comment #3) > Seems like a good informal review opportunity for Alexey. ;) Thank you, Ross, I appreciate it. (In reply to Michael Saboff from comment #0) > JavaScriptCore has a bug where it doesn’t even allow regex2a. It is also part of recent change: please see https://github.com/tc39/ecma262/pull/1869#issuecomment-607742063. V8 doesn’t allow regex2a as well, which is covered by JSTests/test262/test/language/literals/regexp/named-groups/invalid-non-id-continue-groupspecifier.js. Comment on attachment 396024 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=396024&action=review LGTM, a pure joy to informally review. > JSTests/ChangeLog:8 > + New test. We might want to skip failing & no longer correct tests as test262 wasn't updated for recent spec change: - test/language/literals/regexp/named-groups/invalid-u-escape-in-groupspecifier-2.js - test/language/literals/regexp/named-groups/invalid-non-id-start-groupspecifier-6.js - test/language/literals/regexp/named-groups/invalid-non-id-start-groupspecifier-3.js - test/language/literals/regexp/named-groups/invalid-u-escape-in-groupspecifier.js - test/language/literals/regexp/named-groups/invalid-non-id-continue-groupspecifier.js > JSTests/stress/regexp-named-capture-groups.js:16 > + shouldBe(String(error), "SyntaxError: Invalid regular expression: invalid group specifier name"); I really like error message check, and the whole test is crafted with love. Michael, would you object if I export this file to test262 with your & Apple Inc. credentials? > Source/JavaScriptCore/ChangeLog:8 > + Update YARR pattern processing to allow for non-BMP unicode identifier characters in named capture groups. Should we include spec PR link? https://github.com/tc39/ecma262/pull/1869 > Source/JavaScriptCore/yarr/YarrParser.h:543 > + if (U16_IS_LEAD(ch) && unicodePatternOrIdentifierName && (patternRemaining() > 0)) { Since we are changing this line, we can use `!atEndOfPattern()` equivalent instead of `(patternRemaining() > 0)`. Comment on attachment 396024 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=396024&action=review > Source/JavaScriptCore/yarr/YarrParser.h:537 > + template<UnicodeEscapeContext context> consumePossibleSurrogatePair() having notion of UnicodeEscapeContext does't feel right to me. Surrogate pairs are consumed by many productions, not only character escapes & identifiers. `bool inIdentifierName` seems to be a better fit (like we have `bool inCharacterClass`). (In reply to Alexey Shvayka from comment #5) > Comment on attachment 396024 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=396024&action=review > > LGTM, a pure joy to informally review. Thank you for taking a look. > > JSTests/ChangeLog:8 > > + New test. > > We might want to skip failing & no longer correct tests as test262 wasn't updated for recent spec change: > - test/language/literals/regexp/named-groups/invalid-u-escape-in-groupspecifier-2.js > - test/language/literals/regexp/named-groups/invalid-non-id-start-groupspecifier-6.js > - test/language/literals/regexp/named-groups/invalid-non-id-start-groupspecifier-3.js > - test/language/literals/regexp/named-groups/invalid-u-escape-in-groupspecifier.js > - test/language/literals/regexp/named-groups/invalid-non-id-continue-groupspecifier.js I updated JSTests/test262/expectations.yaml for the now failing tests. > > JSTests/stress/regexp-named-capture-groups.js:16 > > + shouldBe(String(error), "SyntaxError: Invalid regular expression: invalid group specifier name"); > > I really like error message check, and the whole test is crafted with love. > Michael, would you object if I export this file to test262 with your & Apple Inc. credentials? Fine by me. > > Source/JavaScriptCore/ChangeLog:8 > > + Update YARR pattern processing to allow for non-BMP unicode identifier characters in named capture groups. > > Should we include spec PR link? https://github.com/tc39/ecma262/pull/1869 > > > Source/JavaScriptCore/yarr/YarrParser.h:543 > > + if (U16_IS_LEAD(ch) && unicodePatternOrIdentifierName && (patternRemaining() > 0)) { > > Since we are changing this line, we can use `!atEndOfPattern()` equivalent > instead of `(patternRemaining() > 0)`. Done. > > Source/JavaScriptCore/yarr/YarrParser.h:537 > > + template<UnicodeEscapeContext context> > > > consumePossibleSurrogatePair() having notion of UnicodeEscapeContext does't feel right to me. > Surrogate pairs are consumed by many productions, not only character escapes & identifiers. > `bool inIdentifierName` seems to be a better fit (like we have `bool inCharacterClass`). In the two other places that consumePossibleSurrogatePair() is used, one is for individual characters and the other as part of a character class parsing, it seems appropriate to use UnicodeEscapeContext::CharacterEscape. The character class parsing uses characters as ends of a range or as individual to include in the class. Created attachment 396102 [details]
Patch with updates from review
(In reply to Michael Saboff from comment #7) > In the two other places that consumePossibleSurrogatePair() is used, one is > for individual characters and the other as part of a character class > parsing, it seems appropriate to use UnicodeEscapeContext::CharacterEscape. I might be putting a little different meaning into UnicodeEscapeContext: for me, it reads as "parent production of Unicode escape we are currently parsing", making it a bit awkward to use in `default` cases of parseTokens() and parseCharacterClass() as they don't parse any escape. If I'd seen just the signature of consumePossibleSurrogatePair(): template<UnicodeEscapeContext context> UChar32 consumePossibleSurrogatePair() I would have assume that it parses \uLEAD\uTRAIL escapes since it has UnicodeEscapeContext as a parameter, and not literal astral pair. (In reply to Alexey Shvayka from comment #9) > (In reply to Michael Saboff from comment #7) > > In the two other places that consumePossibleSurrogatePair() is used, one is > > for individual characters and the other as part of a character class > > parsing, it seems appropriate to use UnicodeEscapeContext::CharacterEscape. > > I might be putting a little different meaning into UnicodeEscapeContext: for > me, it reads as "parent production of Unicode escape we are currently > parsing", making it a bit awkward to use in `default` cases of parseTokens() > and parseCharacterClass() as they don't parse any escape. If we renamed consumePossibleSurrogatePair() to consumePossibleSurrogatePairOrCharacter() would that make things a little clearer? (In reply to Michael Saboff from comment #11) > If we renamed consumePossibleSurrogatePair() to > consumePossibleSurrogatePairOrCharacter() would that make things a little > clearer? Sorry, please disregard comment #10, consumePossibleSurrogatePair() is a good name and it surely needs Unicode escape context. I am not a huge fan of boolean parameters (https://ariya.io/2011/08/hall-of-api-shame-boolean-trap) myself, yet passing UnicodeEscapeContext::CharacterEscape in 2 call sites that are irrelevant to \unicode escapes is a bit misleading IMO. What if we pass UnicodeEscapeContext::None from those? (In reply to Alexey Shvayka from comment #12) > (In reply to Michael Saboff from comment #11) > > If we renamed consumePossibleSurrogatePair() to > > consumePossibleSurrogatePairOrCharacter() would that make things a little > > clearer? > > Sorry, please disregard comment #10, consumePossibleSurrogatePair() is a good name and it surely needs Unicode escape context. > > I am not a huge fan of boolean parameters (https://ariya.io/2011/08/hall-of-api-shame-boolean-trap) myself, yet passing UnicodeEscapeContext::CharacterEscape in 2 call sites that are irrelevant to \unicode escapes is a bit misleading IMO. What if we pass UnicodeEscapeContext::None from those? How about we rename the enum UnicodeEscapeContext to UnicodeParseContext with the two values PatternCodePoint and GroupName? Then we use UnicodeParseContext::PatternCodePoint and UnicodeParseContext::GroupName. Both of these values make sense in the current and new usage and are closer to the ECMAScript spec. Created attachment 396135 [details]
Patch with suggested naming changes and additional named back reference testing
(In reply to Michael Saboff from comment #14) > Created attachment 396135 [details] > Patch with suggested naming changes and additional named back reference > testing Looks perfect, thank you for taking time to address the feedback. Committed r260033: <https://trac.webkit.org/changeset/260033> |
During the March/April 2020 TC-39 meeting, it was agreed that named capture group identifiers can contain unicode escape characters even for non-unicode flagged regular expressions. This change is part of the EcmaScript 2020 draft standard that was approved by TC-39 at the same meeting and should be ratified by the Ecma General Assembly in June 2020. The current 2019 standard allows the following constructs for named capture group identifiers with non-BMP codepoints let regex1a = /(?<𝒜>A)/u; let regex1b = /(?<\u{1d49c}>A)/u; let regex1c = /(?<\ud835\udc9c>A)/u; let regex2a = /(?<𝒜>A)/; // no u flag But didn’t allow non-BMP unicode escapes in named capture group identifiers for non unicode regex’s let regex2b = /(?<\u{1d49c}>A)/; let regex2c = /(?<\ud835\udc9c>A)/; JavaScriptCore has a bug where it doesn’t even allow regex2a. This is to track fixing the JSC bug for regex2a, and adding the two other non-unicode forms to bring JSC into compliance with the 2020 standard.