WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
229393
Start teaching WidthIterator about character clusters
https://bugs.webkit.org/show_bug.cgi?id=229393
Summary
Start teaching WidthIterator about character clusters
Myles C. Maxfield
Reported
2021-08-23 01:31:50 PDT
Start teaching WidthIterator about character clusters
Attachments
Patch
(8.12 KB, patch)
2021-08-23 01:33 PDT
,
Myles C. Maxfield
no flags
Details
Formatted Diff
Diff
View All
Add attachment
proposed patch, testcase, etc.
Myles C. Maxfield
Comment 1
2021-08-23 01:33:32 PDT
Created
attachment 436165
[details]
Patch
Radar WebKit Bug Importer
Comment 2
2021-08-30 01:32:22 PDT
<
rdar://problem/82514688
>
Fujii Hironori
Comment 3
2021-09-10 00:29:20 PDT
Comment on
attachment 436165
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=436165&action=review
> Source/WebCore/ChangeLog:16 > + * platform/graphics/ComposedCharacterClusterTextIterator.h: Added.
Unicode is using the term "grapheme cluster". Is "grapheme cluster" a better word?
https://unicode.org/reports/tr29/
> Source/WebCore/platform/graphics/ComposedCharacterClusterTextIterator.h:46 > + bool consume(UChar32& character, unsigned& clusterLength)
Why does 'consume' return a single character?
Myles C. Maxfield
Comment 4
2021-09-10 02:10:30 PDT
Comment on
attachment 436165
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=436165&action=review
>> Source/WebCore/ChangeLog:16 >> + * platform/graphics/ComposedCharacterClusterTextIterator.h: Added. > > Unicode is using the term "grapheme cluster". Is "grapheme cluster" a better word? >
https://unicode.org/reports/tr29/
"Grapheme cluster" has a formal and rigorous definition defined by Unicode, but that definition doesn't actually always match what users view of as a conceptual "letter." It's close, but there are some tweaks that text engines use to better match user expectation. There's a more general, fuzzy term "composed character cluster" that matches this idea. It's similar to what css-text calls a "typographic letter unit" which has an intentionally fuzzy definition. The concept of a composed character cluster does not have a formal rigorous definition; representing it by a grapheme cluster is a good approximation, but many text engines go beyond this with their own custom tweaks.
>> Source/WebCore/platform/graphics/ComposedCharacterClusterTextIterator.h:46 >> + bool consume(UChar32& character, unsigned& clusterLength) > > Why does 'consume' return a single character?
The |UChar32& character| is an out param that returns the "base" character in the cluster. The "base" character is always the first character. For example, if the cluster is U+0065 LATIN SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT (to form "é"), the "base" character is the "e". I believe the term "base character" in a cluster is a term defined by Unicode, but I'm not sure.
Fujii Hironori
Comment 5
2021-09-10 05:13:56 PDT
Comment on
attachment 436165
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=436165&action=review
>>> Source/WebCore/platform/graphics/ComposedCharacterClusterTextIterator.h:46 >>> + bool consume(UChar32& character, unsigned& clusterLength) >> >> Why does 'consume' return a single character? > > The |UChar32& character| is an out param that returns the "base" character in the cluster. The "base" character is always the first character. For example, if the cluster is U+0065 LATIN SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT (to form "é"), the "base" character is the "e". I believe the term "base character" in a cluster is a term defined by Unicode, but I'm not sure.
'WidthIterator::advanceInternal' advances by 'advanceLength' which is 'clusterLength' returned by the 'consume' method.
https://github.com/WebKit/WebKit/blob/e6953d368b55b5daba4805cc9bbc9383d79e1375/Source/WebCore/platform/graphics/WidthIterator.cpp#L340
> textIterator.advance(advanceLength);
If the complex text code path is removed, I think only base characters of clusters are shown. For example, family emoji 👨👩👧👦 shows only the first person 👨. This is unhappy.
Myles C. Maxfield
Comment 6
2021-09-10 10:21:02 PDT
Comment on
attachment 436165
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=436165&action=review
>>>> Source/WebCore/platform/graphics/ComposedCharacterClusterTextIterator.h:46 >>>> + bool consume(UChar32& character, unsigned& clusterLength) >>> >>> Why does 'consume' return a single character? >> >> The |UChar32& character| is an out param that returns the "base" character in the cluster. The "base" character is always the first character. For example, if the cluster is U+0065 LATIN SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT (to form "é"), the "base" character is the "e". I believe the term "base character" in a cluster is a term defined by Unicode, but I'm not sure. > > 'WidthIterator::advanceInternal' advances by 'advanceLength' which is 'clusterLength' returned by the 'consume' method. > >
https://github.com/WebKit/WebKit/blob/e6953d368b55b5daba4805cc9bbc9383d79e1375/Source/WebCore/platform/graphics/WidthIterator.cpp#L340
Right. The next patch I’m planning to write is to iterate across the whole cluster, instead of only considering the base glyph. Font fallback has to be done for the cluster as a whole, and then after font fallback is complete, each code point in the cluster gets mapped to a glyph in the (one) font. This patch is the first step in that process.
Myles C. Maxfield
Comment 7
2021-09-10 10:24:38 PDT
Comment on
attachment 436165
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=436165&action=review
>>>>> Source/WebCore/platform/graphics/ComposedCharacterClusterTextIterator.h:46 >>>>> + bool consume(UChar32& character, unsigned& clusterLength) >>>> >>>> Why does 'consume' return a single character? >>> >>> The |UChar32& character| is an out param that returns the "base" character in the cluster. The "base" character is always the first character. For example, if the cluster is U+0065 LATIN SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT (to form "é"), the "base" character is the "e". I believe the term "base character" in a cluster is a term defined by Unicode, but I'm not sure. >> >> 'WidthIterator::advanceInternal' advances by 'advanceLength' which is 'clusterLength' returned by the 'consume' method. >> >>
https://github.com/WebKit/WebKit/blob/e6953d368b55b5daba4805cc9bbc9383d79e1375/Source/WebCore/platform/graphics/WidthIterator.cpp#L340
> > Right. The next patch I’m planning to write is to iterate across the whole cluster, instead of only considering the base glyph. Font fallback has to be done for the cluster as a whole, and then after font fallback is complete, each code point in the cluster gets mapped to a glyph in the (one) font. This patch is the first step in that process.
(That’s why this new code is behind a USE(CLUSTER_AWARE_WIDTH_ITERATOR) which is off by default. Additional patches will be needed before we can flip the switch and enable the functionality. Also, I’m confident there will need to be performance work to make this cluster stuff as fast as the current behavior.)
EWS
Comment 8
2021-09-10 17:38:00 PDT
Committed
r282304
(
241576@main
): <
https://commits.webkit.org/241576@main
> All reviewed patches have been landed. Closing bug and clearing flags on
attachment 436165
[details]
.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug