WebKit Bugzilla
New
Browse
Search+
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
NEW
211476
TextManipulationController should merge pieces from the same token
https://bugs.webkit.org/show_bug.cgi?id=211476
Summary
TextManipulationController should merge pieces from the same token
Sihui Liu
Reported
2020-05-05 16:15:39 PDT
There are cases where one token is split into multiple tokens with the same token identifier when completing text manipulation. In
https://bugs.webkit.org/show_bug.cgi?id=210750
, Wenson fixed that for title and option element, by merging adjacent tokens with same identifier into one token. We are seeing other elements having the same issue, so extend that to all elements.
Attachments
Patch
(12.76 KB, patch)
2020-05-05 16:29 PDT
,
Sihui Liu
no flags
Details
Formatted Diff
Diff
Patch
(12.71 KB, patch)
2020-05-05 17:55 PDT
,
Sihui Liu
no flags
Details
Formatted Diff
Diff
Show Obsolete
(2)
View All
Add attachment
proposed patch, testcase, etc.
Sihui Liu
Comment 1
2020-05-05 16:29:40 PDT
Created
attachment 398565
[details]
Patch
Sihui Liu
Comment 2
2020-05-05 17:55:19 PDT
Created
attachment 398574
[details]
Patch
Ryosuke Niwa
Comment 3
2020-05-06 15:27:00 PDT
Comment on
attachment 398574
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=398574&action=review
> Source/WebCore/editing/TextManipulationController.cpp:564 > + if (replacementTokens.isEmpty()) > + return ManipulationFailureType::InvalidToken;
This seems like untested code. Also, it's incorrect to say we had an invalid token just because the replacement is empty.
> Source/WebCore/editing/TextManipulationController.cpp:572 > + if (!token.content.isEmpty() && !currentToken.content.isEmpty()) > + currentToken.content.append(' ');
This doesn't seem right. Why should we always assume to insert a space between two tokens? That may be correct in Latin languages but definitely not in Chinese or Japanese.
Sihui Liu
Comment 4
2020-05-06 15:57:30 PDT
Comment on
attachment 398574
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=398574&action=review
>> Source/WebCore/editing/TextManipulationController.cpp:564 >> + return ManipulationFailureType::InvalidToken; > > This seems like untested code. Also, it's incorrect to say we had an invalid token just because the replacement is empty.
We use InvalidToken when replacementTokens.size() > 1 so I think the InvalidToken means invalid replacement tokens too. Or should I use ManipulationFailureType::InvalidItem?
>> Source/WebCore/editing/TextManipulationController.cpp:572 >> + currentToken.content.append(' '); > > This doesn't seem right. Why should we always assume to insert a space between two tokens? > That may be correct in Latin languages but definitely not in Chinese or Japanese.
I am following the pattern of
http://trac.webkit.org/changeset/260393/webkit
. I can remove it.
Ryosuke Niwa
Comment 5
2020-05-06 16:32:25 PDT
Comment on
attachment 398574
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=398574&action=review
>>> Source/WebCore/editing/TextManipulationController.cpp:564 >>> + return ManipulationFailureType::InvalidToken; >> >> This seems like untested code. Also, it's incorrect to say we had an invalid token just because the replacement is empty. > > We use InvalidToken when replacementTokens.size() > 1 so I think the InvalidToken means invalid replacement tokens too. Or should I use ManipulationFailureType::InvalidItem?
Well, we use that because in the case of a single element replacement, we can't possibly have multiple tokens of different identifiers.
>>> Source/WebCore/editing/TextManipulationController.cpp:572 >>> + currentToken.content.append(' '); >> >> This doesn't seem right. Why should we always assume to insert a space between two tokens? >> That may be correct in Latin languages but definitely not in Chinese or Japanese. > > I am following the pattern of
http://trac.webkit.org/changeset/260393/webkit
. I can remove it.
I don't know what the story of that change log is (perhaps you should talk to Wenson) but I don't think always inserting a whitespace is right even if it worked for some languages. This is something whatever application / framework using WebKit should take care of.
Wenson Hsieh
Comment 6
2020-05-06 16:44:38 PDT
Comment on
attachment 398574
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=398574&action=review
>>>> Source/WebCore/editing/TextManipulationController.cpp:572 >>>> + currentToken.content.append(' '); >>> >>> This doesn't seem right. Why should we always assume to insert a space between two tokens? >>> That may be correct in Latin languages but definitely not in Chinese or Japanese. >> >> I am following the pattern of
http://trac.webkit.org/changeset/260393/webkit
. I can remove it. > > I don't know what the story of that change log is (perhaps you should talk to Wenson) > but I don't think always inserting a whitespace is right even if it worked for some languages. > This is something whatever application / framework using WebKit should take care of.
This was admittedly a hack to deal with how the relevant system framework /sometimes/ splits results into multiple tokens around spaces, and then removes the space from each of the tokens that it split into. For instance, a single token “foo bar” might be returned as two tokens “foo” and “bar”. I hadn’t tested with CJK, and if the relevant system framework splits a single word (“foo") into multiple tokens (“fo” “o”), then the logic I added to try and stitch the tokens back together is indeed wrong :/
Ryosuke Niwa
Comment 7
2020-05-06 16:55:46 PDT
(In reply to Wenson Hsieh from
comment #6
)
> Comment on
attachment 398574
[details]
> Patch > > View in context: >
https://bugs.webkit.org/attachment.cgi?id=398574&action=review
> > >>>> Source/WebCore/editing/TextManipulationController.cpp:572 > >>>> + currentToken.content.append(' '); > >>> > >>> This doesn't seem right. Why should we always assume to insert a space between two tokens? > >>> That may be correct in Latin languages but definitely not in Chinese or Japanese. > >> > >> I am following the pattern of
http://trac.webkit.org/changeset/260393/webkit
. I can remove it. > > > > I don't know what the story of that change log is (perhaps you should talk to Wenson) > > but I don't think always inserting a whitespace is right even if it worked for some languages. > > This is something whatever application / framework using WebKit should take care of. > > This was admittedly a hack to deal with how the relevant system framework > /sometimes/ splits results into multiple tokens around spaces, and then > removes the space from each of the tokens that it split into. For instance, > a single token “foo bar” might be returned as two tokens “foo” and “bar”. > > I hadn’t tested with CJK, and if the relevant system framework splits a > single word (“foo") into multiple tokens (“fo” “o”), then the logic I added > to try and stitch the tokens back together is indeed wrong :/
I mean... we really don't want WebKit to be doing these kinds of text manipulations. We don't want to be in the business of detecting what kind of language the text is in in insert or not insert a whitespace based on that...
Sihui Liu
Comment 8
2020-05-07 09:08:10 PDT
(In reply to Ryosuke Niwa from
comment #7
)
> (In reply to Wenson Hsieh from
comment #6
) > > Comment on
attachment 398574
[details]
> > Patch > > > > View in context: > >
https://bugs.webkit.org/attachment.cgi?id=398574&action=review
> > > > >>>> Source/WebCore/editing/TextManipulationController.cpp:572 > > >>>> + currentToken.content.append(' '); > > >>> > > >>> This doesn't seem right. Why should we always assume to insert a space between two tokens? > > >>> That may be correct in Latin languages but definitely not in Chinese or Japanese. > > >> > > >> I am following the pattern of
http://trac.webkit.org/changeset/260393/webkit
. I can remove it. > > > > > > I don't know what the story of that change log is (perhaps you should talk to Wenson) > > > but I don't think always inserting a whitespace is right even if it worked for some languages. > > > This is something whatever application / framework using WebKit should take care of. > > > > This was admittedly a hack to deal with how the relevant system framework > > /sometimes/ splits results into multiple tokens around spaces, and then > > removes the space from each of the tokens that it split into. For instance, > > a single token “foo bar” might be returned as two tokens “foo” and “bar”. > > > > I hadn’t tested with CJK, and if the relevant system framework splits a > > single word (“foo") into multiple tokens (“fo” “o”), then the logic I added > > to try and stitch the tokens back together is indeed wrong :/ > > I mean... we really don't want WebKit to be doing these kinds of text > manipulations. We don't want to be in the business of detecting what kind of > language the text is in in insert or not insert a whitespace based on that...
I agree with this. I am confused by this behavior too. But currently we don't have clear rules for replacement token. Once we decide that (e.g. replacement token must have a 1-to-n mapping to tokens in original item.), we can communicate with API clients and ask them to change. Before that, we need some hacks to make things work.
Ryosuke Niwa
Comment 9
2020-05-07 10:24:19 PDT
(In reply to Sihui Liu from
comment #8
)
> > I mean... we really don't want WebKit to be doing these kinds of text > > manipulations. We don't want to be in the business of detecting what kind of > > language the text is in in insert or not insert a whitespace based on that... > > I agree with this. I am confused by this behavior too. > But currently we don't have clear rules for replacement token. Once we > decide that (e.g. replacement token must have a 1-to-n mapping to tokens in > original item.), we can communicate with API clients and ask them to change. > Before that, we need some hacks to make things work.
That’s fine but whatever clients relying on this behavior should fix their code before this stuff ships. We don’t want to be maintaining half broken code for years to come.
Wenson Hsieh
Comment 10
2020-05-07 10:28:39 PDT
(In reply to Ryosuke Niwa from
comment #9
)
> (In reply to Sihui Liu from
comment #8
) > > > I mean... we really don't want WebKit to be doing these kinds of text > > > manipulations. We don't want to be in the business of detecting what kind of > > > language the text is in in insert or not insert a whitespace based on that... > > > > I agree with this. I am confused by this behavior too. > > But currently we don't have clear rules for replacement token. Once we > > decide that (e.g. replacement token must have a 1-to-n mapping to tokens in > > original item.), we can communicate with API clients and ask them to change. > > Before that, we need some hacks to make things work. > > That’s fine but whatever clients relying on this behavior should fix their > code before this stuff ships. We don’t want to be maintaining half broken > code for years to come.
Agreed. Maybe we should add a FIXME here explaining that this “stitching” code is a temporary workaround until the relevant client fixes their behavior (for instance, by sending back at most 1 result token, if it is given an item with a single token).
Ryosuke Niwa
Comment 11
2020-05-07 13:39:33 PDT
(In reply to Wenson Hsieh from
comment #10
)
> (In reply to Ryosuke Niwa from
comment #9
) > > (In reply to Sihui Liu from
comment #8
) > > > > I mean... we really don't want WebKit to be doing these kinds of text > > > > manipulations. We don't want to be in the business of detecting what kind of > > > > language the text is in in insert or not insert a whitespace based on that... > > > > > > I agree with this. I am confused by this behavior too. > > > But currently we don't have clear rules for replacement token. Once we > > > decide that (e.g. replacement token must have a 1-to-n mapping to tokens in > > > original item.), we can communicate with API clients and ask them to change. > > > Before that, we need some hacks to make things work. > > > > That’s fine but whatever clients relying on this behavior should fix their > > code before this stuff ships. We don’t want to be maintaining half broken > > code for years to come. > > Agreed. Maybe we should add a FIXME here explaining that this “stitching” > code is a temporary workaround until the relevant client fixes their > behavior (for instance, by sending back at most 1 result token, if it is > given an item with a single token).
I don’t see why clients can’t give multiple consecutive tokens with the same identifier. That’s okay. What’s not okay is for those clients to rely on WebKit to insert whitespaces. That’s simply not how this API was ever designed to work. White space and other text manipulations have to happen in the clients, not WebKit.
Ryosuke Niwa
Comment 12
2020-05-07 13:43:27 PDT
By the way, there is a related case where we’d currently strip away whitespaces. Namely, whitespaces between elements’s open/start tags are supposed to be rendered. With the last algorithm, unfortunately, some of those whitespaces can be stripped away. If you’re investigating bugs in this area, that’s something else to pay attention to. P.S. I’m not trying to review this patch so if anyone can review it, please go ahead. I was just leaving a drive-by comment.
Ryosuke Niwa
Comment 13
2020-05-07 13:45:55 PDT
(In reply to Ryosuke Niwa from
comment #12
)
> By the way, there is a related case where we’d currently strip away > whitespaces. Namely, whitespaces between elements’s open/start tags
And between two end/close tags. See CSS2.1’s appendix where they talk about whitespaces collapsing. Or you can ask Antti / Zalan / Simon about that too :)
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug