Bug 216052 - Align ISO-2022-JP and Shift_JIS encodings with Chrome, Firefox, and the specification
Summary: Align ISO-2022-JP and Shift_JIS encodings with Chrome, Firefox, and the speci...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Alex Christensen
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2020-09-01 14:06 PDT by Alex Christensen
Modified: 2020-09-01 19:55 PDT (History)
2 users (show)

See Also:


Attachments
Patch (478.89 KB, patch)
2020-09-01 14:08 PDT, Alex Christensen
no flags Details | Formatted Diff | Diff
Patch (479.51 KB, patch)
2020-09-01 17:52 PDT, Alex Christensen
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alex Christensen 2020-09-01 14:06:27 PDT
Align ISO-2022-JP and Shift_JIS encodings with Chrome, Firefox, and the specification
Comment 1 Alex Christensen 2020-09-01 14:08:59 PDT
Created attachment 407712 [details]
Patch
Comment 2 Darin Adler 2020-09-01 15:05:43 PDT
Comment on attachment 407712 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=407712&action=review

> Source/WebCore/platform/text/TextCodecCJK.cpp:244
> +    auto characters = string.upconvertedCharacters();
> +    for (WTF::CodePointIterator<UChar> iterator(characters.get(), characters.get() + string.length()); !iterator.atEnd(); ++iterator) {

I didn’t notice this on previous passes. Why not just do this?

    for (auto codePoint : string.codePoints())

I don’t understand why it is better to use StringView::upconvertedCharacters and WTF::CodePointIterator<UChar> instead.

> Source/WebCore/platform/text/TextCodecCJK.cpp:262
> +    auto characters = string.upconvertedCharacters();
> +    for (WTF::CodePointIterator<UChar> iterator(characters.get(), characters.get() + string.length()); !iterator.atEnd(); ++iterator) {

Ditto.

> Source/WebCore/platform/text/TextCodecCJK.h:40
>      enum class Encoding : uint8_t {
>          EUC_JP,
> +        ISO2022JP,
> +        Shift_JIS,
>          Big5
>      };

I guess you don’t agree, but I find that "all on one line" nicer for simple enumerations like this one.

> Source/WebCore/platform/text/TextCodecCJK.h:60
> +    enum class ISO2022JPEncoderState : uint8_t {
> +        ASCII,
> +        Roman,
> +        jis0208
> +    };

Ditto.
Comment 3 Alex Christensen 2020-09-01 15:20:02 PDT
Looks like I've got a few regressions to fix still.  Will fix before landing.

(In reply to Darin Adler from comment #2)
> Comment on attachment 407712 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=407712&action=review
> 
> > Source/WebCore/platform/text/TextCodecCJK.cpp:244
> > +    auto characters = string.upconvertedCharacters();
> > +    for (WTF::CodePointIterator<UChar> iterator(characters.get(), characters.get() + string.length()); !iterator.atEnd(); ++iterator) {
> 
> I didn’t notice this on previous passes. Why not just do this?
> 
>     for (auto codePoint : string.codePoints())
> 
> I don’t understand why it is better to use StringView::upconvertedCharacters
> and WTF::CodePointIterator<UChar> instead.
StringView::CodePoints::Iterator has to check whether the StringView is 8-bit or 16-bit every time it increments to the next code unit.  It also stores a pointer to the StringView which has a pointer to the code units, so there is lots of dereferencing.  We should probably optimize it, but here CodePointIterator is faster.
StringView::upconvertedCharacters is probably free if you are using CJK encodings because the string almost certainly contains at least one non-Latin1 code point.

> > Source/WebCore/platform/text/TextCodecCJK.h:40
> >      enum class Encoding : uint8_t {
> >          EUC_JP,
> > +        ISO2022JP,
> > +        Shift_JIS,
> >          Big5
> >      };
> 
> I guess you don’t agree, but I find that "all on one line" nicer for simple
> enumerations like this one.
This one is being incrementally added to.  This allows the version control history to be nicer.

> > Source/WebCore/platform/text/TextCodecCJK.h:60
> > +    enum class ISO2022JPEncoderState : uint8_t {
> > +        ASCII,
> > +        Roman,
> > +        jis0208
> > +    };
> 
> Ditto.
This one should probably be all on one line because it is simple and finished.
Comment 4 Alex Christensen 2020-09-01 17:49:20 PDT
Comment on attachment 407712 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=407712&action=review

> Source/WebCore/platform/text/TextCodecCJK.cpp:164
> +    auto statefulUnencodableHandler = [&] (UChar32 codePoint, Vector<uint8_t>& result) {
> +        changeStateToASCIIIfNeeded();

Here I need to not switch the state to ASCII if the state is ISO2022JPEncoderState::Roman.  I'll make that change to not regress anything.

> Source/WebCore/platform/text/TextCodecCJK.cpp:169
> +        if ((m_iso2022JPEncoderState == ISO2022JPEncoderState::ASCII || m_iso2022JPEncoderState == ISO2022JPEncoderState::Roman) && (codePoint == 0x000E || codePoint == 0x000F || codePoint == 0x001B)) {

It looks like this line of the specification does not match what Chrome, Firefox, and Safari all do with these code points.  I'll remove this statement to match behavior of browsers.
Comment 5 Alex Christensen 2020-09-01 17:52:38 PDT
Created attachment 407720 [details]
Patch
Comment 6 EWS 2020-09-01 19:54:37 PDT
Committed r266448: <https://trac.webkit.org/changeset/266448>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 407720 [details].
Comment 7 Radar WebKit Bug Importer 2020-09-01 19:55:16 PDT
<rdar://problem/68182742>