WebKit Bugzilla
New
Browse
Search+
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
60786
[Qt] Better complex language support: Switch to ICU library (libicu)
https://bugs.webkit.org/show_bug.cgi?id=60786
Summary
[Qt] Better complex language support: Switch to ICU library (libicu)
Siddharth Mathur
Reported
2011-05-13 12:25:36 PDT
As Qt WebKit based applications and browsers reach more corners of the globe (which speak "complex"/non-European scripts), the large number of text encoding and international domain names (IDN) related failing tests in Qt's Skipped list are painful to watch. I did a quick switch from QT4_UNICODE -> ICU_UNICODE and turned on CONFIG+=text_breaking_with_icu to maximize use of the ICU library, and the following list of Skipped tests .. fast/encoding/mailto-always-utf-8.html fast/encoding/yentest.html fast/encoding/yentest2.html fast/encoding/char-encoding.html fast/encoding/idn-security.html fast/encoding/xmacroman-encoding-test.html fast/encoding/GBK/EUC-CN.html fast/encoding/GBK/chinese.html fast/encoding/GBK/cn-gb.html fast/encoding/GBK/csgb2312.html fast/encoding/GBK/csgb231280.html fast/encoding/GBK/gb2312.html fast/encoding/GBK/gb_2312-80.html fast/encoding/GBK/gbk.html fast/encoding/GBK/iso-ir-58.html fast/encoding/GBK/x-euc-cn.html fast/encoding/GBK/x-gbk.html fast/encoding/char-encoding-mac.html fast/encoding/hebrew/8859-8-e.html fast/encoding/hebrew/8859-8-i.html fast/encoding/hebrew/logical.html fast/encoding/char-decoding.html fast/encoding/frame-default-enc.html fast/encoding/invalid-xml.html fast/encoding/char-decoding-mac.html fast/encoding/hebrew/csISO88598I.html fast/encoding/denormalised-voiced-japanese-chars.html fast/encoding/invalid-UTF-8.html ... shrinks to: fast/encoding/denormalised-voiced-japanese-chars.html expected actual diff pretty diff fast/encoding/idn-security.html expected actual diff pretty diff fast/encoding/invalid-xml.html expected actual diff pretty diff fast/encoding/mailto-always-utf-8.html I am also adding some bugs in the "Blocks" list which I think are solvable by this migration. Some may be dups of the above test failures.
Attachments
Experimental patch for others to spin a build
(5.27 KB, patch)
2011-05-13 12:47 PDT
,
Siddharth Mathur
no flags
Details
Formatted Diff
Diff
updated patch
(8.10 KB, patch)
2011-05-23 13:16 PDT
,
Siddharth Mathur
no flags
Details
Formatted Diff
Diff
Better patch
(10.94 KB, patch)
2011-05-25 11:40 PDT
,
Siddharth Mathur
no flags
Details
Formatted Diff
Diff
Patch
(10.76 KB, patch)
2011-05-25 13:58 PDT
,
Siddharth Mathur
no flags
Details
Formatted Diff
Diff
ready for review
(10.70 KB, patch)
2011-05-26 07:47 PDT
,
Siddharth Mathur
no flags
Details
Formatted Diff
Diff
Rebaselined patch
(10.70 KB, patch)
2011-05-26 09:10 PDT
,
Siddharth Mathur
no flags
Details
Formatted Diff
Diff
the perfect patch (pun intended)
(10.70 KB, patch)
2011-05-26 09:42 PDT
,
Siddharth Mathur
benjamin
: review-
Details
Formatted Diff
Diff
Benchmark test to time layout
(33.37 KB, patch)
2011-05-26 13:48 PDT
,
Siddharth Mathur
no flags
Details
Formatted Diff
Diff
review comments incorporated
(10.72 KB, patch)
2011-05-31 15:09 PDT
,
Siddharth Mathur
no flags
Details
Formatted Diff
Diff
Show Obsolete
(7)
View All
Add attachment
proposed patch, testcase, etc.
Chang Shu
Comment 1
2011-05-13 12:28:41 PDT
This should also help some editing tests.
Siddharth Mathur
Comment 2
2011-05-13 12:47:10 PDT
Created
attachment 93495
[details]
Experimental patch for others to spin a build
Benjamin Poulain
Comment 3
2011-05-15 07:23:52 PDT
Uh. You might want to talk with the release guys...Because last few dozen of times this was suggested, shipping ICU with Qt was totally out of the question.
Siddharth Mathur
Comment 4
2011-05-23 13:16:20 PDT
Created
attachment 94470
[details]
updated patch This experimental patch: - removes the "use ICU for text breaking only" option - consolidates all Qt WebKit use of ICU behind the existing existing USE(ICU_UNICODE) flag - adds a "use_system_icu" qmake CONFIG parameter to override default text handling (Qt's Unicode) - adds a new TextBreakIteratorInternalICUQt.cpp similar to other ports which use ICU (and move out a function from TextBreakIteratorQt.cpp) - guards TextCodecQt.cpp to be only compiled for USE(QT4_UNICODE) configuration Pass CONFIG+=use_system_icu to build-webkit to use this.
Siddharth Mathur
Comment 5
2011-05-25 11:40:57 PDT
Created
attachment 94818
[details]
Better patch
Siddharth Mathur
Comment 6
2011-05-25 13:58:17 PDT
Created
attachment 94851
[details]
Patch
Siddharth Mathur
Comment 7
2011-05-26 07:47:39 PDT
Created
attachment 94971
[details]
ready for review Please note that the default is still the "use Qt's unicode" . I am not proposing to change that in this patch.
Siddharth Mathur
Comment 8
2011-05-26 09:10:02 PDT
Created
attachment 94985
[details]
Rebaselined patch
WebKit Review Bot
Comment 9
2011-05-26 09:12:29 PDT
Attachment 94985
[details]
did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/JavaScriptCore/ChangeLog', u'Source..." exit_code: 1 Source/WebCore/platform/text/qt/TextBreakIteratorInternalICUQt.cpp:39: Extra space between return and cachedSystemLocale [whitespace/declaration] [3] Total errors found: 1 in 10 files If any of these errors are false positives, please file a bug against check-webkit-style.
Siddharth Mathur
Comment 10
2011-05-26 09:42:18 PDT
Created
attachment 94990
[details]
the perfect patch (pun intended)
Siddharth Mathur
Comment 11
2011-05-26 13:48:49 PDT
Created
attachment 95030
[details]
Benchmark test to time layout Also attached is a speed test I wrote for timing layout of a Russian and a Hindi all-text page. When run 100 times (tst_painting complexScript -iterations 100), the benchmark shows almost no timing differences between the QT4_UNICODE and ICU_UNICODE builds of QtWebKit. Here is what is being timed. Is this correct? QBENCHMARK { QWebView* view = new QWebView; QWebFrame* frame = view->page()->mainFrame(); frame->setContent(data); // Layout if one is required. frame->toPlainText(); delete view; }
Benjamin Poulain
Comment 12
2011-05-29 08:16:47 PDT
Comment on
attachment 94990
[details]
the perfect patch (pun intended) View in context:
https://bugs.webkit.org/attachment.cgi?id=94990&action=review
> Source/WebCore/platform/text/qt/TextBreakIteratorInternalICUQt.cpp:30 > +Q_GLOBAL_STATIC_WITH_INITIALIZER(QByteArray, cachedSystemLocale, {*x = QLocale::system().name().toAscii();})
toAscii() is almost never correct in library context. It is not correct here. The function return the string encoded with the default encoding, not the ascii encoding. You should use toLatin1() like the original code. This caching is a change of behavior since we no longer use the system local when it changes. So I become curious: how important is the gain?
> Source/WebCore/platform/text/qt/TextBreakIteratorInternalICUQt.cpp:39 > + return cachedSystemLocale()->data(); > +} > + > +const char* currentTextBreakLocaleID() > +{ > + return cachedSystemLocale()->data();
You should use constData, otherwise the implementation have to detach() if needed.
Siddharth Mathur
Comment 13
2011-05-31 15:08:11 PDT
(In reply to
comment #12
)
> (From update of
attachment 94990
[details]
) > View in context:
https://bugs.webkit.org/attachment.cgi?id=94990&action=review
> > > Source/WebCore/platform/text/qt/TextBreakIteratorInternalICUQt.cpp:30 > > +Q_GLOBAL_STATIC_WITH_INITIALIZER(QByteArray, cachedSystemLocale, {*x = QLocale::system().name().toAscii();}) > > toAscii() is almost never correct in library context. It is not correct here. > The function return the string encoded with the default encoding, not the ascii encoding. You should use toLatin1() like the original code. >
Thanks for the review. I went back to Latin1() in upcoming patch.
> This caching is a change of behavior since we no longer use the system local when it changes. So I become curious: how important is the gain?
I am not shooting for any gains here at all. I noted that the old code passes a pointer managed by a transient stack object to the caller, hence causing a invalid read in my Valgrind tests, so I changed it to use a global static so that a valid memory location is always passed around. And Q_GLOBAL_STATIC_WITH_INITIALIZER() provides a clean way to do that.
> > + > > +const char* currentTextBreakLocaleID() > > +{ > > + return cachedSystemLocale()->data(); > > You should use constData, otherwise the implementation have to detach() if needed.
Fixed to use constData() in the upcoming patch. Thanks!
Siddharth Mathur
Comment 14
2011-05-31 15:09:02 PDT
Created
attachment 95491
[details]
review comments incorporated
Simon Hausmann
Comment 15
2011-05-31 15:34:19 PDT
I can see the motivation of using ICU, but I'd like to raise the point that this introduces fragmentation between different platforms on what I would call rather basic behaviour. Note that in the past we have done changes in Qt to move it closer to ICU behavior, for example commit e6ac173991223dbf3b1b6f7213550ebca4608cb6 in Qt 4.7. I'm pretty sure that Lars would welcome similar changes in Qt to match the behaviour, given WebKit's dominant role in content rendering these days and our desire to make Qt applications and QtWebKit match other browsers where it makes sense. I don't think we should blindly switch without knowing the exact impact on behaviour and how we can maximize the deployment/availability of ICU for people using QtWebKit. Imagine bug reports coming in about line breaking and us having to guess if QtWebKit was built with or without ICU support. I think that can become a maintenance burden.
Simon Hausmann
Comment 16
2011-05-31 15:35:44 PDT
On the topic of locales, please also see recent improvements in QLocale. Perhaps they will have a positive impact on the layout test results (I see locale specific stuff in the patch)
http://labs.qt.nokia.com/2011/01/03/qlocale-its-about-time-and-dates-and-languages-and/
Simon Hausmann
Comment 17
2011-05-31 15:42:17 PDT
(In reply to
comment #16
)
> On the topic of locales, please also see recent improvements in QLocale. Perhaps they will have a positive impact on the layout test results (I see locale specific stuff in the patch) > >
http://labs.qt.nokia.com/2011/01/03/qlocale-its-about-time-and-dates-and-languages-and/
Sorry, that link wasn't correct. However there has been a fair amount of work in QLocale in Qt 4.8, including ICU support for collation/toLower/toUpper. It would be good to know the exact shortcomings in Qt that explain the individual test failures we see.
Siddharth Mathur
Comment 18
2011-06-01 10:09:36 PDT
Simon, thanks for your comments :) A minor note that the above patch does _not_ switch the default build option Qt WebKit, but provides the ability to do so with an CONFIG flag. I agree that as far as Qt WebKit as a portable framework is concerned, I would personally consider it a nightmare if I were to triage and fix lots of bug reports from multiple OSes using different configurations. However, my admittedly limited experience with Qt WebKit bug inflow suggests that only a small number of bugs originate from Windows or Linux QtWebKit API users who are _also_ looking for functional exactness between OSes. For example, a KDE developer may wish for quality on desktop Linux with standards support on par with Chrome, Safari etc. I don't think think he/she compares a QtWebKit Windows build with a QtWebKit Linux build? My understanding is that both desktop Linux and Windows users of Qt and Qt WebKit ship their own shared objects with custom bug fixes applied and build flags toggled, so they accept the cost of being "on their own". Nokia's mobile end-users and developers follow a different trajectory as Nokia internally absorbs the majority of costs of maturing Meego or Symbian QtWebKit binaries that are pre-loaded on devices. To simplify matters, Symbian is certainly becoming a less important platform for future trunk deliveries. On the other hands, the following costs _are_ significant IMHO: - cost to current Qt WebKit developers of maintaining a list of dozens of Skipped tests for basic functionality. - cost to Apple, Google and other non Qt contributors of keeping Qt bots happy - cost to Nokia of shipping frameworks or browsers in 2011 that can't render Thai, Arabic, Hebrew or Hindi correctly. - cost to mankind of using a web that doesn't speak their language correctly. More specifically to your points: a) Is this the Qt tracker bug for QLocale and collation work that you were looking for?
http://bugreports.qt.nokia.com/browse/QTBUG-16178
(collation isn't done yet, and lot of the other changes aren't important for the failures in our Skipped list) b) The IDN related failures (idn-security.html) are due to ICANN specified blacklisted glyphs and characters which cannot be in IDN domain names to prevent confusion or phishing attacks. This is already in ICU, but missing in Qurl::toAce(). Note this IDN/QUrl problems are separate from QtCore's Unicode and Locale related issues, but a (lazy man's) solution happens to be the same. I haven't done a root cause analysis of the dozens of failures in the current configuration, but I thought that the presence of objective data from peer-reviewed Layout Tests cases and long-open bugs would help.
Benjamin Poulain
Comment 19
2011-06-03 07:13:21 PDT
Comment on
attachment 95491
[details]
review comments incorporated Ok, let's give this a try. Could you look on the wiki if this flag can be documented anywhere?
WebKit Commit Bot
Comment 20
2011-06-03 07:47:16 PDT
Comment on
attachment 95491
[details]
review comments incorporated Clearing flags on attachment: 95491 Committed
r88016
: <
http://trac.webkit.org/changeset/88016
>
WebKit Commit Bot
Comment 21
2011-06-03 07:47:24 PDT
All reviewed patches have been landed. Closing bug.
Laszlo Gombos
Comment 22
2011-06-27 13:31:05 PDT
Re-purpose this bug to track the progress of switch to ICU and to track the impacted LayoutTests. The following tests are also passing with ICU turned on: sputnik/Unicode/Unicode_510/S15.5.4.16_A1.html sputnik/Unicode/Unicode_510/S15.5.4.18_A1.html sputnik/Unicode/Unicode_510/S7.6_A1.1_T1.html sputnik/Unicode/Unicode_510/S7.6_A1.1_T2.html sputnik/Unicode/Unicode_510/S7.6_A1.1_T4.html sputnik/Unicode/Unicode_510/S7.6_A2.2_T1.html sputnik/Unicode/Unicode_510/S7.6_A2.2_T2.html sputnik/Unicode/Unicode_510/S7.6_A2.3.html sputnik/Unicode/Unicode_510/S7.6_A5.2_T1.html sputnik/Unicode/Unicode_510/S7.6_A5.2_T2.html sputnik/Unicode/Unicode_510/S7.6_A5.2_T4.html sputnik/Unicode/Unicode_510/S7.6_A5.2_T7.html sputnik/Unicode/Unicode_510/S7.6_A5.2_T8.html sputnik/Unicode/Unicode_510/S7.6_A5.2_T9.html
Jesus Sanchez-Palencia
Comment 23
2012-01-26 09:32:48 PST
(In reply to
comment #22
)
> Re-purpose this bug to track the progress of switch to ICU and to track the impacted LayoutTests.
Closing this bug after
http://trac.webkit.org/changeset/105997
.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug