40347 – Fix fast/parser/entity-surrogate-pairs.html in HTML5 parser

RESOLVED WONTFIX 40347

Fix fast/parser/entity-surrogate-pairs.html in HTML5 parser

https://bugs.webkit.org/show_bug.cgi?id=40347

Summary Fix fast/parser/entity-surrogate-pairs.html in HTML5 parser

Tony Gentilcore

Reported 2010-06-08 19:51:44 PDT

Fix fast/parser/entity-surrogate-pairs.html in HTML5 parser

Attachments
Patch (2.01 KB, patch) 2010-06-08 19:55 PDT, Tony Gentilcore	no flags	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.

Tony Gentilcore

Comment 1 2010-06-08 19:55:02 PDT

Created attachment 58206 [details] Patch

Adam Barth

Comment 2 2010-06-08 19:57:55 PDT

Comment on attachment 58206 [details] Patch WebCore/html/HTML5Lexer.cpp:101 + if (value == 0 || value > 0x10FFFF) Did I just dream up the condition on the right? I thought I got it out of the spec....

WebKit Review Bot

Comment 3 2010-06-08 20:00:22 PDT

Attachment 58206 [details] did not pass style-queue: Failed to run "['WebKitTools/Scripts/check-webkit-style', '--no-squash']" exit_code: 1 WebCore/html/HTML5Lexer.cpp:101: Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons. [readability/comparison_to_zero] [5] Total errors found: 1 in 3 files If any of these errors are false positives, please file a bug against check-webkit-style.

Tony Gentilcore

Comment 4 2010-06-08 20:29:23 PDT

(In reply to comment #2) > (From update of attachment 58206 [details]) > WebCore/html/HTML5Lexer.cpp:101 > + if (value == 0 || value > 0x10FFFF) > Did I just dream up the condition on the right? I thought I got it out of the spec.... That condition is here for sure: http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#preprocessing-the-input-stream And it is enforced in InputStreamPreprocessor::peek(). However, I can't find the part of the spec that talks about entity processing and enforcing it here certainly prevents surrogate pair entities from working.

Eric Seidel (no email)

Comment 5 2010-06-09 00:45:14 PDT

Fixing the test is good. I think our suragate pair handling is still confused. The spec: All U+0000 NULL characters and code points in the range U+D800 to U+DFFF in the input must be replaced by U+FFFD REPLACEMENT CHARACTERs. Any occurrences of such characters and code points are parse errors. That seems wrong. I guess I need to read the HTML5 spec more closely. U+D800–U+DBFF is the range used by the leading surrogate.

Eric Seidel (no email)

Comment 6 2010-06-09 00:49:07 PDT

I think the spec is written in terms of characters, so I'm confused by the use of "code points" in "All U+0000 NULL characters and code points in the range U+D800 to U+DFFF". But I think the "code points" bit is just a mistake. I think that this pre-processing should already have been done at a layer below the InputStreamPreprocessor, so we can just hack out that code.

Tony Gentilcore

Comment 7 2010-06-09 09:55:11 PDT

(In reply to comment #6) > I think the spec is written in terms of characters, so I'm confused by the use of "code points" in "All U+0000 NULL characters and code points in the range U+D800 to U+DFFF". But I think the "code points" bit is just a mistake. > > I think that this pre-processing should already have been done at a layer below the InputStreamPreprocessor, so we can just hack out that code. Yes. Code below InputStreamPreprocessor is handling surrogate pairs. That's why they work when the filter in legalEntityFor() is removed. Adam, let me know if this is off, but my assessment was that your checks in legalEntityFor were based on the spec in #preprocessing-the-input-stream. The bug was that those checks should only be enforced in peek() rather than prior to adding the entities to the stream.

Adam Barth

Comment 8 2010-06-09 09:59:05 PDT

Comment on attachment 58206 [details] Patch Ok. I'm still not sure I understand the issue 100%, but I'm starting to be convinced.

Tony Gentilcore

Comment 9 2010-06-09 10:02:44 PDT

(In reply to comment #8) > (From update of attachment 58206 [details]) > Ok. I'm still not sure I understand the issue 100%, but I'm starting to be convinced. I've flipped the cq? flag. I'm assuming: 1. This is an appropriate style violation. 2. There isn't another part of the spec that I'm missing which legalEntityFor() was based upon (ie. it was based on #preprocessing-the-input-stream).

Adam Barth

Comment 10 2010-06-09 10:11:20 PDT

http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenizing-character-references [[ Otherwise, if the number is in the range 0xD800 to 0xDFFF or is greater than 0x10FFFF, then this is a parse error. Return a U+FFFD REPLACEMENT CHARACTER. ]]

Tony Gentilcore

Comment 11 2010-06-09 10:16:08 PDT

(In reply to comment #10) > http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenizing-character-references > > [[ > Otherwise, if the number is in the range 0xD800 to 0xDFFF or is greater than 0x10FFFF, then this is a parse error. Return a U+FFFD REPLACEMENT CHARACTER. > ]] OK. So that means surrogate pair entities are expressly forbidden in HTML5. Since they were supported before, that leaves a compatibility question. How do we handle these LayoutTests which test for legacy behavior that is expressly forbidden in HTML5?

Adam Barth

Comment 12 2010-06-09 10:20:39 PDT

We note them in the spreadsheet. Can you test for how other browser (especially IE, Firefox, and Firefox nightly) behave?

Tony Gentilcore

Comment 13 2010-06-09 10:42:54 PDT

(In reply to comment #12) > We note them in the spreadsheet. Can you test for how other browser (especially IE, Firefox, and Firefox nightly) behave? We are following the spec, so I'm marking this issue as WONTFIX, and I've noted it in the spreadsheet. In answer to your question, here's how various browsers perform on entity-surrogate-pairs.html: IE8: PASS Firefox 3.6: FAIL Minefield: FAIL WK legacy: PASS WK HTML5: FAIL It is interesting that comment in the test says "Firefox allows these". I suppose that means FF did at some point but that has now been removed. It is a better scary for compat that IE8 allows surrogate pair entities.

Adam Barth

Comment 14 2010-06-09 10:52:39 PDT

We should record all this information in a doc and send it to the HTML5 WG once we've got everything sorted out.

Tony Gentilcore

Comment 15 2010-06-09 10:59:20 PDT

(In reply to comment #14) > We should record all this information in a doc and send it to the HTML5 WG once we've got everything sorted out. Presumably, we are going to end up with a spreadsheet of only lines that have an "N" in needs code change. Those will be the list of issues we'll have to resolve with the spec. I've added this bug number there so we can reference it when we prepare the list of issues.

Alexey Proskuryakov

Comment 16 2010-06-09 13:25:29 PDT

See bug 6446 for why this was done (no known Web compatibility issues, just compatibility with other browsers). But thinking about this now, I like our forgiving behavior. There are processes on the Web that blindly encode all non-ASCII "characters" in input stream as numeric entities. I don't have a specific example. but depending on their definition of "character", they can end up with exactly this. And I'm not aware of any downside in supporting such split entities.

Adam Barth

Comment 17 2010-06-09 13:57:05 PDT

I've added your comment to our new document that's tracking these issues: https://docs.google.com/document/edit?id=1as5xYjyMSCph4960iz0-Kb7hZKf_L6f2vts57NMcVBI&hl=en Our current plan is to have the new parser 100% match the spec. If we want to do something differently, we probably want to change the spec and then move the implementation to match the updated spec.

Alexey Proskuryakov

Comment 18 2012-11-27 12:52:53 PST

For future reference: <rdar://problem/12747226>and <rdar://problem/10447418> track two regressions from switching to the new behavior.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution WONTFIX

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Other

OS OS X 10.5

Product WebKit

Component DOM

Assignee

Tony Gentilcore

Reported

2010-06-08 19:51 PDT

Modified

2012-11-27 12:52 PST History

CC List

6 users Show

URL

Keywords

Depends on

Blocks

39259

Dependencies

tree graph