RESOLVED WONTFIX 40347
Fix fast/parser/entity-surrogate-pairs.html in HTML5 parser
https://bugs.webkit.org/show_bug.cgi?id=40347
Summary Fix fast/parser/entity-surrogate-pairs.html in HTML5 parser
Tony Gentilcore
Reported 2010-06-08 19:51:44 PDT
Fix fast/parser/entity-surrogate-pairs.html in HTML5 parser
Attachments
Patch (2.01 KB, patch)
2010-06-08 19:55 PDT, Tony Gentilcore
no flags
Tony Gentilcore
Comment 1 2010-06-08 19:55:02 PDT
Adam Barth
Comment 2 2010-06-08 19:57:55 PDT
Comment on attachment 58206 [details] Patch WebCore/html/HTML5Lexer.cpp:101 + if (value == 0 || value > 0x10FFFF) Did I just dream up the condition on the right? I thought I got it out of the spec....
WebKit Review Bot
Comment 3 2010-06-08 20:00:22 PDT
Attachment 58206 [details] did not pass style-queue: Failed to run "['WebKitTools/Scripts/check-webkit-style', '--no-squash']" exit_code: 1 WebCore/html/HTML5Lexer.cpp:101: Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons. [readability/comparison_to_zero] [5] Total errors found: 1 in 3 files If any of these errors are false positives, please file a bug against check-webkit-style.
Tony Gentilcore
Comment 4 2010-06-08 20:29:23 PDT
(In reply to comment #2) > (From update of attachment 58206 [details]) > WebCore/html/HTML5Lexer.cpp:101 > + if (value == 0 || value > 0x10FFFF) > Did I just dream up the condition on the right? I thought I got it out of the spec.... That condition is here for sure: http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#preprocessing-the-input-stream And it is enforced in InputStreamPreprocessor::peek(). However, I can't find the part of the spec that talks about entity processing and enforcing it here certainly prevents surrogate pair entities from working.
Eric Seidel (no email)
Comment 5 2010-06-09 00:45:14 PDT
Fixing the test is good. I think our suragate pair handling is still confused. The spec: All U+0000 NULL characters and code points in the range U+D800 to U+DFFF in the input must be replaced by U+FFFD REPLACEMENT CHARACTERs. Any occurrences of such characters and code points are parse errors. That seems wrong. I guess I need to read the HTML5 spec more closely. U+D800–U+DBFF is the range used by the leading surrogate.
Eric Seidel (no email)
Comment 6 2010-06-09 00:49:07 PDT
I think the spec is written in terms of characters, so I'm confused by the use of "code points" in "All U+0000 NULL characters and code points in the range U+D800 to U+DFFF". But I think the "code points" bit is just a mistake. I think that this pre-processing should already have been done at a layer below the InputStreamPreprocessor, so we can just hack out that code.
Tony Gentilcore
Comment 7 2010-06-09 09:55:11 PDT
(In reply to comment #6) > I think the spec is written in terms of characters, so I'm confused by the use of "code points" in "All U+0000 NULL characters and code points in the range U+D800 to U+DFFF". But I think the "code points" bit is just a mistake. > > I think that this pre-processing should already have been done at a layer below the InputStreamPreprocessor, so we can just hack out that code. Yes. Code below InputStreamPreprocessor is handling surrogate pairs. That's why they work when the filter in legalEntityFor() is removed. Adam, let me know if this is off, but my assessment was that your checks in legalEntityFor were based on the spec in #preprocessing-the-input-stream. The bug was that those checks should only be enforced in peek() rather than prior to adding the entities to the stream.
Adam Barth
Comment 8 2010-06-09 09:59:05 PDT
Comment on attachment 58206 [details] Patch Ok. I'm still not sure I understand the issue 100%, but I'm starting to be convinced.
Tony Gentilcore
Comment 9 2010-06-09 10:02:44 PDT
(In reply to comment #8) > (From update of attachment 58206 [details]) > Ok. I'm still not sure I understand the issue 100%, but I'm starting to be convinced. I've flipped the cq? flag. I'm assuming: 1. This is an appropriate style violation. 2. There isn't another part of the spec that I'm missing which legalEntityFor() was based upon (ie. it was based on #preprocessing-the-input-stream).
Adam Barth
Comment 10 2010-06-09 10:11:20 PDT
http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenizing-character-references [[ Otherwise, if the number is in the range 0xD800 to 0xDFFF or is greater than 0x10FFFF, then this is a parse error. Return a U+FFFD REPLACEMENT CHARACTER. ]]
Tony Gentilcore
Comment 11 2010-06-09 10:16:08 PDT
(In reply to comment #10) > http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenizing-character-references > > [[ > Otherwise, if the number is in the range 0xD800 to 0xDFFF or is greater than 0x10FFFF, then this is a parse error. Return a U+FFFD REPLACEMENT CHARACTER. > ]] OK. So that means surrogate pair entities are expressly forbidden in HTML5. Since they were supported before, that leaves a compatibility question. How do we handle these LayoutTests which test for legacy behavior that is expressly forbidden in HTML5?
Adam Barth
Comment 12 2010-06-09 10:20:39 PDT
We note them in the spreadsheet. Can you test for how other browser (especially IE, Firefox, and Firefox nightly) behave?
Tony Gentilcore
Comment 13 2010-06-09 10:42:54 PDT
(In reply to comment #12) > We note them in the spreadsheet. Can you test for how other browser (especially IE, Firefox, and Firefox nightly) behave? We are following the spec, so I'm marking this issue as WONTFIX, and I've noted it in the spreadsheet. In answer to your question, here's how various browsers perform on entity-surrogate-pairs.html: IE8: PASS Firefox 3.6: FAIL Minefield: FAIL WK legacy: PASS WK HTML5: FAIL It is interesting that comment in the test says "Firefox allows these". I suppose that means FF did at some point but that has now been removed. It is a better scary for compat that IE8 allows surrogate pair entities.
Adam Barth
Comment 14 2010-06-09 10:52:39 PDT
We should record all this information in a doc and send it to the HTML5 WG once we've got everything sorted out.
Tony Gentilcore
Comment 15 2010-06-09 10:59:20 PDT
(In reply to comment #14) > We should record all this information in a doc and send it to the HTML5 WG once we've got everything sorted out. Presumably, we are going to end up with a spreadsheet of only lines that have an "N" in needs code change. Those will be the list of issues we'll have to resolve with the spec. I've added this bug number there so we can reference it when we prepare the list of issues.
Alexey Proskuryakov
Comment 16 2010-06-09 13:25:29 PDT
See bug 6446 for why this was done (no known Web compatibility issues, just compatibility with other browsers). But thinking about this now, I like our forgiving behavior. There are processes on the Web that blindly encode all non-ASCII "characters" in input stream as numeric entities. I don't have a specific example. but depending on their definition of "character", they can end up with exactly this. And I'm not aware of any downside in supporting such split entities.
Adam Barth
Comment 17 2010-06-09 13:57:05 PDT
I've added your comment to our new document that's tracking these issues: https://docs.google.com/document/edit?id=1as5xYjyMSCph4960iz0-Kb7hZKf_L6f2vts57NMcVBI&hl=en Our current plan is to have the new parser 100% match the spec. If we want to do something differently, we probably want to change the spec and then move the implementation to match the updated spec.
Alexey Proskuryakov
Comment 18 2012-11-27 12:52:53 PST
For future reference: <rdar://problem/12747226>and <rdar://problem/10447418> track two regressions from switching to the new behavior.
Note You need to log in before you can comment on or make changes to this bug.