RESOLVED FIXED 64132
Implement WebVTT Cue Text Parsing rules and DOM construction for WebVTT elements
https://bugs.webkit.org/show_bug.cgi?id=64132
Summary Implement WebVTT Cue Text Parsing rules and DOM construction for WebVTT elements
Anna Cavender
Reported 2011-07-07 16:26:06 PDT
WebVTT Cue Text Parsing rules : http://www.whatwg.org/specs/web-apps/current-work/multipage/the-iframe-element.html#webvtt-cue-text-parsing-rules WebVTT Cue Text DOM construction rules : http://www.whatwg.org/specs/web-apps/current-work/multipage/the-iframe-element.html#webvtt-cue-text-dom-construction-rules Thanks to WebKit, we get some of these for free: WebVTT Italics Objects (<i></i>) WebVTT Bold Objects (<b></b>) WebVTT Underline Objects (<u></u>) WebVTT Ruby Objects (<ruby></ruby>) WebVTT Ruby Text Objects (<rt></rt>) But there are some we will need to implement: WebVTT Class Objects (<c></c>) WebVTT Voice Objects (<v></v>) WebVTT Leaf Node Objects WebVTT Text Objects WebVTT Timestamp Objects (<[time]>)
Attachments
first pass at WebVTT cue text parser/tokenizer (33.75 KB, patch)
2011-08-08 16:01 PDT, Anna Cavender
no flags
adding classes and title found during parsing to fragment tree (34.20 KB, patch)
2011-08-09 11:25 PDT, Anna Cavender
no flags
addressing Adam's comments (34.93 KB, patch)
2011-08-09 16:27 PDT, Anna Cavender
no flags
using DataVector typedef and fixing a few bugs (37.71 KB, patch)
2011-08-24 12:00 PDT, Anna Cavender
no flags
updating to ToT (37.70 KB, patch)
2011-08-29 13:04 PDT, Anna Cavender
no flags
Anna Cavender
Comment 1 2011-08-08 16:01:01 PDT
Created attachment 103308 [details] first pass at WebVTT cue text parser/tokenizer
Anna Cavender
Comment 2 2011-08-09 11:25:53 PDT
Created attachment 103376 [details] adding classes and title found during parsing to fragment tree
WebKit Review Bot
Comment 3 2011-08-09 11:29:24 PDT
Attachment 103376 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/CMakeLists.txt', u'Source/W..." exit_code: 1 Source/WebCore/platform/track/WebVTTTokenizer.cpp:86: Non-label code inside switch statements should be indented. [whitespace/indent] [4] Total errors found: 1 in 15 files If any of these errors are false positives, please file a bug against check-webkit-style.
Adam Barth
Comment 4 2011-08-09 12:41:49 PDT
Comment on attachment 103376 [details] adding classes and title found during parsing to fragment tree View in context: https://bugs.webkit.org/attachment.cgi?id=103376&action=review This looks great, but I'd like to see another round. There seems to be a bunch of AtomicString/String confusion. AtomicStrings are strings where we've hashed the contents of the string in a giant hash table to use the same storage as other instances of the same string. They're useful when comparing strings for equality (because you can just compare their addresses) and when you want lots of common strings to share the same storage. The trade-off is that they're slower to construct (because you need to hash their contents). > Source/WebCore/platform/track/WebVTTParser.cpp:232 > + switch (m_token.type()) { > + case WebVTTTokenTypes::Character: I'd put this switch statement into its own function to keep the loop small and tight. For HTML and XML, we put this function into it's own class, but that's probably not necessary here. > Source/WebCore/platform/track/WebVTTParser.cpp:234 > + AtomicString content(m_token.characters().data(), m_token.characters().size()); Why AtomicString? String is probably more appropriate if you're going to just put this in a Text node. > Source/WebCore/platform/track/WebVTTParser.cpp:240 > + case WebVTTTokenTypes::StartTag: > + { These { braces aren't in the correct style. These two lines should be merged and the trailing } should be aligned with the "c" in case. > Source/WebCore/platform/track/WebVTTParser.cpp:248 > + else if (AtomicString(m_token.name().data()) == "c") Why bother to malloc a string if it's just one character long? You can just compare the length and the value of the character. Also, what happened to the length? > Source/WebCore/platform/track/WebVTTParser.cpp:271 > + if (tokenTagName == iTag > + || tokenTagName == bTag > + || tokenTagName == uTag > + || tokenTagName == rubyTag > + || tokenTagName == rtTag > + || AtomicString(m_token.name().data()) == "c" > + || AtomicString(m_token.name().data()) == "v") Rather than copy/pasting this code, I would create a free inline function. > Source/WebCore/platform/track/WebVTTParser.cpp:276 > + unsigned position = 0; You sure this shouldn't be size_t ? > Source/WebCore/platform/track/WebVTTParser.cpp:279 > + m_currentNode->parserAddChild(ProcessingInstruction::create(document, "timestamp", AtomicString(m_token.characters().data(), m_token.characters().size()))); AtomicString should probably be String again. > Source/WebCore/platform/track/WebVTTTokenizer.cpp:82 > + UChar cc = m_inputStreamPreprocessor.nextInputCharacter(); I presume the spec says to do the same preprocessing that we do for HTML (e.g., \r\n collapsing, etc).
Vicki Pfau
Comment 5 2011-08-09 13:46:19 PDT
Comment on attachment 103376 [details] adding classes and title found during parsing to fragment tree View in context: https://bugs.webkit.org/attachment.cgi?id=103376&action=review > Source/WebCore/platform/track/WebVTTToken.h:130 > + String m_annotation; > + String m_classes; > + String m_currentBuffer; These probably shouldn't be Strings. When you append a single character to a string, as you do in this tokenizer, it resizes (and therefore memcpys) the string. Since it does this in a loop, the loop becomes very inefficient. The reason that the other tokenizers use a Vector of UChars is because Vectors are more optimized for small resizes.
Anna Cavender
Comment 6 2011-08-09 16:26:58 PDT
Comment on attachment 103376 [details] adding classes and title found during parsing to fragment tree Thanks for the fast response, Adam! And thank you for the explanation too. View in context: https://bugs.webkit.org/attachment.cgi?id=103376&action=review >> Source/WebCore/platform/track/WebVTTParser.cpp:232 >> + case WebVTTTokenTypes::Character: > > I'd put this switch statement into its own function to keep the loop small and tight. For HTML and XML, we put this function into it's own class, but that's probably not necessary here. Done. >> Source/WebCore/platform/track/WebVTTParser.cpp:234 >> + AtomicString content(m_token.characters().data(), m_token.characters().size()); > > Why AtomicString? String is probably more appropriate if you're going to just put this in a Text node. Done. >> Source/WebCore/platform/track/WebVTTParser.cpp:240 >> + { > > These { braces aren't in the correct style. These two lines should be merged and the trailing } should be aligned with the "c" in case. Done. >> Source/WebCore/platform/track/WebVTTParser.cpp:248 >> + else if (AtomicString(m_token.name().data()) == "c") > > Why bother to malloc a string if it's just one character long? You can just compare the length and the value of the character. Also, what happened to the length? Done. Good point. >> Source/WebCore/platform/track/WebVTTParser.cpp:271 >> + || AtomicString(m_token.name().data()) == "v") > > Rather than copy/pasting this code, I would create a free inline function. Done. Nice. >> Source/WebCore/platform/track/WebVTTParser.cpp:276 >> + unsigned position = 0; > > You sure this shouldn't be size_t ? collectTimeStamp takes an unsigned, but I wrote that function too, so I'm open to suggestions :). >> Source/WebCore/platform/track/WebVTTParser.cpp:279 >> + m_currentNode->parserAddChild(ProcessingInstruction::create(document, "timestamp", AtomicString(m_token.characters().data(), m_token.characters().size()))); > > AtomicString should probably be String again. Done. >> Source/WebCore/platform/track/WebVTTToken.h:130 >> + String m_currentBuffer; > > These probably shouldn't be Strings. When you append a single character to a string, as you do in this tokenizer, it resizes (and therefore memcpys) the string. Since it does this in a loop, the loop becomes very inefficient. The reason that the other tokenizers use a Vector of UChars is because Vectors are more optimized for small resizes. Done. Changed to UChar Vectors. Thanks for the explanation. >> Source/WebCore/platform/track/WebVTTTokenizer.cpp:82 >> + UChar cc = m_inputStreamPreprocessor.nextInputCharacter(); > > I presume the spec says to do the same preprocessing that we do for HTML (e.g., \r\n collapsing, etc). Hmm. It doesn't say that specifically, but using the existing preprocessor made sense to me. >> Source/WebCore/platform/track/WebVTTTokenizer.cpp:86 >> + WEBVTT_BEGIN_STATE(DataState) { > > Non-label code inside switch statements should be indented. [whitespace/indent] [4] Funny, these are labels, but the style checker can't tell that.
Anna Cavender
Comment 7 2011-08-09 16:27:37 PDT
Created attachment 103415 [details] addressing Adam's comments
Vicki Pfau
Comment 8 2011-08-10 18:11:22 PDT
Comment on attachment 103415 [details] addressing Adam's comments View in context: https://bugs.webkit.org/attachment.cgi?id=103415&action=review > Source/WebCore/platform/track/WebVTTToken.h:97 > + Vector<UChar, 32> classes() This should be const Vector<UChar, 32>& classes() const, or const DataVector& classes() const (see below) to avoid copying > Source/WebCore/platform/track/WebVTTToken.h:115 > + Vector<UChar, 32> annotation() This should be const Vector<UChar, 32>& annotation() const, or const DataVector& annotation() const (see below) to avoid copying > Source/WebCore/platform/track/WebVTTToken.h:129 > + Vector<UChar, 32> m_annotation; > + Vector<UChar, 32> m_classes; If these are about the same length as a tag name or an attribute name, this is what the MarkupTokenBase::DataVector type (which is just a typedef'd Vector) is for, so it might make sense to use those here.
Anna Cavender
Comment 9 2011-08-24 11:59:51 PDT
Comment on attachment 103415 [details] addressing Adam's comments Thanks Jeffrey. View in context: https://bugs.webkit.org/attachment.cgi?id=103415&action=review >> Source/WebCore/platform/track/WebVTTToken.h:97 >> + Vector<UChar, 32> classes() > > This should be const Vector<UChar, 32>& classes() const, or const DataVector& classes() const (see below) to avoid copying Done. >> Source/WebCore/platform/track/WebVTTToken.h:115 >> + Vector<UChar, 32> annotation() > > This should be const Vector<UChar, 32>& annotation() const, or const DataVector& annotation() const (see below) to avoid copying Done. >> Source/WebCore/platform/track/WebVTTToken.h:129 >> + Vector<UChar, 32> m_classes; > > If these are about the same length as a tag name or an attribute name, this is what the MarkupTokenBase::DataVector type (which is just a typedef'd Vector) is for, so it might make sense to use those here. Good idea, thanks. Done.
Anna Cavender
Comment 10 2011-08-24 12:00:10 PDT
Created attachment 105034 [details] using DataVector typedef and fixing a few bugs
WebKit Review Bot
Comment 11 2011-08-24 12:02:59 PDT
Attachment 105034 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/CMakeLists.txt', u'Source/W..." exit_code: 1 Source/WebCore/platform/track/WebVTTTokenizer.cpp:86: Non-label code inside switch statements should be indented. [whitespace/indent] [4] Total errors found: 1 in 15 files If any of these errors are false positives, please file a bug against check-webkit-style.
Anna Cavender
Comment 12 2011-08-29 13:04:09 PDT
Created attachment 105513 [details] updating to ToT
WebKit Review Bot
Comment 13 2011-08-29 13:07:27 PDT
Attachment 105513 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/CMakeLists.txt', u'Source/W..." exit_code: 1 Source/WebCore/platform/track/WebVTTTokenizer.cpp:86: Non-label code inside switch statements should be indented. [whitespace/indent] [4] Total errors found: 1 in 15 files If any of these errors are false positives, please file a bug against check-webkit-style.
Adam Barth
Comment 14 2011-08-29 13:14:43 PDT
Comment on attachment 105513 [details] updating to ToT View in context: https://bugs.webkit.org/attachment.cgi?id=105513&action=review > Source/WebCore/platform/track/WebVTTParser.cpp:355 > + default: > + break; Generally we skip the default case in WebKit so the compiler tells us if we forgot any cases.
WebKit Review Bot
Comment 15 2011-08-29 13:47:20 PDT
Comment on attachment 105513 [details] updating to ToT Clearing flags on attachment: 105513 Committed r94011: <http://trac.webkit.org/changeset/94011>
WebKit Review Bot
Comment 16 2011-08-29 13:47:26 PDT
All reviewed patches have been landed. Closing bug.
Note You need to log in before you can comment on or make changes to this bug.