RESOLVED FIXED 29344
REGRESSION: fast/workers/dedicated-worker-lifecycle.html failing intermittently on leopard bot
https://bugs.webkit.org/show_bug.cgi?id=29344
Summary REGRESSION: fast/workers/dedicated-worker-lifecycle.html failing intermittent...
Eric Seidel (no email)
Reported 2009-09-17 13:39:06 PDT
fast/workers/dedicated-worker-lifecycle.html failing intermittently on leopard bot http://build.webkit.org/results/Leopard%20Intel%20Release%20(Tests)/r48488%20(5093)/results.html --- layout-test-results/fast/workers/dedicated-worker-lifecycle-expected.txt 2009-09-17 13:27:57.000000000 -0700 +++ layout-test-results/fast/workers/dedicated-worker-lifecycle-actual.txt 2009-09-17 13:27:57.000000000 -0700 @@ -1,12 +1,8 @@ +FAIL: Timed out waiting for notifyDone to be called This test checks whether orphaned workers exit under various conditions On success, you will see a series of "PASS" messages, followed by "TEST COMPLETE". PASS Orphaned worker thread created. -PASS Orphaned worker thread exited. -PASS Orphaned timeout worker thread created. -PASS Orphaned timeout worker thread exited. - -TEST COMPLETE This might be a one-time failure, but I'll file this so we have record for now.
Attachments
Proposal to skip the test until a fix can be found (1.26 KB, patch)
2009-09-28 16:24 PDT, Eric Seidel (no email)
no flags
proposed patch that may address the flakiness (3.43 KB, patch)
2009-09-28 18:32 PDT, Andrew Wilson
no flags
Crash log. (30.78 KB, text/plain)
2009-12-07 14:49 PST, David Levin
no flags
Normal crash log (-- had incorrect define for jsc zombies). (27.47 KB, text/plain)
2009-12-07 15:39 PST, David Levin
no flags
Another JSC_ZOMBIES crash log (in case it's helpful) (26.10 KB, text/plain)
2009-12-07 16:16 PST, Eric Seidel (no email)
no flags
Add dedicated-worker-lifecycle.html to the appropriate Skipped files (2.07 KB, patch)
2009-12-11 11:00 PST, Andrew Wilson
no flags
Disabling on leopard only (1.31 KB, patch)
2009-12-11 11:33 PST, Andrew Wilson
no flags
Eric Seidel (no email)
Comment 1 2009-09-28 16:19:25 PDT
Sigh. This just caused bug 29743 to fail to land too. I'll cook up a patch to skip this test too.
Eric Seidel (no email)
Comment 2 2009-09-28 16:24:33 PDT
Created attachment 40269 [details] Proposal to skip the test until a fix can be found
Andrew Wilson
Comment 3 2009-09-28 16:45:25 PDT
(In reply to comment #2) > Created an attachment (id=40269) [details] > Proposal to skip the test until a fix can be found These lifecycle tests are all somewhat dodgy I suspect. The problem is that the worker threads only exit once the Worker object itself is GC'd, and the worker object will not be GC'd if there's something on the stack that looks like a reference to it. I've seen odd errors in the past due to this (objects not getting GC'd even though there are no references). It's probably reasonable to just leave these disabled since there's no ironclad way for JS code to force something to get garbage collected.
David Levin
Comment 4 2009-09-28 16:50:35 PDT
Based on discussion with Drew, I r+ this. It sounds like it may need to be removed completely.
Eric Seidel (no email)
Comment 5 2009-09-28 16:52:41 PDT
We can force a collection with GCController.collect() in DumpRenderTree. If there are no references to it, it will be collected. Would that help you make this non-flakey?
Maciej Stachowiak
Comment 6 2009-09-28 16:53:32 PDT
(In reply to comment #3) > (In reply to comment #2) > > Created an attachment (id=40269) [details] [details] > > Proposal to skip the test until a fix can be found > > These lifecycle tests are all somewhat dodgy I suspect. > > The problem is that the worker threads only exit once the Worker object itself > is GC'd, and the worker object will not be GC'd if there's something on the > stack that looks like a reference to it. I've seen odd errors in the past due > to this (objects not getting GC'd even though there are no references). It's > probably reasonable to just leave these disabled since there's no ironclad way > for JS code to force something to get garbage collected. We have found ways before to get objects to be GC'd with 100% reliability in tests. Usually this involves doing lots of allocation and calling after the last reference should be dropped, and using a timer to return to the event loop and clear out any temporary stack.
Eric Seidel (no email)
Comment 7 2009-09-28 16:56:36 PDT
I'm in no hurry to check this in. I'll wait until there is some amount of consensus about what should be done to this test. If the vote is it should be Skipped for all platforms, or removed entirely, please just r- the existing patch and I'll be happy to post a different one (or someone else is welcome to). Thanks again for looking!
Andrew Wilson
Comment 8 2009-09-28 17:02:44 PDT
(In reply to comment #6) > We have found ways before to get objects to be GC'd with 100% reliability in > tests. Usually this involves doing lots of allocation and calling after the > last reference should be dropped, and using a timer to return to the event loop > and clear out any temporary stack. When I looked at this, I also tried those ways (including the exhaustive gc() function from js-test-pre.js, also happening from a setTimeout()): function gc() { if (typeof GCController !== "undefined") GCController.collect(); else { function gcRec(n) { if (n < 1) return {}; var temp = {i: "ab" + i + (i / 100000)}; temp += "foo"; gcRec(n-1); } for (var i = 0; i < 1000; i++) gcRec(10) } } And yet the GC would not free the object. In my previous case, it turned out that allocating a Date() object (for whatever reason) would allow the object to get GC'd (so I know the problem really was an issue of GC, and not some kind of lingering reference) - if you look at the body of dedicated-worker-lifecycle.js, you'll see this: // For some reason, the worker object does not get GC'd unless we allocate a new object here. // The conjecture is that there's a value on the stack that appears to point to the worker which this clobbers. new Date(); Anyhow, I'd be happy to make whatever changes are necessary to keep this test live - but I've tried all the suggestions I've heard so far and none of them have worked :(
Andrew Wilson
Comment 9 2009-09-28 17:04:51 PDT
(In reply to comment #8) > When I looked at this, I also tried those ways (including the exhaustive gc() > function from js-test-pre.js, also happening from a setTimeout()): To clarify, I was forcing the "slow/exhaustive" branch in the gc() routine - it wasn't just calling GCCollect.collect(). Anyhow, I could definitely re-enable that code again - maybe at least it would make it stable enough that we could keep the test enabled...
Andrew Wilson
Comment 10 2009-09-28 17:25:52 PDT
Here's my proposal: I'll re-enable the code that does what Maciej suggests. We'll leave the test disabled on mac-leopard, and watch to see if we see failures on any other platform. If not, we can try re-enabling on mac-leopard as well.
Andrew Wilson
Comment 11 2009-09-28 18:32:44 PDT
Created attachment 40277 [details] proposed patch that may address the flakiness Here's a patch that makes the changes Maciej suggests. We can either land this patch and wait and see if it addresses the flakiness, or land both this and Eric's patch.
Eric Seidel (no email)
Comment 12 2009-09-30 15:51:21 PDT
Comment on attachment 40277 [details] proposed patch that may address the flakiness Where did you get the gc() changes? I'd love to try something since this is still flakey on the bots: http://build.webkit.org/results/Leopard%20Intel%20Release%20(Tests)/r48919%20(5566)/results.html
Andrew Wilson
Comment 13 2009-09-30 16:38:31 PDT
(In reply to comment #12) > (From update of attachment 40277 [details]) > Where did you get the gc() changes? > From fast/js/resources/js_test_pre.js, but with minor logic changes to force them to happen even if GCController is defined.
Eric Seidel (no email)
Comment 14 2009-10-01 12:15:46 PDT
Comment on attachment 40277 [details] proposed patch that may address the flakiness I certainly think this is worth a try. I'm not your best reviewer for this, but the others have been silent and this looks non-harmful.
Eric Seidel (no email)
Comment 15 2009-10-01 12:16:06 PDT
Comment on attachment 40269 [details] Proposal to skip the test until a fix can be found Removing Levin's r+ while we wait to see if Andrew's fix works.
Andrew Wilson
Comment 16 2009-10-01 13:28:57 PDT
Committed as r48996. I'll keep this bug open for a bit to see if this addresses the flakiness at all.
Eric Seidel (no email)
Comment 17 2009-10-05 11:09:01 PDT
Comment on attachment 40277 [details] proposed patch that may address the flakiness Clearing my r+ from this patch now that it's been committed. (This way the to-be-committed query doesn't list it.)
Eric Seidel (no email)
Comment 18 2009-10-11 10:18:39 PDT
Alexey Proskuryakov
Comment 19 2009-10-11 15:33:53 PDT
On my home machine (and a debug build), this test isn't just flaky - it's always failing. Worker activity management appears to be working well, so this looks like a general GC failure. I see the worker wrapper marked by Heap::markConservatively(), as if there were references to it remaining on the stack. The test already attempts black magic to fight this. As far as I know, it shouldn't be necessary, and it doesn't work anyway. // Orphan our worker (no more references to it) and wait for it to exit. worker.onmessage = 0; worker = 0; // For some reason, the worker object does not get GC'd unless we allocate a new object here. // The conjecture is that there's a value on the stack that appears to point to the worker which this clobbers. new Date(); waitUntilWorkerThreadsExit(orphanedTimeoutWorkerExited); Here, waitUntilWorkerThreadsExit invokes garbage collector, each 10 ms, waiting for GC to collect the unused Worker object, which never happens.
Andrew Wilson
Comment 20 2009-10-11 15:40:50 PDT
OK, it sounds like the only thing to do here is just skip the test (eric's original patch) since my black magic to try to force the object off the stack isn't working. This really befuddles me, since retrying via setTimeout() should leave the stack totally empty.
Eric Seidel (no email)
Comment 24 2009-12-03 14:13:09 PST
Crashed again: http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r51657%20(7847)/results.html I'll see if I can find cycles to find a locally reproducible crash.
Oliver Hunt
Comment 25 2009-12-03 14:18:46 PST
(In reply to comment #24) > Crashed again: > http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r51657%20(7847)/results.html > > I'll see if I can find cycles to find a locally reproducible crash. try enabling JSC_ZOMBIES
Eric Seidel (no email)
Comment 26 2009-12-07 12:00:00 PST
Another crash on the bots: http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r51782%20(7961)/results.html I would like to disable/skip this test soon if it remains flakey like this. No sense in keeping red tests around.
David Levin
Comment 27 2009-12-07 14:49:51 PST
Created attachment 44434 [details] Crash log. I was unable to repro the crash when I ran the worker directory 1000 times but I ran the fast directory 10 times ("run-webkit-tests --iterations 10 fast/") and I got the crash once.
David Levin
Comment 28 2009-12-07 15:39:28 PST
Created attachment 44438 [details] Normal crash log (-- had incorrect define for jsc zombies). It still looks like some memory corruption issue (that JSC_ZOMBIES didn't detect). Time to run with malloc guard on.
Eric Seidel (no email)
Comment 29 2009-12-07 16:16:58 PST
Created attachment 44442 [details] Another JSC_ZOMBIES crash log (in case it's helpful) Thanks for looking dave! I also ran a JSC_ZOMBIES run this morning, and now that I'm finally back at my desk, I see that it too crashed.
Eric Seidel (no email)
Comment 30 2009-12-07 16:19:29 PST
The ASSERT that it crashed at was: ASSERTION FAILED: primaryHeap.numLiveObjects == primaryHeap.numZombies (/Users/eseidel/Projects/WebKit/JavaScriptCore/runtime/Collector.cpp:199 void JSC::Heap::destroy())
David Levin
Comment 31 2009-12-09 11:22:12 PST
Notes on the crash: * The assert at JavaScriptCore/runtime/Collector.cpp:199 isn't too useful, so it is good to comment it out (I think there may be a bug in the counting done in that function of num live objects.) * Use r51657 for repros. I couldn't repro with tip of tree, r51782 had build errors for me, r51657 built and repros. * The smallest repro that I got so far is two test: run-webkit-tests --iterations 20 LayoutTests/fast/websockets/ fast/workers/dedicated-worker-lifecycle.html If you run ~20 times, you'll very likely see the crash. * Running with malloc guard causes the test to time out but if you turn off the brute force gc in worker-util.js, the test will pass (and the above crashes were done like this). * It turns out that the crash was due to WebSockets and was recently fixed with https://bugs.webkit.org/show_bug.cgi?id=32226 in r51790. (Note no more buildbot crashes have been listed after that revision). Case closed for the crashes!
Andrew Wilson
Comment 32 2009-12-11 10:02:38 PST
Now that Dave has fixed the crash, I'm going to disable this flaky test unless someone objects in short order.
Andrew Wilson
Comment 33 2009-12-11 11:00:42 PST
Created attachment 44697 [details] Add dedicated-worker-lifecycle.html to the appropriate Skipped files Don't need to disable this in Qt because it's already disabled there (missing DRT functionality). The similarly named shared-worker-lifecycle.html and worker-lifecycle.html tests don't rely on GC (they test whether explicitly calling close() shuts down the thread), so dedicated-worker-lifecycle.html is the only one that needs to be disabled.
WebKit Review Bot
Comment 34 2009-12-11 11:04:17 PST
style-queue ran check-webkit-style on attachment 44697 [details] without any errors.
Darin Adler
Comment 35 2009-12-11 11:11:51 PST
Comment on attachment 44697 [details] Add dedicated-worker-lifecycle.html to the appropriate Skipped files If a test is not working on any platform, we normally disable it by renaming the file to add a "-disabled" suffix. Try this command to see the 70 tests currently disabled in this fashion: find LayoutTests -name '*-disabled' Would you consider disabling this test that way? Or is this test still useful on some platform?
David Levin
Comment 36 2009-12-11 11:15:58 PST
Why are disabling this as opposed to deleting it completely? From what I've seen fixing this seems like it depends on getting gc of a *particular* object to happen in a more deterministic manner which despite repeated attempts hasn't been possible (and it is hard to see what is going to change in that regard).
Andrew Wilson
Comment 37 2009-12-11 11:17:53 PST
Because the test itself passes 95%+ of the time (it always passes locally for me - it just fails sporadically on the bots). I'm resubmitting this with the -disabled rename shortly, although just putting it in the skipped file makes it easier to run the test (with --skipped=ignore).
Andrew Wilson
Comment 38 2009-12-11 11:23:03 PST
Rethinking this, I'm just going to skip it on leopard for now, since that's the only platform we've been seeing flakiness on. It could, in theory, fail anywhere since it relies on GC, but in practice it isn't failing anywhere but leopard and the test itself has significant value as it's the only thing that checks to make sure that the complex worker reachability mechanism actually works.
Andrew Wilson
Comment 39 2009-12-11 11:33:00 PST
Created attachment 44699 [details] Disabling on leopard only
Darin Adler
Comment 40 2009-12-11 11:35:08 PST
Comment on attachment 44699 [details] Disabling on leopard only Seems OK as a short term measure. I think we can probably find a way to force the GC to be more deterministic in the test harness. If we start seeing failures on other platforms we might need to come up with a better solution.
WebKit Review Bot
Comment 41 2009-12-11 11:35:20 PST
style-queue ran check-webkit-style on attachment 44699 [details] without any errors.
WebKit Commit Bot
Comment 42 2009-12-11 12:38:38 PST
Comment on attachment 44699 [details] Disabling on leopard only Clearing flags on attachment: 44699 Committed r52014: <http://trac.webkit.org/changeset/52014>
WebKit Commit Bot
Comment 43 2009-12-11 12:38:49 PST
All reviewed patches have been landed. Closing bug.
Darin Adler
Comment 44 2009-12-11 13:04:28 PST
(In reply to comment #37) > just putting it in the skipped file makes it easier to run the test (with --skipped=ignore). It might be good to replace the -disabled technique with a global Skipped file. Although that means there will be 70 other tests that --skipped=ignore will now include, which might not be good.
Note You need to log in before you can comment on or make changes to this bug.