WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
29344
REGRESSION: fast/workers/dedicated-worker-lifecycle.html failing intermittently on leopard bot
https://bugs.webkit.org/show_bug.cgi?id=29344
Summary
REGRESSION: fast/workers/dedicated-worker-lifecycle.html failing intermittent...
Eric Seidel (no email)
Reported
2009-09-17 13:39:06 PDT
fast/workers/dedicated-worker-lifecycle.html failing intermittently on leopard bot
http://build.webkit.org/results/Leopard%20Intel%20Release%20(Tests)/r48488%20(5093)/results.html
--- layout-test-results/fast/workers/dedicated-worker-lifecycle-expected.txt 2009-09-17 13:27:57.000000000 -0700 +++ layout-test-results/fast/workers/dedicated-worker-lifecycle-actual.txt 2009-09-17 13:27:57.000000000 -0700 @@ -1,12 +1,8 @@ +FAIL: Timed out waiting for notifyDone to be called This test checks whether orphaned workers exit under various conditions On success, you will see a series of "PASS" messages, followed by "TEST COMPLETE". PASS Orphaned worker thread created. -PASS Orphaned worker thread exited. -PASS Orphaned timeout worker thread created. -PASS Orphaned timeout worker thread exited. - -TEST COMPLETE This might be a one-time failure, but I'll file this so we have record for now.
Attachments
Proposal to skip the test until a fix can be found
(1.26 KB, patch)
2009-09-28 16:24 PDT
,
Eric Seidel (no email)
no flags
Details
Formatted Diff
Diff
proposed patch that may address the flakiness
(3.43 KB, patch)
2009-09-28 18:32 PDT
,
Andrew Wilson
no flags
Details
Formatted Diff
Diff
Crash log.
(30.78 KB, text/plain)
2009-12-07 14:49 PST
,
David Levin
no flags
Details
Normal crash log (-- had incorrect define for jsc zombies).
(27.47 KB, text/plain)
2009-12-07 15:39 PST
,
David Levin
no flags
Details
Another JSC_ZOMBIES crash log (in case it's helpful)
(26.10 KB, text/plain)
2009-12-07 16:16 PST
,
Eric Seidel (no email)
no flags
Details
Add dedicated-worker-lifecycle.html to the appropriate Skipped files
(2.07 KB, patch)
2009-12-11 11:00 PST
,
Andrew Wilson
no flags
Details
Formatted Diff
Diff
Disabling on leopard only
(1.31 KB, patch)
2009-12-11 11:33 PST
,
Andrew Wilson
no flags
Details
Formatted Diff
Diff
Show Obsolete
(6)
View All
Add attachment
proposed patch, testcase, etc.
Eric Seidel (no email)
Comment 1
2009-09-28 16:19:25 PDT
Sigh. This just caused
bug 29743
to fail to land too. I'll cook up a patch to skip this test too.
Eric Seidel (no email)
Comment 2
2009-09-28 16:24:33 PDT
Created
attachment 40269
[details]
Proposal to skip the test until a fix can be found
Andrew Wilson
Comment 3
2009-09-28 16:45:25 PDT
(In reply to
comment #2
)
> Created an attachment (id=40269) [details] > Proposal to skip the test until a fix can be found
These lifecycle tests are all somewhat dodgy I suspect. The problem is that the worker threads only exit once the Worker object itself is GC'd, and the worker object will not be GC'd if there's something on the stack that looks like a reference to it. I've seen odd errors in the past due to this (objects not getting GC'd even though there are no references). It's probably reasonable to just leave these disabled since there's no ironclad way for JS code to force something to get garbage collected.
David Levin
Comment 4
2009-09-28 16:50:35 PDT
Based on discussion with Drew, I r+ this. It sounds like it may need to be removed completely.
Eric Seidel (no email)
Comment 5
2009-09-28 16:52:41 PDT
We can force a collection with GCController.collect() in DumpRenderTree. If there are no references to it, it will be collected. Would that help you make this non-flakey?
Maciej Stachowiak
Comment 6
2009-09-28 16:53:32 PDT
(In reply to
comment #3
)
> (In reply to
comment #2
) > > Created an attachment (id=40269) [details] [details] > > Proposal to skip the test until a fix can be found > > These lifecycle tests are all somewhat dodgy I suspect. > > The problem is that the worker threads only exit once the Worker object itself > is GC'd, and the worker object will not be GC'd if there's something on the > stack that looks like a reference to it. I've seen odd errors in the past due > to this (objects not getting GC'd even though there are no references). It's > probably reasonable to just leave these disabled since there's no ironclad way > for JS code to force something to get garbage collected.
We have found ways before to get objects to be GC'd with 100% reliability in tests. Usually this involves doing lots of allocation and calling after the last reference should be dropped, and using a timer to return to the event loop and clear out any temporary stack.
Eric Seidel (no email)
Comment 7
2009-09-28 16:56:36 PDT
I'm in no hurry to check this in. I'll wait until there is some amount of consensus about what should be done to this test. If the vote is it should be Skipped for all platforms, or removed entirely, please just r- the existing patch and I'll be happy to post a different one (or someone else is welcome to). Thanks again for looking!
Andrew Wilson
Comment 8
2009-09-28 17:02:44 PDT
(In reply to
comment #6
)
> We have found ways before to get objects to be GC'd with 100% reliability in > tests. Usually this involves doing lots of allocation and calling after the > last reference should be dropped, and using a timer to return to the event loop > and clear out any temporary stack.
When I looked at this, I also tried those ways (including the exhaustive gc() function from js-test-pre.js, also happening from a setTimeout()): function gc() { if (typeof GCController !== "undefined") GCController.collect(); else { function gcRec(n) { if (n < 1) return {}; var temp = {i: "ab" + i + (i / 100000)}; temp += "foo"; gcRec(n-1); } for (var i = 0; i < 1000; i++) gcRec(10) } } And yet the GC would not free the object. In my previous case, it turned out that allocating a Date() object (for whatever reason) would allow the object to get GC'd (so I know the problem really was an issue of GC, and not some kind of lingering reference) - if you look at the body of dedicated-worker-lifecycle.js, you'll see this: // For some reason, the worker object does not get GC'd unless we allocate a new object here. // The conjecture is that there's a value on the stack that appears to point to the worker which this clobbers. new Date(); Anyhow, I'd be happy to make whatever changes are necessary to keep this test live - but I've tried all the suggestions I've heard so far and none of them have worked :(
Andrew Wilson
Comment 9
2009-09-28 17:04:51 PDT
(In reply to
comment #8
)
> When I looked at this, I also tried those ways (including the exhaustive gc() > function from js-test-pre.js, also happening from a setTimeout()):
To clarify, I was forcing the "slow/exhaustive" branch in the gc() routine - it wasn't just calling GCCollect.collect(). Anyhow, I could definitely re-enable that code again - maybe at least it would make it stable enough that we could keep the test enabled...
Andrew Wilson
Comment 10
2009-09-28 17:25:52 PDT
Here's my proposal: I'll re-enable the code that does what Maciej suggests. We'll leave the test disabled on mac-leopard, and watch to see if we see failures on any other platform. If not, we can try re-enabling on mac-leopard as well.
Andrew Wilson
Comment 11
2009-09-28 18:32:44 PDT
Created
attachment 40277
[details]
proposed patch that may address the flakiness Here's a patch that makes the changes Maciej suggests. We can either land this patch and wait and see if it addresses the flakiness, or land both this and Eric's patch.
Eric Seidel (no email)
Comment 12
2009-09-30 15:51:21 PDT
Comment on
attachment 40277
[details]
proposed patch that may address the flakiness Where did you get the gc() changes? I'd love to try something since this is still flakey on the bots:
http://build.webkit.org/results/Leopard%20Intel%20Release%20(Tests)/r48919%20(5566)/results.html
Andrew Wilson
Comment 13
2009-09-30 16:38:31 PDT
(In reply to
comment #12
)
> (From update of
attachment 40277
[details]
) > Where did you get the gc() changes? >
From fast/js/resources/js_test_pre.js, but with minor logic changes to force them to happen even if GCController is defined.
Eric Seidel (no email)
Comment 14
2009-10-01 12:15:46 PDT
Comment on
attachment 40277
[details]
proposed patch that may address the flakiness I certainly think this is worth a try. I'm not your best reviewer for this, but the others have been silent and this looks non-harmful.
Eric Seidel (no email)
Comment 15
2009-10-01 12:16:06 PDT
Comment on
attachment 40269
[details]
Proposal to skip the test until a fix can be found Removing Levin's r+ while we wait to see if Andrew's fix works.
Andrew Wilson
Comment 16
2009-10-01 13:28:57 PDT
Committed as
r48996
. I'll keep this bug open for a bit to see if this addresses the flakiness at all.
Eric Seidel (no email)
Comment 17
2009-10-05 11:09:01 PDT
Comment on
attachment 40277
[details]
proposed patch that may address the flakiness Clearing my r+ from this patch now that it's been committed. (This way the to-be-committed query doesn't list it.)
Eric Seidel (no email)
Comment 18
2009-10-11 10:18:39 PDT
This doesn't seem to be gone, sadly:
http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r49424%20(5891)/results.html
Alexey Proskuryakov
Comment 19
2009-10-11 15:33:53 PDT
On my home machine (and a debug build), this test isn't just flaky - it's always failing. Worker activity management appears to be working well, so this looks like a general GC failure. I see the worker wrapper marked by Heap::markConservatively(), as if there were references to it remaining on the stack. The test already attempts black magic to fight this. As far as I know, it shouldn't be necessary, and it doesn't work anyway. // Orphan our worker (no more references to it) and wait for it to exit. worker.onmessage = 0; worker = 0; // For some reason, the worker object does not get GC'd unless we allocate a new object here. // The conjecture is that there's a value on the stack that appears to point to the worker which this clobbers. new Date(); waitUntilWorkerThreadsExit(orphanedTimeoutWorkerExited); Here, waitUntilWorkerThreadsExit invokes garbage collector, each 10 ms, waiting for GC to collect the unused Worker object, which never happens.
Andrew Wilson
Comment 20
2009-10-11 15:40:50 PDT
OK, it sounds like the only thing to do here is just skip the test (eric's original patch) since my black magic to try to force the object off the stack isn't working. This really befuddles me, since retrying via setTimeout() should leave the stack totally empty.
Eric Seidel (no email)
Comment 21
2009-11-30 22:14:59 PST
Crashed tonight:
http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r51524%20(7729)/results.html
Eric Seidel (no email)
Comment 22
2009-11-30 22:16:17 PST
Looks like it failed earlier tonight as well:
http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r51523%20(7728)/fast/workers/dedicated-worker-lifecycle-pretty-diff.html
Eric Seidel (no email)
Comment 23
2009-12-02 13:04:49 PST
Failed again just now:
http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r51610%20(7808)/fast/workers/dedicated-worker-lifecycle-pretty-diff.html
Eric Seidel (no email)
Comment 24
2009-12-03 14:13:09 PST
Crashed again:
http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r51657%20(7847)/results.html
I'll see if I can find cycles to find a locally reproducible crash.
Oliver Hunt
Comment 25
2009-12-03 14:18:46 PST
(In reply to
comment #24
)
> Crashed again: >
http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r51657%20(7847)/results.html
> > I'll see if I can find cycles to find a locally reproducible crash.
try enabling JSC_ZOMBIES
Eric Seidel (no email)
Comment 26
2009-12-07 12:00:00 PST
Another crash on the bots:
http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r51782%20(7961)/results.html
I would like to disable/skip this test soon if it remains flakey like this. No sense in keeping red tests around.
David Levin
Comment 27
2009-12-07 14:49:51 PST
Created
attachment 44434
[details]
Crash log. I was unable to repro the crash when I ran the worker directory 1000 times but I ran the fast directory 10 times ("run-webkit-tests --iterations 10 fast/") and I got the crash once.
David Levin
Comment 28
2009-12-07 15:39:28 PST
Created
attachment 44438
[details]
Normal crash log (-- had incorrect define for jsc zombies). It still looks like some memory corruption issue (that JSC_ZOMBIES didn't detect). Time to run with malloc guard on.
Eric Seidel (no email)
Comment 29
2009-12-07 16:16:58 PST
Created
attachment 44442
[details]
Another JSC_ZOMBIES crash log (in case it's helpful) Thanks for looking dave! I also ran a JSC_ZOMBIES run this morning, and now that I'm finally back at my desk, I see that it too crashed.
Eric Seidel (no email)
Comment 30
2009-12-07 16:19:29 PST
The ASSERT that it crashed at was: ASSERTION FAILED: primaryHeap.numLiveObjects == primaryHeap.numZombies (/Users/eseidel/Projects/WebKit/JavaScriptCore/runtime/Collector.cpp:199 void JSC::Heap::destroy())
David Levin
Comment 31
2009-12-09 11:22:12 PST
Notes on the crash: * The assert at JavaScriptCore/runtime/Collector.cpp:199 isn't too useful, so it is good to comment it out (I think there may be a bug in the counting done in that function of num live objects.) * Use
r51657
for repros. I couldn't repro with tip of tree,
r51782
had build errors for me,
r51657
built and repros. * The smallest repro that I got so far is two test: run-webkit-tests --iterations 20 LayoutTests/fast/websockets/ fast/workers/dedicated-worker-lifecycle.html If you run ~20 times, you'll very likely see the crash. * Running with malloc guard causes the test to time out but if you turn off the brute force gc in worker-util.js, the test will pass (and the above crashes were done like this). * It turns out that the crash was due to WebSockets and was recently fixed with
https://bugs.webkit.org/show_bug.cgi?id=32226
in
r51790
. (Note no more buildbot crashes have been listed after that revision). Case closed for the crashes!
Andrew Wilson
Comment 32
2009-12-11 10:02:38 PST
Now that Dave has fixed the crash, I'm going to disable this flaky test unless someone objects in short order.
Andrew Wilson
Comment 33
2009-12-11 11:00:42 PST
Created
attachment 44697
[details]
Add dedicated-worker-lifecycle.html to the appropriate Skipped files Don't need to disable this in Qt because it's already disabled there (missing DRT functionality). The similarly named shared-worker-lifecycle.html and worker-lifecycle.html tests don't rely on GC (they test whether explicitly calling close() shuts down the thread), so dedicated-worker-lifecycle.html is the only one that needs to be disabled.
WebKit Review Bot
Comment 34
2009-12-11 11:04:17 PST
style-queue ran check-webkit-style on
attachment 44697
[details]
without any errors.
Darin Adler
Comment 35
2009-12-11 11:11:51 PST
Comment on
attachment 44697
[details]
Add dedicated-worker-lifecycle.html to the appropriate Skipped files If a test is not working on any platform, we normally disable it by renaming the file to add a "-disabled" suffix. Try this command to see the 70 tests currently disabled in this fashion: find LayoutTests -name '*-disabled' Would you consider disabling this test that way? Or is this test still useful on some platform?
David Levin
Comment 36
2009-12-11 11:15:58 PST
Why are disabling this as opposed to deleting it completely? From what I've seen fixing this seems like it depends on getting gc of a *particular* object to happen in a more deterministic manner which despite repeated attempts hasn't been possible (and it is hard to see what is going to change in that regard).
Andrew Wilson
Comment 37
2009-12-11 11:17:53 PST
Because the test itself passes 95%+ of the time (it always passes locally for me - it just fails sporadically on the bots). I'm resubmitting this with the -disabled rename shortly, although just putting it in the skipped file makes it easier to run the test (with --skipped=ignore).
Andrew Wilson
Comment 38
2009-12-11 11:23:03 PST
Rethinking this, I'm just going to skip it on leopard for now, since that's the only platform we've been seeing flakiness on. It could, in theory, fail anywhere since it relies on GC, but in practice it isn't failing anywhere but leopard and the test itself has significant value as it's the only thing that checks to make sure that the complex worker reachability mechanism actually works.
Andrew Wilson
Comment 39
2009-12-11 11:33:00 PST
Created
attachment 44699
[details]
Disabling on leopard only
Darin Adler
Comment 40
2009-12-11 11:35:08 PST
Comment on
attachment 44699
[details]
Disabling on leopard only Seems OK as a short term measure. I think we can probably find a way to force the GC to be more deterministic in the test harness. If we start seeing failures on other platforms we might need to come up with a better solution.
WebKit Review Bot
Comment 41
2009-12-11 11:35:20 PST
style-queue ran check-webkit-style on
attachment 44699
[details]
without any errors.
WebKit Commit Bot
Comment 42
2009-12-11 12:38:38 PST
Comment on
attachment 44699
[details]
Disabling on leopard only Clearing flags on attachment: 44699 Committed
r52014
: <
http://trac.webkit.org/changeset/52014
>
WebKit Commit Bot
Comment 43
2009-12-11 12:38:49 PST
All reviewed patches have been landed. Closing bug.
Darin Adler
Comment 44
2009-12-11 13:04:28 PST
(In reply to
comment #37
)
> just putting it in the skipped file makes it easier to run the test (with --skipped=ignore).
It might be good to replace the -disabled technique with a global Skipped file. Although that means there will be 70 other tests that --skipped=ignore will now include, which might not be good.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug