61847 – Worker may not be stopped after LT.

RESOLVED FIXED61847

Worker may not be stopped after LT.

https://bugs.webkit.org/show_bug.cgi?id=61847

Summary Worker may not be stopped after LT.

Hao Zheng

Reported 2011-06-01 01:48:01 PDT

In worker.py, only call to cleanup() is in __del__, which is not guaranteed to be called when the python interpreter exits. And self._driver is not killed after LT. After many runs of LT, there will be many zombie processes of driver. I think we need to run cleanup() explicitly for worker. I'm not sure where is better to put cleanup(): def handle_stop(self, src): self._done = True ** self.cleanup() Or, def run(self, port): self.safe_init(port) exception_msg = "" _log.debug("%s starting" % self._name) try: self._worker_connection.run_message_loop() if not self.is_done(): raise AssertionError("%s: ran out of messages in worker queue." % self._name) except KeyboardInterrupt: exception_msg = ", interrupted" except: exception_msg = ", exception raised" finally: _log.debug("%s done%s" % (self._name, exception_msg)) if exception_msg: exception_type, exception_value, exception_traceback = sys.exc_info() stack_utils.log_traceback(_log.debug, exception_traceback) # FIXME: Figure out how to send a message with a traceback. self._worker_connection.post_message('exception', (exception_type, exception_value, None)) self._worker_connection.post_message('done') ** self.cleanup() I'd like to hear your comments.

Attachments
Proposed patch. (1.47 KB, patch) 2011-06-01 21:15 PDT, Hao Zheng	no flags	Details Formatted Diff Diff
Patch (3.75 KB, patch) 2011-06-02 19:17 PDT, Dirk Pranke	no flags	Details Formatted Diff Diff
Patch (3.55 KB, patch) 2011-06-03 12:34 PDT, Dirk Pranke	no flags	Details Formatted Diff Diff
Show Obsolete (2) View All Add attachment proposed patch, testcase, etc.

Dirk Pranke

Comment 1 2011-06-01 10:34:28 PDT

cleanup() should be getting called at the end of run(). I'm pretty sure it did at some point; I wonder if I deleted it? Feel free to add a patch for this, or I'll do so this afternoon.

Hao Zheng

Comment 2 2011-06-01 21:15:49 PDT

Created attachment 95716 [details] Proposed patch.

WebKit Commit Bot

Comment 3 2011-06-02 12:58:55 PDT

The commit-queue encountered the following flaky tests while processing attachment 95716 [details]: fast/history/history-subframe-with-name.html bug 51039 (author: mihaip@chromium.org) http/tests/websocket/tests/error-detect.html bug 54012 (author: abarth@webkit.org) The commit-queue is continuing to process your patch.

WebKit Commit Bot

Comment 4 2011-06-02 13:01:14 PDT

Comment on attachment 95716 [details] Proposed patch. Clearing flags on attachment: 95716 Committed r87946: <http://trac.webkit.org/changeset/87946>

WebKit Commit Bot

Comment 5 2011-06-02 13:01:18 PDT

All reviewed patches have been landed. Closing bug.

Adam Barth

Comment 6 2011-06-02 14:40:34 PDT

This broke NRWT in multiple configurations. Here's an example: http://build.chromium.org/p/chromium.webkit/builders/Webkit%20Linux%20%28dbg%29%281%29/builds/6754/steps/webkit_tests/logs/stdio 2011-06-02 13:35:14,919 8291 manager.py:672 DEBUG No wedged threads 2011-06-02 13:35:15,419 8291 manager.py:690 INFO Exception raised, exiting Traceback (most recent call last): File "/mnt/data/b/build/slave/Webkit_Linux__dbg__1_/build/src/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/run_webkit_tests.py", line 460, in <module> sys.exit(main()) File "/mnt/data/b/build/slave/Webkit_Linux__dbg__1_/build/src/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/run_webkit_tests.py", line 455, in main return run(port_obj, options, args) File "/mnt/data/b/build/slave/Webkit_Linux__dbg__1_/build/src/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/run_webkit_tests.py", line 118, in run num_unexpected_results = manager.run(result_summary) File "/mnt/data/b/build/slave/Webkit_Linux__dbg__1_/build/src/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py", line 779, in run self._run_tests(self._test_files_list, result_summary)) File "/mnt/data/b/build/slave/Webkit_Linux__dbg__1_/build/src/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py", line 678, in _run_tests assert not worker_state.worker_connection.is_alive() AssertionError program finished with exit code 1

Dirk Pranke

Comment 7 2011-06-02 19:17:46 PDT

Created attachment 95845 [details] Patch

Hao Zheng

Comment 8 2011-06-02 20:13:12 PDT

Thanks for your help. Sorry to break the bots.

Tony Chang

Comment 9 2011-06-03 09:53:28 PDT

Comment on attachment 95845 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=95845&action=review > Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:681 > + _log.error('Worked %d did not exit in time. Aborting.' % worker_state.number) > + raise AssertionError('Worker %d did not exit in time' % worker_state.number) Should we raise an Assertion here or just log the error and try to keep going? > Tools/Scripts/webkitpy/layout_tests/layout_package/worker.py:150 > + self.kill_driver() > + self.stop_servers_with_lock() Nit: We may want to remove the _driver and _has_http_lock checks in the other call sites (in a separate patch).

Dirk Pranke

Comment 10 2011-06-03 10:01:43 PDT

(In reply to comment #9) > (From update of attachment 95845 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=95845&action=review > > > Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:681 > > + _log.error('Worked %d did not exit in time. Aborting.' % worker_state.number) > > + raise AssertionError('Worker %d did not exit in time' % worker_state.number) > > Should we raise an Assertion here or just log the error and try to keep going? > Good question. I'm not actually sure what the end result of the run will be in either case (hung process or process that eventually exits when the slow thread/process exits). I would prefer to leave it as an assertion for now just to help identify if it actually occurs. > > Tools/Scripts/webkitpy/layout_tests/layout_package/worker.py:150 > > + self.kill_driver() > > + self.stop_servers_with_lock() > > Nit: We may want to remove the _driver and _has_http_lock checks in the other call sites (in a separate patch). will do.

Tony Chang

Comment 11 2011-06-03 11:29:41 PDT

Comment on attachment 95845 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=95845&action=review >>> Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:681 >>> + raise AssertionError('Worker %d did not exit in time' % worker_state.number) >> >> Should we raise an Assertion here or just log the error and try to keep going? > > Good question. I'm not actually sure what the end result of the run will be in either case (hung process or process that eventually exits when the slow thread/process exits). I would prefer to leave it as an assertion for now just to help identify if it actually occurs. I think this is OK, but In general I think we should try to have the bots keep going as much as possible. If the script crashes, it causes problems for the gardener who then needs to rollback and wait for the bots to cycle an extra time. The downside for you is much smaller: you just need to land the patch manually and watch the bots as they cycle to see if an error is logged. You might want to try landing this during non-busy hours or coordinate the landing with the gardener.

Dirk Pranke

Comment 12 2011-06-03 12:32:23 PDT

(In reply to comment #11) > (From update of attachment 95845 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=95845&action=review > > >>> Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:681 > >>> + raise AssertionError('Worker %d did not exit in time' % worker_state.number) > >> > >> Should we raise an Assertion here or just log the error and try to keep going? > > > > Good question. I'm not actually sure what the end result of the run will be in either case (hung process or process that eventually exits when the slow thread/process exits). I would prefer to leave it as an assertion for now just to help identify if it actually occurs. > > I think this is OK, but In general I think we should try to have the bots keep going as much as possible. If the script crashes, it causes problems for the gardener who then needs to rollback and wait for the bots to cycle an extra time. The downside for you is much smaller: you just need to land the patch manually and watch the bots as they cycle to see if an error is logged. > > You might want to try landing this during non-busy hours or coordinate the landing with the gardener. Oh, I'm not worried about this patch in particular being broken, I'm worried about something happening down the road that causes this to fail. That said, I just tested it, and raising the assert won't by itself cause the program to exit (since there are still child processes running), so there's not much effective difference here between asserting and logging an error. I'll change this to a log for now. There's also two typos in that line that are real bugs, so I'll post another patch that actually works :)

Dirk Pranke

Comment 13 2011-06-03 12:34:24 PDT

Created attachment 95947 [details] Patch

Dirk Pranke

Comment 14 2011-06-03 12:44:03 PDT

Committed r88040: <http://trac.webkit.org/changeset/88040>

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Unspecified

OS Unspecified

Product WebKit

Component Tools / Tests

Assignee

Dirk Pranke

Reported

2011-06-01 01:48 PDT

Modified

2011-06-03 12:44 PDT History

CC List

4 users Show

URL

Keywords

Depends on

61965

Blocks

Dependencies

tree graph