[KYUUBI #7301] Fix engine not killed after deregisterEngine#7302
[KYUUBI #7301] Fix engine not killed after deregisterEngine#7302taylor12805 wants to merge 2 commits intoapache:masterfrom
Conversation
(cherry picked from commit b6b05e9)
There was a problem hiding this comment.
Pull request overview
This PR fixes a bug where non-CONNECTION share level engines were not killed when deregistered, leaving zombie engine processes running. The fix adds logic to kill the engine application process when deregistering engines with share levels USER, GROUP, SERVER_LOCAL, or SERVER.
Changes:
- Added engine process killing logic in the
deregistermethod for non-CONNECTION share level engines - The kill attempt is wrapped in a try-catch block to handle failures gracefully
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (shareLevel != CONNECTION && builder != null) { | ||
| try { | ||
| val appMgrInfo = builder.appMgrInfo() | ||
| engineManager.killApplication(appMgrInfo, engineRefId, Some(appUser)) |
There was a problem hiding this comment.
Missing null check for engineManager. The condition should be updated to check both builder and engineManager for null before attempting to call killApplication. Without this check, a NullPointerException will occur when engineManager is null, such as in test scenarios where EngineRef is instantiated with a null engineManager parameter.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7302 +/- ##
======================================
Coverage 0.00% 0.00%
======================================
Files 698 698
Lines 43646 43654 +8
Branches 5894 5895 +1
======================================
- Misses 43646 43654 +8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I suppose the failing test is not correlated to this PR |
|
It might be better to make the Engine can know whether it is zombie, and stop itself. Since the |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (shareLevel != CONNECTION && builder != null && engineManager != null) { | ||
| try { | ||
| val appMgrInfo = builder.appMgrInfo() | ||
| engineManager.killApplication(appMgrInfo, engineRefId, Some(appUser)) | ||
| info(s"Killed engine process for $engineRefId with share level $shareLevel") | ||
| } catch { | ||
| case e: Exception => | ||
| warn(s"Error killing engine process for $engineRefId", e) | ||
| } | ||
| } |
There was a problem hiding this comment.
Critical bug: The condition builder != null will prevent killing engines in the most common deregister scenarios. The builder field is only set when this EngineRef instance creates a new engine (line 216 in the create method). However, when an existing engine is discovered via getOrCreate, the builder remains null. This means when deregister is called for discovered engines (the exact scenario described in issue #7301), the engine process won't be killed.
Additionally, even if builder were non-null, calling builder.appMgrInfo() is incorrect here because the appMgrInfo should be retrieved from the ServiceNodeInfo that was just fetched from ZooKeeper. The correct pattern (as shown in AdminResource.scala:312) is to get appMgrInfo from sn.attributes.get(KyuubiReservedKeys.KYUUBI_ENGINE_APP_MGR_INFO_KEY).
The fix should:
- Get the ServiceNodeInfo from ZooKeeper
- Extract appMgrInfo from
sn.attributes - Use that to kill the application, without depending on the builder at all
| if (shareLevel != CONNECTION && builder != null && engineManager != null) { | ||
| try { | ||
| val appMgrInfo = builder.appMgrInfo() | ||
| engineManager.killApplication(appMgrInfo, engineRefId, Some(appUser)) | ||
| info(s"Killed engine process for $engineRefId with share level $shareLevel") | ||
| } catch { | ||
| case e: Exception => | ||
| warn(s"Error killing engine process for $engineRefId", e) | ||
| } | ||
| } |
There was a problem hiding this comment.
The existing test for deregister (line 377-398 in EngineRefTests.scala) doesn't verify that the engine process is actually killed when deregister is called. The test only verifies that the engine is removed from ZooKeeper. Since this PR's purpose is to fix the bug where engines are not killed after deregistration, a test should be added to verify that killApplication is called with the correct parameters when deregister succeeds. This could be done by mocking the engineManager and verifying the killApplication call.
(cherry picked from commit b6b05e9)
Why are the changes needed?
Fix old engine not killed bug: #7301
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
No