Skip to content

Conversation

@DaanHoogland
Copy link
Contributor

@DaanHoogland DaanHoogland commented Apr 14, 2025

Description

This PR is a small functional change inspired by #9800, #9852 and #9473. More related issues may exist. The intention is to introduce the concept of non fatal failures in the router health check scripts. In the past only success and failure were reported. Now also the statuses warning and unknown are available.

The scripts,

  • cpu_usage_check.py,
  • dhcp_check.py and
  • diskspace_check.py
    were changed, but the machanism may be applied to other scripts as well.

monitor.py has been changed to relay the extra possible statuses, and the backend and UI have been altered to store them and display them to the user.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI
  • test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

  1. create a VPC
  2. create a tier
  3. create a VM in the new tier
  4. wait for the VM to come up
  5. delete the VM
  6. .... (wait) once the vr has shut the tier down the VR won't have a webservice for the tier anymore and will report a failure which is a false positive. After the change it should show as UNKNOWN instead of FAILED

How did you try to break this feature and the system with this change?

added some fake tests to test the UI:
Screenshot 2025-04-25 at 17 02 52

@DaanHoogland DaanHoogland changed the title Ghi9800 update vr on setting change [router] make a distinction between fatal errors, warnings and unknown as healthcheck result Apr 14, 2025
@codecov
Copy link

codecov bot commented Apr 14, 2025

Codecov Report

❌ Patch coverage is 13.48684% with 263 lines in your changes missing coverage. Please review.
✅ Project coverage is 17.39%. Comparing base (3d6ec29) to head (39fdd8d).
⚠️ Report is 56 commits behind head on main.

Files with missing lines Patch % Lines
...ork/router/VirtualNetworkApplianceManagerImpl.java 4.82% 134 Missing and 4 partials ⚠️
...tack/engine/orchestration/NetworkOrchestrator.java 9.52% 57 Missing ⚠️
...a/com/cloud/network/router/CommandSetupHelper.java 0.00% 22 Missing and 2 partials ⚠️
...loud/network/lb/LoadBalancingRulesManagerImpl.java 0.00% 13 Missing ⚠️
...src/main/java/com/cloud/api/ApiResponseHelper.java 0.00% 9 Missing ⚠️
.../api/response/RouterHealthCheckResultResponse.java 0.00% 8 Missing ⚠️
.../src/main/java/com/cloud/configuration/Config.java 50.00% 8 Missing ⚠️
...m/cloud/network/dao/RouterHealthCheckResultVO.java 0.00% 2 Missing ⚠️
.../apache/cloudstack/framework/config/ConfigKey.java 66.66% 2 Missing ⚠️
...oud/network/lb/ElasticLoadBalancerManagerImpl.java 0.00% 1 Missing ⚠️
... and 1 more
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #10710      +/-   ##
============================================
+ Coverage     17.36%   17.39%   +0.03%     
- Complexity    15237    15281      +44     
============================================
  Files          5888     5890       +2     
  Lines        525741   526155     +414     
  Branches      64164    64233      +69     
============================================
+ Hits          91274    91528     +254     
- Misses       424167   424282     +115     
- Partials      10300    10345      +45     
Flag Coverage Δ
uitests 3.62% <ø> (-0.01%) ⬇️
unittests 18.44% <13.48%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions
Copy link

github-actions bot commented May 9, 2025

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

@Pearl1594 Pearl1594 modified the milestones: 4.20.1, 4.20.2 Jun 3, 2025
@Pearl1594
Copy link
Contributor

We probably need to raise this against 4.19?

@DaanHoogland DaanHoogland modified the milestones: 4.20.2, 4.19.4 Jun 3, 2025
@DaanHoogland DaanHoogland self-assigned this Aug 7, 2025
@DaanHoogland DaanHoogland force-pushed the ghi9800-updateVrOnSettingChange branch from 732ef96 to f3848c0 Compare August 7, 2025 07:09
Copy link
Contributor

@harikrishna-patnala harikrishna-patnala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code LGTM

Copy link
Member

@vishesh92 vishesh92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

@DaanHoogland
Copy link
Contributor Author

this has been (dev-)tested but needs an upgrade test as most important validation.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 15063

@blueorangutan
Copy link

[SF] Trillian test result (tid-14379)
Environment: kvm-ol8 (x2), zone: Advanced Networking with Mgmt server ol8
Total time taken: 54023 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10710-t14379-kvm-ol8.zip
Smoke tests completed. 145 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_03_deploy_and_scale_kubernetes_cluster Failure 1.17 test_kubernetes_clusters.py
test_04_autoscale_kubernetes_cluster Failure 27.82 test_kubernetes_clusters.py
test_01_deployVMInSharedNetwork Failure 410.15 test_network.py

@blueorangutan
Copy link

Packaging result [SF]: ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 15111

@harikrishna-patnala
Copy link
Contributor

@DaanHoogland is this failure related "test_01_deployVMInSharedNetwork" ?

@weizhouapache
Copy link
Member

weizhouapache commented Sep 20, 2025

@DaanHoogland is this failure related "test_01_deployVMInSharedNetwork" ?

maybe just a intermittent failure

@blueorangutan package

@blueorangutan
Copy link

@weizhouapache a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 15123

@weizhouapache
Copy link
Member

@blueorangutan test

@blueorangutan
Copy link

@weizhouapache a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-14410)
Environment: kvm-ol8 (x2), zone: Advanced Networking with Mgmt server ol8
Total time taken: 52362 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10710-t14410-kvm-ol8.zip
Smoke tests completed. 147 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@DaanHoogland
Copy link
Contributor Author

@vishesh92 @harikrishna-patnala @weizhouapache , there is a slight backwards incompatibility with this change; success can now be true, false or not present. In addition to result there is status, which can be “success”, “failure”, “warning” or “unknown”. So the incompatibilty is that result may not be present. I think this is not grave as it would be a false positive in the past. What do you think?

Other than this we are good to merge (@vladimirpetrov tested but forgot to give his opinion here;)

@blueorangutan
Copy link

[SF] Trillian test result (tid-14415)
Environment: kvm-ol8 (x2), zone: Advanced Networking with Mgmt server ol8
Total time taken: 50207 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10710-t14415-kvm-ol8.zip
Smoke tests completed. 147 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@harikrishna-patnala
Copy link
Contributor

@vishesh92 @harikrishna-patnala @weizhouapache , there is a slight backwards incompatibility with this change; success can now be true, false or not present. In addition to result there is status, which can be “success”, “failure”, “warning” or “unknown”. So the incompatibilty is that result may not be present. I think this is not grave as it would be a false positive in the past. What do you think?

Other than this we are good to merge (@vladimirpetrov tested but forgot to give his opinion here;)

as we are only updating the value of the response parameter properly I think this is fine and moreover more detailed 'status' is now present.

@harikrishna-patnala harikrishna-patnala merged commit aca8732 into apache:main Sep 22, 2025
27 of 28 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in Apache CloudStack 4.22.0 Sep 22, 2025
@harikrishna-patnala harikrishna-patnala deleted the ghi9800-updateVrOnSettingChange branch September 22, 2025 06:09
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Oct 17, 2025
…n as healthcheck result (apache#10710)

* [routers] distiction between fatal failure and warning or unknown on healthchecks

* UI status for router health checks

* status from scripts varied

* automation signalled errors

* revert removal of update sql

* upgradeversion

* move config item and further cleanup

* handling services better

* backwards compatible response

---------

Co-authored-by: Daan Hoogland <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment