Skip to content

Conversation

Jeffwan
Copy link
Collaborator

@Jeffwan Jeffwan commented Sep 3, 2025

Pull Request Description

[Feat] Support StormService pause rollout in upgrade

  • Update stormservice golang client
  • Improve the test coverage
  • Refactor the API to support manual resume

Related Issues

Resolves: #1291

Important: Before submitting, please complete the description above and review the checklist below.


Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

  • [Bug]: Corrections to existing functionality
  • [CI]: Changes to build process or CI pipeline
  • [Docs]: Updates or additions to documentation
  • [API]: Modifications to aibrix's API or interface
  • [CLI]: Changes or additions to the Command Line Interface
  • [Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

  • PR title includes appropriate prefix(es)
  • Changes are clearly explained in the PR description
  • New and existing tests pass successfully
  • Code adheres to project style and best practices
  • Documentation updated to reflect changes (if applicable)
  • Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

@Jeffwan Jeffwan marked this pull request as draft September 3, 2025 13:22
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @Jeffwan, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the StormService upgrade mechanism by introducing a robust canary deployment strategy. This allows for controlled, phased rollouts of new versions, minimizing risk by gradually exposing changes to a subset of the environment before full deployment. The new capabilities include flexible pausing and weight-based traffic management.

Highlights

  • Canary Deployment Feature: Introduces comprehensive canary deployment capabilities for StormService upgrades, enabling gradual rollouts with defined steps for weight-based traffic shifting and configurable pauses.
  • Flexible Pause Mechanisms: Implements both time-based automatic pauses and manual pauses that require explicit user intervention to resume, offering fine-grained control during staged rollouts.
  • API and CRD Extensions: Extends the StormService API with new CanaryUpdateStrategy and CanaryStatus fields, along with supporting types like CanaryStep, PauseStep, CanaryPhase, and PauseCondition, fully integrated via CRD updates and client-side apply configurations.
  • Enhanced Test Coverage: Adds extensive unit, integration, and end-to-end tests specifically for the new canary deployment logic, ensuring robustness and reliability of the feature.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature: support for canary deployments in StormService upgrades. It adds new API fields under updateStrategy.canary to define canary steps, including setting weights and pausing. New controller logic is added to process these canary steps, along with corresponding status fields to track progress. The changes also include generated client code, CRD updates, unit tests, integration tests, and E2E tests for the new functionality.

While the overall structure and API design are sound, there is a critical issue: the core logic to apply the canary weight is not implemented. The functions responsible for adjusting the replica distribution are currently stubs. Additionally, the E2E tests are not comprehensive enough as they don't verify the actual workload state during the rollout. There is also a minor issue with a non-English comment in the code.

@googs1025
Copy link
Collaborator

/cc
will help this feature review in this weekend 😄

@Jeffwan Jeffwan force-pushed the jiaxin/ss-pause-stop-desing branch 2 times, most recently from b21cf6c to 7546fcc Compare September 7, 2025 03:38
@Jeffwan Jeffwan changed the title [WIP][Feat] Support StormService pause rollout in upgrade [Feat] Support StormService pause rollout in upgrade Sep 7, 2025
@Jeffwan Jeffwan marked this pull request as ready for review September 7, 2025 03:53
@Jeffwan Jeffwan force-pushed the jiaxin/ss-pause-stop-desing branch 3 times, most recently from b365fff to 7a5ae34 Compare September 7, 2025 05:59
// Step 4: Clear canary status - this triggers normal rollout logic to take over
stormService.Status.CanaryStatus = nil

if err := r.Status().Patch(ctx, stormService, client.MergeFrom(original)); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r.applyCanaryStatusUpdate has updated the status once, and we will patch it one last time. Will this cause the previous update to be missed?

err := r.Status().Patch(ctx, stormService, client.MergeFrom(original)) we use original for merge

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, this has not been refactored yet. the abort capability is not finished yet. I should clean this up.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed at this moment, abort will be added back in future PRs

Comment on lines 301 to 304
// Emit a consistent CanaryUpdate event even if the pause condition already exists
update := newCanaryStatusUpdate().
addEvent("Canary paused at manual pause step. Remove CanaryPauseStep pause condition to continue")
if err := r.applyCanaryStatusUpdate(ctx, stormService, update); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there a lot of repetitive events here? because we will request after 30 sec

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I do notice the issue. I should remove some unhelpful events

@googs1025 googs1025 self-assigned this Sep 8, 2025
@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Sep 15, 2025

@googs1025 I will address the comments tomorrow. it's a little bit busy last week to work on this issue.

@Jeffwan Jeffwan force-pushed the jiaxin/ss-pause-stop-desing branch 3 times, most recently from 4bbf97b to e59a44b Compare September 23, 2025 23:08
@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Sep 23, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive canary deployment feature for StormService, including support for weighted steps and both automatic and manual pauses. The implementation is extensive, with new API types, controller logic, and thorough unit and integration tests. The core logic for progressing through canary steps, handling pauses, and calculating replica distribution is well-structured. However, I've identified a critical bug in the scaling logic that could cause a panic, and the API design for resuming manual pauses is unconventional and should be revised to follow Kubernetes best practices. There are also some minor code style and consistency issues. Overall, this is a significant and valuable feature addition that will be even better with these fixes.

* Update stormservice golang client
* Improve the test coverage
* Refactor the API to support manual resume
* improve the canary features
* Leave e2e test to future PRs
* fix lint and verify issues
* Polish the canary status
* Simplify the canary status fields
* Final patch

Signed-off-by: Jiaxin Shan <[email protected]>
@Jeffwan Jeffwan force-pushed the jiaxin/ss-pause-stop-desing branch from e59a44b to ff769e2 Compare September 23, 2025 23:31
@googs1025
Copy link
Collaborator

will review today

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Pauseable Upgrades in Storm Service

2 participants