Conversation
Summary of ChangesHello @goabhinav, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a critical new test to validate the resilience of containerized services on network devices. By simulating a control processor failover, the test ensures that deployed containers and their associated volumes maintain their state and functionality, thereby enhancing the overall stability and reliability of the system's container supervisor capabilities. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Pull Request Functional Test Report for #5071 / 1780e44Virtual Devices
Hardware Devices
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new test for container supervisor failover (CNTR-3). The changes include a README.md file detailing the test procedure and a Go test file failover_test.go that implements the test. The test correctly follows the structure of setting up a container and volume, triggering a control processor switchover, and then verifying that the container and volume persist after recovery.
My review focuses on adherence to the repository's style guide. I've identified one instance where time.Sleep is used redundantly, which is discouraged. I've provided a suggestion to remove it to improve test efficiency. Otherwise, the code is well-written and the test logic is sound.
feature/container/failover/README.md
Outdated
|
|
||
| 1. **Setup**: Using `gnoi.Containerz`, deploy a container image, create a volume, and start a container mounting that volume. Verify the container is running and the volume exists. | ||
| 2. **Trigger Failover**: Identify the standby control processor using gNMI. Trigger a switchover using `gnoi.System.SwitchControlProcessor`. | ||
| 3. **Verify Recovery**: Wait for the switchover to complete. Verify that the container started in step 1 is in `RUNNING` state and the volume still exists using `gnoi.Containerz`. |
There was a problem hiding this comment.
I think we should check that RPCs to the container work as well.
| @@ -0,0 +1,71 @@ | |||
| # CNTR-3: Container Supervisor Failover | |||
There was a problem hiding this comment.
This is a good start but we should test several other scenarios like what happens when the backup is not available and containers are started. Do the containers get started when the backup returns?
There was a problem hiding this comment.
The question is how can we map this to gNOI calls on the vendor devices; can we manually kill and restart a containerz instance on the individual supervisors?
Additional scenario to consider:
What happens when the primary returns after a failover? Will the original containers still be available? Are modifications to containers/volumes that only reached the backup properly replicated back to the primary?
Could be simulated by calling SwitchControlProcessor twice. Although the SwitchControlProcessor implementation may just switch the handling supervisor without actually restarting the primary.
There was a problem hiding this comment.
Added a test for double switchover. I will send a separate PR which has tests with cold reboot.
feature/container/failover/README.md
Outdated
|
|
||
| ## CNTR-3.1: Container Supervisor Failover | ||
|
|
||
| 1. **Setup**: Using `gnoi.Containerz`, deploy a container image, create a volume, and start a container mounting that volume. Verify the container is running and the volume exists. |
There was a problem hiding this comment.
start a container mounting that volume; Is the container in failover_test.go actually mounting the volume?
There was a problem hiding this comment.
Yes in TestContainerAndVolumePersistence.
| @@ -0,0 +1,71 @@ | |||
| # CNTR-3: Container Supervisor Failover | |||
There was a problem hiding this comment.
The question is how can we map this to gNOI calls on the vendor devices; can we manually kill and restart a containerz instance on the individual supervisors?
Additional scenario to consider:
What happens when the primary returns after a failover? Will the original containers still be available? Are modifications to containers/volumes that only reached the backup properly replicated back to the primary?
Could be simulated by calling SwitchControlProcessor twice. Although the SwitchControlProcessor implementation may just switch the handling supervisor without actually restarting the primary.
| if _, err := cli.CreateVolume(ctx, volName, "local", nil, nil); err != nil { | ||
| t.Fatalf("Failed to create volume: %v", err) | ||
| } |
There was a problem hiding this comment.
| if _, err := cli.CreateVolume(ctx, volName, "local", nil, nil); err != nil { | |
| t.Fatalf("Failed to create volume: %v", err) | |
| } | |
| mountOpts := map[string]string{ | |
| "options": "bind", | |
| "mountpoint": "/tmp", | |
| } | |
| if _, err := cli.CreateVolume(ctx, volName, "local", nil, mountOpts); err != nil { | |
| t.Fatalf("Failed to create volume: %v", err) | |
| } |
could we add mount options here? The reference implementation still has a bug where a subsequent call to StartContainer will fail unless mount options are specified.
Co-authored-by: Fabian Sommerauer <sommerauer.fabian@gmail.com>
No description provided.