Skip to content

Conversation

zhangjyr
Copy link
Collaborator

@zhangjyr zhangjyr commented Oct 15, 2025

Pull Request Description

This PR completes the last piece of the Batch API integration. Specifically:

  1. Envoy integration
  2. Added e2e test script to verify batch job execution
  3. Update the document to customize job execution

Note that the k8s job is disabled by default. Config object storage (S3 or TOS) to enable it.
The default behavior is to upload files to the metadata-local and running job in the dummy engine, which is only for testing purposes.

Related Issues

Resolves: #1277

Important: Before submitting, please complete the description above and review the checklist below.


Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

  • [Bug]: Corrections to existing functionality
  • [CI]: Changes to build process or CI pipeline
  • [Docs]: Updates or additions to documentation
  • [API]: Modifications to aibrix's API or interface
  • [CLI]: Changes or additions to the Command Line Interface
  • [Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

  • PR title includes appropriate prefix(es)
  • Changes are clearly explained in the PR description
  • New and existing tests pass successfully
  • Code adheres to project style and best practices
  • Documentation updated to reflect changes (if applicable)
  • Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

Jingyuan Zhang added 6 commits October 2, 2025 14:26
Signed-off-by: Jingyuan Zhang <[email protected]>
…openai_batch_envoy_integration

Signed-off-by: Jingyuan Zhang <[email protected]>

# Conflicts:
#	python/aibrix/aibrix/batch/README.md
#	python/aibrix/aibrix/batch/scheduler.py
#	python/aibrix/aibrix/metadata/app.py
#	python/aibrix/tests/batch/conftest.py
#	python/aibrix/tests/batch/test_e2e_openai_batch_api.py
#	python/aibrix/tests/batch/test_job_cache.py
#	python/aibrix/tests/metadata/test_app_integration.py
…openai_batch_envoy_integration

Signed-off-by: Jingyuan Zhang <[email protected]>
Signed-off-by: Jingyuan Zhang <[email protected]>
Copy link
Contributor

Summary of Changes

Hello @zhangjyr, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request finalizes the integration of the Batch API by establishing the necessary Envoy routing for batch-related endpoints, introducing robust end-to-end tests to ensure the functionality of the batch processing workflow, and providing detailed documentation and configuration options for customizing batch job execution within Kubernetes. These changes collectively enable a fully functional and verifiable Batch API experience.

Highlights

  • Envoy Integration for Batch API: Configured Envoy Gateway to route '/v1/files' and '/v1/batches' endpoints to the metadata service, completing the network integration for the Batch API.
  • End-to-End Batch API Verification: Introduced a new comprehensive end-to-end test suite ('test_batch_api.py') to validate the entire batch processing workflow against a real service instance, covering file upload, job creation, status polling, and result retrieval.
  • Customizable Batch Job Execution: Added a Kubernetes Job template ('job_template_patch.yaml') and updated documentation to allow users to customize batch job execution environments, including container images, parallelism, and resource requirements.
  • Metadata Service Enhancements: Updated the metadata service Dockerfile, Kubernetes deployment, and RBAC rules to support Kubernetes Job creation and management, including new permissions and the ability to load custom job templates.
  • Object Storage Integration: Enabled the metadata service to configure S3 and TOS object storage credentials via Kubernetes secrets, facilitating file storage for batch operations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively completes the Batch API integration by adding Envoy configuration, E2E tests, and comprehensive documentation. The changes are well-structured and consistent across the repository. My review focuses on improving the robustness of documentation and test setup, enhancing security by adhering to the principle of least privilege in RBAC configurations, and modernizing asynchronous code patterns in the new E2E tests.

- metadata.yaml
- redis.yaml

configMapGenerator:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need configmap here? I see python folder have many skeleton templates

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The job template patch has to be mapped to the container folder to take effect. This configmap can achieve this without rebuilding the image.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then should we remove file based templates?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am glad to hear the options. Can you elaborate on how users might customize the job template?
The bottom line is users have to change the k8s_job_template.yaml and rebuild the image.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

template should be managed by operation folks. We can have one template now, leave the flexibility support to later . Once we get more feedback, we can start to work on it

resources: ["jobs"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
# For batch job ServiceAccount management
- apiGroups: [""]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use a fix service account, along with aibrix installation? instead of reply on permission here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was originally fixed. I later found that the service account must be under the default namespace, while the kustomize overrides the namespace and puts everything in the aibrix-system.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

em. we can put the file under a separate config/manifest folder without aibrix-system override.

in helm, I think we can support it as well, something like

  namespace: {{ .Values.job.namespace }}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if it works. How is kustomize going to load config/manifest without overriding the namespace? I can have a try.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can ask user to follow a guidance to create a RBAC in short term, we do not want the controller to have such high permission.

- name: metadata-service
image: {{ .Values.metadata.service.container.image.repository }}:{{ .Values.metadata.service.container.image.tag }}
imagePullPolicy: {{ .Values.metadata.service.container.image.imagePullPolicy | default "IfNotPresent" }}
command: ["python", "-m", "aibrix.metadata.app"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep the explicit commands here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I can add an explicit command, but it would be aibrix_metadata

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's ok. user sometimes need to adjust parameters, if there's no config here, it automatically use container entrypoints, and user do not know it unless they know the commands in Dockerfile

Copy link
Collaborator Author

@zhangjyr zhangjyr Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found aibrix_metadatadoes not work in the helm. I have to change to
["python", "-m", "aibrix.metadata.app"].
One more thing, in the config, I disabled k8s-job by default, so users can run other metadata services (v1/models) if they do not configure external object storage (in this case, the k8s job will not work anyway). Do I need to apply the same logic in chart, too? This means users have to adjust values to enable object storage and k8s-job.

Jingyuan Zhang added 6 commits October 15, 2025 14:21
Cleanup files

Signed-off-by: Jingyuan Zhang <[email protected]>
Signed-off-by: Jingyuan Zhang <[email protected]>
Signed-off-by: Jingyuan Zhang <[email protected]>
Signed-off-by: Jingyuan Zhang <[email protected]>
@zhangjyr
Copy link
Collaborator Author

@Jeffwan Do you have any idea why the helm chart test failed? It looks the metadata healthz and readyz endpoints not accessible. I ran the same chart-testing locally and it all fine. Does it has anything to do with the image problem you saw in the metadata service migration PR?

@zhangjyr zhangjyr requested a review from Jeffwan October 16, 2025 06:33
@Jeffwan
Copy link
Collaborator

Jeffwan commented Oct 16, 2025

@zhangjyr the problem is the helm doesn't use the images build from this branch, it just use nightly image. this is a known issue and we should fix it, but we can ignore the problem at this moment

Copy link
Collaborator

@Jeffwan Jeffwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm. I still have some questions on the permission

…ig/job is create for user to deploy job rbac.

Signed-off-by: Jingyuan Zhang <[email protected]>
@Jeffwan Jeffwan merged commit 402521d into vllm-project:main Oct 17, 2025
13 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC]: OpenAI-compatible Batch API support

2 participants