-
Notifications
You must be signed in to change notification settings - Fork 467
Description
Summary
To expose OpenAI-compatible batch API in a production-ready and k8s-native way, this RFC collects various deployment scenarios, defines batch system roles and their responsibilities, and showcases how roles can be combined to serve different deployment scenarios. The RFC overrides previous #182, and provides a detailed design of the underlying batch runtime.
Motivation
Deployment scenarios
After interviewing application developers, we found that the AIBrix can be deployed in various settings for batch purposes:
- Batching using parallel LLM workers

The parallel LLM workers are created on demand and are dedicated to running batch inference tasks. This setting is suitable for:
- Running on spot resources.
- Batches of large size, containing a large number of inference tasks.
- Dedicated batch serving.
- Batching using the existing LLM service.

The existing LLM service can be an external LLM service or an existing vLLM deployment, which can be independently scaled out of the batch system.
Batch system roles
A batch system that supports both settings can be abstracted as:

The Batch Job Entity is responsible for utilizing k8s-native features, including:
- Utilizing k8s-native informer for active batch job status notification.
- (Optional) Batch Tasks Workers launching and fault tolerance.
The API Gateway is responsible for offering the OpenAI Batch API interface and is responsible for:
- Submit a batch job by creating the Batch Job Entity.
- Forward other batch APIs to the Batch Jobs Controller.
The Batch Job Controller implements OpenAI Batch APIs and is responsible for:
- Watch and cache Batch Job Entity status updates.
- Query the batch job list from the Metadata Store.
- Query and cache the detailed job progress and statistics from the Metadata Store on receipt of Batch Job Entity status updates
- Schedule and dispatch batch jobs to Batch Job Drivers.
- (Optional) Provision Batch Job Drivers if drivers are deployed separately, including:
- Scale Batch Job Drivers based on batch job volume (reference: Auto Scaling Microservices with Kubernetes Event-Driven Autoscaler (KEDA)
- Reschedule batch jobs on the failure of the Batch Job Driver.
- (Optional) Scale Batch Task Executors.
The Batch Job Driver drives job progress by:
- Read and parse the input file from I/O Storage, getting individual inference queries (Tasks) information.
- (Optional) Schedule and map tasks to workers for coordinated scheduling, including task rescheduling on worker failure.
- Collect outputs and write aggregated output to the I/O Storage.
- Checkpoint job progress to the Metadata Store
- Restore the job checkpoint from the Metadata Store on job retrial.
- Update Batch Job Entity on job status change.
The Batch Tasks Worker handles individual tasks by:
- Dispatch task to Batch Task Executor.
- (Optional) Claim task ownership for work-stealing scheduling, including:
- Read the task input from the shared input file from the I/O Storage.
- Write the task output as an individual file to the I/O Storage.
- Implement task execution idempotency.
- Update task progress to the Metadata Store on task completion.
The Batch Task Executor does LLM inference.
Example role mappings for various scenarios
- Simple LLM workers

- Colocating with existing online LLM services

Proposed Change
We will offer support as mentioned in this document, step by step, and offer different variations to fit different batch scenarios, including:
- Modify the Gateway plugin to support the OpenAI Batch API interface, and:
- Submit Batch Job Entity
- Forward other OpenAI Batch APIs to Batch API Service
- Add Batch API Service that extends API Gateway functionality. The Batch API Service will:
- Implement OpenAI Batch APIs.
- Disable/enable Batch Job Driver features.
- Add the separate scalable Batch Job Driver service to support high-volume batch jobs.
- The Batch Job Entity will support two variations for different settings, including:
- Kubernetes Job
- BatchJob CRD
- Add Batch Tasks Worker to the LLM runtime to support work-stealing scheduling.
PR Plan
PR1: Batch API Service, including Kubernetes Job with a dummy scalar tasks worker.
PR2: Review File uploads and downloads API (#344) and add support for S3-compatible storage.
PR3: Gateway plugin support for OpenAI Batch and File API interface.
PR4: Add Batch Tasks Worker to LLM runtime and integrate with Kubernetes Job.
PR5: Add BatchJob CRD support and separate Batch Job Driver service.
Alternatives Considered
No response