Skip to content

[RFC]: OpenAI-compatible Batch API support #1277

@zhangjyr

Description

@zhangjyr

Summary

To expose OpenAI-compatible batch API in a production-ready and k8s-native way, this RFC collects various deployment scenarios, defines batch system roles and their responsibilities, and showcases how roles can be combined to serve different deployment scenarios. The RFC overrides previous #182, and provides a detailed design of the underlying batch runtime.

Motivation

Deployment scenarios

After interviewing application developers, we found that the AIBrix can be deployed in various settings for batch purposes:

  1. Batching using parallel LLM workers
Image

The parallel LLM workers are created on demand and are dedicated to running batch inference tasks. This setting is suitable for:

  • Running on spot resources.
  • Batches of large size, containing a large number of inference tasks.
  • Dedicated batch serving.
  1. Batching using the existing LLM service.
Image

The existing LLM service can be an external LLM service or an existing vLLM deployment, which can be independently scaled out of the batch system.

Batch system roles

A batch system that supports both settings can be abstracted as:

Image

The Batch Job Entity is responsible for utilizing k8s-native features, including:

  1. Utilizing k8s-native informer for active batch job status notification.
  2. (Optional) Batch Tasks Workers launching and fault tolerance.

The API Gateway is responsible for offering the OpenAI Batch API interface and is responsible for:

  1. Submit a batch job by creating the Batch Job Entity.
  2. Forward other batch APIs to the Batch Jobs Controller.

The Batch Job Controller implements OpenAI Batch APIs and is responsible for:

  1. Watch and cache Batch Job Entity status updates.
  2. Query the batch job list from the Metadata Store.
  3. Query and cache the detailed job progress and statistics from the Metadata Store on receipt of Batch Job Entity status updates
  4. Schedule and dispatch batch jobs to Batch Job Drivers.
  5. (Optional) Provision Batch Job Drivers if drivers are deployed separately, including:
  1. (Optional) Scale Batch Task Executors.

The Batch Job Driver drives job progress by:

  1. Read and parse the input file from I/O Storage, getting individual inference queries (Tasks) information.
  2. (Optional) Schedule and map tasks to workers for coordinated scheduling, including task rescheduling on worker failure.
  3. Collect outputs and write aggregated output to the I/O Storage.
  4. Checkpoint job progress to the Metadata Store
  5. Restore the job checkpoint from the Metadata Store on job retrial.
  6. Update Batch Job Entity on job status change.

The Batch Tasks Worker handles individual tasks by:

  1. Dispatch task to Batch Task Executor.
  2. (Optional) Claim task ownership for work-stealing scheduling, including:
  • Read the task input from the shared input file from the I/O Storage.
  • Write the task output as an individual file to the I/O Storage.
  • Implement task execution idempotency.
  • Update task progress to the Metadata Store on task completion.

The Batch Task Executor does LLM inference.

Example role mappings for various scenarios

  1. Simple LLM workers
Image
  1. Colocating with existing online LLM services
Image

Proposed Change

We will offer support as mentioned in this document, step by step, and offer different variations to fit different batch scenarios, including:

  1. Modify the Gateway plugin to support the OpenAI Batch API interface, and:
  • Submit Batch Job Entity
  • Forward other OpenAI Batch APIs to Batch API Service
  1. Add Batch API Service that extends API Gateway functionality. The Batch API Service will:
  • Implement OpenAI Batch APIs.
  • Disable/enable Batch Job Driver features.
  1. Add the separate scalable Batch Job Driver service to support high-volume batch jobs.
  2. The Batch Job Entity will support two variations for different settings, including:
  • Kubernetes Job
  • BatchJob CRD
  1. Add Batch Tasks Worker to the LLM runtime to support work-stealing scheduling.

PR Plan

PR1: Batch API Service, including Kubernetes Job with a dummy scalar tasks worker.
PR2: Review File uploads and downloads API (#344) and add support for S3-compatible storage.
PR3: Gateway plugin support for OpenAI Batch and File API interface.
PR4: Add Batch Tasks Worker to LLM runtime and integrate with Kubernetes Job.
PR5: Add BatchJob CRD support and separate Batch Job Driver service.

Alternatives Considered

No response

Metadata

Metadata

Assignees

Labels

kind/enhancementNew feature or requestkind/featureCategorizes issue or PR as related to a new feature.priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions