Skip to content

Track TrainJob progress and expose training metrics #2779

@astefanutti

Description

@astefanutti

What you would like to be added?

This feature proposes 1) the definition of a standard "contract" for training runtimes to expose / push the fine-grained state of the training loops and 2) the implementation of a mechanism that updates the TrainJob resource statuses to reflect the training state / progress in real-time.

The status should, among other information:

  • Include a percentage (e.g. current steps / total steps) but an ETA would also be very useful
  • Include the training metrics that are relevant for HPO with Katib

The implementation outline could be:

  • Define a schema for the training runtime to expose the training loop metrics
  • Instrument training loops to periodically write their progression / status in the above format, e.g. to their rank 0 node standard output
    • For custom trainers, provide examples showing how to instrument the training loop, e.g. for HuggingFace Transformers Trainer callbacks
    • For built-in trainers, we may want to seamlessly instrument the runtime
  • Augment the TrainJob controller to watch rank 0 nodes of running TrainJobs to read the metrics and update the corresponding TrainJob statuses

One benefit of this approach would be to not add any extra RBAC / security requirements for the TrainJob Pods that would still be able to run using the default service account.

Why is this needed?

Model training is an iterative process whose progression in time is fairly predictable, which makes tracking the progression of train jobs both possible, desirable and useful.

While a training job progression is usually accessible by reading the job rank 0 node logs, it might not be the best user experience for AI practitioners, nor provide the more robust mechanism for clients to access / parse this information.

Exposing the training metrics in real-time to the TrainJob API status will also unlock integration with other components like Katib, GUIs and possibly experimentation tracking solutions.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions