Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
66dbb3e
Merge branch 'intelligent-machine-learning:master' into master
BalaBalaYi Aug 12, 2025
47ad47e
Merge branch 'intelligent-machine-learning:master' into master
BalaBalaYi Aug 12, 2025
addf7eb
Merge branch 'intelligent-machine-learning:master' into master
BalaBalaYi Aug 13, 2025
4d0bc97
Merge branch 'intelligent-machine-learning:master' into master
BalaBalaYi Aug 20, 2025
28bf801
Merge branch 'intelligent-machine-learning:master' into master
BalaBalaYi Aug 27, 2025
aeb01aa
Merge branch 'intelligent-machine-learning:master' into master
BalaBalaYi Aug 27, 2025
0c6827a
Merge branch 'intelligent-machine-learning:master' into master
BalaBalaYi Sep 3, 2025
f4717f5
Merge branch 'intelligent-machine-learning:master' into master
BalaBalaYi Sep 4, 2025
cf511cd
Merge branch 'intelligent-machine-learning:master' into master
BalaBalaYi Sep 4, 2025
cafd9c7
Merge branch 'intelligent-machine-learning:master' into master
BalaBalaYi Sep 5, 2025
f5e173b
Merge branch 'intelligent-machine-learning:master' into master
BalaBalaYi Sep 8, 2025
93b3d53
Merge branch 'intelligent-machine-learning:master' into master
BalaBalaYi Sep 10, 2025
a7a9356
Merge branch 'intelligent-machine-learning:master' into master
BalaBalaYi Sep 10, 2025
ebbc634
Merge branch 'intelligent-machine-learning:master' into master
BalaBalaYi Sep 12, 2025
e935694
Merge branch 'intelligent-machine-learning:master' into master
BalaBalaYi Sep 16, 2025
0247643
Merge branch 'intelligent-machine-learning:master' into master
BalaBalaYi Sep 17, 2025
512ff06
Merge branch 'intelligent-machine-learning:master' into master
BalaBalaYi Sep 18, 2025
d326b87
Merge branch 'intelligent-machine-learning:master' into master
BalaBalaYi Sep 22, 2025
35ad459
Merge branch 'intelligent-machine-learning:master' into master
BalaBalaYi Sep 24, 2025
0642812
Merge branch 'intelligent-machine-learning:master' into master
BalaBalaYi Sep 29, 2025
9399127
Merge branch 'intelligent-machine-learning:master' into master
BalaBalaYi Oct 14, 2025
55ee7ae
Merge branch 'intelligent-machine-learning:master' into master
BalaBalaYi Oct 17, 2025
ab59cd2
stash
BalaBalaYi Oct 20, 2025
88cff04
doc updating
BalaBalaYi Oct 29, 2025
249f711
doc updating
BalaBalaYi Oct 29, 2025
ddee584
doc update
BalaBalaYi Oct 30, 2025
1a5db38
doc update
BalaBalaYi Oct 30, 2025
b52b8ff
Merge branch 'master' into v0.6.0_release
BalaBalaYi Oct 31, 2025
b699e25
fix
BalaBalaYi Oct 31, 2025
bae7192
Merge remote-tracking branch 'origin/v0.6.0_release' into v0.6.0_release
BalaBalaYi Oct 31, 2025
b0f6dfb
Merge branch 'master' into v0.6.0_release
BalaBalaYi Nov 3, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 9 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,28 +7,33 @@
[![Build](https://github.com/intelligent-machine-learning/easydl/actions/workflows/main.yml/badge.svg)](https://github.com/intelligent-machine-learning/easydl/actions/workflows/main.yml)
[![OpenSSF Best Practices](https://www.bestpractices.dev/projects/9827/badge)](https://www.bestpractices.dev/projects/9827)
[![Code Coverage](https://codecov.io/gh/intelligent-machine-learning/dlrover/branch/master/graph/badge.svg)](https://codecov.io/gh/intelligent-machine-learning/dlrover)
[![GitHub contributors](https://img.shields.io/github/contributors/intelligent-machine-learning/dlrover?style=flat)](https://github.com/intelligent-machine-learning/dlrover/graphs/contributors)
[![PyPI Status Badge](https://badge.fury.io/py/dlrover.svg)](https://pypi.org/project/dlrover/)
</div>

DLRover makes the distributed training of large AI models easy, stable, fast and green.
It can automatically train the Deep Learning model on the distributed cluster.
It helps model developers to focus on model arichtecture, without taking care of
It helps model developers to focus on model architecture, without taking care of
any engineering stuff, say, hardware acceleration, distributed running, etc.
Now, it provides automated operation and maintenance for deep learning
training jobs on K8s/Ray. Major features as

- **Full-Scene**: Support deep learning full-scene distributed training computation implementation.
- **Fault-Tolerance**: The distributed training can continue running in the event of failures.
- **Flash Checkpoint**: The distributed training can recover failures from the in-memory checkpoint in seconds.
- **Auto-Scaling**: The distributed training can scale up/down resources to improve the stability, throughput
and resource utilization.
- **Others**:
- **XPU Timer Integration**: With runtime xpu-timer integration, can possess stronger runtime diagnostics and fault tolerance capabilities.
- **Flash Checkpoint**: The distributed training can recover failures from the in-memory checkpoint in seconds.

Furthermore, DLRover offers extension libraries for PyTorch and TensorFlow to expedite training. These are also open-source projects available in our [GitHub repositories](https://github.com/intelligent-machine-learning).
- [ATorch](https://github.com/intelligent-machine-learning/atorch): an extension library of PyTorch to Speed Up Training of Large LLM.
- [TFPlus](https://github.com/intelligent-machine-learning/tfplus): an extension library of TensorFlow to Speed Up Training of Search, Recommendation and Advertisement.
- [TFPlus](https://github.com/intelligent-machine-learning/tfplus)(K8S platform only): an extension library of TensorFlow to Speed Up Training of Search, Recommendation and Advertisement.

## Latest News

- [2025/08] [Practice: Gang Scheduling with DLRover](docs/tutorial/gang_scheduling.md)
- [2025/12] [DLRover on Ray's new architecture achieves its first official release.](docs/blogs/dlrover_on_ray.md)
- [2025/08] [Practice: Gang Scheduling with DLRover.](docs/tutorial/gang_scheduling.md)
- [2025/01] [EDiT: A Local-SGD-Based Efficient Distributed Training Method for Large Language Models, ICLR'25.](https://arxiv.org/abs/2412.07210)
- [2024/06] [DLRover-RM has been accepted by VLDB'24.](docs/blogs/dlrover_rm.md)
- [2024/04] [Flash Checkpoint Supports HuggingFace transformers.Trainer to Asynchronously persist checkpoints.](docs/blogs/flash_checkpoint.md#huggingface-transformerstrainer)
Expand Down
4 changes: 4 additions & 0 deletions RELEASES.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ The DLRover project follows the semantic versioning scheme and maintains a separ

For laset news about DLRover you can check as following link: https://github.com/intelligent-machine-learning/dlrover?tab=readme-ov-file#latest-news=


## Release 0.6.0 on Dec 31, 2025
Please refer to [release 0.6.0](https://github.com/intelligent-machine-learning/dlrover/releases/tag/v0.6.0)

## Release 0.5.0 on Jul 7, 2025
Please refer to [release 0.5.0](https://github.com/intelligent-machine-learning/dlrover/releases/tag/v0.5.0)

Expand Down
8 changes: 4 additions & 4 deletions setup.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright 2023 The DLRover Authors. All rights reserved.
# Copyright 2025 The DLRover Authors. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
Expand Down Expand Up @@ -31,13 +31,13 @@

setup(
name="dlrover",
version="0.6.0.dev0",
version="0.6.0",
description="An Automatic Distributed Deep Learning Framework",
long_description="DLRover helps model developers focus on model algorithm"
" itself, without taking care of any engineering stuff,"
" say, hardware acceleration, distribute running, etc."
" It provides static and dynamic nodes' configuration automatically,"
", before and during a model training job running on k8s",
" It provides static and dynamic workloads' configuration automatically,"
", before and during a model training job running on k8s or ray.",
long_description_content_type="text/markdown",
author="Ant Group",
url="https://github.com/intelligent-machine-learning/dlrover",
Expand Down