Skip to content

Conversation

HAOCHENYE
Copy link
Collaborator

No description provided.

@CyCle1024 CyCle1024 requested a review from Copilot September 16, 2025 11:44
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds ETA (Estimated Time of Arrival) logging functionality and fixes an error when resuming training related to the e2e_tgs calculation. The changes track training time more accurately and provide users with estimated completion time information.

  • Add ETA calculation and logging to show estimated training completion time
  • Fix e2e_tgs calculation error when resuming training by properly tracking training time offset
  • Persist training time offset in checkpoint state for accurate resume functionality

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

self._cur_epoch = train_state["cur_epoch"]
self._consumed_samples = train_state["consumed_samples"]
self._consumed_tokens = train_state["consumed_tokens"]
self._train_time_offset = train_state["train_time_offset"]
Copy link
Preview

Copilot AI Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause a KeyError when resuming from checkpoints created before this change, since 'train_time_offset' won't exist in older checkpoint files. Add a default value using train_state.get('train_time_offset', 0) to handle backward compatibility.

Suggested change
self._train_time_offset = train_state["train_time_offset"]
self._train_time_offset = train_state.get("train_time_offset", 0)

Copilot uses AI. Check for mistakes.

Comment on lines +1007 to +1008
eta_seconds = remaining_tokens / tgs
eta_hms = str(timedelta(seconds=int(eta_seconds)))
Copy link
Preview

Copilot AI Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause a ZeroDivisionError if tgs is zero or very close to zero, which can happen during the first few training steps. Add a guard condition to check if tgs > 0 before performing the division.

Suggested change
eta_seconds = remaining_tokens / tgs
eta_hms = str(timedelta(seconds=int(eta_seconds)))
if tgs > 1e-8:
eta_seconds = remaining_tokens / tgs
eta_hms = str(timedelta(seconds=int(eta_seconds)))
else:
eta_seconds = float('inf')
eta_hms = "inf"

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants