Skip to content

Conversation

tushar00jain
Copy link
Contributor

Summary:

  • call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
  • change the env var that's used to determine the file after every quorum

Differential Revision: D84260745

Copy link

meta-codesync bot commented Oct 16, 2025

@tushar00jain has exported this pull request. If you are a Meta employee, you can view the originating Diff in D84260745.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 16, 2025
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Oct 16, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum

Differential Revision: D84260745
@tushar00jain tushar00jain force-pushed the export-D84260745 branch 2 times, most recently from 1d99280 to d048341 Compare October 16, 2025 16:31
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Oct 16, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum

Differential Revision: D84260745
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Oct 16, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum

Differential Revision: D84260745
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Oct 16, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum

Differential Revision: D84260745
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Oct 17, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum

Differential Revision: D84260745
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Oct 17, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum

Differential Revision: D84260745
Summary:

work.wait() can throw so wrap that in a try/catch to handle it gracefully by reporting error to the manager, leading the should_commit to fail

Differential Revision: D84880993
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum

Differential Revision: D84260745
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant