Skip to content

Improve heartbeat failure messaging #358

@obgibson

Description

@obgibson
  1. Expose document explaining how heartbeats are used to mark runs and tasks as failed. This document can be at metaflow.org and in the READMEs of this repository. This will be in addition to https://github.com/Netflix/metaflow-service/blob/master/services/ui_backend_service/docs/environment.md#heartbeat-intervals
  2. When a task or run fails because of a missing heartbeat, show that fact in MFGUI.
  3. Have a default minimum heartbeat and a maximum heartbeat time. If the task/run misses the minimum heartbeat, show it as "pending" and only show it as "failed" when it misses the maximum heartbeat time. This functionality will have to consider resumes and multiple attempts.

The reason for this issue is that some runs/tasks are being marked as "failed" when they have not started yet, and some runs/tasks are still marked as "running" when they have failed but not reached the heartbeat threshold yet.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions