diff --git a/docs/encyclopedia/detecting-workflow-failures.mdx b/docs/encyclopedia/detecting-workflow-failures.mdx index c8e16df6fa..963f44046f 100644 --- a/docs/encyclopedia/detecting-workflow-failures.mdx +++ b/docs/encyclopedia/detecting-workflow-failures.mdx @@ -27,7 +27,7 @@ If you need to perform an action inside your Workflow after a specific period of - [Workflow Run Timeout](#workflow-run-timeout) - [Workflow Task Timeout](#workflow-task-timeout) -## Workflow Execution Timeout? {#workflow-execution-timeout} +## Workflow Execution Timeout {#workflow-execution-timeout} **What is a Workflow Execution Timeout in Temporal?** @@ -51,7 +51,7 @@ If this timeout is reached, the Workflow Execution changes to a Timed Out status This timeout is different from the [Workflow Run Timeout](#workflow-run-timeout). This timeout is most commonly used for stopping the execution of a [Temporal Cron Job](/cron-job) after a certain amount of time has passed. -## Workflow Run Timeout? {#workflow-run-timeout} +## Workflow Run Timeout {#workflow-run-timeout} **What is a Workflow Run Timeout in Temporal?** @@ -79,7 +79,7 @@ This timeout is most commonly used to limit the execution time of a single [Temp If the Workflow Run Timeout is reached, the Workflow Execution will be Timed Out. -## Workflow Task Timeout? {#workflow-task-timeout} +## Workflow Task Timeout {#workflow-task-timeout} **What is a Workflow Task Timeout in Temporal?** @@ -104,3 +104,23 @@ This Timeout is primarily available to recognize whether a Worker has gone down This timeout is primarily available to recognize whether a Worker has gone down so that the Workflow Execution can be recovered on a different Worker. The main reason for increasing the default value is to accommodate a Workflow Execution that has an extensive Workflow Execution History, requiring more than 10 seconds for the Worker to load. It's worth mentioning that although you can extend the timeout up to the maximum value of 120 seconds, it's not recommended to move beyond the default value. + +## Detecting Workflow Task Failures + +Use the `TemporalReportedProblems` Search Attribute to detect Workflows with failed Workflow Tasks. +A failed Workflow Task does not cause the Workflow to fail. Some Tasks within a Workflow may be intended to fail. +For example, a Workflow Task may check a remote data source for new messages. If there aren't any, the Task will fail as intended. +If your Task has code to handle the failure, the Workflow will proceed. +However, if your Workflow has a Task that fails and the failure is not handled, the Workflow will continue to run, but will not complete. +Detecting Workflows in this state is a common troubleshooting issue. + +To identify Workflows with Task failures, you can use the Temporal Web UI. See [Task Failures View](/web-ui/#task-failures-view) for more details. + +You can also detect Workflows with Task failures by searching for the `TemporalReportedProblems` search attribute with your observability tools. + +:::warning Activating Workflow Task Failure in AWS Namespaces + +To enable the Task Failures View for a Namespace running on AWS, you need to update the Dynamic Config for that Namespace. +See [Activating Task Failure View for AWS Namespaces](/web-ui/#activate-task-failures-view-for-aws). + +::: \ No newline at end of file diff --git a/docs/encyclopedia/visibility/search-attributes.mdx b/docs/encyclopedia/visibility/search-attributes.mdx index b4c9efbbfc..aa61c1170c 100644 --- a/docs/encyclopedia/visibility/search-attributes.mdx +++ b/docs/encyclopedia/visibility/search-attributes.mdx @@ -73,6 +73,7 @@ These Search Attributes are created when the initial index is created. | StateTransitionCount | Int | The number of times that Workflow Execution has persisted its state. Available only for closed Workflows. | | TaskQueue | Keyword | Task Queue used by Workflow Execution. | | TemporalChangeVersion | Keyword List | Stores change/version pairs if the GetVersion API is enabled. | +| TemporalReportedProblems | Keyword List | Stores information about Workflow task failures. Formatted as `category= cause=`. | TemporalScheduledStartTime | Datetime | The time that the Workflow is schedule to start according to the Schedule Spec. Can be manually triggered. Set on Schedules. | | TemporalScheduledById | Keyword | The Id of the Schedule that started the Workflow. | | TemporalSchedulePaused | Boolean | Indicates whether the Schedule has been paused. Set on Schedules. | diff --git a/docs/web-ui.mdx b/docs/web-ui.mdx index 2da59c03a6..129e88e463 100644 --- a/docs/web-ui.mdx +++ b/docs/web-ui.mdx @@ -54,7 +54,7 @@ For start time and end time, users can set their preferred date and time format Select a Workflow Execution to view the Workflow Execution's History, Workers, Relationships, pending Activities and Nexus Operations, Queries, and Metadata. -### Saved Views +### Saved Views {#saved-views} Saved Views let you save and reuse your frequently used visibility queries in the Temporal Web UI. Instead of recreating complex filters every time, you can save them once and apply them with a single click. @@ -137,6 +137,52 @@ Saved Views that use relative times will be shared with absolute time. ::: +## Task Failures View {#task-failures-view} + +The Task Failures view is a pre-defined Saved View that displays Workflows that have a Workflow Task failure. +These Workflows are still running, but one of their Tasks has failed or timed out. + +The details of the Task Failures View displays the Workflow's ID, the Run ID, and the Workflow type. +Clicking on any of the links in the details opens the Workflow page for that Workflow. +On this page, you will find more information about the Task that failed and remaining pending tasks. +You can also cancel the Workflow by clicking the Request Cancellation button on this page. + +### Activating Task Failure View for AWS Namespaces {#activate-task-failures-view-for-aws} + +To enable the Task Failures View for a Namespace running on AWS, you need to update the Dynamic Config first. +To turn the feature on for a Namespace, use the following command: + +``` command +omni ocld dynamic-config namespace patch --namespace "$NS" --json '{ + "system.numConsecutiveWorkflowTaskProblemsToTriggerSearchAttribute": 5 +}' +``` + +`$NS` is the name of the Namespace where you want to set up Task Failures view. +`numConsecutiveWorkflowTaskProblemsToTriggerSearchAttribute` is the number of consecutive Workflow Task Failures +required to trigger the `TemporalReportedProblems` search attribute. +The default value is 5. If adding this search attribute causes strain on the visibility system, consider increasing this number. + +To turn off the feature for a Namespace, set `numConsecutiveWorkflowTaskProblemsToTriggerSearchAttribute` to 0. +You can also deactivate the feature by removing the key: + +``` command +omni ocld dynamic-config namespace remove \ + -n "$NS" \ + --key "system.numConsecutiveWorkflowTaskProblemsToTriggerSearchAttribute" +``` + +where `$NS` is the name of the Namespace for which you wish to deactivate the feature. + +To determine which Namespaces in your fleet have the feature activated, use the following command: + +``` command +omni ocld dc search \ + --namespace \ + --key-regex 'system.numConsecutiveWorkflowTaskProblemsToTriggerSearchAttribute' \ + --all +``` + ## History A Workflow Execution History is a view of the [Events](/workflow-execution/event#event) and Event fields within the