chore(e2e): rhidp-9016 - Create automated script to provision RHDH Orchestrator integrated #3616

gustavolira · 2025-10-22T17:30:55Z

Description

Introduces a new automated script to provision the RHDH Orchestrator.

A dedicated shell script that sets up the Orchestrator backend with default values and a single command.

Configuration adjustments to streamline deployment and reduce manual steps.

Extended logging and validation to ensure correct integration of the Orchestrator module.

Which issue(s) does this PR fix

https://issues.redhat.com/browse/RHIDP-9016

PR acceptance criteria

Please make sure that the following steps are complete:

GitHub Actions are completed and successful
Unit Tests are updated and passing
E2E Tests are updated and passing
Documentation is updated if necessary (requirement for new features)
Add a screenshot if the change is UX/UI related

How to test changes / Special notes to the reviewer

openshift-ci · 2025-10-22T17:31:02Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign albarbaro for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

.ibm/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2025-10-22T19:44:59Z

The image is available at:

HusneShabbir · 2025-10-23T11:54:36Z

/retest

github-actions · 2025-10-24T19:22:45Z

The image is available at:

github-actions · 2025-10-27T14:11:57Z

The image is available at:

chadcrum · 2025-10-27T14:48:22Z

.ibm/orchestrator-infra/roles/deploy-resource-optimization-workflow/defaults/main.yml

+costmetrics_operator_source: redhat-operators
+costmetrics_operator_source_namespace: openshift-marketplace
+
+costmetrics_client_id: "e989874e-279e-4291-b104-60fab5d7f9bc"


Just to note - this is revoked

chadcrum · 2025-10-27T15:15:19Z

Are you including the Ansible roles on purpose?

If so, some of the roles are not directly related to RHDH / Orchestrator so you can remove them:

deploy-cost-metrics-operator
deploy-optimizer-app
deploy-orchestrator # This is old - I would remove this as well
deploy-resource-optimization-plugin
deploy-resource-optimization-workflow
odf-node-recovery
post-mortem

chadcrum · 2025-10-27T15:19:35Z

.ibm/orchestrator-infra/scripts/02-deploy-postgresql.sh

+EOF
+
+echo "=== Waiting for database initialization ==="
+oc wait job -l job-name --for=condition=Complete -n ${NAMESPACE} --timeout=60s 2> /dev/null || true


should job-name be something specific?

added app: init-orchestrator-db label to the Job and updated the wait command to use it.

chadcrum · 2025-10-27T15:23:27Z

.ibm/orchestrator-infra/scripts/04-deploy-workflows.sh

+# Clonar repositório de workflows
+TEMP_DIR=$(mktemp -d)
+echo "Cloning workflows repository to ${TEMP_DIR}..."
+git clone "${WORKFLOW_REPO}" "${TEMP_DIR}/workflows"


Cloning the workflows repo is only necessary when installing the greeting workflow, which I do not see happening in this script.

gustavolira · 2025-10-27T22:06:32Z

Are you including the Ansible roles on purpose?

If so, some of the roles are not directly related to RHDH / Orchestrator so you can remove them:

deploy-cost-metrics-operator

deploy-optimizer-app

deploy-orchestrator # This is old - I would remove this as well

deploy-resource-optimization-plugin

deploy-resource-optimization-workflow

odf-node-recovery

post-mortem

You're absolutely right,I should remove those roles. I used flight-path-auto-tests as the base for this project and forgot to clean up the unused roles that aren't needed for the basic Orchestrator infrastructure.

…ntegrated with RHDH Signed-off-by: Gustavo Lira <[email protected]>

Signed-off-by: Gustavo Lira <[email protected]>

- Changed loop variable from 'i' to '_' in deploy-orchestrator.sh for clarity. - Updated git clone command in 04-deploy-workflows.sh to use quotes around variables for better handling of paths. Signed-off-by: Gustavo Lira <[email protected]>

- Modified the condition for creating a namespace to ensure it only executes when helm_managed_rhdh is false and the workflow_namespace is not "sonataflow-infra". - Removed redundant condition check to streamline the task logic. Signed-off-by: Gustavo Lira <[email protected]>

… to adhere to best practices. - Updated various shell commands and YAML configurations for improved readability and consistency. - Ensured proper formatting in multiple files, including deployment scripts and task definitions. Signed-off-by: Gustavo Lira <[email protected]>

@chadcrum

- Remove unused roles from flight-path-auto-tests base: - deploy-cost-metrics-operator - deploy-optimizer-app - deploy-orchestrator (old) - deploy-resource-optimization-plugin - deploy-resource-optimization-workflow - odf-node-recovery - post-mortem - Remove unnecessary workflow repo cloning in 04-deploy-workflows.sh (workflows are created inline, no repo cloning needed) - Fix PostgreSQL Job initialization: - Added app: init-orchestrator-db label to Job - Updated wait command to use correct label Addresses feedback from @chadcrum

…eating SonataFlowPlatform The SonataFlowPlatform CRD was being created before the Serverless Logic Operator was ready, causing 'Failed to find exact match for sonataflow.org/v1alpha08.SonataFlowPlatform' error. Moved the wait for serverless operator components to run immediately after the plugin-infra.sh script execution, before creating PostgreSQL and SonataFlowPlatform resources.

…d of relying on external script The external plugin-infra.sh script was failing with GitHub rate limits (429). Replace it with direct operator installation using Kubernetes resources. Changes: - Install Red Hat OpenShift Serverless Operator via Subscription - Install Red Hat OpenShift Serverless Logic Operator via Subscription - Create required namespaces before operator installation - Add proper wait logic for CSVs to be in Succeeded state - Remove dependency on external GitHub script This makes the deployment more reliable and predictable.

…nstalling Logic Operator The Logic Operator requires Knative Serving and Knative Eventing to be ready before it can install properly. Changes: - Create KnativeServing instance in knative-serving namespace - Create KnativeEventing instance in knative-eventing namespace - Wait for Knative components to be Ready before installing Logic Operator - Add explicit wait for Logic Operator deployment to be Available - Add wait for SonataFlow CRDs to be created before attempting to use them This ensures the correct operator installation sequence and prevents 'Failed to find exact match for SonataFlowPlatform' errors.

The Logic Operator subscription needs an OperatorGroup in its namespace to be properly resolved and installed. Changes: - Add OperatorGroup for openshift-serverless-logic namespace - Remove wait for non-existent logic-operator deployment (Logic Operator is managed by Serverless Operator, no separate deployment) - Increase CSV wait timeout from 5 to 10 minutes - Improve error messages in wait loops This ensures the Logic Operator CSV is created and the SonataFlow CRDs are properly installed in the cluster.

…fra namespace The vars_files was loading role defaults AFTER our vars definition, causing rhdh_ns to be overwritten with 'rhdh-operator' instead of 'orchestrator-infra'. Changes: - Remove vars_files that was overriding our namespace variable - Add explicit kubeconfig_path variable definition - This ensures PostgreSQL and SonataFlowPlatform are created in the correct 'orchestrator-infra' namespace Fixes issue where all infrastructure was being created in wrong namespaces.

The Logic Operator does not support OwnNamespace mode and was failing with 'OwnNamespace InstallModeType not supported' error. Changes: - Remove targetNamespaces from OperatorGroup (enables AllNamespaces mode) - Update CSV wait to accept both Succeeded and Failed phases - Add note that Failed phase is acceptable if CRDs are installed - The Logic Operator successfully installs CRDs even when CSV shows Failed This is expected behavior - the Logic Operator installs the required SonataFlow CRDs regardless of the CSV phase.

…Operator The Logic Operator requires an OperatorGroup with AllNamespaces mode. Changes: - Rename OperatorGroup from 'openshift-serverless-logic' to 'global-operators' - Keep empty spec to enable AllNamespaces mode - Add wait for Logic Operator controller pod to be ready - Add dbMigrationStrategy to SonataFlow Platform persistence config - Improve wait conditions for all deployments - Add KUBECONFIG environment to all shell commands - Wait for deployments to be created before checking their status This ensures the Logic Operator CSV reaches Succeeded state and the controller pod is running before attempting to create SonataFlowPlatform.

… troubleshooting Changes: - Update script name from deploy.sh to deploy-orchestrator.sh - Update command-line flags to match actual implementation - Add troubleshooting section for Logic Operator CSV Failed state - Document that Failed CSV state is expected when CRDs are installed - Add verification commands for Logic Operator controller

…ation

…ler pod The Logic Operator controller pod uses label 'app.kubernetes.io/name=sonataflow-operator' not 'app.kubernetes.io/name=logic-operator-rhel8'. This fixes the wait condition that was failing to find the controller pod.

… playbook The variable orchestrator_db_name was missing from the main playbook vars after removing the vars_files that was loading defaults/main.yml. This fixes the error: 'orchestrator_db_name' is undefined

…ests_subpath variables These variables are required for workflow deployment but were missing from the main playbook vars after removing vars_files. This fixes the error: 'workflow_repo' is undefined

The SonataFlowPlatform uses 'Succeed' condition type, not 'Ready'. This was causing the wait task to timeout unnecessarily. Changes: - Updated jsonpath to look for conditions[?(@.type=="Succeed")] - Updated variable name from READY to SUCCEED for clarity - Updated log messages to reflect correct condition name

…flows Workflows were trying to connect to extracted hostname from secret but should use the hardcoded service name 'postgresql'. This fixes UnknownHostException: sonataflow-psql-postgresql.orchestrator-infra Changes: - Set dynamic_psql_svc_name to 'postgresql' directly - Removed regex extraction that was causing incorrect hostname - Workflows will now connect to postgresql.orchestrator-infra correctly

The namespace deletion wasn't waiting long enough and wasn't cleaning SonataFlow resources first, which can block namespace deletion. Changes: - Delete SonataFlow and SonataFlowPlatform resources before namespace - Improved wait loop with better feedback (shows progress every 5s) - Added force cleanup for stuck resources after 60s timeout - Better logging to show deletion progress This ensures clean reinstallation when running without --no-clean flag.

…credentials Workflows were failing with 'couldn't find key POSTGRES_USER in Secret' because the secret uses POSTGRESQL_USER and POSTGRESQL_PASSWORD, not POSTGRES_USER and POSTGRES_PASSWORD. Changes: - Updated dynamic_psql_user_key to 'POSTGRESQL_USER' - Updated dynamic_psql_password_key to 'POSTGRESQL_PASSWORD' - Fixed extraction shell command to read POSTGRESQL_USER from secret - Added error handling for missing POSTGRES_HOST key This fixes CreateContainerConfigError in workflow pods.

…abase The sonataflow user doesn't have permission to create databases. Need to use postgres superuser instead. Changes: - Updated CREATE DATABASE command to use 'postgres' user - Added OWNER sonataflow to ensure correct ownership - This ensures the database is created successfully on first run

Allow users to override workflow images via variables. Defaults to 'latest' tag but can be changed to specific versions if image pull issues occur. Changes: - Added user_onboarding_image variable (defaults to latest) - Added image patch task for user-onboarding workflow - Image can be overridden at runtime: -e user_onboarding_image=<image:tag> - Automatically restarts pods when image is changed Example usage: ./deploy-orchestrator.sh # Or with specific tag: ansible-playbook ... -e user_onboarding_image=quay.io/orchestrator/demo-user-onboarding:656e56bd

…NSES files Deleted outdated CHANGELOG.md and REVIEW_RESPONSES.md files to streamline the repository and eliminate unnecessary clutter. Updated deploy.yml to fix whitespace issues for better readability.

…meout The wait tasks were hanging indefinitely when pods didn't exist or were in error states like ImagePullBackOff. Changes: - Added loop to check pod existence before waiting (24 iterations x 5s = 2min) - Check pod count first, then wait for readiness with short timeout - Always exit 0 (success) after timeout to allow deployment to continue - Better logging to show progress during wait - Prevents Ansible from hanging on workflow pod issues This ensures the deployment continues even if workflow pods have temporary issues, allowing for manual troubleshooting without blocking the entire deploy.

The script was showing incorrect port (8080) for Data Index Service. The actual service port is 80, not 8080. Changes: - Updated help text to show correct URL with /graphql endpoint - Updated final success message with correct port (80) - Added alternative full URL format for clarity - Removed incorrect :8080 port reference Correct URLs: - Short: http://sonataflow-platform-data-index-service.orchestrator-infra/graphql - Full: http://sonataflow-platform-data-index-service.orchestrator-infra.svc.cluster.local:80/graphql This fixes RHDH connection errors to Data Index Service.

…port Updated documentation to show correct Data Index Service URL. The service uses port 80, not 8080, and requires /graphql endpoint. Changes: - Updated example configuration with correct URL - Added both short and full URL formats - Removed incorrect :8080 port reference This matches the actual service configuration and prevents connection errors.

…ntegrated Signed-off-by: Gustavo Lira <[email protected]>

openshift-ci · 2025-10-29T14:00:08Z

@gustavolira: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-ocp-helm	`f25f7ad`	link	true	`/test e2e-ocp-helm`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

github-actions · 2025-10-29T14:03:56Z

The image is available at:

github-actions · 2025-11-06T02:10:33Z

This PR is stale because it has been open 7 days with no activity. Remove stale label or comment or this will be closed in 21 days.

openshift-ci bot requested review from albarbaro and psrna October 22, 2025 17:31

gustavolira temporarily deployed to internal October 22, 2025 17:31 — with GitHub Actions Inactive

gustavolira temporarily deployed to internal October 22, 2025 17:53 — with GitHub Actions Inactive

gustavolira changed the title ~~RHIDP-9016 - Create automated script to provision RHDH Orchestrator integrated with RHDH~~ RHIDP-9016 - Create automated script to provision RHDH Orchestrator integrated Oct 22, 2025

gustavolira changed the title ~~RHIDP-9016 - Create automated script to provision RHDH Orchestrator integrated~~ RHIDP-9016 - create automated script to provision RHDH Orchestrator integrated Oct 22, 2025

gustavolira changed the title ~~RHIDP-9016 - create automated script to provision RHDH Orchestrator integrated~~ chore(e2e): rhidp-9016 - Create automated script to provision RHDH Orchestrator integrated Oct 22, 2025

gustavolira temporarily deployed to internal October 22, 2025 18:47 — with GitHub Actions Inactive

gustavolira temporarily deployed to internal October 22, 2025 20:01 — with GitHub Actions Inactive

gustavolira temporarily deployed to internal October 22, 2025 20:16 — with GitHub Actions Inactive

gustavolira temporarily deployed to internal October 24, 2025 18:28 — with GitHub Actions Inactive

gustavolira temporarily deployed to internal October 27, 2025 13:12 — with GitHub Actions Inactive

chadcrum reviewed Oct 27, 2025

View reviewed changes

gustavolira force-pushed the RHIDP-9016 branch from 8b88ea8 to b5b8b4c Compare October 27, 2025 22:15

gustavolira temporarily deployed to internal October 27, 2025 22:15 — with GitHub Actions Inactive

gustavolira added 7 commits October 29, 2025 10:08

RHIDP-9016 - Create automated script to provision RHDH Orchestrator i…

fe038fb

…ntegrated with RHDH Signed-off-by: Gustavo Lira <[email protected]>

RHIDP-9016 - removing empty file

c081965

Signed-off-by: Gustavo Lira <[email protected]>

RHIDP-9016

3d2869d

- Changed loop variable from 'i' to '_' in deploy-orchestrator.sh for clarity. - Updated git clone command in 04-deploy-workflows.sh to use quotes around variables for better handling of paths. Signed-off-by: Gustavo Lira <[email protected]>

gustavolira added 23 commits October 29, 2025 10:08

docs(orchestrator-infra): add changelog and review responses document…

ef09167

…ation

chore(orchestrator-infra): fix trailing whitespace

e9a92cb

fix(orchestrator-infra): add missing orchestrator_db_name variable to…

a963fab

… playbook The variable orchestrator_db_name was missing from the main playbook vars after removing the vars_files that was loading defaults/main.yml. This fixes the error: 'orchestrator_db_name' is undefined

fix(orchestrator-infra): add missing workflow_repo and workflow_manif…

7177e1c

…ests_subpath variables These variables are required for workflow deployment but were missing from the main playbook vars after removing vars_files. This fixes the error: 'workflow_repo' is undefined

chore(orchestrator-infra): remove obsolete CHANGELOG and REVIEW_RESPO…

6cc9671

…NSES files Deleted outdated CHANGELOG.md and REVIEW_RESPONSES.md files to streamline the repository and eliminate unnecessary clutter. Updated deploy.yml to fix whitespace issues for better readability.

rhidp-9016 - Create automated script to provision RHDH Orchestrator i…

f25f7ad

…ntegrated Signed-off-by: Gustavo Lira <[email protected]>

gustavolira force-pushed the RHIDP-9016 branch from b5b8b4c to f25f7ad Compare October 29, 2025 13:08

gustavolira temporarily deployed to internal October 29, 2025 13:09 — with GitHub Actions Inactive

github-actions bot added the Stale label Nov 6, 2025

chore(e2e): rhidp-9016 - Create automated script to provision RHDH Orchestrator integrated #3616

Are you sure you want to change the base?

chore(e2e): rhidp-9016 - Create automated script to provision RHDH Orchestrator integrated #3616

Uh oh!

Conversation

gustavolira commented Oct 22, 2025

Description

Which issue(s) does this PR fix

PR acceptance criteria

How to test changes / Special notes to the reviewer

Uh oh!

openshift-ci bot commented Oct 22, 2025

Uh oh!

github-actions bot commented Oct 22, 2025

Uh oh!

HusneShabbir commented Oct 23, 2025

Uh oh!

github-actions bot commented Oct 24, 2025

Uh oh!

github-actions bot commented Oct 27, 2025

Uh oh!

chadcrum Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

chadcrum commented Oct 27, 2025

Uh oh!

chadcrum Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

gustavolira Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

chadcrum Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

gustavolira commented Oct 27, 2025

Uh oh!

openshift-ci bot commented Oct 29, 2025

Uh oh!

github-actions bot commented Oct 29, 2025

Uh oh!

github-actions bot commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants