Skip to content

Conversation

@gustavolira
Copy link
Member

Description

Introduces a new automated script to provision the RHDH Orchestrator.

A dedicated shell script that sets up the Orchestrator backend with default values and a single command.

Configuration adjustments to streamline deployment and reduce manual steps.

Extended logging and validation to ensure correct integration of the Orchestrator module.

Which issue(s) does this PR fix

https://issues.redhat.com/browse/RHIDP-9016

PR acceptance criteria

Please make sure that the following steps are complete:

  • GitHub Actions are completed and successful
  • Unit Tests are updated and passing
  • E2E Tests are updated and passing
  • Documentation is updated if necessary (requirement for new features)
  • Add a screenshot if the change is UX/UI related

How to test changes / Special notes to the reviewer

@openshift-ci openshift-ci bot requested review from albarbaro and psrna October 22, 2025 17:31
@openshift-ci
Copy link

openshift-ci bot commented Oct 22, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign albarbaro for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gustavolira gustavolira changed the title RHIDP-9016 - Create automated script to provision RHDH Orchestrator integrated with RHDH RHIDP-9016 - Create automated script to provision RHDH Orchestrator integrated Oct 22, 2025
@gustavolira gustavolira changed the title RHIDP-9016 - Create automated script to provision RHDH Orchestrator integrated RHIDP-9016 - create automated script to provision RHDH Orchestrator integrated Oct 22, 2025
@gustavolira gustavolira changed the title RHIDP-9016 - create automated script to provision RHDH Orchestrator integrated chore(e2e): rhidp-9016 - Create automated script to provision RHDH Orchestrator integrated Oct 22, 2025
@github-actions
Copy link
Contributor

@HusneShabbir
Copy link
Contributor

/retest

@github-actions
Copy link
Contributor

@github-actions
Copy link
Contributor

costmetrics_operator_source: redhat-operators
costmetrics_operator_source_namespace: openshift-marketplace

costmetrics_client_id: "e989874e-279e-4291-b104-60fab5d7f9bc"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to note - this is revoked

@chadcrum
Copy link
Contributor

Are you including the Ansible roles on purpose?

If so, some of the roles are not directly related to RHDH / Orchestrator so you can remove them:

  • deploy-cost-metrics-operator
  • deploy-optimizer-app
  • deploy-orchestrator # This is old - I would remove this as well
  • deploy-resource-optimization-plugin
  • deploy-resource-optimization-workflow
  • odf-node-recovery
  • post-mortem

EOF

echo "=== Waiting for database initialization ==="
oc wait job -l job-name --for=condition=Complete -n ${NAMESPACE} --timeout=60s 2> /dev/null || true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should job-name be something specific?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added app: init-orchestrator-db label to the Job and updated the wait command to use it.

# Clonar repositório de workflows
TEMP_DIR=$(mktemp -d)
echo "Cloning workflows repository to ${TEMP_DIR}..."
git clone "${WORKFLOW_REPO}" "${TEMP_DIR}/workflows"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cloning the workflows repo is only necessary when installing the greeting workflow, which I do not see happening in this script.

@gustavolira
Copy link
Member Author

Are you including the Ansible roles on purpose?

If so, some of the roles are not directly related to RHDH / Orchestrator so you can remove them:

  • deploy-cost-metrics-operator
  • deploy-optimizer-app
  • deploy-orchestrator # This is old - I would remove this as well
  • deploy-resource-optimization-plugin
  • deploy-resource-optimization-workflow
  • odf-node-recovery
  • post-mortem

You're absolutely right,I should remove those roles. I used flight-path-auto-tests as the base for this project and forgot to clean up the unused roles that aren't needed for the basic Orchestrator infrastructure.

- Changed loop variable from 'i' to '_' in deploy-orchestrator.sh for clarity.
- Updated git clone command in 04-deploy-workflows.sh to use quotes around variables for better handling of paths.

Signed-off-by: Gustavo Lira <[email protected]>
- Modified the condition for creating a namespace to ensure it only executes when helm_managed_rhdh is false and the workflow_namespace is not "sonataflow-infra".
- Removed redundant condition check to streamline the task logic.

Signed-off-by: Gustavo Lira <[email protected]>
… to adhere to best practices.

- Updated various shell commands and YAML configurations for improved readability and consistency.
- Ensured proper formatting in multiple files, including deployment scripts and task definitions.

Signed-off-by: Gustavo Lira <[email protected]>
- Remove unused roles from flight-path-auto-tests base:
  - deploy-cost-metrics-operator
  - deploy-optimizer-app
  - deploy-orchestrator (old)
  - deploy-resource-optimization-plugin
  - deploy-resource-optimization-workflow
  - odf-node-recovery
  - post-mortem

- Remove unnecessary workflow repo cloning in 04-deploy-workflows.sh
  (workflows are created inline, no repo cloning needed)

- Fix PostgreSQL Job initialization:
  - Added app: init-orchestrator-db label to Job
  - Updated wait command to use correct label

Addresses feedback from @chadcrum
…eating SonataFlowPlatform

The SonataFlowPlatform CRD was being created before the Serverless Logic
Operator was ready, causing 'Failed to find exact match for
sonataflow.org/v1alpha08.SonataFlowPlatform' error.

Moved the wait for serverless operator components to run immediately after
the plugin-infra.sh script execution, before creating PostgreSQL and
SonataFlowPlatform resources.
…d of relying on external script

The external plugin-infra.sh script was failing with GitHub rate limits (429).
Replace it with direct operator installation using Kubernetes resources.

Changes:
- Install Red Hat OpenShift Serverless Operator via Subscription
- Install Red Hat OpenShift Serverless Logic Operator via Subscription
- Create required namespaces before operator installation
- Add proper wait logic for CSVs to be in Succeeded state
- Remove dependency on external GitHub script

This makes the deployment more reliable and predictable.
…nstalling Logic Operator

The Logic Operator requires Knative Serving and Knative Eventing to be
ready before it can install properly.

Changes:
- Create KnativeServing instance in knative-serving namespace
- Create KnativeEventing instance in knative-eventing namespace
- Wait for Knative components to be Ready before installing Logic Operator
- Add explicit wait for Logic Operator deployment to be Available
- Add wait for SonataFlow CRDs to be created before attempting to use them

This ensures the correct operator installation sequence and prevents
'Failed to find exact match for SonataFlowPlatform' errors.
The Logic Operator subscription needs an OperatorGroup in its namespace
to be properly resolved and installed.

Changes:
- Add OperatorGroup for openshift-serverless-logic namespace
- Remove wait for non-existent logic-operator deployment
  (Logic Operator is managed by Serverless Operator, no separate deployment)
- Increase CSV wait timeout from 5 to 10 minutes
- Improve error messages in wait loops

This ensures the Logic Operator CSV is created and the SonataFlow CRDs
are properly installed in the cluster.
…fra namespace

The vars_files was loading role defaults AFTER our vars definition,
causing rhdh_ns to be overwritten with 'rhdh-operator' instead of
'orchestrator-infra'.

Changes:
- Remove vars_files that was overriding our namespace variable
- Add explicit kubeconfig_path variable definition
- This ensures PostgreSQL and SonataFlowPlatform are created in the
  correct 'orchestrator-infra' namespace

Fixes issue where all infrastructure was being created in wrong namespaces.
The Logic Operator does not support OwnNamespace mode and was failing
with 'OwnNamespace InstallModeType not supported' error.

Changes:
- Remove targetNamespaces from OperatorGroup (enables AllNamespaces mode)
- Update CSV wait to accept both Succeeded and Failed phases
- Add note that Failed phase is acceptable if CRDs are installed
- The Logic Operator successfully installs CRDs even when CSV shows Failed

This is expected behavior - the Logic Operator installs the required
SonataFlow CRDs regardless of the CSV phase.
…Operator

The Logic Operator requires an OperatorGroup with AllNamespaces mode.

Changes:
- Rename OperatorGroup from 'openshift-serverless-logic' to 'global-operators'
- Keep empty spec to enable AllNamespaces mode
- Add wait for Logic Operator controller pod to be ready
- Add dbMigrationStrategy to SonataFlow Platform persistence config
- Improve wait conditions for all deployments
- Add KUBECONFIG environment to all shell commands
- Wait for deployments to be created before checking their status

This ensures the Logic Operator CSV reaches Succeeded state and the
controller pod is running before attempting to create SonataFlowPlatform.
… troubleshooting

Changes:
- Update script name from deploy.sh to deploy-orchestrator.sh
- Update command-line flags to match actual implementation
- Add troubleshooting section for Logic Operator CSV Failed state
- Document that Failed CSV state is expected when CRDs are installed
- Add verification commands for Logic Operator controller
…ler pod

The Logic Operator controller pod uses label 'app.kubernetes.io/name=sonataflow-operator'
not 'app.kubernetes.io/name=logic-operator-rhel8'.

This fixes the wait condition that was failing to find the controller pod.
… playbook

The variable orchestrator_db_name was missing from the main playbook vars
after removing the vars_files that was loading defaults/main.yml.

This fixes the error: 'orchestrator_db_name' is undefined
…ests_subpath variables

These variables are required for workflow deployment but were missing
from the main playbook vars after removing vars_files.

This fixes the error: 'workflow_repo' is undefined
The SonataFlowPlatform uses 'Succeed' condition type, not 'Ready'.
This was causing the wait task to timeout unnecessarily.

Changes:
- Updated jsonpath to look for conditions[?(@.type=="Succeed")]
- Updated variable name from READY to SUCCEED for clarity
- Updated log messages to reflect correct condition name
…flows

Workflows were trying to connect to extracted hostname from secret but should
use the hardcoded service name 'postgresql'.

This fixes UnknownHostException: sonataflow-psql-postgresql.orchestrator-infra

Changes:
- Set dynamic_psql_svc_name to 'postgresql' directly
- Removed regex extraction that was causing incorrect hostname
- Workflows will now connect to postgresql.orchestrator-infra correctly
The namespace deletion wasn't waiting long enough and wasn't cleaning
SonataFlow resources first, which can block namespace deletion.

Changes:
- Delete SonataFlow and SonataFlowPlatform resources before namespace
- Improved wait loop with better feedback (shows progress every 5s)
- Added force cleanup for stuck resources after 60s timeout
- Better logging to show deletion progress

This ensures clean reinstallation when running without --no-clean flag.
…credentials

Workflows were failing with 'couldn't find key POSTGRES_USER in Secret'
because the secret uses POSTGRESQL_USER and POSTGRESQL_PASSWORD, not
POSTGRES_USER and POSTGRES_PASSWORD.

Changes:
- Updated dynamic_psql_user_key to 'POSTGRESQL_USER'
- Updated dynamic_psql_password_key to 'POSTGRESQL_PASSWORD'
- Fixed extraction shell command to read POSTGRESQL_USER from secret
- Added error handling for missing POSTGRES_HOST key

This fixes CreateContainerConfigError in workflow pods.
…abase

The sonataflow user doesn't have permission to create databases.
Need to use postgres superuser instead.

Changes:
- Updated CREATE DATABASE command to use 'postgres' user
- Added OWNER sonataflow to ensure correct ownership
- This ensures the database is created successfully on first run
Allow users to override workflow images via variables.
Defaults to 'latest' tag but can be changed to specific versions
if image pull issues occur.

Changes:
- Added user_onboarding_image variable (defaults to latest)
- Added image patch task for user-onboarding workflow
- Image can be overridden at runtime: -e user_onboarding_image=<image:tag>
- Automatically restarts pods when image is changed

Example usage:
  ./deploy-orchestrator.sh
  # Or with specific tag:
  ansible-playbook ... -e user_onboarding_image=quay.io/orchestrator/demo-user-onboarding:656e56bd
…NSES files

Deleted outdated CHANGELOG.md and REVIEW_RESPONSES.md files to streamline the repository and eliminate unnecessary clutter. Updated deploy.yml to fix whitespace issues for better readability.
…meout

The wait tasks were hanging indefinitely when pods didn't exist or were
in error states like ImagePullBackOff.

Changes:
- Added loop to check pod existence before waiting (24 iterations x 5s = 2min)
- Check pod count first, then wait for readiness with short timeout
- Always exit 0 (success) after timeout to allow deployment to continue
- Better logging to show progress during wait
- Prevents Ansible from hanging on workflow pod issues

This ensures the deployment continues even if workflow pods have temporary
issues, allowing for manual troubleshooting without blocking the entire deploy.
The script was showing incorrect port (8080) for Data Index Service.
The actual service port is 80, not 8080.

Changes:
- Updated help text to show correct URL with /graphql endpoint
- Updated final success message with correct port (80)
- Added alternative full URL format for clarity
- Removed incorrect :8080 port reference

Correct URLs:
  - Short: http://sonataflow-platform-data-index-service.orchestrator-infra/graphql
  - Full: http://sonataflow-platform-data-index-service.orchestrator-infra.svc.cluster.local:80/graphql

This fixes RHDH connection errors to Data Index Service.
…port

Updated documentation to show correct Data Index Service URL.
The service uses port 80, not 8080, and requires /graphql endpoint.

Changes:
- Updated example configuration with correct URL
- Added both short and full URL formats
- Removed incorrect :8080 port reference

This matches the actual service configuration and prevents connection errors.
@openshift-ci
Copy link

openshift-ci bot commented Oct 29, 2025

@gustavolira: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-ocp-helm f25f7ad link true /test e2e-ocp-helm

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@github-actions
Copy link
Contributor

@github-actions
Copy link
Contributor

github-actions bot commented Nov 6, 2025

This PR is stale because it has been open 7 days with no activity. Remove stale label or comment or this will be closed in 21 days.

@github-actions github-actions bot added the Stale label Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants