Skip to content

Conversation

@dguido
Copy link
Member

@dguido dguido commented Nov 27, 2025

Summary

  • Adds a pre-flight SSH connectivity check in users.yml before the main user management tasks
  • Provides a helpful error message with diagnostic steps when the server is unreachable

Problem

When running ./algo update-users, if the server is unreachable, users see cryptic Ansible SSH retry errors:

ssh_retry: attempt: 3, ssh return code is 255

This provides no indication of what's wrong or how to fix it.

Solution

Added two tasks after loading server config:

  1. Test SSH connectivity using wait_for module with 10-second timeout
  2. Fail with helpful message if unreachable, including:
    • Possible causes (server not running, IP changed, firewall blocking)
    • Diagnostic commands (nc -zv and ssh -vvv)

Example error message:

Cannot connect to 18.202.x.x on port 4160.

Possible causes:
- Server is not running (check your cloud provider console)
- IP address changed (common after EC2 restart without Elastic IP)
- Firewall/security group blocking port 4160

To diagnose:
  nc -zv 18.202.x.x 4160
  ssh -vvv -p 4160 -i configs/algo.pem [email protected]

Test plan

  • Ansible syntax check passes
  • All 91 unit tests pass
  • Pre-commit hooks pass (yamllint, ansible-lint)
  • Manual test with unreachable server shows helpful error
  • Manual test with reachable server proceeds normally

Fixes #14878

🤖 Generated with Claude Code

When running ./algo update-users, if the server is unreachable, users
previously saw cryptic Ansible SSH retry errors like:

    ssh_retry: attempt: 3, ssh return code is 255

This adds a pre-flight check that tests SSH connectivity before running
the full playbook. If the connection fails, it provides a helpful error
message with:
- Clear explanation of what failed
- Possible causes (server not running, IP changed, firewall blocking)
- Diagnostic commands to run (nc, ssh -vvv)

Fixes #14878

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@dguido dguido requested a review from jackivanov as a code owner November 27, 2025 19:45
@claude
Copy link
Contributor

claude bot commented Nov 27, 2025

Code Review

I've reviewed PR #14881 which adds a pre-flight SSH connectivity check before running user management tasks. Overall, this is a well-implemented improvement that addresses a real pain point. Here's my detailed feedback:


Strengths

  1. Excellent UX improvement: The pre-flight check will fail fast with a helpful error message instead of letting users sit through multiple SSH retry timeouts.

  2. Clear error messaging: The failure message includes:

    • What failed (connection details)
    • Possible causes (server not running, IP changed, firewall issues)
    • Diagnostic commands to run (nc, ssh -vvv)
  3. Proper placement: The check is correctly placed after loading the server config but before attempting the actual SSH connection, which is the ideal spot.

  4. Correct conditional logic: The when clauses properly skip the check for localhost deployments.


🔍 Issues Found

Critical: SSH Key Path Hardcoded

Location: users.yml:75

ssh -vvv -p {{ ansible_ssh_port | default(ssh_port) }} -i configs/algo.pem {{ server_user | default('algo') }}@{{ algo_server }}

Problem: The SSH key path is hardcoded as configs/algo.pem, but this won't work when:

  • Users have a custom SSH key configured in their .config.yml file
  • The .config.yml contains ansible_ssh_private_key_file pointing to a different key

Evidence: Looking at server.yml:195, the config can store a custom key path:

ansible_ssh_private_key_file: {{ SSH_keys.private }}

Fix: Use the variable from the loaded config:

ssh -vvv -p {{ ansible_ssh_port | default(ssh_port) }} -i {{ ansible_ssh_private_key_file | default('configs/algo.pem') }} {{ server_user | default('algo') }}@{{ algo_server }}

Minor: Username Default Inconsistency

Location: users.yml:75 vs users.yml:102

The diagnostic command uses:

{{ server_user | default('algo') }}

But when adding the host, it uses:

ansible_ssh_user: "{{ server_user | default('root') }}"

Impact: The diagnostic SSH command may use the wrong username if server_user is not defined.

Question: Which default is correct? Looking at server.yml:192, server_user is always written to .config.yml from ansible_ssh_user, so this shouldn't be undefined. However, for consistency, the diagnostic command should match line 102's default of 'root'.

Fix: Change line 75 to use 'root' as the default:

{{ server_user | default('root') }}@{{ algo_server }}

💡 Suggestions (Optional)

  1. Consider adding connection timeout context: The error message could mention that the check uses a 10-second timeout, so if the server is slow to respond (but not dead), users understand why it failed.

  2. DNS resolution note: If the algo_server is a hostname (not IP), DNS resolution failures would also cause this check to fail. Mentioning this in the error message could be helpful.


🧪 Testing Recommendations

The PR description mentions manual testing is incomplete. Before merging, please test:

  1. Unreachable server scenarios:

    • Server stopped
    • Firewall blocking the SSH port
    • Invalid IP address
    • Hostname that doesn't resolve
  2. Successful connection scenarios:

    • Standard deployment
    • Custom SSH port
    • Custom SSH key path
    • Localhost deployment (should skip the check)
  3. Edge cases:

    • IPv6 server address
    • Server with hostname vs IP
    • Custom server_user value

📋 Required Changes

Before merging, please address:

  1. Fix the hardcoded SSH key path (line 75) - use ansible_ssh_private_key_file variable
  2. Fix the username default (line 75) - use 'root' instead of 'algo'
  3. ⚠️ Complete manual testing as outlined above

📖 Alignment with CLAUDE.md Guidelines

  • ✅ Clear, concise implementation
  • ✅ No over-engineering
  • ✅ Follows the principle of better error messages
  • ✅ Maintains backwards compatibility (localhost check is skipped)
  • ✅ No security concerns introduced

Summary

This is a valuable improvement that addresses issue #14878 effectively. The two issues found are straightforward to fix and relate to variable consistency. Once the hardcoded SSH key path and username default are corrected, and manual testing is completed, this PR will be ready to merge.

Great work on improving the user experience! 👍

@dguido dguido merged commit 2662c5e into master Nov 27, 2025
23 checks passed
@dguido dguido deleted the fix/ssh-preflight-check-14878 branch November 27, 2025 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add pre-flight SSH connectivity check before update-users

2 participants