Skip to content

Conversation

@sandydoo
Copy link
Member

@sandydoo sandydoo commented Sep 18, 2025

  • Properly wait on child processes to ensure they are reaped
  • Remove manual timeout during process shutdown
  • Perform a textbook unix signal shutdown by restoring the default handler and re-raising the signal. If we're going to manage processes, we should do it properly.

Fixes #2141.

Wait on child processes instead of sleeping.
This is important to properly reap child processes.
Fixes an issue where process-compose would fail to restart the task runner.
Do the thing where we re-trigger the default handler to get the correct
signal behaviour.
@sandydoo sandydoo added bug Something isn't working tasks labels Sep 18, 2025
@github-actions
Copy link
Contributor

🔍 Suggested Reviewers

Based on git blame analysis of the changed lines, the following contributors have significant experience with the modified code:

Please consider reviewing this PR as you have authored significant portions of the code being modified. Your expertise would be valuable! 🙏

This comment was automatically generated by git-blame-auto-reviewer

Last updated: 2025-09-18T02:04:57.591Z

@claude
Copy link
Contributor

claude bot commented Sep 18, 2025

Claude finished @sandydoo's task —— View job


PR Review Complete

I've reviewed this PR that implements proper Unix signal handling for task process restarts. Here's my analysis focusing on areas for improvement:

Todo List:

  • Read project guidelines from CLAUDE.md
  • Analyze main.rs changes
  • Analyze signal_handler.rs changes
  • Analyze task_state.rs changes
  • Provide comprehensive review feedback

Critical Safety Issues

1. Unsafe Signal Handling in exit_process() (signal_handler.rs:80-89)

pub fn exit_process(&self) -> ! {
    let signal = self.last_signal().unwrap_or(libc::SIGTERM);
    unsafe {
        libc::signal(signal, libc::SIG_DFL);
        libc::kill(libc::getpid(), signal);
    }
    std::process::exit(1);
}

Issues:

  • The project guidelines explicitly state "No unsafe: Don't use unsafe code" but this function uses unsafe
  • libc::signal() is deprecated and unreliable in multithreaded programs - should use sigaction()
  • Race condition: another signal could arrive between setting the default handler and sending the signal
  • The fallback std::process::exit(1) at the end suggests the signal mechanism might fail

Suggested Fix:
Use nix crate's safe abstractions:

use nix::sys::signal::{SigAction, SigHandler, SigSet, Signal};

pub fn exit_process(&self) -> ! {
    let signal = match self.last_signal() {
        Some(sig) => Signal::try_from(sig).unwrap_or(Signal::SIGTERM),
        None => Signal::SIGTERM,
    };
    
    // Use sigaction instead of deprecated signal()
    let default_action = SigAction::new(SigHandler::SigDfl, SaFlags::empty(), SigSet::empty());
    unsafe {
        nix::sys::signal::sigaction(signal, &default_action).ok();
        nix::sys::signal::kill(nix::unistd::getpid(), signal).ok();
    }
    std::process::exit(1);
}

2. Process Group Handling Issue (task_state.rs:332-342)

// Send SIGTERM to the process group first for graceful shutdown
signal::killpg(Pid::from_raw(pid as i32), Signal::SIGTERM).ok();
tokio::select! {
    _ = child.wait() => {}
    _ = tokio::time::sleep(std::time::Duration::from_secs(5)) => {
        child.kill().await.ok();
    }
}

Issues:

  • killpg() sends signal to process group, but child.kill() only kills the direct child process
  • Inconsistent error handling - .ok() swallows all errors
  • 5-second timeout might not be appropriate for all processes
  • After killpg(), child processes might be orphaned

Suggested Fix:

// Send SIGTERM to the process group
if let Err(e) = signal::killpg(Pid::from_raw(pid as i32), Signal::SIGTERM) {
    tracing::warn!("Failed to send SIGTERM to process group: {}", e);
}

tokio::select! {
    _ = child.wait() => {}
    _ = tokio::time::sleep(std::time::Duration::from_secs(5)) => {
        // Send SIGKILL to process group for forceful cleanup
        signal::killpg(Pid::from_raw(pid as i32), Signal::SIGKILL).ok();
        child.kill().await.ok();
    }
}

Design Issues

3. Signal Handler Resource Management (signal_handler.rs:92-97)

impl Drop for SignalHandler {
    fn drop(&mut self) {
        // The join handle will be aborted when dropped, which is fine
        // since we're shutting down anyway
    }
}

The Drop implementation is empty, but the spawned task might be accessing shared resources. Consider graceful shutdown:

impl Drop for SignalHandler {
    fn drop(&mut self) {
        self._handle.abort();
        // Consider: self.cancellation_token.cancel(); if not already cancelled
    }
}

4. Hardcoded Timeout Values (task_state.rs:339)

The 5-second timeout for process termination is hardcoded. This should be configurable:

const GRACEFUL_SHUTDOWN_TIMEOUT: Duration = Duration::from_secs(5);

Or better yet, make it part of TaskConfig.

Minor Issues

5. Inconsistent Error Handling (task_state.rs:336,340)

Using .ok() to ignore errors prevents proper debugging:

signal::killpg(Pid::from_raw(pid as i32), Signal::SIGTERM).ok();
child.kill().await.ok();

Should log warnings when operations fail.

6. Missing Signal Validation (signal_handler.rs:81)

The unwrap_or(libc::SIGTERM) assumes SIGTERM is always valid. Consider validating the signal value.

Performance Concerns

7. Atomic Operations (signal_handler.rs:73-77)

pub fn last_signal(&self) -> Option<i32> {
    match self.last_signal.load(Ordering::Relaxed) {
        0 => None,
        i => Some(i)
    }
}

Using Relaxed ordering is appropriate here, but consider documenting why this is sufficient.

Test Coverage Gaps

The tests in signal_handler.rs don't cover:

  • Signal handling edge cases
  • The exit_process() function (though it's hard to test)
  • Concurrent signal reception
  • Process group termination logic

Security Considerations

The use of process_group(0) in task_state.rs:84 creates a new process group, which is good for signal isolation but ensure this doesn't interfere with existing process management.

Summary

The core concept of centralized signal handling is solid, but the implementation has critical safety issues, particularly the use of unsafe code against project guidelines and potential race conditions in signal handling. The process group management also needs refinement to ensure proper cleanup of child processes.

@domenkozar
Copy link
Member

This will be painful to resolve with #2060 that removes this logic completely :(

@sandydoo
Copy link
Member Author

This will be painful to resolve with #2060 that removes this logic completely :(

It'll be fine. The child shutdown will merge without issues.

As for the top-level signal handling, I think #2060 removed a bit too much. I only see ctrl+c handling left, which isn't enough. I don't see any issue integrating the tokio graceful stuff with this.

@domenkozar domenkozar merged commit 728857c into main Sep 18, 2025
267 of 278 checks passed
@sandydoo sandydoo deleted the tasks-fix-restarts branch September 26, 2025 23:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working tasks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Restarting process stopped working after v1.8

3 participants