Skip to content

stop() in physical.py fails to re-image VMs when agent crashes #2604

@para0x0dise

Description

@para0x0dise

My Setup

I'm using physical machines managed by a FOG server to re-image VMs after each malware analysis. After an analysis completes, CAPE calls the stop() method in physical.py to reset the machine. This method checks the VM state; if it's running, it triggers a deployment task via the FOG server to restore the VM to a clean snapshot.

Here’s the relevant part of the code:

def stop(self, label):
      """Stop a physical machine.
      @param label: physical machine name.
      @raise CuckooMachineError: if unable to stop.
      """
      taskID_Deploy = 0
      hostID = 0
      
      ## IF AGENT IS CRASHED, THIS CONDITION WOULDN'T BE TRIGGERED
      ## THE VM WOULDN'T BE RE-IMAGGED
      if self._status(label) == self.RUNNING:
          log.debug("Rebooting machine: %s", label)
          machine = self._get_machine(label)

          r_hosts = requests.get(f"http://{self.options.fog.hostname}/fog/host", headers=headers)
          hosts = r_hosts.json()["hosts"]

          for host in hosts:
              if machine.name == host["name"]:
                  print(f"{host['id']}: {host['name']}")
                  hostID = host["id"]
                  r_types = requests.get(f"http://{self.options.fog.hostname}/fog/tasktype", headers=headers)
                  types = r_types.json()

Current Behavior

When the agent inside the VM crashes, self._status(label) does not return RUNNING. As a result, the VM is skipped and never re-imaged, leaving it in an infected state indefinitely.

# IF THE AGENT CRASHES, THIS CONDITION IS NEVER TRIGGERED,
# AND THE VM WILL NOT BE RE-IMAGED
if self._status(label) == self.RUNNING:

Fix Attempt 1

To work around this, I modified the condition to check if the machine object is returned by self._get_machine(label) instead of relying on self._status(label):

machine = self._get_machine(label)

# if self._status(label) == self.RUNNING:
if machine:
    log.debug("Rebooting machine: %s", label)
    # machine = self._get_machine(label)

New Problem Introduced

While this workaround successfully initiates re-imaging even when the agent crashes, it appears to cause another issue: machines with agent crashes are no longer used in subsequent analyses. I suspect this is because they are marked as inactive or removed from the machines pool in the SQLAlchemy-backed database.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions