We are using the visual part (ViT) of BioClip to process images. However, there is an issue with the forward method in BaseCAM.
In the following line of code:
self.outputs = outputs = self.activations_and_grads(input_tensor)
target_categories = np.argmax(outputs.cpu().data.numpy(), axis=-1)
The outputs in this case is the CLS token embedding, which is a high-dimensional vector used to represent the global semantic information of the input image. This embedding is not a classification result or logits, but rather a feature vector.