Fix smart_batching_collate Inefficiency (#2556)

PrithivirajDamodaran · tomaarsen · web-flow · commit 684b6b5736c4 · 2024-05-22T15:48:55.000+02:00
* Fix smart_batching_collate Inefficiency

SentenceTransformer.py:846 throws a Inefficiency warning:

".....Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:275.) labels = torch.tensor([example.label for example in batch])"

* Update SentenceTransformer.py

* Remove some comments; add edge case (if labels is empty)

---------

Co-authored-by: Tom Aarsen &lt;Cubiegamedev@gmail.com&gt;
diff --git a/sentence_transformers/SentenceTransformer.py b/sentence_transformers/SentenceTransformer.py
@@ -1000,8 +1000,16 @@ def smart_batching_collate(self, batch: List["InputExample"]) -> Tuple[List[Dict
         """
         texts = [example.texts for example in batch]
         sentence_features = [self.tokenize(sentence) for sentence in zip(*texts)]
-        labels = torch.tensor([example.label for example in batch])
-        return sentence_features, labels
+        labels = [example.label for example in batch]
+
+        # Use torch.from_numpy to convert the numpy array directly to a tensor,
+        # which is the recommended approach for converting numpy arrays to tensors
+        if labels and isinstance(labels[0], np.ndarray):
+            labels_tensor = torch.from_numpy(np.stack(labels))
+        else:
+            labels_tensor = torch.tensor(labels)
+
+        return sentence_features, labels_tensor
 
     def _text_length(self, text: Union[List[int], List[List[int]]]):
         """