-
Notifications
You must be signed in to change notification settings - Fork 547
Description
Hello!
I found an AI-Specific Code smell in your project.
The smell is called: TensorArray Not Used
You can find more information about it in this paper: https://dl.acm.org/doi/abs/10.1145/3522664.3528620.
According to the paper, the smell is described as follows:
Problem | If the developer initializes an array using tf.constant() and tries to assign a new value to it in the loop to keep it growing, the code will run into an error. The developer can fix this error by the low-level tf.while_loop() API. However, it is inefficient coding in this way. A lot of intermediate tensors are built in this process. |
---|---|
Solution | Using tf.TensorArray() for growing array in the loop is a better solution for this kind of problem in TensorFlow 2. |
Impact | Efficiency, Error-proneness |
Example:
### TensorFlow
import tensorflow as tf
@tf.function
def fibonacci(n):
a = tf.constant(1)
b = tf.constant(1)
- c = tf.constant([1, 1])
+ c = tf.TensorArray(tf.int32, n)
+ c = c.write(0, a)
+ c = c.write(1, b)
for i in range(2, n):
a, b = b, a + b
- c = tf.concat([c, [b]], 0)
+ c = c.write(i, b)
- return c
+ return c.stack()
You can find the code related to this smell in this link:
CLUE/baselines/models_pytorch/classifier_pytorch/transformers/tokenization_utils.py
Lines 855 to 875 in 2ea9046
if add_special_tokens: | |
sequence = self.build_inputs_with_special_tokens(ids, pair_ids) | |
token_type_ids = self.create_token_type_ids_from_sequences(ids, pair_ids) | |
encoded_inputs["special_tokens_mask"] = self.get_special_tokens_mask(ids, pair_ids) | |
else: | |
sequence = ids + pair_ids if pair else ids | |
token_type_ids = [0] * len(ids) + ([1] * len(pair_ids) if pair else []) | |
if return_tensors == 'tf' and is_tf_available(): | |
sequence = tf.constant([sequence]) | |
token_type_ids = tf.constant([token_type_ids]) | |
elif return_tensors == 'pt' and is_torch_available(): | |
sequence = torch.tensor([sequence]) | |
token_type_ids = torch.tensor([token_type_ids]) | |
elif return_tensors is not None: | |
logger.warning("Unable to convert output to tensors format {}, PyTorch or TensorFlow is not available.".format(return_tensors)) | |
encoded_inputs["input_ids"] = sequence | |
encoded_inputs["token_type_ids"] = token_type_ids | |
if max_length and len(encoded_inputs["input_ids"]) > max_length: |
I also found instances of this smell in other files, such as:
File: https://github.com/CLUEbenchmark/CLUE/blob/master/baselines/models/bert/optimization_test.py#L26-L36 Line: 31
File: https://github.com/CLUEbenchmark/CLUE/blob/master/baselines/models/bert_wwm_ext/optimization_test.py#L26-L36 Line: 31
File: https://github.com/CLUEbenchmark/CLUE/blob/master/baselines/models/ernie/optimization_test.py#L26-L36 Line: 31
File: https://github.com/CLUEbenchmark/CLUE/blob/master/baselines/models/roberta_wwm_ext/optimization_test.py#L26-L36 Line: 31
File: https://github.com/CLUEbenchmark/CLUE/blob/master/baselines/models/roberta_wwm_large_ext/optimization_test.py#L26-L36 Line: 31
.
I hope this information is helpful!