Skip to content

TensorArray Not Used on line 865 of tokenization_utils.py  #170

@CodeSmileBot

Description

@CodeSmileBot

Hello!

I found an AI-Specific Code smell in your project.
The smell is called: TensorArray Not Used

You can find more information about it in this paper: https://dl.acm.org/doi/abs/10.1145/3522664.3528620.

According to the paper, the smell is described as follows:

Problem If the developer initializes an array using tf.constant() and tries to assign a new value to it in the loop to keep it growing, the code will run into an error. The developer can fix this error by the low-level tf.while_loop() API. However, it is inefficient coding in this way. A lot of intermediate tensors are built in this process.
Solution Using tf.TensorArray() for growing array in the loop is a better solution for this kind of problem in TensorFlow 2.
Impact Efficiency, Error-proneness

Example:

### TensorFlow
import tensorflow as tf
@tf.function
def fibonacci(n):
   a = tf.constant(1)
   b = tf.constant(1)
-    c = tf.constant([1, 1])
+    c = tf.TensorArray(tf.int32, n)
+    c = c.write(0, a)
+    c = c.write(1, b)

   for i in range(2, n):
       a, b = b, a + b
-       c = tf.concat([c, [b]], 0)
+		c = c.write(i, b)

-    return c
+	 return c.stack()

You can find the code related to this smell in this link:

if add_special_tokens:
sequence = self.build_inputs_with_special_tokens(ids, pair_ids)
token_type_ids = self.create_token_type_ids_from_sequences(ids, pair_ids)
encoded_inputs["special_tokens_mask"] = self.get_special_tokens_mask(ids, pair_ids)
else:
sequence = ids + pair_ids if pair else ids
token_type_ids = [0] * len(ids) + ([1] * len(pair_ids) if pair else [])
if return_tensors == 'tf' and is_tf_available():
sequence = tf.constant([sequence])
token_type_ids = tf.constant([token_type_ids])
elif return_tensors == 'pt' and is_torch_available():
sequence = torch.tensor([sequence])
token_type_ids = torch.tensor([token_type_ids])
elif return_tensors is not None:
logger.warning("Unable to convert output to tensors format {}, PyTorch or TensorFlow is not available.".format(return_tensors))
encoded_inputs["input_ids"] = sequence
encoded_inputs["token_type_ids"] = token_type_ids
if max_length and len(encoded_inputs["input_ids"]) > max_length:
.

I also found instances of this smell in other files, such as:

File: https://github.com/CLUEbenchmark/CLUE/blob/master/baselines/models/bert/optimization_test.py#L26-L36 Line: 31
File: https://github.com/CLUEbenchmark/CLUE/blob/master/baselines/models/bert_wwm_ext/optimization_test.py#L26-L36 Line: 31
File: https://github.com/CLUEbenchmark/CLUE/blob/master/baselines/models/ernie/optimization_test.py#L26-L36 Line: 31
File: https://github.com/CLUEbenchmark/CLUE/blob/master/baselines/models/roberta_wwm_ext/optimization_test.py#L26-L36 Line: 31
File: https://github.com/CLUEbenchmark/CLUE/blob/master/baselines/models/roberta_wwm_large_ext/optimization_test.py#L26-L36 Line: 31
.

I hope this information is helpful!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions