Skip to content

Commit ada1286

Browse files
futurulustiberiu44
authored andcommitted
Prevent empty sentences in tokenization (#114)
In some cases (usually involving sequences of multiple whitespace characters), the tokenizer can produce sentences with zero tokens. This causes errors later in the pipeline, specifically the following: ``` File "/usr/local/lib/python3.6/dist-packages/cube/api.py" line 194 in __call__ sequences = self._parser.parse_sequences(sequences) File "/usr/local/lib/python3.6/dist-packages/cube/generic_networks/parsers.py" line 496 in parse_sequences predicted_tags = self.tag(new_sequence) File "/usr/local/lib/python3.6/dist-packages/cube/generic_networks/parsers.py" line 226 in tag arc_matrix, aux_arc_matrix, proj_labels, softmax_morphology = self._predict_arc(seq) File "/usr/local/lib/python3.6/dist-packages/cube/generic_networks/parsers.py" line 470 in _predict_arc s_max = dy.softmax(dy.concatenate(s_max)) File "_dynet.pyx" line 4605 in _dynet.concatenate File "_dynet.pyx" line 4618 in _dynet.concatenate AssertionError: List is empty, nothing to concatenate. ``` This change removes empty sequences from the tokenization output.
1 parent 466637b commit ada1286

File tree

1 file changed

+2
-1
lines changed

1 file changed

+2
-1
lines changed

cube/generic_networks/tokenizers.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -388,7 +388,8 @@ def tokenize(self, input_string):
388388
if input_string[index + 1] in string.whitespace:
389389
space_after_end_of_sentence = True
390390
seq = self._get_tokens(w.strip(), space_after_end_of_sentence=space_after_end_of_sentence)
391-
sequences.append(seq)
391+
if seq:
392+
sequences.append(seq)
392393
w = ""
393394
last_ss_break = index
394395
last_checked_index = index

0 commit comments

Comments
 (0)