Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
现在中文的人民币和美元可以被正确播报了,例如100¥会被读成一百元 100.11¥会被读成一百元一角一分,100.00¥会被读成一百元整。100$会被读成一百美元,100.11$会被读成一百美元十一美分。
我在cleaner.py和num.py做了修改,我一开始尝试过修改text_normlization.py的normalize_sentence,但是发现货币符号作为特殊符号似乎会被优先处理掉,所以我选择在cleaner里直接先处理货币。
放几个log当作效果预览:
实际输入的目标文本(每句): 那个$193。
['那个$193。']
['zh']
前端处理后的文本(每句): 那个一百九十三美元.
3%|██▋ | 50/1500 [00:00<00:15, 95.48it/s]T2S Decoding EOS [0 -> 53]
3%|██▊ | 52/1500 [00:00<00:15, 93.62it/s]
实际输入的目标文本(每句): 还有这个100000000.11$。
['还有这个100000000.11$。']
['zh']
前端处理后的文本(每句): 还有这个一亿美元十一美分.
4%|███▏ | 60/1500 [00:00<00:15, 93.93it/s]T2S Decoding EOS [0 -> 70]
5%|███▋ | 69/1500 [00:00<00:15, 92.75it/s]
实际输入的目标文本(每句): 还有这个100000000000000000.11$。
['还有这个100000000000000000.11$。']
['zh']
前端处理后的文本(每句): 还有这个十亿亿美元十一美分.
4%|███▏ | 60/1500 [00:00<00:15, 95.48it/s]T2S Decoding EOS [0 -> 69]
5%|███▋ | 68/1500 [00:00<00:15, 93.86it/s]
实际输入的目标文本(每句): 那个11.11¥。
['那个11.11¥。']
['zh']
前端处理后的文本(每句): 那个十一元一角一分.
3%|██▏ | 40/1500 [00:00<00:15, 95.10it/s]T2S Decoding EOS [0 -> 49]
3%|██▌ | 48/1500 [00:00<00:15, 93.04it/s]
0.002 0.236 3.121 1.019
另外,我试图将cny/usd也加入货币播报体系里,结果发现他们会被LangSegmenter当作英语分开,因为这个也不是特别常见/特别重要,我选择先不处理。。