Skip to content

Spark NLP 4.2.0: Wav2Vec2 for Automatic Speech Recognition (ASR), TAPAS for Table Question Answering, CamemBERT for Token Classification, new evaluation metrics for external datasets in all classifiers, much faster EntityRuler, over 3000+ state-of-the-art multi-lingual models & pipelines, and many more!

Compare
Choose a tag to compare
@maziyarpanahi maziyarpanahi released this 27 Sep 14:29
· 1155 commits to master since this release

๐Ÿ“ข Overview

For the first time ever we are delighted to announce Automatic Speech Recognition (ASR) support in Spark NLP by using state-of-the-art Wav2Vec2 models at scale ๐Ÿš€. This release also comes with Table Question Answering by TAPAS, CamemBERT for Token Classification, support for an external test dataset during training of all classifiers, much faster EntityRuler, 3000+ state-of-the-art models, and other enhancements and bug fixes!

We are also celebrating crossing 11000+ free and open-source models & pipelines in our Models Hub. ๐ŸŽ‰ As always, we would like to thank our community for their feedback, questions, and feature requests.


โญ New Features & improvements

  • NEW: Introducing Wav2Vec2ForCTC annotator in Spark NLP ๐Ÿš€. Wav2Vec2ForCTC can load Wav2Vec2 models for the Automatic Speech Recognition (ASR) task. Wav2Vec2 is a multi-modal model, that combines speech and text. It's the first multi-modal model of its kind we welcome in Spark NLP. This annotator is compatible with all the models trained/fine-tuned by using Wav2Vec2ForCTC for PyTorch or TFWav2Vec2ForCTC for TensorFlow models in HuggingFace ๐Ÿค— (#12767)

image

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

  • NEW: Introducing TapasForQuestionAnswering annotator in Spark NLP ๐Ÿš€. TapasForQuestionAnswering can load TAPAS Models with a cell selection head and optional aggregation head on top for question-answering tasks on tables (linear layers on top of the hidden-states output to compute logits and optional logits_aggregation), e.g. for SQA, WTQ or WikiSQL-supervised tasks. TAPAS is a BERT-based model specifically designed (and pre-trained) for answering questions about tabular data. This annotator is compatible with all the models trained/fine-tuned by using TapasForQuestionAnswering for PyTorch or TFTapasForQuestionAnswering for TensorFlow models in HuggingFace ๐Ÿค—

image

TAPAS: Weakly Supervised Table Parsing via Pre-training

  • NEW: Introducing CamemBertForTokenClassification annotator in Spark NLP ๐Ÿš€. CamemBertForTokenClassification can load CamemBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using CamembertForTokenClassification for PyTorch or TFCamembertForTokenClassification for TensorFlow in HuggingFace ๐Ÿค—
    (#12752)
  • Implementing setTestDataset to evaluate metrics on an external dataset during training of Text Classifiers in Spark NLP. This feature is similar to NerDLApproach where metrics are calculated on each Epoch and have been added to the following multi-class/multi-label text classifier annotators: ClassifierDLApproach, SentimentDLApproach, and MultiClassifierDLApproach (#12796)
  • Refactoring and improving EntityRuler annotator inference to up to 24x faster especially when used with a long list of labels/entities. We speed up the inference process by implementing the Aho-Corasick algorithm to match patterns in a string. This requires the following changes when using EntityRuler #12634
  • Add support for S3 storage in the cache_folder where models are downloaded, extracted, and loaded from. Previously, we only supported all local file systems, HDFS, and DBFS. This new feature is especially useful for users on Kubernetes clusters with no access to HDFS or any other distributed file systems (#12707)
  • Implementing lookaround functionalities in DocumentNormalizer annotator. Currently, DocumentNormalizer has both lookahead and lookbehind functionalities. To extend support for more complex normalizations, especially within the clinical text we are introducing the lookaround feature (#12735)
  • Implementing setReplaceEntities param to NerOverwriter annotator to replace all the NER labels (entities) with the given new labels (entities) (#12745)

Bug Fixes

  • Fix a bug in generating the NerDL graph by using TF v2. The previous graph generated by the TFGraphBuilder annotator resulted in an exception when the length of the sequence was 1. This issue has been resolved and the new graphs created by TFGraphBuilder won't have this issue anymore (#12636)
  • Fix a bug introduced in the 4.0.0 release between Transformer-based Word Embeddings annotators. In the 4.0.0 release, the following annotators were migrated to BatchAnnotate to improve their performance, especially on GPU. However, a bug was introduced in sentence indices which when it is combined with SentenceEmbeddings for Text Classifications tasks (ClassifierDLApproach, SentimentDLApproach, and ClassifierDLApproach) resulted in low accuracy: AlbertEmbeddings, CamemBertEmbeddings, DeBertaEmbeddings, DistilBertEmbeddings, LongformerEmbeddings, RoBertaEmbeddings, XlmRoBertaEmbeddings, and XlnetEmbeddings (#12641)
  • Add support for a list of questions and context in LightPipline. Previously, only one context and question at a time were supported in LightPipeline for Question Answering annotators. We have added support to fullAnnotate and annotate to receive two lists of questions and contexts (#12653)
  • Fix division by zero exception in the GPT2Transformer annotator when the setDoSample param was set to true (#12661)
  • Fix AttributeError when PretrainedPipeline is used in Python with ImageAssembler as one of the stages (#12813)

๐Ÿ““ New Notebooks

Spark NLP Notebooks Colab
Wav2Vec2ForCTC Automatic Speech Recognition in Spark NLP Open In Colab
ViTForImageClassification HuggingFace in Spark NLP - ViTForImageClassification Open In Colab
CamemBertForTokenClassification HuggingFace in Spark NLP - CamemBertForTokenClassification Open In Colab
ClassifierDLApproach ClassifierDL Train and Evaluate Open In Colab
MultiClassifierDLApproach MultiClassifierDL Train and Evaluate Open In Colab
SentimentDLApproach SentimentDL Train and Evaluate Open In Colab
Pretrained/cache_folder Download & Load Models From S3 Open In Colab
EntityRuler EntityRuler Open In Colab
EntityRuler EntityRuler Alphabet Open In Colab
EntityRuler EntityRuler LightPipeline Open In Colab
EntityRuler EntityRuler Without Storage Open In Colab
DocumentNormalizer Apply Lookaround Patterns Open In Colab

Models

Spark NLP 4.2.0 comes with 3000+ state-of-the-art pre-trained transformer models in many languages.

Featured Models

Model Name Lang
Wav2Vec2ForCTC asr_wav2vec2_base_100h_by_facebook en
Wav2Vec2ForCTC asr_wav2vec2_base_960h_by_facebook en
Wav2Vec2ForCTC asr_wav2vec2_large_960h en
Wav2Vec2ForCTC asr_wav2vec2_large_xlsr_53_german_by_facebook de
Wav2Vec2ForCTC asr_wav2vec2_large_xlsr_53_french_by_facebook fr
Wav2Vec2ForCTC asr_wav2vec2_large_xlsr_53_polish_by_facebook nl
Wav2Vec2ForCTC asr_wav2vec2_base_10k_voxpopuli hu
Wav2Vec2ForCTC asr_wav2vec2_base_10k_voxpopuli fi
Wav2Vec2ForCTC asr_wav2vec2_base_10k_voxpopuli it
Wav2Vec2ForCTC asr_wav2vec2_large_xlsr_japanese_hiragana ja

Check 2000+ Wav2Vec2 models & pipelines for Models Hub - Automatic Speech Recognition (ASR)

Spark NLP covers the following languages:

English ,Multilingual ,Afrikaans ,Afro-Asiatic languages ,Albanian ,Altaic languages ,American Sign Language ,Amharic ,Arabic ,Argentine Sign Language ,Armenian ,Artificial languages ,Atlantic-Congo languages ,Austro-Asiatic languages ,Austronesian languages ,Azerbaijani ,Baltic languages ,Bantu languages ,Basque ,Basque (family) ,Belarusian ,Bemba (Zambia) ,Bengali, Bangla ,Berber languages ,Bihari ,Bislama ,Bosnian ,Brazilian Sign Language ,Breton ,Bulgarian ,Catalan ,Caucasian languages ,Cebuano ,Celtic languages ,Central Bikol ,Chichewa, Chewa, Nyanja ,Chilean Sign Language ,Chinese ,Chuukese ,Colombian Sign Language ,Congo Swahili ,Croatian ,Cushitic languages ,Czech ,Danish ,Dholuo, Luo (Kenya and Tanzania) ,Dravidian languages ,Dutch ,East Slavic languages ,Eastern Malayo-Polynesian languages ,Efik ,Esperanto ,Estonian ,Ewe ,Fijian ,Finnish ,Finnish Sign Language ,Finno-Ugrian languages ,French ,French-based creoles and pidgins ,Ga ,Galician ,Ganda ,Georgian ,German ,Germanic languages ,Gilbertese ,Greek (modern) ,Greek languages ,Gujarati ,Gun ,Haitian, Haitian Creole ,Hausa ,Hebrew (modern) ,Hiligaynon ,Hindi ,Hiri Motu ,Hungarian ,Icelandic ,Igbo ,Iloko ,Indic languages ,Indo-European languages ,Indo-Iranian languages ,Indonesian ,Irish ,Isoko ,Isthmus Zapotec ,Italian ,Italic languages ,Japanese ,Japanese ,Kabyle ,Kalaallisut, Greenlandic ,Kannada ,Kaonde ,Kinyarwanda ,Kirundi ,Kongo ,Korean ,Kwangali ,Kwanyama, Kuanyama ,Latin ,Latvian ,Lingala ,Lithuanian ,Louisiana Creole ,Lozi ,Luba-Katanga ,Luba-Lulua ,Lunda ,Lushai ,Luvale ,Macedonian ,Malagasy ,Malay ,Malayalam ,Malayo-Polynesian languages ,Maltese ,Manx ,Marathi (Marฤแนญhฤซ) ,Marshallese ,Mexican Sign Language ,Mon-Khmer languages ,Morisyen ,Mossi ,Multiple languages ,Ndonga ,Nepali ,Niger-Kordofanian languages ,Nigerian Pidgin ,Niuean ,North Germanic languages ,Northern Sotho, Pedi, Sepedi ,Norwegian ,Norwegian Bokmรฅl ,Norwegian Nynorsk ,Nyaneka ,Oromo ,Pangasinan ,Papiamento ,Persian (Farsi) ,Peruvian Sign Language ,Philippine languages ,Pijin ,Pohnpeian ,Polish ,Portuguese ,Portuguese-based creoles and pidgins ,Punjabi (Eastern) ,Romance languages ,Romanian ,Rundi ,Russian ,Ruund ,Salishan languages ,Samoan ,San Salvador Kongo ,Sango ,Semitic languages ,Serbo-Croatian ,Seselwa Creole French ,Shona ,Sindhi ,Sino-Tibetan languages ,Slavic languages ,Slovak ,Slovene ,Somali ,South Caucasian languages ,South Slavic languages ,Southern Sotho ,Spanish ,Spanish Sign Language ,Sranan Tongo ,Swahili ,Swati ,Swedish ,Tagalog ,Tahitian ,Tai ,Tamil ,Telugu ,Tetela ,Tetun Dili ,Thai ,Tigrinya ,Tiv ,Tok Pisin ,Tonga (Tonga Islands) ,Tonga (Zambia) ,Tsonga ,Tswana ,Tumbuka ,Turkic languages ,Turkish ,Tuvalu ,Tzotzil ,Ukrainian ,Umbundu ,Uralic languages ,Urdu ,Venda ,Venezuelan Sign Language ,Vietnamese ,Wallisian ,Walloon ,Waray (Philippines) ,Welsh ,West Germanic languages ,West Slavic languages ,Western Malayo-Polynesian languages ,Wolaitta, Wolaytta ,Wolof ,Xhosa ,Yapese ,Yiddish ,Yoruba ,Yucatec Maya, Yucateco ,Zande (individual language) ,Zulu

The complete list of all 11000+ models & pipelines in 230+ languages is available on Models Hub


๐Ÿ“– Documentation


Installation

Python

#PyPI

pip install spark-nlp==4.2.0

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.0

M1

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.0

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>4.2.0</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>4.2.0</version>
</dependency>

spark-nlp-m1:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-m1_2.12</artifactId>
    <version>4.2.0</version>
</dependency>

FAT JARs

What's Changed

Contributors

@maziyarpanahi @suvrat-joshi @danilojsl @josejuanmartinez @ahmedlone127 @Damla-Gurbaz @vankov @xusliebana @DevinTDHa @jsl-builder @Cabir40 @muhammetsnts @wolliq @Meryem1425 @pabla @C-K-Loan @rpranab @agsfer

Full Changelog: 4.1.0...4.2.0


This discussion was created from the release John Snow Labs Spark-NLP 4.2.0: Wav2Vec2 for Automatic Speech Recognition (ASR), TAPAS for Table Question Answering, CamemBERT for Token Classification, new evaluation metrics for external datasets in all classifiers, much faster EntityRuler, over 3000+ state-of-the-art multi-lingual models & pipelines, and many more!.