Spark NLP 4.2.0: Wav2Vec2 for Automatic Speech Recognition (ASR), TAPAS for Table Question Answering, CamemBERT for Token Classification, new evaluation metrics for external datasets in all classifiers, much faster EntityRuler, over 3000+ state-of-the-art multi-lingual models & pipelines, and many more!
๐ข Overview
For the first time ever we are delighted to announce Automatic Speech Recognition (ASR) support in Spark NLP by using state-of-the-art Wav2Vec2 models at scale ๐. This release also comes with Table Question Answering by TAPAS, CamemBERT for Token Classification, support for an external test dataset during training of all classifiers, much faster EntityRuler, 3000+ state-of-the-art models, and other enhancements and bug fixes!
We are also celebrating crossing 11000+ free and open-source models & pipelines in our Models Hub. ๐ As always, we would like to thank our community for their feedback, questions, and feature requests.
โญ New Features & improvements
- NEW: Introducing Wav2Vec2ForCTC annotator in Spark NLP ๐.
Wav2Vec2ForCTC
can loadWav2Vec2
models for the Automatic Speech Recognition (ASR) task. Wav2Vec2 is a multi-modal model, that combines speech and text. It's the first multi-modal model of its kind we welcome in Spark NLP. This annotator is compatible with all the models trained/fine-tuned by usingWav2Vec2ForCTC
for PyTorch orTFWav2Vec2ForCTC
for TensorFlow models in HuggingFace ๐ค (#12767)
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
- NEW: Introducing TapasForQuestionAnswering annotator in Spark NLP ๐.
TapasForQuestionAnswering
can load TAPAS Models with a cell selection head and optional aggregation head on top for question-answering tasks on tables (linear layers on top of the hidden-states output to compute logits and optional logits_aggregation), e.g. for SQA, WTQ or WikiSQL-supervised tasks. TAPAS is a BERT-based model specifically designed (and pre-trained) for answering questions about tabular data. This annotator is compatible with all the models trained/fine-tuned by usingTapasForQuestionAnswering
for PyTorch orTFTapasForQuestionAnswering
for TensorFlow models in HuggingFace ๐ค
- NEW: Introducing CamemBertForTokenClassification annotator in Spark NLP ๐.
CamemBertForTokenClassification
can load CamemBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by usingCamembertForTokenClassification
for PyTorch orTFCamembertForTokenClassification
for TensorFlow in HuggingFace ๐ค
(#12752) - Implementing
setTestDataset
to evaluate metrics on an external dataset during training of Text Classifiers in Spark NLP. This feature is similar to NerDLApproach where metrics are calculated on each Epoch and have been added to the following multi-class/multi-label text classifier annotators:ClassifierDLApproach
,SentimentDLApproach
, andMultiClassifierDLApproach
(#12796) - Refactoring and improving
EntityRuler
annotator inference to up to 24x faster especially when used with a long list of labels/entities. We speed up the inference process by implementing the Aho-Corasick algorithm to match patterns in a string. This requires the following changes when usingEntityRuler
#12634 - Add support for S3 storage in the
cache_folder
where models are downloaded, extracted, and loaded from. Previously, we only supported all local file systems, HDFS, and DBFS. This new feature is especially useful for users on Kubernetes clusters with no access to HDFS or any other distributed file systems (#12707) - Implementing
lookaround
functionalities inDocumentNormalizer
annotator. Currently,DocumentNormalizer
has bothlookahead
andlookbehind
functionalities. To extend support for more complex normalizations, especially within the clinical text we are introducing thelookaround
feature (#12735) - Implementing
setReplaceEntities
param toNerOverwriter
annotator to replace all the NER labels (entities) with the given new labels (entities) (#12745)
Bug Fixes
- Fix a bug in generating the NerDL graph by using TF v2. The previous graph generated by the
TFGraphBuilder
annotator resulted in an exception when the length of the sequence was 1. This issue has been resolved and the new graphs created byTFGraphBuilder
won't have this issue anymore (#12636) - Fix a bug introduced in the 4.0.0 release between Transformer-based Word Embeddings annotators. In the 4.0.0 release, the following annotators were migrated to BatchAnnotate to improve their performance, especially on GPU. However, a bug was introduced in sentence indices which when it is combined with SentenceEmbeddings for Text Classifications tasks (ClassifierDLApproach, SentimentDLApproach, and ClassifierDLApproach) resulted in low accuracy: AlbertEmbeddings, CamemBertEmbeddings, DeBertaEmbeddings, DistilBertEmbeddings, LongformerEmbeddings, RoBertaEmbeddings, XlmRoBertaEmbeddings, and XlnetEmbeddings (#12641)
- Add support for a list of questions and context in LightPipline. Previously, only one context and question at a time were supported in LightPipeline for Question Answering annotators. We have added support to
fullAnnotate
andannotate
to receive two lists of questions and contexts (#12653) - Fix division by zero exception in the
GPT2Transformer
annotator when thesetDoSample
param was set to true (#12661) - Fix
AttributeError
when PretrainedPipeline is used in Python with ImageAssembler as one of the stages (#12813)
๐ New Notebooks
Spark NLP | Notebooks | Colab |
---|---|---|
Wav2Vec2ForCTC | Automatic Speech Recognition in Spark NLP | |
ViTForImageClassification | HuggingFace in Spark NLP - ViTForImageClassification | |
CamemBertForTokenClassification | HuggingFace in Spark NLP - CamemBertForTokenClassification | |
ClassifierDLApproach | ClassifierDL Train and Evaluate | |
MultiClassifierDLApproach | MultiClassifierDL Train and Evaluate | |
SentimentDLApproach | SentimentDL Train and Evaluate | |
Pretrained/cache_folder | Download & Load Models From S3 | |
EntityRuler | EntityRuler | |
EntityRuler | EntityRuler Alphabet | |
EntityRuler | EntityRuler LightPipeline | |
EntityRuler | EntityRuler Without Storage | |
DocumentNormalizer | Apply Lookaround Patterns |
- You can visit Import Transformers in Spark NLP
- You can visit Spark NLP Workshop for 100+ examples
Models
Spark NLP 4.2.0 comes with 3000+ state-of-the-art pre-trained transformer models in many languages.
Featured Models
Model | Name | Lang |
---|---|---|
Wav2Vec2ForCTC | asr_wav2vec2_base_100h_by_facebook | en |
Wav2Vec2ForCTC | asr_wav2vec2_base_960h_by_facebook | en |
Wav2Vec2ForCTC | asr_wav2vec2_large_960h | en |
Wav2Vec2ForCTC | asr_wav2vec2_large_xlsr_53_german_by_facebook | de |
Wav2Vec2ForCTC | asr_wav2vec2_large_xlsr_53_french_by_facebook | fr |
Wav2Vec2ForCTC | asr_wav2vec2_large_xlsr_53_polish_by_facebook | nl |
Wav2Vec2ForCTC | asr_wav2vec2_base_10k_voxpopuli | hu |
Wav2Vec2ForCTC | asr_wav2vec2_base_10k_voxpopuli | fi |
Wav2Vec2ForCTC | asr_wav2vec2_base_10k_voxpopuli | it |
Wav2Vec2ForCTC | asr_wav2vec2_large_xlsr_japanese_hiragana | ja |
Check 2000+ Wav2Vec2 models & pipelines for Models Hub - Automatic Speech Recognition (ASR)
Spark NLP covers the following languages:
English
,Multilingual
,Afrikaans
,Afro-Asiatic languages
,Albanian
,Altaic languages
,American Sign Language
,Amharic
,Arabic
,Argentine Sign Language
,Armenian
,Artificial languages
,Atlantic-Congo languages
,Austro-Asiatic languages
,Austronesian languages
,Azerbaijani
,Baltic languages
,Bantu languages
,Basque
,Basque (family)
,Belarusian
,Bemba (Zambia)
,Bengali, Bangla
,Berber languages
,Bihari
,Bislama
,Bosnian
,Brazilian Sign Language
,Breton
,Bulgarian
,Catalan
,Caucasian languages
,Cebuano
,Celtic languages
,Central Bikol
,Chichewa, Chewa, Nyanja
,Chilean Sign Language
,Chinese
,Chuukese
,Colombian Sign Language
,Congo Swahili
,Croatian
,Cushitic languages
,Czech
,Danish
,Dholuo, Luo (Kenya and Tanzania)
,Dravidian languages
,Dutch
,East Slavic languages
,Eastern Malayo-Polynesian languages
,Efik
,Esperanto
,Estonian
,Ewe
,Fijian
,Finnish
,Finnish Sign Language
,Finno-Ugrian languages
,French
,French-based creoles and pidgins
,Ga
,Galician
,Ganda
,Georgian
,German
,Germanic languages
,Gilbertese
,Greek (modern)
,Greek languages
,Gujarati
,Gun
,Haitian, Haitian Creole
,Hausa
,Hebrew (modern)
,Hiligaynon
,Hindi
,Hiri Motu
,Hungarian
,Icelandic
,Igbo
,Iloko
,Indic languages
,Indo-European languages
,Indo-Iranian languages
,Indonesian
,Irish
,Isoko
,Isthmus Zapotec
,Italian
,Italic languages
,Japanese
,Japanese
,Kabyle
,Kalaallisut, Greenlandic
,Kannada
,Kaonde
,Kinyarwanda
,Kirundi
,Kongo
,Korean
,Kwangali
,Kwanyama, Kuanyama
,Latin
,Latvian
,Lingala
,Lithuanian
,Louisiana Creole
,Lozi
,Luba-Katanga
,Luba-Lulua
,Lunda
,Lushai
,Luvale
,Macedonian
,Malagasy
,Malay
,Malayalam
,Malayo-Polynesian languages
,Maltese
,Manx
,Marathi (Marฤแนญhฤซ)
,Marshallese
,Mexican Sign Language
,Mon-Khmer languages
,Morisyen
,Mossi
,Multiple languages
,Ndonga
,Nepali
,Niger-Kordofanian languages
,Nigerian Pidgin
,Niuean
,North Germanic languages
,Northern Sotho, Pedi, Sepedi
,Norwegian
,Norwegian Bokmรฅl
,Norwegian Nynorsk
,Nyaneka
,Oromo
,Pangasinan
,Papiamento
,Persian (Farsi)
,Peruvian Sign Language
,Philippine languages
,Pijin
,Pohnpeian
,Polish
,Portuguese
,Portuguese-based creoles and pidgins
,Punjabi (Eastern)
,Romance languages
,Romanian
,Rundi
,Russian
,Ruund
,Salishan languages
,Samoan
,San Salvador Kongo
,Sango
,Semitic languages
,Serbo-Croatian
,Seselwa Creole French
,Shona
,Sindhi
,Sino-Tibetan languages
,Slavic languages
,Slovak
,Slovene
,Somali
,South Caucasian languages
,South Slavic languages
,Southern Sotho
,Spanish
,Spanish Sign Language
,Sranan Tongo
,Swahili
,Swati
,Swedish
,Tagalog
,Tahitian
,Tai
,Tamil
,Telugu
,Tetela
,Tetun Dili
,Thai
,Tigrinya
,Tiv
,Tok Pisin
,Tonga (Tonga Islands)
,Tonga (Zambia)
,Tsonga
,Tswana
,Tumbuka
,Turkic languages
,Turkish
,Tuvalu
,Tzotzil
,Ukrainian
,Umbundu
,Uralic languages
,Urdu
,Venda
,Venezuelan Sign Language
,Vietnamese
,Wallisian
,Walloon
,Waray (Philippines)
,Welsh
,West Germanic languages
,West Slavic languages
,Western Malayo-Polynesian languages
,Wolaitta, Wolaytta
,Wolof
,Xhosa
,Yapese
,Yiddish
,Yoruba
,Yucatec Maya, Yucateco
,Zande (individual language)
,Zulu
The complete list of all 11000+ models & pipelines in 230+ languages is available on Models Hub
๐ Documentation
- TF Hub & HuggingFace to Spark NLP
- Models Hub with new models
- Spark NLP documentation
- Spark NLP Scala APIs
- Spark NLP Python APIs
- Spark NLP Workshop notebooks
- Spark NLP publications
- Spark NLP in Action
- Spark NLP training certification notebooks for Google Colab and Databricks
- Spark NLP Display for visualization of different types of annotations
- Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
Installation
Python
#PyPI
pip install spark-nlp==4.2.0
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.0
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.0
M1
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.0
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>4.2.0</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>4.2.0</version>
</dependency>
spark-nlp-m1:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-m1_2.12</artifactId>
<version>4.2.0</version>
</dependency>
FAT JARs
-
CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.2.0.jar
-
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.2.0.jar
-
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.0.jar
-
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.0.jar
What's Changed
Contributors
@maziyarpanahi @suvrat-joshi @danilojsl @josejuanmartinez @ahmedlone127 @Damla-Gurbaz @vankov @xusliebana @DevinTDHa @jsl-builder @Cabir40 @muhammetsnts @wolliq @Meryem1425 @pabla @C-K-Loan @rpranab @agsfer
Full Changelog: 4.1.0...4.2.0
This discussion was created from the release John Snow Labs Spark-NLP 4.2.0: Wav2Vec2 for Automatic Speech Recognition (ASR), TAPAS for Table Question Answering, CamemBERT for Token Classification, new evaluation metrics for external datasets in all classifiers, much faster EntityRuler, over 3000+ state-of-the-art multi-lingual models & pipelines, and many more!.