Release Spark NLP 4.0.0: New modern extractive Question answering (QA) annotators for ALBERT, BERT, DistilBERT, DeBERTa, RoBERTa, Longformer, and XLM-RoBERTa, official support for Apple silicon M1, support oneDNN to improve CPU up to 97%, improved transformers on GPU up to +700%, 1000+ state-of-the-art models, and lots more! · JohnSnowLabs/spark-nlp

Overview

We are very excited to release Spark NLP 4.0.0! This has been one of the biggest releases we have ever done and we are so proud to share this with our community! 🎉

This release comes with official support for Apple silicon M1 chip (for the first time), official support for Spark/PySpark 3.2, support oneAPI Deep Neural Network Library (oneDNN) to improve TensorFlow on CPU up to 97%, optimized transformer-based embeddings on GPU to increase the performance up to +700%, brand new modern extractive transformer-based Question answering (QA) annotators for tasks like SQuAD based on ALBERT, BERT, DistilBERT, DeBERTa, RoBERTa, Longformer, and XLM-RoBERTa architectures, 1000+ state-of-the-art models, WordEmbeddingsModel now works in clusters without HDFS/DBFS/S3 such as Kubernetes, new Databricks and EMR support, new NER models achieving highest F1 score in Spark NLP, and many more enhancements and bug fixes!

We would like to mention that Spark NLP 4.0.0 drops the support for Spark 2.3 and 2.4 (Scala 2.11). Starting 4.0.0 we only support Spark/PySpark 3.x on Scala 2.12.

As always, we would like to thank our community for their feedback, questions, and feature requests.

Major features and improvements

NEW: Support for The oneAPI Deep Neural Network Library (oneDNN) optimizations to improve TensorFlow on CPU. Enabling onDNN can improve some transformer-based models up to 97%. By default, the oneDNN optimizations will be turned off. To enable them, you can set the environment variable TF_ENABLE_ONEDNN_OPTS. On Linux systems, for instance: export TF_ENABLE_ONEDNN_OPTS=1
NEW: Optimizing batch processing for transformer-based Word Embeddings on a GPU device. These optimizations can result in performance improvements up to +700% (more details in the Benchmarks section)
NEW: Official support for Apple silicon M1 on macOS devices. You can use the spark-nlp-m1 package that supports Apple silicon M1 on your macOS machine in Spark NLP 4.0.0
NEW: Introducing AlbertForQuestionAnswering annotator in Spark NLP 🚀. AlbertForQuestionAnswering can load ALBERT Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using AlbertForQuestionAnswering for PyTorch or TFAlbertForQuestionAnswering for TensorFlow models in HuggingFace 🤗
NEW: Introducing BertForQuestionAnswering annotator in Spark NLP 🚀. BertForQuestionAnswering can load BERT & ELECTRA Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using BertForQuestionAnswering and ElectraForQuestionAnswering for PyTorch or TFBertForQuestionAnswering and TFElectraForQuestionAnswering for TensorFlow models in HuggingFace 🤗
NEW: Introducing DeBertaForQuestionAnswering annotator in Spark NLP 🚀. DeBertaForQuestionAnswering can load DeBERTa v2&v3 Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using DebertaV2ForQuestionAnswering for PyTorch or TFDebertaV2ForQuestionAnswering for TensorFlow models in HuggingFace 🤗
NEW: Introducing DistilBertForQuestionAnswering annotator in Spark NLP 🚀. DistilBertForQuestionAnswering can load DistilBERT Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using DistilBertForQuestionAnswering for PyTorch or TFDistilBertForQuestionAnswering for TensorFlow models in HuggingFace 🤗
NEW: Introducing LongformerForQuestionAnswering annotator in Spark NLP 🚀. LongformerForQuestionAnswering can load Longformer Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using LongformerForQuestionAnswering for PyTorch or TFLongformerForQuestionAnswering for TensorFlow models in HuggingFace 🤗
NEW: Introducing RoBertaForQuestionAnswering annotator in Spark NLP 🚀. RoBertaForQuestionAnswering can load RoBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using RobertaForQuestionAnswering for PyTorch or TFRobertaForQuestionAnswering for TensorFlow models in HuggingFace 🤗
NEW: Introducing XlmRoBertaForQuestionAnswering annotator in Spark NLP 🚀. XlmRoBertaForQuestionAnswering can load XLM-RoBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using XLMRobertaForQuestionAnswering for PyTorch or TFXLMRobertaForQuestionAnswering for TensorFlow models in HuggingFace 🤗
NEW: Introducing MultiDocumentAssembler annotator where multiple inputs require to be converted to DOCUMENT such as in XXXForQuestionAnswering annotators
NEW: Introducing SpanBertCorefModel annotator for Coreference Resolution on BERT and SpanBERT models based on BERT for Coreference Resolution: Baselines and Analysis paper. An implementation of a SpanBert-based coreference resolution model.
NEW: Introducing enableInMemoryStorage parameter in WordEmbeddingsModel annotator. By enabling this parameter the annotator will no longer require a distributed storage to unpack indices and will perform everything in-memory.
Official support for Apache Spark and PySpark 3.2.x on Scala 2.12. Spark NLP by default is shipped for Spark 3.2.x and supports Spark/PySpark 3.0.x and 3.1.x in additions
Unifying all supported Apache Spark packages on Maven into spark-nlp for CPU, spark-nlp-gpu for GPU, and spark-nlp-m1 for new Apple silicon M1 on macOS. The need for Apache Spark specific packages like spark-nlp-spark32 has been removed.
Adding a new param to sparknlp.start() function in Python and Scala for Apple silicon M1 on macOS (m1=True)
Upgrade TensorFlow to 2.7.1 and start supporting Apple silicon M1
Upgrade RocksDB with new enhancements and support for Apple silicon M1
Upgrade SentencePiece tokenizer TF ops to 2.7.1
Upgrade SentencePiece JNI to v0.1.96 and provide support for Apple silicon M1 on macOS support
Upgrade to Scala 2.12.15
Update Colab, Kaggle, and SageMaker scripts
Refactor the entire Python module in Spark NLP to make the development and maintenance easier
Refactor unit tests in Python and migrate to pytest
Welcoming 6x new Databricks runtimes to our Spark NLP family:
- Databricks 10.4 LTS
- Databricks 10.4 LTS ML
- Databricks 10.4 LTS ML GPU
- Databricks 10.5
- Databricks 10.5 ML
- Databricks 10.5 ML GPU
Welcoming a new EMR 6.x series to our Spark NLP family:
- EMR 6.6.0 (Apache Spark 3.2.0 / Hadoop 3.2.1)
Migrate T5Transformer to TensorFlow v2 architecture by re-uploading all the existing models
Support for 2 inputs in LightPipeline with MultiDocumentAssembler
Add new default NerDL graph for xsmall DeBERTa embeddings model (384 dimensions)
Adding annotateJava method to PretrainedPipeline class in Java to facilitate the use of LightPipelines
Allow change of case sensitivity. Currently, the user cannot set the setCaseSensitive param. This allows users to change this value if the model was saved/uploaded with the wrong case sensitivity parameter. (BERT, ALBERT, DistilBERT, RoBERTa, DeBERTa, XLM-RoBERTa, and Longformer for XXXForSequenceClassitication and XXXForTokenClassification.
Keep accuracy in ClassifierDL and SentimentDL during the training between 0.0 and 1.0
Preserve the original form of the token in BPE Tokenizer used in RoBERTa annotators (used in embeddings, sequence and token classification)

Performance Improvements (Benchmarks)

We have introduced two major performance improvements for GPU and CPU devices in Spark NLP 4.0.0 release.

The following benchmarks have been done by using a single Dell Server with the following specs:

GPU: Tesla P100 PCIe 12GB - CUDA Version: 11.3 - Driver Version: 465.19.01
CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz - 40 Cores
Memory: 80G

GPU

We have improved our batch processing approach for transformer-based Word Embeddings to improve their performance on a GPU device. These optimizations result in performance improvements up to +700%. The detailed list of improved transformer models on GPU in comparison to Spark NLP 3.4.x:

Model on GPU	Spark NLP 3.4.3 vs. 4.0.0
RoBERTa base	+560%(6.6x)
RoBERTa Large	+332%(4.3x)
Albert Base	+587%(6.9x)
Albert Large	+332%(4.3x)
DistilBERT	+659%(7.6x)
XLM-RoBERTa Base	+638%(7.4x)
XLM-RoBERTa Large	+365%(4.7x)
XLNet Base	+449%(5.5x)
XLNet Large	+267%(3.7x)
DeBERTa Base	+713%(8.1x)
DeBERTa Large	+477%(5.8x)
Longformer Base	+52%(1.5x)

CPU

The oneAPI Deep Neural Network Library (oneDNN) optimizations are now available in Spark NLP 4.0.0 that uses TensorFlow 2.7.1. You can enable those CPU optimizations by setting the environment variable TF_ENABLE_ONEDNN_OPTS=1.

Intel has been collaborating with Google to optimize its performance on Intel Xeon processor-based platforms using Intel oneAPI Deep Neural Network (oneDNN), an open-source, cross-platform performance library for DL applications. TensorFlow optimizations are enabled via oneDNN to accelerate key performance-intensive operations such as convolution, matrix multiplication, and batch normalization.

Comparing the last release of Spark NLP 3.4.3 on CPU vs. Spark NLP 4.0.0 on CPU with oneDNN enabled.

Model on CPU	3.4.x vs. 4.0.0 with oneDNN
BERT Base	+47%
BERT Large	+42%
RoBERTa Base	+51%
RoBERTa Large	+61%
Albert Base	+83%
Albert Large	+58%
DistilBERT	+80%
XLM-RoBERTa Base	+82%
XLM-RoBERTa Large	+72%
XLNet Base	+50%
XLNet Large	+27%
DeBERTa Base	+59%
DeBERTa Large	+56%
CamemBERT Base	+97%
CamemBERT Large	+65%
Longformer Base	+63%

Bug Fixes

Fix the default pre-trained model for DeBertaForTokenClassification in Scala and Python
Remove a requirement in DocumentNormalizer that consecutive stage processing can produce empty text annotations without breaking the pipeline
Fix WordSegmenterModel outputting the wrong order of tokens. The regex that groups the tagging format was refactored to preserve the order of segmented outputs (tokens)
Fix encoding sentences not respecting the max sequence length given by a user in XlmRobertaSentenceEmbeddings
Fix encoding sentences by using SentencePiece to calculate the correct tokens indexing
Fix SentencePiece serialization issue when XlmRoBertaEmbeddings and XlmRoBertaSentenceEmbeddings annotators are used from a Fat JAR on GPU
Remove non-existing parameters from DocumentAssembler in Python

Updated Requirements

Java 8 (still supported) or 11
Apache Spark 3.x (3.0, 3.1, and 3.2)
NVIDIA® GPU drivers version 450.80.02 or higher
CUDA® Toolkit 11.2
cuDNN SDK 8.1.0
Scala 2.12.15

Backward Compatibility

Deprecate support for Spark/PySpark 2.3, Spark/PySpark 2.4, and Scala 2.11 #8319
The start() functions in Python and Scala will no longer have spark23, spark24, and spark32 parameters. The default sparknlp.start() works on PySpark 3.0.x, 3.1.x, and 3.2.x without the need for any Spark-related flags
Some models/pipelines which were trained or saved by using Spark and PySpark 2.3/2.4 will no longer work on Spark NLP 4.0.0
Remove json4s-ext dependency to allow the support for all Apache Spark major releases in one build

Models and Pipelines

Spark NLP 4.0.0 comes with 1000+ state-of-the-art pre-trained transformer models in many languages.

New NER Models

nerdl_conll_deberta_large NER model breaks the previously highest F1 on CoNLL03 dev by 1%

Model	Name	Lang	Dev F1
NerDLModel	nerdl_conll_deberta_large	`en`	`96%`
NerDLModel	nerdl_conll_elmo	`en`	`95.6%`
NerDLModel	nerdl_conll_deberta_base	`en`	`94%`

Featured Models

Model	Name	Lang
AlbertForQuestionAnswering	albert_base_qa_squad2	`en`
DebertaForQuestionAnswering	deberta_v3_xsmall_qa_squad2	`en`
DistilBertForQuestionAnswering	distilbert_base_cased_qa_squad2	`en`
LongformerForQuestionAnswering	longformer_base_base_qa_squad2	`en`
RoBertaForQuestionAnswering	roberta_base_qa_squad2	`en`
XlmRoBertaForQuestionAnswering	xlm_roberta_base_qa_squad2	`en`
DistilBertForQuestionAnswering	distilbert_qa_multi_finedtuned_squad	`pt`
BertForQuestionAnswering	bert_qa_bert_large_cased_squad_v1.1_portuguese	`pt`
BertForQuestionAnswering	bert_qa_chinese_pert_base_mrc	`zh`
BertForQuestionAnswering	bert_qa_arap_qa_bert	`ar`
BertForQuestionAnswering	bert_qa_ainize_klue_bert_base_mrc	`ko`
BertForQuestionAnswering	bert_qa_Part_1_mBERT_Model_E1	`xx`
BertForQuestionAnswering	bert_qa_qacombination_bert_el_Danastos	`el`

Spark NLP covers the following languages:

English ,Multilingual ,Afrikaans ,Afro-Asiatic languages ,Albanian ,Altaic languages ,American Sign Language ,Amharic ,Arabic ,Argentine Sign Language ,Armenian ,Artificial languages ,Atlantic-Congo languages ,Austro-Asiatic languages ,Austronesian languages ,Azerbaijani ,Baltic languages ,Bantu languages ,Basque ,Basque (family) ,Belarusian ,Bemba (Zambia) ,Bengali, Bangla ,Berber languages ,Bihari ,Bislama ,Bosnian ,Brazilian Sign Language ,Breton ,Bulgarian ,Catalan ,Caucasian languages ,Cebuano ,Celtic languages ,Central Bikol ,Chichewa, Chewa, Nyanja ,Chilean Sign Language ,Chinese ,Chuukese ,Colombian Sign Language ,Congo Swahili ,Croatian ,Cushitic languages ,Czech ,Danish ,Dholuo, Luo (Kenya and Tanzania) ,Dravidian languages ,Dutch ,East Slavic languages ,Eastern Malayo-Polynesian languages ,Efik ,Esperanto ,Estonian ,Ewe ,Fijian ,Finnish ,Finnish Sign Language ,Finno-Ugrian languages ,French ,French-based creoles and pidgins ,Ga ,Galician ,Ganda ,Georgian ,German ,Germanic languages ,Gilbertese ,Greek (modern) ,Greek languages ,Gujarati ,Gun ,Haitian, Haitian Creole ,Hausa ,Hebrew (modern) ,Hiligaynon ,Hindi ,Hiri Motu ,Hungarian ,Icelandic ,Igbo ,Iloko ,Indic languages ,Indo-European languages ,Indo-Iranian languages ,Indonesian ,Irish ,Isoko ,Isthmus Zapotec ,Italian ,Italic languages ,Japanese ,Japanese ,Kabyle ,Kalaallisut, Greenlandic ,Kannada ,Kaonde ,Kinyarwanda ,Kirundi ,Kongo ,Korean ,Kwangali ,Kwanyama, Kuanyama ,Latin ,Latvian ,Lingala ,Lithuanian ,Louisiana Creole ,Lozi ,Luba-Katanga ,Luba-Lulua ,Lunda ,Lushai ,Luvale ,Macedonian ,Malagasy ,Malay ,Malayalam ,Malayo-Polynesian languages ,Maltese ,Manx ,Marathi (Marāṭhī) ,Marshallese ,Mexican Sign Language ,Mon-Khmer languages ,Morisyen ,Mossi ,Multiple languages ,Ndonga ,Nepali ,Niger-Kordofanian languages ,Nigerian Pidgin ,Niuean ,North Germanic languages ,Northern Sotho, Pedi, Sepedi ,Norwegian ,Norwegian Bokmål ,Norwegian Nynorsk ,Nyaneka ,Oromo ,Pangasinan ,Papiamento ,Persian (Farsi) ,Peruvian Sign Language ,Philippine languages ,Pijin ,Pohnpeian ,Polish ,Portuguese ,Portuguese-based creoles and pidgins ,Punjabi (Eastern) ,Romance languages ,Romanian ,Rundi ,Russian ,Ruund ,Salishan languages ,Samoan ,San Salvador Kongo ,Sango ,Semitic languages ,Serbo-Croatian ,Seselwa Creole French ,Shona ,Sindhi ,Sino-Tibetan languages ,Slavic languages ,Slovak ,Slovene ,Somali ,South Caucasian languages ,South Slavic languages ,Southern Sotho ,Spanish ,Spanish Sign Language ,Sranan Tongo ,Swahili ,Swati ,Swedish ,Tagalog ,Tahitian ,Tai ,Tamil ,Telugu ,Tetela ,Tetun Dili ,Thai ,Tigrinya ,Tiv ,Tok Pisin ,Tonga (Tonga Islands) ,Tonga (Zambia) ,Tsonga ,Tswana ,Tumbuka ,Turkic languages ,Turkish ,Tuvalu ,Tzotzil ,Ukrainian ,Umbundu ,Uralic languages ,Urdu ,Venda ,Venezuelan Sign Language ,Vietnamese ,Wallisian ,Walloon ,Waray (Philippines) ,Welsh ,West Germanic languages ,West Slavic languages ,Western Malayo-Polynesian languages ,Wolaitta, Wolaytta ,Wolof ,Xhosa ,Yapese ,Yiddish ,Yoruba ,Yucatec Maya, Yucateco ,Zande (individual language) ,Zulu

The complete list of all 6000+ models & pipelines in 230+ languages is available on Models Hub

New Notebooks

Import hundreds of models in different languages to Spark NLP

Spark NLP	HuggingFace Notebooks	Colab
AlbertForQuestionAnswering	HuggingFace in Spark NLP - AlbertForQuestionAnswering
BertForQuestionAnswering	HuggingFace in Spark NLP - BertForQuestionAnswering
DeBertaForQuestionAnswering	HuggingFace in Spark NLP - DeBertaForQuestionAnswering
DistilBertForQuestionAnswering	HuggingFace in Spark NLP - DistilBertForQuestionAnswering
LongformerForQuestionAnswering	HuggingFace in Spark NLP - LongformerForQuestionAnswering
RoBertaForQuestionAnswering	HuggingFace in Spark NLP - RoBertaForQuestionAnswering
XlmRobertaForQuestionAnswering	HuggingFace in Spark NLP - XlmRobertaForQuestionAnswering

You can visit Import Transformers in Spark NLP for more info

Documentation

Serving Spark NLP via API in Java
TF Hub & HuggingFace to Spark NLP
Models Hub with new models
Spark NLP documentation
Spark NLP Scala APIs
Spark NLP Python APIs
Spark NLP Workshop notebooks
Spark NLP publications
Spark NLP in Action
Spark NLP training certification notebooks for Google Colab and Databricks
Spark NLP Display for visualization of different types of annotations
Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI

pip install spark-nlp==4.0.0

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, and 3.2.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.0

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.0

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, and 3.2.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>4.0.0</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>4.0.0</version>
</dependency>

spark-nlp-m1:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-m1_2.12</artifactId>
    <version>4.0.0</version>
</dependency>

FAT JARs

CPU on Apache Spark 3.x/3.1.x/3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.0.0.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.0.0.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.0.0.jar

What's Changed

Full Changelog: 3.4.4...4.0.0

@vankov @mahmoodbayeshi @Ahmetemintek @DevinTDHa @albertoandreottiATgmail @KshitizGIT @jsl-models @gokhanturer @josejuanmartinez @murat-gunay @rpranab @wolliq @bunyamin-polat @pabla @danilojsl @agsfer @Meryem1425 @gadde5300 @muhammetsnts @Damla-Gurbaz @maziyarpanahi @jsl-builder @Cabir40 @suvrat-joshi

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!