Skip to content

Spark NLP 4.0.0: New modern extractive Question answering (QA) annotators for ALBERT, BERT, DistilBERT, DeBERTa, RoBERTa, Longformer, and XLM-RoBERTa, official support for Apple silicon M1, support oneDNN to improve CPU up to 97%, improved transformers on GPU up to +700%, 1000+ state-of-the-art models, and lots more!

Compare
Choose a tag to compare
@maziyarpanahi maziyarpanahi released this 15 Jun 17:38
· 1523 commits to master since this release

Overview

We are very excited to release Spark NLP 4.0.0! This has been one of the biggest releases we have ever done and we are so proud to share this with our community! 🎉

This release comes with official support for Apple silicon M1 chip (for the first time), official support for Spark/PySpark 3.2, support oneAPI Deep Neural Network Library (oneDNN) to improve TensorFlow on CPU up to 97%, optimized transformer-based embeddings on GPU to increase the performance up to +700%, brand new modern extractive transformer-based Question answering (QA) annotators for tasks like SQuAD based on ALBERT, BERT, DistilBERT, DeBERTa, RoBERTa, Longformer, and XLM-RoBERTa architectures, 1000+ state-of-the-art models, WordEmbeddingsModel now works in clusters without HDFS/DBFS/S3 such as Kubernetes, new Databricks and EMR support, new NER models achieving highest F1 score in Spark NLP, and many more enhancements and bug fixes!

We would like to mention that Spark NLP 4.0.0 drops the support for Spark 2.3 and 2.4 (Scala 2.11). Starting 4.0.0 we only support Spark/PySpark 3.x on Scala 2.12.

As always, we would like to thank our community for their feedback, questions, and feature requests.


Major features and improvements

  • NEW: Support for The oneAPI Deep Neural Network Library (oneDNN) optimizations to improve TensorFlow on CPU. Enabling onDNN can improve some transformer-based models up to 97%. By default, the oneDNN optimizations will be turned off. To enable them, you can set the environment variable TF_ENABLE_ONEDNN_OPTS. On Linux systems, for instance: export TF_ENABLE_ONEDNN_OPTS=1
  • NEW: Optimizing batch processing for transformer-based Word Embeddings on a GPU device. These optimizations can result in performance improvements up to +700% (more details in the Benchmarks section)
  • NEW: Official support for Apple silicon M1 on macOS devices. You can use the spark-nlp-m1 package that supports Apple silicon M1 on your macOS machine in Spark NLP 4.0.0
  • NEW: Introducing AlbertForQuestionAnswering annotator in Spark NLP 🚀. AlbertForQuestionAnswering can load ALBERT Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using AlbertForQuestionAnswering for PyTorch or TFAlbertForQuestionAnswering for TensorFlow models in HuggingFace 🤗
  • NEW: Introducing BertForQuestionAnswering annotator in Spark NLP 🚀. BertForQuestionAnswering can load BERT & ELECTRA Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using BertForQuestionAnswering and ElectraForQuestionAnswering for PyTorch or TFBertForQuestionAnswering and TFElectraForQuestionAnswering for TensorFlow models in HuggingFace 🤗
  • NEW: Introducing DeBertaForQuestionAnswering annotator in Spark NLP 🚀. DeBertaForQuestionAnswering can load DeBERTa v2&v3 Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using DebertaV2ForQuestionAnswering for PyTorch or TFDebertaV2ForQuestionAnswering for TensorFlow models in HuggingFace 🤗
  • NEW: Introducing DistilBertForQuestionAnswering annotator in Spark NLP 🚀. DistilBertForQuestionAnswering can load DistilBERT Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using DistilBertForQuestionAnswering for PyTorch or TFDistilBertForQuestionAnswering for TensorFlow models in HuggingFace 🤗
  • NEW: Introducing LongformerForQuestionAnswering annotator in Spark NLP 🚀. LongformerForQuestionAnswering can load Longformer Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using LongformerForQuestionAnswering for PyTorch or TFLongformerForQuestionAnswering for TensorFlow models in HuggingFace 🤗
  • NEW: Introducing RoBertaForQuestionAnswering annotator in Spark NLP 🚀. RoBertaForQuestionAnswering can load RoBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using RobertaForQuestionAnswering for PyTorch or TFRobertaForQuestionAnswering for TensorFlow models in HuggingFace 🤗
  • NEW: Introducing XlmRoBertaForQuestionAnswering annotator in Spark NLP 🚀. XlmRoBertaForQuestionAnswering can load XLM-RoBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using XLMRobertaForQuestionAnswering for PyTorch or TFXLMRobertaForQuestionAnswering for TensorFlow models in HuggingFace 🤗
  • NEW: Introducing MultiDocumentAssembler annotator where multiple inputs require to be converted to DOCUMENT such as in XXXForQuestionAnswering annotators
  • NEW: Introducing SpanBertCorefModel annotator for Coreference Resolution on BERT and SpanBERT models based on BERT for Coreference Resolution: Baselines and Analysis paper. An implementation of a SpanBert-based coreference resolution model.
  • NEW: Introducing enableInMemoryStorage parameter in WordEmbeddingsModel annotator. By enabling this parameter the annotator will no longer require a distributed storage to unpack indices and will perform everything in-memory.
  • Official support for Apache Spark and PySpark 3.2.x on Scala 2.12. Spark NLP by default is shipped for Spark 3.2.x and supports Spark/PySpark 3.0.x and 3.1.x in additions
  • Unifying all supported Apache Spark packages on Maven into spark-nlp for CPU, spark-nlp-gpu for GPU, and spark-nlp-m1 for new Apple silicon M1 on macOS. The need for Apache Spark specific packages like spark-nlp-spark32 has been removed.
  • Adding a new param to sparknlp.start() function in Python and Scala for Apple silicon M1 on macOS (m1=True)
  • Upgrade TensorFlow to 2.7.1 and start supporting Apple silicon M1
  • Upgrade RocksDB with new enhancements and support for Apple silicon M1
  • Upgrade SentencePiece tokenizer TF ops to 2.7.1
  • Upgrade SentencePiece JNI to v0.1.96 and provide support for Apple silicon M1 on macOS support
  • Upgrade to Scala 2.12.15
  • Update Colab, Kaggle, and SageMaker scripts
  • Refactor the entire Python module in Spark NLP to make the development and maintenance easier
  • Refactor unit tests in Python and migrate to pytest
  • Welcoming 6x new Databricks runtimes to our Spark NLP family:
    • Databricks 10.4 LTS
    • Databricks 10.4 LTS ML
    • Databricks 10.4 LTS ML GPU
    • Databricks 10.5
    • Databricks 10.5 ML
    • Databricks 10.5 ML GPU
  • Welcoming a new EMR 6.x series to our Spark NLP family:
    • EMR 6.6.0 (Apache Spark 3.2.0 / Hadoop 3.2.1)
  • Migrate T5Transformer to TensorFlow v2 architecture by re-uploading all the existing models
  • Support for 2 inputs in LightPipeline with MultiDocumentAssembler
  • Add new default NerDL graph for xsmall DeBERTa embeddings model (384 dimensions)
  • Adding annotateJava method to PretrainedPipeline class in Java to facilitate the use of LightPipelines
  • Allow change of case sensitivity. Currently, the user cannot set the setCaseSensitive param. This allows users to change this value if the model was saved/uploaded with the wrong case sensitivity parameter. (BERT, ALBERT, DistilBERT, RoBERTa, DeBERTa, XLM-RoBERTa, and Longformer for XXXForSequenceClassitication and XXXForTokenClassification.
  • Keep accuracy in ClassifierDL and SentimentDL during the training between 0.0 and 1.0
  • Preserve the original form of the token in BPE Tokenizer used in RoBERTa annotators (used in embeddings, sequence and token classification)

Performance Improvements (Benchmarks)

We have introduced two major performance improvements for GPU and CPU devices in Spark NLP 4.0.0 release.

The following benchmarks have been done by using a single Dell Server with the following specs:

  • GPU: Tesla P100 PCIe 12GB - CUDA Version: 11.3 - Driver Version: 465.19.01
  • CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz - 40 Cores
  • Memory: 80G

GPU

We have improved our batch processing approach for transformer-based Word Embeddings to improve their performance on a GPU device. These optimizations result in performance improvements up to +700%. The detailed list of improved transformer models on GPU in comparison to Spark NLP 3.4.x:

Model on GPU Spark NLP 3.4.3 vs. 4.0.0
RoBERTa base +560%(6.6x)
RoBERTa Large +332%(4.3x)
Albert Base +587%(6.9x)
Albert Large +332%(4.3x)
DistilBERT +659%(7.6x)
XLM-RoBERTa Base +638%(7.4x)
XLM-RoBERTa Large +365%(4.7x)
XLNet Base +449%(5.5x)
XLNet Large +267%(3.7x)
DeBERTa Base +713%(8.1x)
DeBERTa Large +477%(5.8x)
Longformer Base +52%(1.5x)

Spark NLP 3 4 vs  Spark NLP 4 0 on GPU

CPU

The oneAPI Deep Neural Network Library (oneDNN) optimizations are now available in Spark NLP 4.0.0 that uses TensorFlow 2.7.1. You can enable those CPU optimizations by setting the environment variable TF_ENABLE_ONEDNN_OPTS=1.

Intel has been collaborating with Google to optimize its performance on Intel Xeon processor-based platforms using Intel oneAPI Deep Neural Network (oneDNN), an open-source, cross-platform performance library for DL applications. TensorFlow optimizations are enabled via oneDNN to accelerate key performance-intensive operations such as convolution, matrix multiplication, and batch normalization.

Comparing the last release of Spark NLP 3.4.3 on CPU vs. Spark NLP 4.0.0 on CPU with oneDNN enabled.

Model on CPU 3.4.x vs. 4.0.0 with oneDNN
BERT Base +47%
BERT Large +42%
RoBERTa Base +51%
RoBERTa Large +61%
Albert Base +83%
Albert Large +58%
DistilBERT +80%
XLM-RoBERTa Base +82%
XLM-RoBERTa Large +72%
XLNet Base +50%
XLNet Large +27%
DeBERTa Base +59%
DeBERTa Large +56%
CamemBERT Base +97%
CamemBERT Large +65%
Longformer Base +63%

Spark NLP 3 4 on CPU vs  Spark NLP 4 0 on CPU with oneDNN


Bug Fixes

  • Fix the default pre-trained model for DeBertaForTokenClassification in Scala and Python
  • Remove a requirement in DocumentNormalizer that consecutive stage processing can produce empty text annotations without breaking the pipeline
  • Fix WordSegmenterModel outputting the wrong order of tokens. The regex that groups the tagging format was refactored to preserve the order of segmented outputs (tokens)
  • Fix encoding sentences not respecting the max sequence length given by a user in XlmRobertaSentenceEmbeddings
  • Fix encoding sentences by using SentencePiece to calculate the correct tokens indexing
  • Fix SentencePiece serialization issue when XlmRoBertaEmbeddings and XlmRoBertaSentenceEmbeddings annotators are used from a Fat JAR on GPU
  • Remove non-existing parameters from DocumentAssembler in Python

Updated Requirements

  • Java 8 (still supported) or 11
  • Apache Spark 3.x (3.0, 3.1, and 3.2)
  • NVIDIA® GPU drivers version 450.80.02 or higher
  • CUDA® Toolkit 11.2
  • cuDNN SDK 8.1.0
  • Scala 2.12.15

Backward Compatibility

  • Deprecate support for Spark/PySpark 2.3, Spark/PySpark 2.4, and Scala 2.11 #8319
  • The start() functions in Python and Scala will no longer have spark23, spark24, and spark32 parameters. The default sparknlp.start() works on PySpark 3.0.x, 3.1.x, and 3.2.x without the need for any Spark-related flags
  • Some models/pipelines which were trained or saved by using Spark and PySpark 2.3/2.4 will no longer work on Spark NLP 4.0.0
  • Remove json4s-ext dependency to allow the support for all Apache Spark major releases in one build

Models and Pipelines

Spark NLP 4.0.0 comes with 1000+ state-of-the-art pre-trained transformer models in many languages.

New NER Models

nerdl_conll_deberta_large NER model breaks the previously highest F1 on CoNLL03 dev by 1%

Model Name Lang Dev F1
NerDLModel nerdl_conll_deberta_large en 96%
NerDLModel nerdl_conll_elmo en 95.6%
NerDLModel nerdl_conll_deberta_base en 94%

Featured Models

Model Name Lang
AlbertForQuestionAnswering albert_base_qa_squad2 en
DebertaForQuestionAnswering deberta_v3_xsmall_qa_squad2 en
DistilBertForQuestionAnswering distilbert_base_cased_qa_squad2 en
LongformerForQuestionAnswering longformer_base_base_qa_squad2 en
RoBertaForQuestionAnswering roberta_base_qa_squad2 en
XlmRoBertaForQuestionAnswering xlm_roberta_base_qa_squad2 en
DistilBertForQuestionAnswering distilbert_qa_multi_finedtuned_squad pt
BertForQuestionAnswering bert_qa_bert_large_cased_squad_v1.1_portuguese pt
BertForQuestionAnswering bert_qa_chinese_pert_base_mrc zh
BertForQuestionAnswering bert_qa_arap_qa_bert ar
BertForQuestionAnswering bert_qa_ainize_klue_bert_base_mrc ko
BertForQuestionAnswering bert_qa_Part_1_mBERT_Model_E1 xx
BertForQuestionAnswering bert_qa_qacombination_bert_el_Danastos el

Spark NLP covers the following languages:

English ,Multilingual ,Afrikaans ,Afro-Asiatic languages ,Albanian ,Altaic languages ,American Sign Language ,Amharic ,Arabic ,Argentine Sign Language ,Armenian ,Artificial languages ,Atlantic-Congo languages ,Austro-Asiatic languages ,Austronesian languages ,Azerbaijani ,Baltic languages ,Bantu languages ,Basque ,Basque (family) ,Belarusian ,Bemba (Zambia) ,Bengali, Bangla ,Berber languages ,Bihari ,Bislama ,Bosnian ,Brazilian Sign Language ,Breton ,Bulgarian ,Catalan ,Caucasian languages ,Cebuano ,Celtic languages ,Central Bikol ,Chichewa, Chewa, Nyanja ,Chilean Sign Language ,Chinese ,Chuukese ,Colombian Sign Language ,Congo Swahili ,Croatian ,Cushitic languages ,Czech ,Danish ,Dholuo, Luo (Kenya and Tanzania) ,Dravidian languages ,Dutch ,East Slavic languages ,Eastern Malayo-Polynesian languages ,Efik ,Esperanto ,Estonian ,Ewe ,Fijian ,Finnish ,Finnish Sign Language ,Finno-Ugrian languages ,French ,French-based creoles and pidgins ,Ga ,Galician ,Ganda ,Georgian ,German ,Germanic languages ,Gilbertese ,Greek (modern) ,Greek languages ,Gujarati ,Gun ,Haitian, Haitian Creole ,Hausa ,Hebrew (modern) ,Hiligaynon ,Hindi ,Hiri Motu ,Hungarian ,Icelandic ,Igbo ,Iloko ,Indic languages ,Indo-European languages ,Indo-Iranian languages ,Indonesian ,Irish ,Isoko ,Isthmus Zapotec ,Italian ,Italic languages ,Japanese ,Japanese ,Kabyle ,Kalaallisut, Greenlandic ,Kannada ,Kaonde ,Kinyarwanda ,Kirundi ,Kongo ,Korean ,Kwangali ,Kwanyama, Kuanyama ,Latin ,Latvian ,Lingala ,Lithuanian ,Louisiana Creole ,Lozi ,Luba-Katanga ,Luba-Lulua ,Lunda ,Lushai ,Luvale ,Macedonian ,Malagasy ,Malay ,Malayalam ,Malayo-Polynesian languages ,Maltese ,Manx ,Marathi (Marāṭhī) ,Marshallese ,Mexican Sign Language ,Mon-Khmer languages ,Morisyen ,Mossi ,Multiple languages ,Ndonga ,Nepali ,Niger-Kordofanian languages ,Nigerian Pidgin ,Niuean ,North Germanic languages ,Northern Sotho, Pedi, Sepedi ,Norwegian ,Norwegian Bokmål ,Norwegian Nynorsk ,Nyaneka ,Oromo ,Pangasinan ,Papiamento ,Persian (Farsi) ,Peruvian Sign Language ,Philippine languages ,Pijin ,Pohnpeian ,Polish ,Portuguese ,Portuguese-based creoles and pidgins ,Punjabi (Eastern) ,Romance languages ,Romanian ,Rundi ,Russian ,Ruund ,Salishan languages ,Samoan ,San Salvador Kongo ,Sango ,Semitic languages ,Serbo-Croatian ,Seselwa Creole French ,Shona ,Sindhi ,Sino-Tibetan languages ,Slavic languages ,Slovak ,Slovene ,Somali ,South Caucasian languages ,South Slavic languages ,Southern Sotho ,Spanish ,Spanish Sign Language ,Sranan Tongo ,Swahili ,Swati ,Swedish ,Tagalog ,Tahitian ,Tai ,Tamil ,Telugu ,Tetela ,Tetun Dili ,Thai ,Tigrinya ,Tiv ,Tok Pisin ,Tonga (Tonga Islands) ,Tonga (Zambia) ,Tsonga ,Tswana ,Tumbuka ,Turkic languages ,Turkish ,Tuvalu ,Tzotzil ,Ukrainian ,Umbundu ,Uralic languages ,Urdu ,Venda ,Venezuelan Sign Language ,Vietnamese ,Wallisian ,Walloon ,Waray (Philippines) ,Welsh ,West Germanic languages ,West Slavic languages ,Western Malayo-Polynesian languages ,Wolaitta, Wolaytta ,Wolof ,Xhosa ,Yapese ,Yiddish ,Yoruba ,Yucatec Maya, Yucateco ,Zande (individual language) ,Zulu

The complete list of all 6000+ models & pipelines in 230+ languages is available on Models Hub

New Notebooks

Import hundreds of models in different languages to Spark NLP

Spark NLP HuggingFace Notebooks Colab
AlbertForQuestionAnswering HuggingFace in Spark NLP - AlbertForQuestionAnswering Open In Colab
BertForQuestionAnswering HuggingFace in Spark NLP - BertForQuestionAnswering Open In Colab
DeBertaForQuestionAnswering HuggingFace in Spark NLP - DeBertaForQuestionAnswering Open In Colab
DistilBertForQuestionAnswering HuggingFace in Spark NLP - DistilBertForQuestionAnswering Open In Colab
LongformerForQuestionAnswering HuggingFace in Spark NLP - LongformerForQuestionAnswering Open In Colab
RoBertaForQuestionAnswering HuggingFace in Spark NLP - RoBertaForQuestionAnswering Open In Colab
XlmRobertaForQuestionAnswering HuggingFace in Spark NLP - XlmRobertaForQuestionAnswering Open In Colab

You can visit Import Transformers in Spark NLP for more info


Documentation


Installation

Python

#PyPI

pip install spark-nlp==4.0.0

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, and 3.2.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.0

M1

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.0

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, and 3.2.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>4.0.0</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>4.0.0</version>
</dependency>

spark-nlp-m1:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-m1_2.12</artifactId>
    <version>4.0.0</version>
</dependency>

FAT JARs

What's Changed

Full Changelog: 3.4.4...4.0.0

@vankov @mahmoodbayeshi @Ahmetemintek @DevinTDHa @albertoandreottiATgmail @KshitizGIT @jsl-models @gokhanturer @josejuanmartinez @murat-gunay @rpranab @wolliq @bunyamin-polat @pabla @danilojsl @agsfer @Meryem1425 @gadde5300 @muhammetsnts @Damla-Gurbaz @maziyarpanahi @jsl-builder @Cabir40 @suvrat-joshi