Spark NLP 4.0.0: New modern extractive Question answering (QA) annotators for ALBERT, BERT, DistilBERT, DeBERTa, RoBERTa, Longformer, and XLM-RoBERTa, official support for Apple silicon M1, support oneDNN to improve CPU up to 97%, improved transformers on GPU up to +700%, 1000+ state-of-the-art models, and lots more!
Overview
We are very excited to release Spark NLP 4.0.0! This has been one of the biggest releases we have ever done and we are so proud to share this with our community! 🎉
This release comes with official support for Apple silicon M1 chip (for the first time), official support for Spark/PySpark 3.2, support oneAPI Deep Neural Network Library (oneDNN) to improve TensorFlow on CPU up to 97%, optimized transformer-based embeddings on GPU to increase the performance up to +700%, brand new modern extractive transformer-based Question answering (QA) annotators for tasks like SQuAD based on ALBERT, BERT, DistilBERT, DeBERTa, RoBERTa, Longformer, and XLM-RoBERTa architectures, 1000+ state-of-the-art models, WordEmbeddingsModel now works in clusters without HDFS/DBFS/S3 such as Kubernetes, new Databricks and EMR support, new NER models achieving highest F1 score in Spark NLP, and many more enhancements and bug fixes!
We would like to mention that Spark NLP 4.0.0 drops the support for Spark 2.3 and 2.4 (Scala 2.11). Starting 4.0.0 we only support Spark/PySpark 3.x on Scala 2.12.
As always, we would like to thank our community for their feedback, questions, and feature requests.
Major features and improvements
- NEW: Support for The oneAPI Deep Neural Network Library (oneDNN) optimizations to improve TensorFlow on CPU. Enabling onDNN can improve some transformer-based models up to 97%. By default, the oneDNN optimizations will be turned off. To enable them, you can set the environment variable TF_ENABLE_ONEDNN_OPTS. On Linux systems, for instance:
export TF_ENABLE_ONEDNN_OPTS=1
- NEW: Optimizing batch processing for transformer-based Word Embeddings on a GPU device. These optimizations can result in performance improvements up to +700% (more details in the Benchmarks section)
- NEW: Official support for Apple silicon M1 on macOS devices. You can use the
spark-nlp-m1
package that supports Apple silicon M1 on your macOS machine in Spark NLP 4.0.0 - NEW: Introducing AlbertForQuestionAnswering annotator in Spark NLP 🚀.
AlbertForQuestionAnswering
can loadALBERT
Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by usingAlbertForQuestionAnswering
for PyTorch orTFAlbertForQuestionAnswering
for TensorFlow models in HuggingFace 🤗 - NEW: Introducing BertForQuestionAnswering annotator in Spark NLP 🚀.
BertForQuestionAnswering
can loadBERT
&ELECTRA
Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by usingBertForQuestionAnswering
andElectraForQuestionAnswering
for PyTorch orTFBertForQuestionAnswering
andTFElectraForQuestionAnswering
for TensorFlow models in HuggingFace 🤗 - NEW: Introducing DeBertaForQuestionAnswering annotator in Spark NLP 🚀.
DeBertaForQuestionAnswering
can loadDeBERTa
v2&v3 Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by usingDebertaV2ForQuestionAnswering
for PyTorch orTFDebertaV2ForQuestionAnswering
for TensorFlow models in HuggingFace 🤗 - NEW: Introducing DistilBertForQuestionAnswering annotator in Spark NLP 🚀.
DistilBertForQuestionAnswering
can loadDistilBERT
Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by usingDistilBertForQuestionAnswering
for PyTorch orTFDistilBertForQuestionAnswering
for TensorFlow models in HuggingFace 🤗 - NEW: Introducing LongformerForQuestionAnswering annotator in Spark NLP 🚀.
LongformerForQuestionAnswering
can loadLongformer
Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by usingLongformerForQuestionAnswering
for PyTorch orTFLongformerForQuestionAnswering
for TensorFlow models in HuggingFace 🤗 - NEW: Introducing RoBertaForQuestionAnswering annotator in Spark NLP 🚀.
RoBertaForQuestionAnswering
can loadRoBERTa
Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by usingRobertaForQuestionAnswering
for PyTorch orTFRobertaForQuestionAnswering
for TensorFlow models in HuggingFace 🤗 - NEW: Introducing XlmRoBertaForQuestionAnswering annotator in Spark NLP 🚀.
XlmRoBertaForQuestionAnswering
can loadXLM-RoBERTa
Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by usingXLMRobertaForQuestionAnswering
for PyTorch orTFXLMRobertaForQuestionAnswering
for TensorFlow models in HuggingFace 🤗 - NEW: Introducing MultiDocumentAssembler annotator where multiple inputs require to be converted to DOCUMENT such as in XXXForQuestionAnswering annotators
- NEW: Introducing SpanBertCorefModel annotator for Coreference Resolution on BERT and SpanBERT models based on BERT for Coreference Resolution: Baselines and Analysis paper. An implementation of a SpanBert-based coreference resolution model.
- NEW: Introducing
enableInMemoryStorage
parameter inWordEmbeddingsModel
annotator. By enabling this parameter the annotator will no longer require a distributed storage to unpack indices and will perform everything in-memory. - Official support for Apache Spark and PySpark 3.2.x on Scala 2.12. Spark NLP by default is shipped for Spark 3.2.x and supports Spark/PySpark 3.0.x and 3.1.x in additions
- Unifying all supported Apache Spark packages on Maven into
spark-nlp
for CPU,spark-nlp-gpu
for GPU, andspark-nlp-m1
for new Apple silicon M1 on macOS. The need for Apache Spark specific packages likespark-nlp-spark32
has been removed. - Adding a new param to
sparknlp.start()
function in Python and Scala for Apple silicon M1 on macOS (m1=True
) - Upgrade TensorFlow to 2.7.1 and start supporting Apple silicon M1
- Upgrade RocksDB with new enhancements and support for Apple silicon M1
- Upgrade SentencePiece tokenizer TF ops to 2.7.1
- Upgrade SentencePiece JNI to v0.1.96 and provide support for Apple silicon M1 on macOS support
- Upgrade to Scala 2.12.15
- Update Colab, Kaggle, and SageMaker scripts
- Refactor the entire Python module in Spark NLP to make the development and maintenance easier
- Refactor unit tests in Python and migrate to pytest
- Welcoming 6x new Databricks runtimes to our Spark NLP family:
- Databricks 10.4 LTS
- Databricks 10.4 LTS ML
- Databricks 10.4 LTS ML GPU
- Databricks 10.5
- Databricks 10.5 ML
- Databricks 10.5 ML GPU
- Welcoming a new EMR 6.x series to our Spark NLP family:
- EMR 6.6.0 (Apache Spark 3.2.0 / Hadoop 3.2.1)
- Migrate T5Transformer to TensorFlow v2 architecture by re-uploading all the existing models
- Support for 2 inputs in LightPipeline with MultiDocumentAssembler
- Add new default NerDL graph for xsmall DeBERTa embeddings model (384 dimensions)
- Adding annotateJava method to PretrainedPipeline class in Java to facilitate the use of LightPipelines
- Allow change of case sensitivity. Currently, the user cannot set the
setCaseSensitive
param. This allows users to change this value if the model was saved/uploaded with the wrong case sensitivity parameter. (BERT, ALBERT, DistilBERT, RoBERTa, DeBERTa, XLM-RoBERTa, and Longformer for XXXForSequenceClassitication and XXXForTokenClassification. - Keep accuracy in ClassifierDL and SentimentDL during the training between 0.0 and 1.0
- Preserve the original form of the token in BPE Tokenizer used in RoBERTa annotators (used in embeddings, sequence and token classification)
Performance Improvements (Benchmarks)
We have introduced two major performance improvements for GPU and CPU devices in Spark NLP 4.0.0 release.
The following benchmarks have been done by using a single Dell Server with the following specs:
- GPU: Tesla P100 PCIe 12GB - CUDA Version: 11.3 - Driver Version: 465.19.01
- CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz - 40 Cores
- Memory: 80G
GPU
We have improved our batch processing approach for transformer-based Word Embeddings to improve their performance on a GPU device. These optimizations result in performance improvements up to +700%. The detailed list of improved transformer models on GPU in comparison to Spark NLP 3.4.x:
Model on GPU | Spark NLP 3.4.3 vs. 4.0.0 |
---|---|
RoBERTa base | +560%(6.6x) |
RoBERTa Large | +332%(4.3x) |
Albert Base | +587%(6.9x) |
Albert Large | +332%(4.3x) |
DistilBERT | +659%(7.6x) |
XLM-RoBERTa Base | +638%(7.4x) |
XLM-RoBERTa Large | +365%(4.7x) |
XLNet Base | +449%(5.5x) |
XLNet Large | +267%(3.7x) |
DeBERTa Base | +713%(8.1x) |
DeBERTa Large | +477%(5.8x) |
Longformer Base | +52%(1.5x) |
CPU
The oneAPI Deep Neural Network Library (oneDNN) optimizations are now available in Spark NLP 4.0.0 that uses TensorFlow 2.7.1. You can enable those CPU optimizations by setting the environment variable TF_ENABLE_ONEDNN_OPTS=1
.
Intel has been collaborating with Google to optimize its performance on Intel Xeon processor-based platforms using Intel oneAPI Deep Neural Network (oneDNN), an open-source, cross-platform performance library for DL applications. TensorFlow optimizations are enabled via oneDNN to accelerate key performance-intensive operations such as convolution, matrix multiplication, and batch normalization.
Comparing the last release of Spark NLP 3.4.3 on CPU vs. Spark NLP 4.0.0 on CPU with oneDNN enabled.
Model on CPU | 3.4.x vs. 4.0.0 with oneDNN |
---|---|
BERT Base | +47% |
BERT Large | +42% |
RoBERTa Base | +51% |
RoBERTa Large | +61% |
Albert Base | +83% |
Albert Large | +58% |
DistilBERT | +80% |
XLM-RoBERTa Base | +82% |
XLM-RoBERTa Large | +72% |
XLNet Base | +50% |
XLNet Large | +27% |
DeBERTa Base | +59% |
DeBERTa Large | +56% |
CamemBERT Base | +97% |
CamemBERT Large | +65% |
Longformer Base | +63% |
Bug Fixes
- Fix the default pre-trained model for DeBertaForTokenClassification in Scala and Python
- Remove a requirement in DocumentNormalizer that consecutive stage processing can produce empty text annotations without breaking the pipeline
- Fix WordSegmenterModel outputting the wrong order of tokens. The regex that groups the tagging format was refactored to preserve the order of segmented outputs (tokens)
- Fix encoding sentences not respecting the max sequence length given by a user in XlmRobertaSentenceEmbeddings
- Fix encoding sentences by using SentencePiece to calculate the correct tokens indexing
- Fix SentencePiece serialization issue when XlmRoBertaEmbeddings and XlmRoBertaSentenceEmbeddings annotators are used from a Fat JAR on GPU
- Remove non-existing parameters from DocumentAssembler in Python
Updated Requirements
- Java 8 (still supported) or 11
- Apache Spark 3.x (3.0, 3.1, and 3.2)
- NVIDIA® GPU drivers version 450.80.02 or higher
- CUDA® Toolkit 11.2
- cuDNN SDK 8.1.0
- Scala 2.12.15
Backward Compatibility
- Deprecate support for Spark/PySpark 2.3, Spark/PySpark 2.4, and Scala 2.11 #8319
- The start() functions in Python and Scala will no longer have
spark23
,spark24
, andspark32
parameters. The defaultsparknlp.start()
works on PySpark 3.0.x, 3.1.x, and 3.2.x without the need for any Spark-related flags - Some models/pipelines which were trained or saved by using Spark and PySpark 2.3/2.4 will no longer work on Spark NLP 4.0.0
- Remove json4s-ext dependency to allow the support for all Apache Spark major releases in one build
Models and Pipelines
Spark NLP 4.0.0 comes with 1000+ state-of-the-art pre-trained transformer models in many languages.
New NER Models
nerdl_conll_deberta_large
NER model breaks the previously highest F1 on CoNLL03 dev by 1%
Model | Name | Lang | Dev F1 |
---|---|---|---|
NerDLModel | nerdl_conll_deberta_large | en |
96% |
NerDLModel | nerdl_conll_elmo | en |
95.6% |
NerDLModel | nerdl_conll_deberta_base | en |
94% |
Featured Models
Model | Name | Lang |
---|---|---|
AlbertForQuestionAnswering | albert_base_qa_squad2 | en |
DebertaForQuestionAnswering | deberta_v3_xsmall_qa_squad2 | en |
DistilBertForQuestionAnswering | distilbert_base_cased_qa_squad2 | en |
LongformerForQuestionAnswering | longformer_base_base_qa_squad2 | en |
RoBertaForQuestionAnswering | roberta_base_qa_squad2 | en |
XlmRoBertaForQuestionAnswering | xlm_roberta_base_qa_squad2 | en |
DistilBertForQuestionAnswering | distilbert_qa_multi_finedtuned_squad | pt |
BertForQuestionAnswering | bert_qa_bert_large_cased_squad_v1.1_portuguese | pt |
BertForQuestionAnswering | bert_qa_chinese_pert_base_mrc | zh |
BertForQuestionAnswering | bert_qa_arap_qa_bert | ar |
BertForQuestionAnswering | bert_qa_ainize_klue_bert_base_mrc | ko |
BertForQuestionAnswering | bert_qa_Part_1_mBERT_Model_E1 | xx |
BertForQuestionAnswering | bert_qa_qacombination_bert_el_Danastos | el |
Spark NLP covers the following languages:
English
,Multilingual
,Afrikaans
,Afro-Asiatic languages
,Albanian
,Altaic languages
,American Sign Language
,Amharic
,Arabic
,Argentine Sign Language
,Armenian
,Artificial languages
,Atlantic-Congo languages
,Austro-Asiatic languages
,Austronesian languages
,Azerbaijani
,Baltic languages
,Bantu languages
,Basque
,Basque (family)
,Belarusian
,Bemba (Zambia)
,Bengali, Bangla
,Berber languages
,Bihari
,Bislama
,Bosnian
,Brazilian Sign Language
,Breton
,Bulgarian
,Catalan
,Caucasian languages
,Cebuano
,Celtic languages
,Central Bikol
,Chichewa, Chewa, Nyanja
,Chilean Sign Language
,Chinese
,Chuukese
,Colombian Sign Language
,Congo Swahili
,Croatian
,Cushitic languages
,Czech
,Danish
,Dholuo, Luo (Kenya and Tanzania)
,Dravidian languages
,Dutch
,East Slavic languages
,Eastern Malayo-Polynesian languages
,Efik
,Esperanto
,Estonian
,Ewe
,Fijian
,Finnish
,Finnish Sign Language
,Finno-Ugrian languages
,French
,French-based creoles and pidgins
,Ga
,Galician
,Ganda
,Georgian
,German
,Germanic languages
,Gilbertese
,Greek (modern)
,Greek languages
,Gujarati
,Gun
,Haitian, Haitian Creole
,Hausa
,Hebrew (modern)
,Hiligaynon
,Hindi
,Hiri Motu
,Hungarian
,Icelandic
,Igbo
,Iloko
,Indic languages
,Indo-European languages
,Indo-Iranian languages
,Indonesian
,Irish
,Isoko
,Isthmus Zapotec
,Italian
,Italic languages
,Japanese
,Japanese
,Kabyle
,Kalaallisut, Greenlandic
,Kannada
,Kaonde
,Kinyarwanda
,Kirundi
,Kongo
,Korean
,Kwangali
,Kwanyama, Kuanyama
,Latin
,Latvian
,Lingala
,Lithuanian
,Louisiana Creole
,Lozi
,Luba-Katanga
,Luba-Lulua
,Lunda
,Lushai
,Luvale
,Macedonian
,Malagasy
,Malay
,Malayalam
,Malayo-Polynesian languages
,Maltese
,Manx
,Marathi (Marāṭhī)
,Marshallese
,Mexican Sign Language
,Mon-Khmer languages
,Morisyen
,Mossi
,Multiple languages
,Ndonga
,Nepali
,Niger-Kordofanian languages
,Nigerian Pidgin
,Niuean
,North Germanic languages
,Northern Sotho, Pedi, Sepedi
,Norwegian
,Norwegian Bokmål
,Norwegian Nynorsk
,Nyaneka
,Oromo
,Pangasinan
,Papiamento
,Persian (Farsi)
,Peruvian Sign Language
,Philippine languages
,Pijin
,Pohnpeian
,Polish
,Portuguese
,Portuguese-based creoles and pidgins
,Punjabi (Eastern)
,Romance languages
,Romanian
,Rundi
,Russian
,Ruund
,Salishan languages
,Samoan
,San Salvador Kongo
,Sango
,Semitic languages
,Serbo-Croatian
,Seselwa Creole French
,Shona
,Sindhi
,Sino-Tibetan languages
,Slavic languages
,Slovak
,Slovene
,Somali
,South Caucasian languages
,South Slavic languages
,Southern Sotho
,Spanish
,Spanish Sign Language
,Sranan Tongo
,Swahili
,Swati
,Swedish
,Tagalog
,Tahitian
,Tai
,Tamil
,Telugu
,Tetela
,Tetun Dili
,Thai
,Tigrinya
,Tiv
,Tok Pisin
,Tonga (Tonga Islands)
,Tonga (Zambia)
,Tsonga
,Tswana
,Tumbuka
,Turkic languages
,Turkish
,Tuvalu
,Tzotzil
,Ukrainian
,Umbundu
,Uralic languages
,Urdu
,Venda
,Venezuelan Sign Language
,Vietnamese
,Wallisian
,Walloon
,Waray (Philippines)
,Welsh
,West Germanic languages
,West Slavic languages
,Western Malayo-Polynesian languages
,Wolaitta, Wolaytta
,Wolof
,Xhosa
,Yapese
,Yiddish
,Yoruba
,Yucatec Maya, Yucateco
,Zande (individual language)
,Zulu
The complete list of all 6000+ models & pipelines in 230+ languages is available on Models Hub
New Notebooks
Import hundreds of models in different languages to Spark NLP
Spark NLP | HuggingFace Notebooks | Colab |
---|---|---|
AlbertForQuestionAnswering | HuggingFace in Spark NLP - AlbertForQuestionAnswering | |
BertForQuestionAnswering | HuggingFace in Spark NLP - BertForQuestionAnswering | |
DeBertaForQuestionAnswering | HuggingFace in Spark NLP - DeBertaForQuestionAnswering | |
DistilBertForQuestionAnswering | HuggingFace in Spark NLP - DistilBertForQuestionAnswering | |
LongformerForQuestionAnswering | HuggingFace in Spark NLP - LongformerForQuestionAnswering | |
RoBertaForQuestionAnswering | HuggingFace in Spark NLP - RoBertaForQuestionAnswering | |
XlmRobertaForQuestionAnswering | HuggingFace in Spark NLP - XlmRobertaForQuestionAnswering |
You can visit Import Transformers in Spark NLP for more info
Documentation
- Serving Spark NLP via API in Java
- TF Hub & HuggingFace to Spark NLP
- Models Hub with new models
- Spark NLP documentation
- Spark NLP Scala APIs
- Spark NLP Python APIs
- Spark NLP Workshop notebooks
- Spark NLP publications
- Spark NLP in Action
- Spark NLP training certification notebooks for Google Colab and Databricks
- Spark NLP Display for visualization of different types of annotations
- Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
Installation
Python
#PyPI
pip install spark-nlp==4.0.0
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, and 3.2.x (Scala 2.12):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.0
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.0
M1
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.0
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, and 3.2.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>4.0.0</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>4.0.0</version>
</dependency>
spark-nlp-m1:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-m1_2.12</artifactId>
<version>4.0.0</version>
</dependency>
FAT JARs
-
CPU on Apache Spark 3.x/3.1.x/3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.0.0.jar
-
GPU on Apache Spark 3.0.x/3.1.x/3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.0.0.jar
-
M1 on Apache Spark 3.0.x/3.1.x/3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.0.0.jar
What's Changed
Full Changelog: 3.4.4...4.0.0
@vankov @mahmoodbayeshi @Ahmetemintek @DevinTDHa @albertoandreottiATgmail @KshitizGIT @jsl-models @gokhanturer @josejuanmartinez @murat-gunay @rpranab @wolliq @bunyamin-polat @pabla @danilojsl @agsfer @Meryem1425 @gadde5300 @muhammetsnts @Damla-Gurbaz @maziyarpanahi @jsl-builder @Cabir40 @suvrat-joshi