Release Spark NLP 4.2.3: Improved CoNLLGenerator annotator, new rules parameter in RegexMatcher, new IAnnotation feature for LightPipeline in Scala, and bug fixes · JohnSnowLabs/spark-nlp

📢 Overview

Spark NLP 4.2.3 🚀 comes with new improvements to the CoNLLGenerator annotator, a new way to pass rules to the RegexMatcher annotator, unifying control over a number of columns in setInputCols between the Scala and Python, new documentation for our new IAnnotation feature for those who are using Spark NLP in Scala, and bug fixes.

Do not forget to visit Models Hub with over 11400+ free and open-source models & pipelines. As always, we would like to thank our community for their feedback, questions, and feature requests. 🎉

⭐ New Features & improvements

Adding metadata sentence key parameter in order to select which metadata field to use as a sentence for the CoNLLGenerator annotator
Include escaping in the CoNLLGenerator annotator when writing to CSV and preserve special char token
Add rules and delimiter parameters to RegexMatcher annotator to support string as input in addition to a file

regexMatcher = RegexMatcher() \
      .setRules(["\\d{4}\\/\\d\\d\\/\\d\\d,date", "\\d{2}\\/\\d\\d\\/\\d\\d,short_date"]) \
      .setDelimiter(",") \
      .setInputCols(["sentence"]) \
      .setOutputCol("regex") \
      .setStrategy("MATCH_ALL")

Implement a new control over a number of accepted columns in Python. This will sync the behavior between Scala and Python where the user sets more columns than allowed inside setInputCols while using Spark NLP in Python
Add documentation for the new IAnnotation feature for Scala users

Bug Fixes

Fix NotSerializableException when the WordEmbeddings annotator is used over the K8s cluster while setEnableInMemoryStorage is set to true
Fix a bug in the RegexTokenizer annotator when it outputs the wrong indexes if the pattern includes splits that are not followed by a space
Fix training module failing on EMR due to a bad Apache Spark version detection. The use of the following classes was fixed on EMR: CoNLL(), CoNLLU(), POS(), and PubTator()
Fix a bug in the CoNLLGenerator annotator where the token has non-int metadata
Fix the wrong SentencePiece model's name required for DeBertaForQuestionAnswering and DeBertaEmbeddings when importing models
Fix NaNs result in some ViTForImageClassification models/pipelines

📓 New Notebooks

You can visit Import Transformers in Spark NLP
You can visit Spark NLP Workshop for 100+ examples

📖 Documentation

TF Hub & HuggingFace to Spark NLP
Models Hub with new models
Spark NLP documentation
Spark NLP Scala APIs
Spark NLP Python APIs
Spark NLP Workshop notebooks
Spark NLP publications
Spark NLP in Action
Spark NLP training certification notebooks for Google Colab and Databricks
Spark NLP Display for visualization of different types of annotations
Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI

pip install spark-nlp==4.2.3

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.3

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.3

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.3

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>4.2.3</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>4.2.3</version>
</dependency>

spark-nlp-m1:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-m1_2.12</artifactId>
    <version>4.2.3</version>
</dependency>

spark-nlp-aarch64:

<!-- https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64 -->
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>4.2.3</version>
</dependency>

FAT JARs

CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.2.3.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.2.3.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.3.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-4.2.3.jar

What's Changed

Models hub legal by @josejuanmartinez in #12999
Models hub finance by @josejuanmartinez in #13000
Embed React and ReactDOM instead of packages from unpkg [skip test] by @pabla in #13002
updated OCR release notes by @albertoandreottiATgmail in #13010
Compat tables by @albertoandreottiATgmail in #13012
Updating s3 link for dependency_conllu model by @luca-martial in #13016
Add new demos by @agsfer in #13020
Add new demos 24 by @agsfer in #13022
Updated legre_contract_doc_parties_en and finre_work_experience_en mo… by @bunyamin-polat in #13023
Docs/alab update documentation 410 by @diatrambitas in #13024
Doc fix scala and open source by @ArshaanNazir in #13008
Update 2022-10-22-finclf_bert_sentiment_analysis_lt.md by @gadde5300 in #13026
add alab image by @agsfer in #13030
Docs/alab update documentation 410 by @diatrambitas in #13034
SPARKNLP 643 detecting spark version in a safer way by @maziyarpanahi in #13035
Docs/alab update documentation 410 by @diatrambitas in #13041
Added content for exporting visual NER project ad updated few other sections by @suvrat-joshi in #13042
Bump model card Spark NLP HC version to 4.2.1 by @luca-martial in #13027
SPARKNLP-642: Fix indexing issue for regex splits without space by @DevinTDHa in #13032
Update ALAB by @agsfer in #13045
Serializable Issue K8s Word Embeddings by @danilojsl in #13001
FEATURE NMH-133: Rename products in search [skip-test] by @KshitizGIT in #12998
Fix sorting in the versions drop-down [skip test] by @pabla in #13049
Add tooltips for Unidirectional and Bidirectional models [skip test] by @pabla in #13064
FEATURE NMH-134: Rebranding products [skip-test] by @KshitizGIT in #13065
Adding Control for Annotators with One Column by @danilojsl in #12997
Update 2022-10-18-legre_confidentiality_en.md by @gadde5300 in #13059
Update 2022-09-28-legre_indemnifications_en.md by @gadde5300 in #13058
Fix a bug in Vision Transformer annotator that results in NaNs for some models by @ahmedlone127 in #13048
Bug fix and enhancements for CoNLLGenerator annotator by @maziyarpanahi in #13053
SPARKNLP-621: Add string support to RegexMatcher in addition to a file by @DevinTDHa in #13060
Add ScalaDoc for IAnnotation by @danilojsl in #13061
doc fix in old hc md files by @ArshaanNazir in #13025
Release/423 release candidate by @maziyarpanahi in #13036

Full Changelog: 4.2.2...4.2.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark NLP 4.2.3: Improved CoNLLGenerator annotator, new rules parameter in RegexMatcher, new IAnnotation feature for LightPipeline in Scala, and bug fixes