Skip to content

Spark NLP 4.2.3: Improved CoNLLGenerator annotator, new rules parameter in RegexMatcher, new IAnnotation feature for LightPipeline in Scala, and bug fixes

Compare
Choose a tag to compare
@maziyarpanahi maziyarpanahi released this 10 Nov 20:17
· 1024 commits to master since this release

πŸ“’ Overview

Spark NLP 4.2.3 πŸš€ comes with new improvements to the CoNLLGenerator annotator, a new way to pass rules to the RegexMatcher annotator, unifying control over a number of columns in setInputCols between the Scala and Python, new documentation for our new IAnnotation feature for those who are using Spark NLP in Scala, and bug fixes.

Do not forget to visit Models Hub with over 11400+ free and open-source models & pipelines. As always, we would like to thank our community for their feedback, questions, and feature requests. πŸŽ‰


⭐ New Features & improvements

  • Adding metadata sentence key parameter in order to select which metadata field to use as a sentence for the CoNLLGenerator annotator
  • Include escaping in the CoNLLGenerator annotator when writing to CSV and preserve special char token
  • Add rules and delimiter parameters to RegexMatcher annotator to support string as input in addition to a file
regexMatcher = RegexMatcher() \
      .setRules(["\\d{4}\\/\\d\\d\\/\\d\\d,date", "\\d{2}\\/\\d\\d\\/\\d\\d,short_date"]) \
      .setDelimiter(",") \
      .setInputCols(["sentence"]) \
      .setOutputCol("regex") \
      .setStrategy("MATCH_ALL")
  • Implement a new control over a number of accepted columns in Python. This will sync the behavior between Scala and Python where the user sets more columns than allowed inside setInputCols while using Spark NLP in Python
  • Add documentation for the new IAnnotation feature for Scala users

Bug Fixes

  • Fix NotSerializableException when the WordEmbeddings annotator is used over the K8s cluster while setEnableInMemoryStorage is set to true
  • Fix a bug in the RegexTokenizer annotator when it outputs the wrong indexes if the pattern includes splits that are not followed by a space
  • Fix training module failing on EMR due to a bad Apache Spark version detection. The use of the following classes was fixed on EMR: CoNLL(), CoNLLU(), POS(), and PubTator()
  • Fix a bug in the CoNLLGenerator annotator where the token has non-int metadata
  • Fix the wrong SentencePiece model's name required for DeBertaForQuestionAnswering and DeBertaEmbeddings when importing models
  • Fix NaNs result in some ViTForImageClassification models/pipelines

πŸ““ New Notebooks


πŸ“– Documentation


Installation

Python

#PyPI

pip install spark-nlp==4.2.3

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.3

M1

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.3

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.3

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>4.2.3</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>4.2.3</version>
</dependency>

spark-nlp-m1:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-m1_2.12</artifactId>
    <version>4.2.3</version>
</dependency>

spark-nlp-aarch64:

<!-- https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64 -->
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>4.2.3</version>
</dependency>

FAT JARs

What's Changed

Full Changelog: 4.2.2...4.2.3