Spark NLP 4.2.3: Improved CoNLLGenerator annotator, new rules parameter in RegexMatcher, new IAnnotation feature for LightPipeline in Scala, and bug fixes
π’ Overview
Spark NLP 4.2.3 π comes with new improvements to the CoNLLGenerator
annotator, a new way to pass rules to the RegexMatcher
annotator, unifying control over a number of columns in setInputCols
between the Scala and Python, new documentation for our new IAnnotation
feature for those who are using Spark NLP in Scala, and bug fixes.
Do not forget to visit Models Hub with over 11400+ free and open-source models & pipelines. As always, we would like to thank our community for their feedback, questions, and feature requests. π
β New Features & improvements
- Adding metadata sentence key parameter in order to select which metadata field to use as a sentence for the
CoNLLGenerator
annotator - Include escaping in the
CoNLLGenerator
annotator when writing to CSV and preserve special char token - Add rules and delimiter parameters to RegexMatcher annotator to support string as input in addition to a file
regexMatcher = RegexMatcher() \
.setRules(["\\d{4}\\/\\d\\d\\/\\d\\d,date", "\\d{2}\\/\\d\\d\\/\\d\\d,short_date"]) \
.setDelimiter(",") \
.setInputCols(["sentence"]) \
.setOutputCol("regex") \
.setStrategy("MATCH_ALL")
- Implement a new control over a number of accepted columns in Python. This will sync the behavior between Scala and Python where the user sets more columns than allowed inside setInputCols while using Spark NLP in Python
- Add documentation for the new
IAnnotation
feature for Scala users
Bug Fixes
- Fix
NotSerializableException
when theWordEmbeddings
annotator is used over the K8s cluster whilesetEnableInMemoryStorage
is set totrue
- Fix a bug in the
RegexTokenizer
annotator when it outputs the wrong indexes if the pattern includes splits that are not followed by a space - Fix training module failing on EMR due to a bad Apache Spark version detection. The use of the following classes was fixed on EMR:
CoNLL()
,CoNLLU()
,POS()
, andPubTator()
- Fix a bug in the
CoNLLGenerator
annotator where the token has non-int metadata - Fix the wrong
SentencePiece
model's name required forDeBertaForQuestionAnswering
andDeBertaEmbeddings
when importing models - Fix
NaNs
result in some ViTForImageClassification models/pipelines
π New Notebooks
- You can visit Import Transformers in Spark NLP
- You can visit Spark NLP Workshop for 100+ examples
π Documentation
- TF Hub & HuggingFace to Spark NLP
- Models Hub with new models
- Spark NLP documentation
- Spark NLP Scala APIs
- Spark NLP Python APIs
- Spark NLP Workshop notebooks
- Spark NLP publications
- Spark NLP in Action
- Spark NLP training certification notebooks for Google Colab and Databricks
- Spark NLP Display for visualization of different types of annotations
- Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
Installation
Python
#PyPI
pip install spark-nlp==4.2.3
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.3
M1
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.3
AArch64
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.3
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>4.2.3</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>4.2.3</version>
</dependency>
spark-nlp-m1:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-m1_2.12</artifactId>
<version>4.2.3</version>
</dependency>
spark-nlp-aarch64:
<!-- https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64 -->
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>4.2.3</version>
</dependency>
FAT JARs
-
CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.2.3.jar
-
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.2.3.jar
-
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.3.jar
-
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-4.2.3.jar
What's Changed
- Models hub legal by @josejuanmartinez in #12999
- Models hub finance by @josejuanmartinez in #13000
- Embed React and ReactDOM instead of packages from unpkg [skip test] by @pabla in #13002
- updated OCR release notes by @albertoandreottiATgmail in #13010
- Compat tables by @albertoandreottiATgmail in #13012
- Updating s3 link for dependency_conllu model by @luca-martial in #13016
- Add new demos by @agsfer in #13020
- Add new demos 24 by @agsfer in #13022
- Updated legre_contract_doc_parties_en and finre_work_experience_en mo⦠by @bunyamin-polat in #13023
- Docs/alab update documentation 410 by @diatrambitas in #13024
- Doc fix scala and open source by @ArshaanNazir in #13008
- Update 2022-10-22-finclf_bert_sentiment_analysis_lt.md by @gadde5300 in #13026
- add alab image by @agsfer in #13030
- Docs/alab update documentation 410 by @diatrambitas in #13034
- SPARKNLP 643 detecting spark version in a safer way by @maziyarpanahi in #13035
- Docs/alab update documentation 410 by @diatrambitas in #13041
- Added content for exporting visual NER project ad updated few other sections by @suvrat-joshi in #13042
- Bump model card Spark NLP HC version to 4.2.1 by @luca-martial in #13027
- SPARKNLP-642: Fix indexing issue for regex splits without space by @DevinTDHa in #13032
- Update ALAB by @agsfer in #13045
- Serializable Issue K8s Word Embeddings by @danilojsl in #13001
- FEATURE NMH-133: Rename products in search [skip-test] by @KshitizGIT in #12998
- Fix sorting in the versions drop-down [skip test] by @pabla in #13049
- Add tooltips for Unidirectional and Bidirectional models [skip test] by @pabla in #13064
- FEATURE NMH-134: Rebranding products [skip-test] by @KshitizGIT in #13065
- Adding Control for Annotators with One Column by @danilojsl in #12997
- Update 2022-10-18-legre_confidentiality_en.md by @gadde5300 in #13059
- Update 2022-09-28-legre_indemnifications_en.md by @gadde5300 in #13058
- Fix a bug in Vision Transformer annotator that results in NaNs for some models by @ahmedlone127 in #13048
- Bug fix and enhancements for CoNLLGenerator annotator by @maziyarpanahi in #13053
- SPARKNLP-621: Add string support to RegexMatcher in addition to a file by @DevinTDHa in #13060
- Add ScalaDoc for IAnnotation by @danilojsl in #13061
- doc fix in old hc md files by @ArshaanNazir in #13025
- Release/423 release candidate by @maziyarpanahi in #13036
Full Changelog: 4.2.2...4.2.3