Releases · JohnSnowLabs/spark-nlp

21 Mar 19:33

2.0.0

1adcd70

John Snow Labs Spark-NLP 2.0.0: Bert embeddings, embeddings as annotators, better OCR, new pretrained pipelines

Thank you for following up with the biggest changelog ever on Spark NLP: Spark NLP 2.0.0! Where to begin?
We have no less than 50 Pull Requests merged this time. Most importantly, we become the first library to have a production
ready implementation of BERT embeddings. Along with this interesting deep learning and context based embeddings algorithm, here is a quick overview of new things:

Word Embeddings as well as Bert Embeddings are now annotators, just like any other component in the library. This means, embeddings can be
cached on memory through DataFrames, can be saved on disk and shared as part of pipelines!
We revamped and enhanced Named Entity Recognition (NER) Deep Learning models to a new state of the art level, reaching up to 93% F1 micro-averaged accuracy in the industry standard.
We upgraded tensorflow version and also started using contrib LSTM Cells.
Performance and memory usage improvements also tag along by improving serialization throughput of Deep Learning annotators by receiving feedback from Apache Spark contributor Davies Liu.
Revamping and expanding our pretrained pipelines list, plus the addition of new pretrained models for different languages together with
tons of new example notebooks, which include changes that aim the library to be easier to use. API overall was modified towards helping new comers get started.
OCR module comes with a handful of improvements that increase accuracy.
All of this comes together with a full range of bug fixes and annotator improvements, follow up the details below!
Bear with us since documentation is still catching up a little bit behind, as well as new models to be made available. Stay tuned on Slack!

New Features

BertEmbeddings annotator, with four google ready models ready to be used through Spark NLP as part of your pipelines, includes Wordpiece tokenization.
WordEmbeddings, our previous embeddings system is now an Annotator to be serialized along Spark ML pipelines
Created training helper functions that create spark datasets from files, such as CoNLL and POS tagging
NER DL has been revamped by using contrib LSTM Cells. Added library handling for different OS.

Enhancements

OCR improved handling of images by adding binarizing of buffered segments
OCR now allows automatic adaptive scaling
SentenceDetector params merged between DL and Rule based annotators
SentenceDetector max length has been disabled by default, and now truncates by whitespace
Part of Speech, NER, Spell Checking and Vivekn Sentiment Analysis annotators now train from dataset passed to fit() using Spark in the process
Tokens and Chunks now hold metadata information regarding which sentence they belong to by sentence ID
AnnotatorApproach annotators now allow a param trainingCols allowing them to use different inputs in training and in prediction. Improves Pipeline versatility.
LightPipelines now allow method transform() to call against a DataFrame
Noticeable performance gains by improving serialization performance in annotators through removal of transient variables
Spark NLP in 30 seconds now provides a function SparkNLP.start() and sparknlp.start() (python) that automatically creates a local Spark session.
Improved DateMatcher accuracy
Improved Normalizer annotator by supporting and tokenizing a slang dictionary, with case sensitivity matching option
ContextSpellChecker now is capable of handling multiple sentences in a row
PretrainedPipeline feature now allows handling John Snow Labs remote pretrained pipelines to make it easy to update and access new models
Symmetric Delete spell checking model improved training performance

Models and Pipelines

Added more than 15 pretrained pipelines that cover a huge range of use cases. To be documented
Improved multi language support by adding french and italian pipelines and models. More to come!
Dependency Parser annotators now include a pretrained english model based on CoNLL-U 2009

Bugfixes

Fixed python classname reference when deserializing pipelines
Fixed serialization in ContextSpellChecker
Fixed a bug in LightPipeline causing not to include output from embedded pipelines in a PipelineModel
Fixed DateMatcher wrong param name not allowing to access it properly
Fixed a bug where DateMatcher didn't know how to handle dash in dates where year had two digits instead of four
Fixed a ContextSpellChecker bug that prevented it from being used repeatedly with collections in LightPipeline
Fixed a bug in OCR that made it blow up with some image formats when using text preferred method
Fixed a bug on OCR which made params not to work in cluster mode
Fixed OCR setSplitPages and setSplitRegions to work properly if tesseract detected multiple regions

Developer API

AnnotatorType params renamed to inputAnnotatorTypes and outputAnnotatorTypes
Embeddings now serialize along a FloatArray in Annotation class
Disabled useFeatureBroadcasting, showed better performance number when training large models in annotators that use Features
OCR must be instantiated
OCR works best with 4.0.0-beta.1

Build and release

Added GPU build with tensorflow-gpu to Maven coordinates
Removed .jar file from pip package

Assets 2

24 Feb 05:34

saif-ellafi

1.8.3

62c78fc

John Snow Labs Spark-NLP 1.8.3: Revisited DeepSentenceDetector, embeddings from S3, fixed python deserialization modules

Overview

We're glad to announce a new release for Spark NLP. This one calls the attention of the community who contributed
immensely towards reporting bugs and feedback to the library. This release focuses in various bugfixes around DeepSentenceDetector
and also python deserialization of some specific pipelines. It also improves the DeepSentenceDetector allowing further fine-tuning
and customization. Then, we have embeddings that are being cached in the models folder, and further improvements towards accessing
them through S3 storage. Finally, we have made serious improvements in noteoboks and documentation around the library.
Special thanks to @Tshimanga and @haimco10 for very interesting contributions. See you on Slack!

Enhancements

Improved OCR performance in skew detection
SentenceDetector now better handles single quote protections (Thanks @haimco10)
DeepSentenceDetector now can explodeSentences (Thanks @Tshimanga from Deep6.ai)
EmbeddingsHelper now is capable of caching downloaded embeddings to avoid re-downloading
Application.conf file may now be read from an s3 location
DeepSentenceDetector has now access to all pragmatic SentenceDetector params in order to fine-tune it

Bugfixes

Fixed ambiguous classpath resolution in pyspark, causing errors in deserializing some models
Fixed DeepSentenceDetector not being deserializable in PySpark
Fixed Chunk2Doc and Doc2Chunk annotators not being loadable in PySpark
Fixed a bug where DeepSentenceDetector wouldn't corrent denote start and end offsets (Thanks @Tshimanga from Deep6.ai)
Fixed a bug where DeepSentenceDetector would miss sentence parts when NER model missed header sentence (Thanks @Tshimanga from Deep6.ai)
Cleaned and optimized DeepSentenceDetector code (Thanks @danilojsl)
Fixed a missing dependency for OCR

Documentation and notebooks

Added support and instructions for Anaconda deployment (Thanks @maziyarpanahi)
Updated various python notebooks to show utilization of spark packages instead of jars
Added a new conference talk with Spark NLP in French at XebiCon'18
Updated documentation towards less use of jars in favor of dependency solving

Contributors

maziyarpanahi, haimco10, and 2 other contributors

Assets 2

08 Feb 04:15

saif-ellafi

1.8.2

95134da

John Snow Labs Spark-NLP 1.8.2: OCR Autorotation, Embeddings bugfixes, new utility annotators and languages

Overview

This release potentially targets to improve performance and resource usage in some pipelines that use word embeddings, it also comes
together with a very interesting autorotation feature in OCR, and a couple of new annotators to solve particular needs, including the ChunkTokenizer
or a Param to limit sentence lengths. Finally, we are starting to organize our multilingual store of models and data for training models.
Check the examples for some italian notebooks!. Thanks again to all community for such quick feedback all the time.

New Features

OCR now capable of automatic rotation, significantly improving accuracy in some scenarios
ChunkTokenizer is a new annotator that Tokenizes CHUNK type annotations. Extends Tokenizer algorithm and stores chunk ID for reference.
SentenceDetector new Param maxLength now cuts off sentences longer than (by default) 240 characters. It avoids Deep Learning annotator issues and may improve performance in some scenarios.
NerConverter new Param whiteList now allows a list of NER labels to be considered, while discarding the rest. May be useful for selective CHUNKing pipelines.

Enhancements

Pipelines using Word Embeddings should now perform faster due to a group of RocksDB optimizations allowing annotators to reuse current open connections to DB

Bugfixes

Fixed a bug where DeepSentenceDetector was missing the load() interface (Thanks @Tshimanga from Deep6!)
Fixed a bug where RocksDB opened too many files at once causing pipelines to fail or to work very slowly
Fixed NerCrfModel when prefetching RocksDB causing slower performance

Framework

Added missing artifact resolution dependencies for OCR Module
Started adding and organizing multilanguage models (Thanks @maziyarpanahi)
Updated RocksDB to 5.17.2

Contributors

maziyarpanahi and Tshimanga

Assets 2

26 Jan 00:20

saif-ellafi

1.8.1

acd4c09

John Snow Labs Spark-NLP 1.8.1: ML SentenceDetector, improved ContextSpellChecker and bugfixes

Overview

This hotfix version of Spark-NLP improves framework support by adding Maven coordinates for OCR and allowing S3 retrieval of files.
We also included code for generating Graphs for NerDL and also for creating your own metadata files for a private model downloader.
As new features, we are including a new experimental machine learning based sentence detector, which uses NER for bounds detections.
Aside from this, we are including a few bug fixes and OCR improvements. Enjoy! and thanks again for community contributions!

New Features

New DeepSentenceDetector annotator takes Spark-NLP's NER Deep Learning models as a base to improve sentence detection

Enhancements

Improved accuracy of ContextSpellChecker by enabling re-ranking of candidate words according to a weighted levenshtein distance
OCR process now defaults to split content in rows whether paragraphs or pages are identified for improved parallelism. Maybe turned off

Examples and use cases

Added Scala examples for Sentiment analysis and Lemmatizer in Italian (Thanks Vincenzo Gaudenzi from DXC.technology for dataset and model contribution!!!)

Bugfixes

Fixed a bug in Norvig and Symmetric SpellCheckers where the pattern parameter was not provided properly in Scala side (Thanks @johnmccain for reporting!)

Framework

Added hadoop-aws dependency for remote download capabilities (e.g. word embeddings sets)

Other

Metadata files for pretrained model downloads code is now included. This may be useful if anyone wants to set up their own private local model downloader service
NerDL Graphs generation code is now included in the library. This allows the usage of custom word embedding dimensions and feature counts.

Special mentions

Vincenzo Gaudenzi (DXC.technology) for contributing Italian datasets and models. @maziyarpanahi for creating examples with them.
@correlator from Deep6.ai for contributing feedback in slack and features feedback in general
@johnmccain for reporting bugs in spell checker
@rohit-nlp for delivering maven coordinates for OCR
@haimco10 for contributing a sentence detector improvement with apostrophe's use case. Not merged due specific issues involved.

Contributors

correlator, maziyarpanahi, and 3 other contributors

Assets 2

23 Dec 06:16

saif-ellafi

1.8.0

454610b

John Snow Labs Spark-NLP 1.8.0: Dependency Parser, Context Spell Checker and Spark 2.4.0

Overview

This release is huge! Spark-NLP made the leap into Spark 2.4.0, even with the challenge of not having everyone yet on board there (i.e. Zeppelin doesn't yet support it).
In this version we release three new NLP annotators. Two for dependency parsing processes and one for contextual deep learning based spell checking.
We also significantly improved OCR functionality, fine-tuning capabilities and general output performance, particularly on tesseract.
Finally, there's plenty of bug fixes and improvements in the word embeddings field, along with performance boosts and reduced disk IO.
Feel free to shoot us with any feedback you have! Particularly on your Spark 2.4.x experience.

New Features

Built on top of Spark 2.4.0
Dependency Parser annotator allows for sentence relationship encoding
Typed Dependency Parser annotator allows for labeling relationships within dependency tags
ContextSpellChecker is our first Deep Learning based Spell Checker that evaluates context and not only tokens

Enhancements

More OCR parameters exposed for further fine tuning, including preferred methods priority and page segmentation modes
OCR now has a setting setSplitPages() which allows setting whether to output one page per row or the entire document instead
Improved word embeddings performance when working in local filesystems
Reduced the amount of disk IO when working with Word Embeddings
All python notebooks improved for better readability and better documentation
Simplified PySpark interface API
CoNLLGenerator utility class which helps building CoNLL-2003 files for NER training
EmbeddingsHelper now allows reading word embeddings files directly from s3a:// paths

Bugfixes

Solved race-condition issues in regards of cluster usage of RocksDB index for embeddings
Fixed application.conf reading bug which didn't properly refresh AWS credentials
RocksDB index no longer uses compression, in order to support Windows without native RocksDB compression libraries
Solved various python default parameter settings
Fixed circular dependency with jbig pdfbox image OCR

Deprecations

DeIdentification annotator is no longer supported in the open source version of Spark-NLP
AssertionStatus annotator is no longer supported in the open source version of Spark-NLP

Assets 2

11 Nov 23:09

saif-ellafi

1.7.3

7894655

John Snow Labs Spark-NLP 1.7.3: Fixed cluster-mode word embeddings on pretrained and improved PySpark API

Overview

This hotfix release focuses on fixing word-embeddings cluster problems on some frameworks such as Databricsk, while keeping 1.7.x performance benefits. Various YARN based clusters have been tested, databricks cloud among them to test this hotfix.
Aside of that, multiple improvements have been commited towards a better support of PySpark-NLP, fixing diverse technical issues in the API that help consistency in Annotator's super classes.
Finally, PIP installation has been made easier with a SparkNLP class that creates SparkSession automatically, for those who are learning Python Spark on their local computers.
Thanks to all the community for reporting issues.

Bugfixes

Fixed 'RocksDB not serializable' when running LightPipeline scenarios or using _.functions implicits
Fixed dependency with apache.commons.codec causing Apache Zeppelin 0.8.0 not to work in %pyspark
Fixed Python pretrained() downloader not correctly setting Params and incorrectly creating new Model UIDs
Fixed error 'JavaPackage not callable' when using AnnotatorModel.load() API without instantiating the class first
Fixed Spark addFiles missing local file causing Word Embeddings not properly work in some Cluster-based frameworks
Fixed broadcast NoSuchElementException Failed to get broadcast_6_piece0 of broadcast_6 causing pretrained models not work in cluster frameworks (thanks @EnricoMi)

Developer API

EmbeddingsHelper.setRef() has been removed. Reference is now set implicitly through EmbeddingsHelper.load(). Does not need to be loaded before deserializing models.
Fixed and properly renamed chunk2doc and dock2chunk transformers, should now be working as expected
Renamed setCompositeTokens to setCompositeTokensPatterns to help user remind that regex are being used in such Param
Fixed PySpark automatic getter and setter Param generation when using pretrained() or load() models
Simplified cluster path resolution for word embeddings

Other

sparknlp.base now contains SparkNLP() classs which automatically cretes SparkSession using appropriate jar settings. Helps newcomers get started in PySpark NLP.

Contributors

EnricoMi

Assets 2

20 Oct 23:14

saif-ellafi

1.7.2

4211763

John Snow Labs Spark-NLP 1.7.2: Cluster deserialization, application.conf runtime read fix, hotfixes

Overview

Quick release with another hotfix, due to a new found bug when deserializing word embeddings in a distributed fs. Also introduces changes in application.conf reader in order
to allow run-time changes. Also introduces renaming from EmbeddingsHelper API.

Bugfixes

Fixed embeddings deserialization from distributed filesystem (caused due to windows pathfix)
Fixed application.conf not reading changes in runtime
Added missing remote_locs argument in python pretrained() functions
Fixed wrong build version introduced in 1.7.1 to detect proper pretrained models version

Developer API

Renamed EmbeddingsHelper functions for more convenience

Assets 2

19 Oct 22:33

saif-ellafi

1.7.1

298efe3

John Snow Labs Spark-NLP 1.7.1: Word embeddings deserialization hotfix, windows path fix, Chunk2Doc transformer

Overview

Thanks to our slack community (Bryan Wilkinson, @maziyarpanahi, @apiltamang), a few bugs been pointed out very quickly from 1.7.0 release. This hotfix fixes an embeddings deserialization issue when cache_pretrained is located on a distributed filesystem.
Also, fixes some path resolution in Windows OS. Thanks to Maziyar, .gitattributes been added in order to identify proper languages in GitHub.
Finally, 1.7.1 adds a missing annotator from 1.7.0 Chunk2Doc, which converts CHUNK types into DOCUMENT types, for further retokenization or other annotations.

Enhancements

Chunk2Doc annotator converts annotatorType from CHUNK to DOCUMENT

Bugfixes

Fixed embedding-based annotators deserialization error when cache_pretrained is on distributed fs (Thanks Bryan Wilkinson for pointing out issue and testing fix)
Fixed windows path reading when deserializing embeddings (Thanks @apiltamang)

Other

.gitattributes added in order to properly discard jupyter as main language for GitHub repo (thanks @maziyarpanahi)

Contributors

maziyarpanahi and apiltamang

Assets 2

16 Oct 05:49

saif-ellafi

1.7.0

64a27c6

John Snow Labs Spark-NLP 1.7.0: Decoupled word embeddings, better windows support

Overview

Having multiple annotators that use the same word embeddings set, may result in huge pipelines, driver memory and storage consumption.
Since now on, embeddings may be shared and reutilized across annotators making the process much more efficient.
Also, thanks to @apiltamang, we now better support path resolution for Windows implementations.

Enhancements

Memory and storage saving by allowing annotators with embeddings through params 'includeEmbeddings' and 'embeddingsRef' to allow them to set whether they should be included when saved, or referenced by id from other annotators.
EmbeddingsHelper class allows embeddings management

Bug fixes

Thanks to @apiltamang for improving URI path support for Windows Servers

Developer API

Embeddings interfaces and method names completely refactored, hopefully simplified and easier to understand

Contributors

apiltamang

Assets 2

17 Sep 17:45

saif-ellafi

1.6.3

00ef171

John Snow Labs Spark-NLP 1.6.3: DeIdentification annotator, better OCR and bugfixes

Overview

This release includes a new annotator for de-identification of sensitive information. It uses CHUNK annotations, meaning its accuracy will depend on previous annotators on the pipeline.
Also, OCR capabilities have been improved in the OCR module.
In terms of broken stuff, we've fixed a few annoying bugs on SymmetricDelete and SentenceDetector explode feature.
Finally, pip is now part of the official repositories, meaning you can install it just as any other module. It also includes jars and we've added a SparkNLP class which creates SparkSession easily for you.
Thanks again for all community contribution in issues, feedback and comments in GitHub and in Slack.

New features

DeIdentification annotator, takes DOCUMENT and TOKEN from the original sentence, plus a CHUNK annotation to anonymize target chunk in sentence. CHUNK annotation might come from NerConverter, TextMatcher or other chunk annotators.

Enhancements

Kernel zoom and region erosion improve overall detection quality. Fixed some stability bugs. Improved parallelism

Bug fixes

Sentence Detector explode sentences into rows now works properly
Fixed Dictionary-based sentiment detector not working on pyspark
Added missing NerConverter to annotator._ imports
Fixed SymmetricDelete spell checker deleting tokens in some scenarios
Fixed SymmetricDelete spell checker unwilling lower-casing

Other

PySpark pip now part from official pip repos
Pip installation now includes corresponding spark-nlp jar. base module includes SparkNLP SparkSession creator

Assets 2

Releases: JohnSnowLabs/spark-nlp

John Snow Labs Spark-NLP 2.0.0: Bert embeddings, embeddings as annotators, better OCR, new pretrained pipelines

New Features

Enhancements

Models and Pipelines

Bugfixes

Developer API

Build and release

Uh oh!

John Snow Labs Spark-NLP 1.8.3: Revisited DeepSentenceDetector, embeddings from S3, fixed python deserialization modules

Overview

Enhancements

Bugfixes

Documentation and notebooks

Contributors

Uh oh!

John Snow Labs Spark-NLP 1.8.2: OCR Autorotation, Embeddings bugfixes, new utility annotators and languages

Overview

New Features

Enhancements

Bugfixes

Framework

Contributors

Uh oh!

John Snow Labs Spark-NLP 1.8.1: ML SentenceDetector, improved ContextSpellChecker and bugfixes

Overview

New Features

Enhancements

Examples and use cases

Bugfixes

Framework

Other

Special mentions

Contributors

Uh oh!

John Snow Labs Spark-NLP 1.8.0: Dependency Parser, Context Spell Checker and Spark 2.4.0

Overview

New Features

Enhancements

Bugfixes

Deprecations

Uh oh!

John Snow Labs Spark-NLP 1.7.3: Fixed cluster-mode word embeddings on pretrained and improved PySpark API

Overview

Bugfixes

Developer API

Other

Contributors

Uh oh!

John Snow Labs Spark-NLP 1.7.2: Cluster deserialization, application.conf runtime read fix, hotfixes

Overview

Bugfixes

Developer API

Uh oh!

John Snow Labs Spark-NLP 1.7.1: Word embeddings deserialization hotfix, windows path fix, Chunk2Doc transformer

Overview

Enhancements

Bugfixes

Other

Contributors

Uh oh!

John Snow Labs Spark-NLP 1.7.0: Decoupled word embeddings, better windows support

Overview

Enhancements

Bug fixes

Developer API

Contributors

Uh oh!

John Snow Labs Spark-NLP 1.6.3: DeIdentification annotator, better OCR and bugfixes

Overview

New features

Enhancements

Bug fixes

Other

Uh oh!