27 Jan 23:01

9656fc0

John Snow Labs Spark-NLP 1.3.0: Better tokenizer, assertion status annotator, and more

========
1.3.0

IMPORTANT: Pipelines from 1.2.6 or older cannot be loaded from 1.3.0

We are happy to announce a big release this time. 1.3.0 includes a brand new annotator for assertion status and an improved tokenizer, along with many enhancements that bring side-effects to the library.

New features

#94
Tokenizer annotator has been revamped. It now follows standard NLP Rules, matching above 90% of StanfordNLP Tokens
This annotator has now more complex rules allowing setting custom composite words as exceptions (e.g. to not break New York)
and custom Prefix, Infix, Suffix and Breaking rules. It uses regular expression groups in order to match various tokens per target word
Defaults have been updated to also be language agnostic and support foreign characters from Unicode charset
#93
Assertion Status. This annotator identifies negated sequences within target scope. Assertion status is a machine learning
annotator and works throughout a set of Word Embeddings which a set of them is provided as a part of our Python notebook examples.
#90
Recursive Pipelines. We have created our own Pipeline class which will take more advantages from Spark-NLP annotators.
Although this Pipeline is completely optional and works well with default Apache Spark estimators and transforms, it allows
training our annotators more efficiently by allowing annotator approaches access the previous state of the Pipeline,
allowing them to use it to tokenize or transform their own external content. It is recommended to use such Pipelines.

Enhancements

#83
Part of Speech training has been improved in both performance and quality, and now better makes use of the input corpus provided.
New params have been extended in order to have more control of its training, through corpusFormat and corpusLimit, allowing
whether to read training data as Dataset or raw text files, and the number of limit files if a folder is provided
#84
Thanks to @lambdaofgod to allow Normalizer to optionally lower case tokens
Thanks to Lorenz Bernauer, Normalizer default pattern now becomes language agnostic by not breaking unicode characters such as Spanish or German letters
Features now have appropriate default values which are lazy by nature and executed only once upon request. This improves by side effect to the Lemmatizer performance.
RuleFactory (A regex rule factory) performance has been improved due to set to use a Factory pattern and not re-check it's strategy on every transformation in run-time.
This might have positive side effects in SentenceDetector, DateMatcher and RegexMatcher which extensively use this class.

Class Renames

RegexTokenizer -> Tokenizer (it is not just regex anymore)
SentenceDetectorModel -> SentenceDetector (it is not a model, it is a rule-based algorithm)
SentimentDetectorModel -> SentimentDetector (it is not a model, it is a rule-based algorithm)

User Utilities

ResourceHelper has a function createDatasetFromText which allows the user to more
easily read one or multiple text files from path into a dataset with various options,
including filename by row or by file aggregation. This class should be more widely
used since it helps dealing with local files parsing. It shall be better documented.
com.johnsnowlabs.util now contains a Benchmark class which allows measuring the time of
any function easily, by using it as Benchmark.time("Description of measured") {someFunction()}

Developer API

https://github.com/JohnSnowLabs/spark-nlp/pull/89/files
Word embedding traits have been generalized. Now any annotator who might want to use them can easily access their properties
Recursive pipelines now allow injecting PipelineModel object into train() stage. It is an optional parameter. If the user
utilizes RecursivePipeline, the annotator might use this pipeline for transforming secondary data inputs.
Annotator abstract class has been divided into a previous RawAnnotator class which contains all annotator properties
and validations, but does not make use of the annotate() function. This allows annotators that need to work directly with
the transform() call, but also participate between other annotators in the pipeline

Bugfixes

Fixed a bug in annotators with word embeddings not correctly serializing into disk
Fixed a bug creating temporary folders in home folder
Fixed a broken geospatial pattern in sentence detection

Contributors

lambdaofgod

Assets 2

12 Jan 04:45

saif-ellafi

1.2.6

5055bc4

John Snow Labs Spark-NLP 1.2.6: Improved Serialization Performance

Enhancements

#82
Vivekn Sentiment Analysis improved memory consumption and training performance
Parameter pruneCorpus is an adjustable value now, defaults to 1. Higher values lead to better performance
but are meant on larger corpora. tokenPattern params are meant to allow different tokenization regex
within the corpora provided on Vivekn and Norvig models.
#81
Serialization improvements. New default format (parquet lasted little) is RDD objects. Proved to be lighter on
heap memory. Also added lazier default values for Feature containers. New application.conf performance tunning
settings allow to customize whether we want to Feature broadcast or not, and use parquet or objects in serialization.

Assets 2

08 Jan 22:11

saif-ellafi

1.2.5

5bcd2c6

John Snow Labs Spark-NLP 1.2.5

Note: Pipelines from 1.2.4 or older cannot be loaded from 1.2.5

New features

#70
Word embeddings parameter for CRF NER annotator
#78
Annotator Features replace spark Params and are now serialized using Kryo and partitioned parquet files, increases performance and smaller memory consumption in Driver for saving and loading pipelines with large corpora. Such features are now also broadcasted for better performance in distributed environments. This enhancement is a breaking change, does not allow to load older pipelines

Bug fixes

cb9aa43
Stemmer was not capable of being deserialized (Implements DefaultParamsReadable)
#75
Sentence Boundary detector was not properly setting bounds

Documentation (thanks @maziyarpanahi)

#79
Typo in code
#74
Bad description

Contributors

maziyarpanahi

Assets 2

23 Dec 07:07

saif-ellafi

1.2.4

c427eff

John Snow Labs Spark-NLP 1.2.4

New features

c17ddac
ResourceHelper now allows input files to be read in the shape of Spark Dataset, implicitly enabling HDFS paths, allowing larger annotator input files. Needs to set 'TXTDS' as input format Param to let annotators read this way. Allowed in: Lemmatizer, EntityExtractor, RegexMatcher, Sentiment Analysis models, Spell Checker and Dependency Parser.

Enhancements and progress

#64
EntityExtractor refactored. This annotator uses an input file containing a list of entities to look for inside target text. This annotator has been refactored to be of better use and specifically faster, by using a Trie search algorithm. Proper examples included in python notebooks.
4920e5c
CRF NER Benchmarking progress. CRF NER Documentation and official release coming soon

Bug fixes

Issue #41 <> d3b9086
Fixed default resources not being loaded properly when using the library through --spark-packages. Improved input reading from resources and folder resources, and falling back to disk, with better error handling.
0840585
Corrected param names in DocumentAssembler
Issue #58 <> 5a53395
Deleted a left-over deprecated function which was misleading.
c02591b
Added a filtering to ensure no empty sentences arrive to unnormalized Vivekn Sentiment Analysis

Documentation and examples

b81e95c
Added additional resources into FAQ page.
0c3f43c
Added Spark Submit example notebook with full Pipeline use case
Issue #53 <> 20efe4a
Fixed scala python documentation mistakes
782eb8d
Typos fix

Other

91d8acb
Removed Regex NER due to slowness and little use. CRF NER to replace NER.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

========
1.3.0

New features

Enhancements

Class Renames

User Utilities

Developer API

Bugfixes

Contributors

Uh oh!

Enhancements

Uh oh!

Note: Pipelines from 1.2.4 or older cannot be loaded from 1.2.5

New features

Bug fixes

Documentation (thanks @maziyarpanahi)

Contributors

Uh oh!

New features

Enhancements and progress

Bug fixes

Documentation and examples

Other

Uh oh!

Releases: JohnSnowLabs/spark-nlp

John Snow Labs Spark-NLP 1.3.0: Better tokenizer, assertion status annotator, and more

======== 1.3.0

New features

Enhancements

Class Renames

User Utilities

Developer API

Bugfixes

Contributors

Uh oh!

John Snow Labs Spark-NLP 1.2.6: Improved Serialization Performance

Enhancements

Uh oh!

John Snow Labs Spark-NLP 1.2.5

Note: Pipelines from 1.2.4 or older cannot be loaded from 1.2.5

New features

Bug fixes

Documentation (thanks @maziyarpanahi)

Contributors

Uh oh!

John Snow Labs Spark-NLP 1.2.4

New features

Enhancements and progress

Bug fixes

Documentation and examples

Other

Uh oh!

========
1.3.0