Skip to content

Releases: JohnSnowLabs/spark-nlp

John Snow Labs Spark-NLP 1.3.0: Better tokenizer, assertion status annotator, and more

27 Jan 23:01
Compare
Choose a tag to compare

========
1.3.0

IMPORTANT: Pipelines from 1.2.6 or older cannot be loaded from 1.3.0

We are happy to announce a big release this time. 1.3.0 includes a brand new annotator for assertion status and an improved tokenizer, along with many enhancements that bring side-effects to the library.


New features

  • #94
    Tokenizer annotator has been revamped. It now follows standard NLP Rules, matching above 90% of StanfordNLP Tokens
    This annotator has now more complex rules allowing setting custom composite words as exceptions (e.g. to not break New York)
    and custom Prefix, Infix, Suffix and Breaking rules. It uses regular expression groups in order to match various tokens per target word
    Defaults have been updated to also be language agnostic and support foreign characters from Unicode charset
  • #93
    Assertion Status. This annotator identifies negated sequences within target scope. Assertion status is a machine learning
    annotator and works throughout a set of Word Embeddings which a set of them is provided as a part of our Python notebook examples.
  • #90
    Recursive Pipelines. We have created our own Pipeline class which will take more advantages from Spark-NLP annotators.
    Although this Pipeline is completely optional and works well with default Apache Spark estimators and transforms, it allows
    training our annotators more efficiently by allowing annotator approaches access the previous state of the Pipeline,
    allowing them to use it to tokenize or transform their own external content. It is recommended to use such Pipelines.

Enhancements

  • #83
    Part of Speech training has been improved in both performance and quality, and now better makes use of the input corpus provided.
    New params have been extended in order to have more control of its training, through corpusFormat and corpusLimit, allowing
    whether to read training data as Dataset or raw text files, and the number of limit files if a folder is provided
  • #84
    Thanks to @lambdaofgod to allow Normalizer to optionally lower case tokens
  • Thanks to Lorenz Bernauer, Normalizer default pattern now becomes language agnostic by not breaking unicode characters such as Spanish or German letters
  • Features now have appropriate default values which are lazy by nature and executed only once upon request. This improves by side effect to the Lemmatizer performance.
  • RuleFactory (A regex rule factory) performance has been improved due to set to use a Factory pattern and not re-check it's strategy on every transformation in run-time.
    This might have positive side effects in SentenceDetector, DateMatcher and RegexMatcher which extensively use this class.

Class Renames

RegexTokenizer -> Tokenizer (it is not just regex anymore)
SentenceDetectorModel -> SentenceDetector (it is not a model, it is a rule-based algorithm)
SentimentDetectorModel -> SentimentDetector (it is not a model, it is a rule-based algorithm)


User Utilities

  • ResourceHelper has a function createDatasetFromText which allows the user to more
    easily read one or multiple text files from path into a dataset with various options,
    including filename by row or by file aggregation. This class should be more widely
    used since it helps dealing with local files parsing. It shall be better documented.
  • com.johnsnowlabs.util now contains a Benchmark class which allows measuring the time of
    any function easily, by using it as Benchmark.time("Description of measured") {someFunction()}

Developer API

  • https://github.com/JohnSnowLabs/spark-nlp/pull/89/files
    Word embedding traits have been generalized. Now any annotator who might want to use them can easily access their properties
  • Recursive pipelines now allow injecting PipelineModel object into train() stage. It is an optional parameter. If the user
    utilizes RecursivePipeline, the annotator might use this pipeline for transforming secondary data inputs.
  • Annotator abstract class has been divided into a previous RawAnnotator class which contains all annotator properties
    and validations, but does not make use of the annotate() function. This allows annotators that need to work directly with
    the transform() call, but also participate between other annotators in the pipeline

Bugfixes

  • Fixed a bug in annotators with word embeddings not correctly serializing into disk
  • Fixed a bug creating temporary folders in home folder
  • Fixed a broken geospatial pattern in sentence detection

John Snow Labs Spark-NLP 1.2.6: Improved Serialization Performance

12 Jan 04:45
Compare
Choose a tag to compare

Enhancements

  • #82
    Vivekn Sentiment Analysis improved memory consumption and training performance
    Parameter pruneCorpus is an adjustable value now, defaults to 1. Higher values lead to better performance
    but are meant on larger corpora. tokenPattern params are meant to allow different tokenization regex
    within the corpora provided on Vivekn and Norvig models.
  • #81
    Serialization improvements. New default format (parquet lasted little) is RDD objects. Proved to be lighter on
    heap memory. Also added lazier default values for Feature containers. New application.conf performance tunning
    settings allow to customize whether we want to Feature broadcast or not, and use parquet or objects in serialization.

John Snow Labs Spark-NLP 1.2.5

08 Jan 22:11
Compare
Choose a tag to compare

Note: Pipelines from 1.2.4 or older cannot be loaded from 1.2.5

New features

  • #70
    Word embeddings parameter for CRF NER annotator
  • #78
    Annotator Features replace spark Params and are now serialized using Kryo and partitioned parquet files, increases performance and smaller memory consumption in Driver for saving and loading pipelines with large corpora. Such features are now also broadcasted for better performance in distributed environments. This enhancement is a breaking change, does not allow to load older pipelines

Bug fixes

  • cb9aa43
    Stemmer was not capable of being deserialized (Implements DefaultParamsReadable)
  • #75
    Sentence Boundary detector was not properly setting bounds

Documentation (thanks @maziyarpanahi)

  • #79
    Typo in code
  • #74
    Bad description

John Snow Labs Spark-NLP 1.2.4

23 Dec 07:07
Compare
Choose a tag to compare

New features

  • c17ddac
    ResourceHelper now allows input files to be read in the shape of Spark Dataset, implicitly enabling HDFS paths, allowing larger annotator input files. Needs to set 'TXTDS' as input format Param to let annotators read this way. Allowed in: Lemmatizer, EntityExtractor, RegexMatcher, Sentiment Analysis models, Spell Checker and Dependency Parser.

Enhancements and progress

  • #64
    EntityExtractor refactored. This annotator uses an input file containing a list of entities to look for inside target text. This annotator has been refactored to be of better use and specifically faster, by using a Trie search algorithm. Proper examples included in python notebooks.
  • 4920e5c
    CRF NER Benchmarking progress. CRF NER Documentation and official release coming soon

Bug fixes

  • Issue #41 <> d3b9086
    Fixed default resources not being loaded properly when using the library through --spark-packages. Improved input reading from resources and folder resources, and falling back to disk, with better error handling.
  • 0840585
    Corrected param names in DocumentAssembler
  • Issue #58 <> 5a53395
    Deleted a left-over deprecated function which was misleading.
  • c02591b
    Added a filtering to ensure no empty sentences arrive to unnormalized Vivekn Sentiment Analysis

Documentation and examples

  • b81e95c
    Added additional resources into FAQ page.
  • 0c3f43c
    Added Spark Submit example notebook with full Pipeline use case
  • Issue #53 <> 20efe4a
    Fixed scala python documentation mistakes
  • 782eb8d
    Typos fix

Other

  • 91d8acb
    Removed Regex NER due to slowness and little use. CRF NER to replace NER.