JohnSnowLabs
diff --git a/‎CHANGELOG‎
Lines changed: 47 additions & 0 deletions b/‎CHANGELOG‎
Lines changed: 47 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 31 additions & 30 deletions b/‎README.md‎
Lines changed: 31 additions & 30 deletions
diff --git a/‎build.sbt‎
Lines changed: 2 additions & 2 deletions b/‎build.sbt‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/index.html‎
Lines changed: 1 addition & 2 deletions b/‎docs/index.html‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎docs/notebooks.html‎
Lines changed: 7 additions & 7 deletions b/‎docs/notebooks.html‎
Lines changed: 7 additions & 7 deletions
@@ -1,3 +1,50 @@
+========
+1.8.0
+========
+---------------
+Overview
+---------------
+This release is huge! Spark-NLP made the leap into Spark 2.4.0, even with the challenge of not having everyone yet on board there (i.e. Zeppelin doesn't yet support it).
+In this version we release three new NLP annotators. Two for dependency parsing processes and one for contextual deep learning based spell checking.
+We also significantly improved OCR functionality, fine-tuning capabilities and general output performance, particularly on tesseract.
+Finally, there's plenty of bug fixes and improvements in the word embeddings field, along with performance boosts and reduced disk IO.
+Feel free to shoot us with any feedback you have! Particularly on your Spark 2.4.x experience.
+
+---------------
+New Features
+---------------
+* Built on top of Spark 2.4.0
+* Dependency Parser annotator allows for sentence relationship encoding
+* Typed Dependency Parser annotator allows for labeling relationships within dependency tags
+* ContextSpellChecker is our first Deep Learning based Spell Checker that evaluates context and not only tokens
+
+---------------
+Enhancements
+---------------
+* More OCR parameters exposed for further fine tuning, including preferred methods priority and page segmentation modes
+* OCR now has a setting setSplitPages() which allows setting whether to output one page per row or the entire document instead
+* Improved word embeddings performance when working in local filesystems
+* Reduced the amount of disk IO when working with Word Embeddings
+* All python notebooks improved for better readability and better documentation
+* Simplified PySpark interface API
+* CoNLLGenerator utility class which helps building CoNLL-2003 files for NER training
+* EmbeddingsHelper now allows reading word embeddings files directly from s3a:// paths
+
+---------------
+Bugfixes
+---------------
+* Solved race-condition issues in regards of cluster usage of RocksDB index for embeddings
+* Fixed application.conf reading bug which didn't properly refresh AWS credentials
+* RocksDB index no longer uses compression, in order to support Windows without native RocksDB compression libraries
+* Solved various python default parameter settings
+* Fixed circular dependency with jbig pdfbox image OCR
+
+---------------
+Deprecations
+---------------
+* DeIdentification annotator is no longer supported in the open source version of Spark-NLP
+* AssertionStatus annotator is no longer supported in the open source version of Spark-NLP
+
 ========
 1.7.3
 ========
 
@@ -9,7 +9,9 @@ Take a look at our official spark-nlp page: http://nlp.johnsnowlabs.com/ for use
 Questions? Feedback? Request access sending an email to [email protected]
 
 # Apache Spark Support
-As of *1.7.x* Spark-NLP does _NOT_ yet work with Spark 2.4.x 
+Spark-NLP *1.8.0* has been built on top of Apache Spark 2.4.0
+
+Note that Spark is not retrocompatible with Spark 2.3.x, so models and environments might not work 
 
 # Usage
 
@@ -20,18 +22,18 @@ This library has been uploaded to the spark-packages repository https://spark-pa
 
 Benefit of spark-packages is that makes it available for both Scala-Java and Python
 
-To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:1.7.3` to you spark command
+To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:1.8.0` to you spark command
 
 ```sh
-spark-shell --packages JohnSnowLabs:spark-nlp:1.7.3
+spark-shell --packages JohnSnowLabs:spark-nlp:1.8.0
 ```
 
 ```sh
-pyspark --packages JohnSnowLabs:spark-nlp:1.7.3
+pyspark --packages JohnSnowLabs:spark-nlp:1.8.0
 ```
 
 ```sh
-spark-submit --packages JohnSnowLabs:spark-nlp:1.7.3
+spark-submit --packages JohnSnowLabs:spark-nlp:1.8.0
 ```
 
 ### offline mode using jars
@@ -43,14 +45,14 @@ Use either one of the following options
 
 * Add the following Maven Coordinates to the interpreter's library list
 ```
-com.johnsnowlabs.nlp:spark-nlp_2.11:1.7.3
+com.johnsnowlabs.nlp:spark-nlp_2.11:1.8.0
 ```
 * Add path to pre-built jar from [here](#pre-compiled-spark-nlp-and-spark-nlp-ocr) in the interpreter's library list making sure the jar is available to driver path
 
 ### Python in Zeppelin
 Apart from previous step, install python module through pip
 ```
-pip install spark-nlp==1.7.3
+pip install spark-nlp==1.8.0
 ```
 Configure Zeppelin properly, use cells with %spark.pyspark or any interpreter name you chose.
 
@@ -61,7 +63,7 @@ An alternative option would be to set `SPARK_SUBMIT_OPTIONS` (zeppelin-env.sh) a
 ## Python without explicit Spark installation
 If you installed pyspark through pip, you can install sparknlp through pip as well
 ```
-pip install spark-nlp==1.7.3
+pip install spark-nlp==1.8.0
 ```
 Then you'll have to create a SparkSession manually, for example:
 ```
@@ -87,7 +89,7 @@ export PYSPARK_PYTHON=python3
 export PYSPARK_DRIVER_PYTHON=jupyter
 export PYSPARK_DRIVER_PYTHON_OPTS=notebook
 
-pyspark --packages JohnSnowLabs:spark-nlp:1.7.3
+pyspark --packages JohnSnowLabs:spark-nlp:1.8.0
 ```
 
 Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp`
@@ -112,12 +114,12 @@ sparknlp {
 ```
 
 ## Pre-compiled Spark-NLP and Spark-NLP-OCR
-You may download fat-jar from here:
-[Spark-NLP 1.7.3 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-assembly-1.7.3.jar)
-or non-fat from here
-[Spark-NLP 1.7.3 PKG JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp_2.11-1.7.3.jar)
+Spark-NLP FAT-JAR from here (Does NOT include Spark):
+[Spark-NLP 1.8.0 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-assembly-1.8.0.jar)
+Spark-NLP GPU Enhanced Tensorflow FAT-JAR:
+[Spark-NLP 1.8.0-gpu FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-assembly-1.8.0-gpu.jar)
 Spark-NLP-OCR Module (Requires native Tesseract 4.x+ for image based OCR. Does not require Spark-NLP to work but highly suggested)
-[Spark-NLP-OCR 1.7.3 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-ocr-assembly-1.7.3.jar)
+[Spark-NLP-OCR 1.8.0 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-ocr-assembly-1.8.0.jar)
 
 ## Maven central
 
@@ -129,19 +131,19 @@ Our package is deployed to maven central. In order to add this package as a depe
 <dependency>
   <groupId>com.johnsnowlabs.nlp</groupId>
   <artifactId>spark-nlp_2.11</artifactId>
-  <version>1.7.3</version>
+  <version>1.8.0</version>
 </dependency>
 ```
 
 #### SBT
 ```sbtshell
-libraryDependencies += "com.johnsnowlabs.nlp" % "spark-nlp_2.11" % "1.7.3"
+libraryDependencies += "com.johnsnowlabs.nlp" % "spark-nlp_2.11" % "1.8.0"
 ```
 
 If you are using `scala 2.11`
 
 ```sbtshell
-libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "1.7.3"
+libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "1.8.0"
 ```
 
 ## Using the jar manually 
@@ -162,22 +164,21 @@ The preferred way to use the library when running spark programs is using the `-
 
 If you have troubles using pretrained() models in your environment, here a list to various models (only valid for latest versions).
 If there is any older than current version of a model, it means they still work for current versions.
-### Updated for 1.7.3
+### Updated for 1.8.0
 ### Pipelines
-* [Basic Pipeline](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_basic_en_1.6.1_2_1533856444797.zip)
-* [Advanced Pipeline](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_advanced_en_1.7.0_2_1539460910585.zip)
-* [Vivekn Sentiment Pipeline](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_vivekn_en_1.6.2_2_1534781342094.zip)
+* [Basic Pipeline](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_basic_en_1.8.0_2.4_1545435998968.zip)
+* [Advanced Pipeline](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_advanced_en_1.8.0_2.4_1545436028146.zip)
+* [Vivekn Sentiment Pipeline](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_vivekn_en_1.8.0_2.4_1545436008101.zip)
 
 ### Models
-* [PerceptronModel (POS)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_fast_en_1.6.1_2_1533853928168.zip)
-* [ViveknSentimentModel (Sentiment)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/vivekn_fast_en_1.6.2_2_1534781337758.zip)
-* [SymmetricDeleteModel (Spell Checker)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/spell_sd_fast_en_1.6.2_2_1534781178138.zip)
-* [ContextSpellCheckerModel (Spell Checker)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/context_spell_gen_en_1.7.0_2_1544041161062.zip)
-* [NorvigSweetingModel (Spell Checker)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/spell_fast_en_1.6.2_2_1534781328404.zip)
-* [NerCRFModel (NER)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_fast_en_1.7.0_2_1539896043754.zip)
-* [NerDLModel (NER)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_precise_en_1.7.0_2_1539623388047.zip)
-* [LemmatizerModel (Lemmatizer)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_fast_en_1.6.1_2_1533854538211.zip)
-* [AssertionDLModel (Assertion)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/as_fast_dl_en_1.7.0_2_1539653960749.zip)
+* [PerceptronModel (POS)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_fast_en_1.8.0_2.4_1545434653742.zip)
+* [ViveknSentimentModel (Sentiment)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/vivekn_fast_en_1.8.0_2.4_1545435741623.zip)
+* [SymmetricDeleteModel (Spell Checker)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/spell_sd_fast_en_1.8.0_2.4_1545435558025.zip)
+* ContextSpellCheckerModel (Spell Checker)
+* [NorvigSweetingModel (Spell Checker)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/spell_fast_en_1.8.0_2.4_1545435732032.zip)
+* [NerCRFModel (NER)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_fast_en_1.8.0_2.4_1545435254745.zip)
+* [NerDLModel (NER)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_precise_en_1.8.0_2.4_1545439567330.zip)
+* [LemmatizerModel (Lemmatizer)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_fast_en_1.8.0_2.4_1545435317864.zip)
 ``
 # FAQ
 [Check our Articles and FAQ page here](https://nlp.johnsnowlabs.com/articles.html)
 
@@ -9,7 +9,7 @@ name := "spark-nlp"
 
 organization := "com.johnsnowlabs.nlp"
 
-version := "1.7.3"
+version := "1.8.0"
 
 scalaVersion in ThisBuild := scalaVer
 
@@ -151,7 +151,7 @@ assemblyMergeStrategy in assembly := {
 lazy val ocr = (project in file("ocr"))
   .settings(
     name := "spark-nlp-ocr",
-    version := "1.7.3",
+    version := "1.8.0",
     libraryDependencies ++= ocrDependencies ++
       analyticsDependencies ++
       testDependencies,
 
@@ -78,8 +78,7 @@ <h2 class="title">High Performance NLP with Apache Spark </h2>
                     </p>
                 <a class="btn btn-info btn-cta" style="float: center;margin-top: 10px;" href="mailto:[email protected]?subject=SparkNLP%20Slack%20access" target="_blank"> Questions? Join our Slack</a>
                 <b/><p/><p/>
-                <p><span class="label label-warning">2018 Nov 11st - Update!</span> 1.7.3 Released! Word embeddings decoupled from annotators, better Windows and improved cluster support</p>
-                <p><span class="label label-danger">Apache Spark 2.4.x not yet supported</span></p>
+                <p><span class="label label-warning">2018 Nov 21st - Update!</span> 1.8.0 Released! Dependency Parser, new Spell Checker, Spark 2.4.0, performance boosts and more!</p>
             </div>
             <div id="cards-wrapper" class="cards-wrapper row">
                 <div class="item item-green col-md-4 col-sm-6 col-xs-6">
 
@@ -103,7 +103,7 @@ <h4 id="scala-vivekn-notebook" class="section-block"> Vivekn Sentiment Analysis<
                                     Since we are dealing with small amounts of data, we put in practice LightPipelines.
                                 </p>
                                 <p>
-                                    <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.7.3/example/src/TrainViveknSentiment.scala" target="_blank"> Take me to code!</a>
+                                    <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.8.0/example/src/TrainViveknSentiment.scala" target="_blank"> Take me to code!</a>
                                 </p>
                             </div>
                         </section>
@@ -135,7 +135,7 @@ <h4 id="vivekn-notebook" class="section-block"> Vivekn Sentiment Analysis</h4>
                                     better Sentiment Analysis accuracy
                                   </p>
                                 <p>
-                                    <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.7.3/python/example/vivekn-sentiment/sentiment.ipynb" target="_blank"> Take me to notebook!</a>
+                                    <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.8.0/python/example/vivekn-sentiment/sentiment.ipynb" target="_blank"> Take me to notebook!</a>
                                 </p>
                             </div>
                             <div>
@@ -157,7 +157,7 @@ <h4 id="sentiment-notebook" class="section-block"> Rule-based Sentiment Analysis
                                 Each of these sentences will be used for giving a score to text
                             </p>
                                 </p>
-                                    <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.7.3/python/example/dictionary-sentiment/sentiment.ipynb" target="_blank"> Take me to notebook!</a>
+                                    <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.8.0/python/example/dictionary-sentiment/sentiment.ipynb" target="_blank"> Take me to notebook!</a>
                                 </p>
                             </div>
                             <div>
@@ -177,7 +177,7 @@ <h4 id="crfner-notebook" class="section-block"> CRF Named Entity Recognition</h4
                                     approach to use the same pipeline for tagging external resources.
                                 </p>
                                 <p>
-                                <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.7.3/python/example/crf-ner/ner.ipynb" target="_blank"> Take me to notebook!</a>
+                                <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.8.0/python/example/crf-ner/ner.ipynb" target="_blank"> Take me to notebook!</a>
                                 </p>
                             </div>
                             <div>
@@ -196,7 +196,7 @@ <h4 id="dlner-notebook" class="section-block"> CNN Deep Learning NER</h4>
                                     and it will leverage batch-based distributed calls to native TensorFlow libraries during prediction.
                                 </p>
                                 <p>
-                                    <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.7.3/python/example/dl-ner/ner.ipynb" target="_blank"> Take me to notebook!</a>
+                                    <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.8.0/python/example/dl-ner/ner.ipynb" target="_blank"> Take me to notebook!</a>
                                 </p>
                             </div>
                             <div>
@@ -211,7 +211,7 @@ <h4 id="text-notebook" class="section-block"> Simple Text Matching</h4>
                                     This annotator is an Annotator Model and does not require training.
                                 </p>
                                 <p>
-                                    <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.7.3/python/example/text-matcher/extractor.ipynb" target="_blank"> Take me to notebook!</a>
+                                    <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.8.0/python/example/text-matcher/extractor.ipynb" target="_blank"> Take me to notebook!</a>
                                 </p>
                             </div>
                             <div>
@@ -224,7 +224,7 @@ <h4 id="downloader-notebook" class="section-block"> Retrieving Pretrained models
 
                                 </p>
                                 <p>
-                                    <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.7.3/python/example/model-downloader/ModelDownloaderExample.ipynb" target="_blank"> Take me to notebook!</a>
+                                    <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.8.0/python/example/model-downloader/ModelDownloaderExample.ipynb" target="_blank"> Take me to notebook!</a>
                                 </p>
                             </div>
                         </section>