Merge pull request #286 from JohnSnowLabs/171-release-candidate

saif-ellafi · web-flow · commit 298efe3bd306 · 2018-10-19T19:02:24.000-03:00
release candidate 1.7.1
diff --git a/CHANGELOG b/CHANGELOG
@@ -1,3 +1,29 @@
+========
+1.7.1
+========
+---------------
+Overview
+---------------
+Thanks to our slack community (Bryan Wilkinson, @maziyarpanahi, @apiltamang), a few bugs been pointed out very quickly from 1.7.0 release. This hotfix fixes an embeddings deserialization issue when cache_pretrained is located on a distributed filesystem.
+Also, fixes some path resolution in Windows OS. Thanks to Maziyar, .gitattributes been added in order to identify proper languages in GitHub.
+Finally, 1.7.1 adds a missing annotator from 1.7.0 Chunk2Doc, which converts CHUNK types into DOCUMENT types, for further retokenization or other annotations.
+
+---------------
+Enhancements
+---------------
+* Chunk2Doc annotator converts annotatorType from CHUNK to DOCUMENT
+
+---------------
+Bugfixes
+---------------
+* Fixed embedding-based annotators deserialization error when cache_pretrained is on distributed fs (Thanks Bryan Wilkinson for pointing out issue and testing fix)
+* Fixed windows path reading when deserializing embeddings (Thanks @apiltamang)
+
+---------------
+Other
+---------------
+* .gitattributes added in order to properly discard jupyter as main language for GitHub repo (thanks @maziyarpanahi)
+
 ========
 1.7.0
 ========
diff --git a/README.md b/README.md
@@ -14,18 +14,18 @@ Questions? Feedback? Request access sending an email to nlp@johnsnowlabs.com
 
 This library has been uploaded to the spark-packages repository https://spark-packages.org/package/JohnSnowLabs/spark-nlp .
 
-To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:1.7.0` to you spark command
+To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:1.7.1` to you spark command
 
 ```sh
-spark-shell --packages JohnSnowLabs:spark-nlp:1.7.0
+spark-shell --packages JohnSnowLabs:spark-nlp:1.7.1
 ```
 
 ```sh
-pyspark --packages JohnSnowLabs:spark-nlp:1.7.0
+pyspark --packages JohnSnowLabs:spark-nlp:1.7.1
 ```
 
 ```sh
-spark-submit --packages JohnSnowLabs:spark-nlp:1.7.0
+spark-submit --packages JohnSnowLabs:spark-nlp:1.7.1
 ```
 
 ## Jupyter Notebook
@@ -35,23 +35,23 @@ export SPARK_HOME=/path/to/your/spark/folder
 export PYSPARK_DRIVER_PYTHON=jupyter
 export PYSPARK_DRIVER_PYTHON_OPTS=notebook
 
-pyspark --packages JohnSnowLabs:spark-nlp:1.7.0
+pyspark --packages JohnSnowLabs:spark-nlp:1.7.1
 ```
 
 ## Apache Zeppelin
 This way will work for both Scala and Python
 ```
-export SPARK_SUBMIT_OPTIONS="--packages JohnSnowLabs:spark-nlp:1.7.0"
+export SPARK_SUBMIT_OPTIONS="--packages JohnSnowLabs:spark-nlp:1.7.1"
 ```
 Alternatively, add the following Maven Coordinates to the interpreter's library list
 ```
-com.johnsnowlabs.nlp:spark-nlp_2.11:1.7.0
+com.johnsnowlabs.nlp:spark-nlp_2.11:1.7.1
 ```
 
 ## Python without explicit Spark installation
 If you installed pyspark through pip, you can now install sparknlp through pip
 ```
-pip install spark-nlp==1.7.0
+pip install spark-nlp==1.7.1
 ```
 Then you'll have to create a SparkSession manually, for example:
 ```
@@ -84,11 +84,11 @@ sparknlp {
 
 ## Pre-compiled Spark-NLP and Spark-NLP-OCR
 You may download fat-jar from here:
-[Spark-NLP 1.7.0 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-assembly-1.7.0.jar)
+[Spark-NLP 1.7.1 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-assembly-1.7.1.jar)
 or non-fat from here
-[Spark-NLP 1.7.0 PKG JAR](http://repo1.maven.org/maven2/com/johnsnowlabs/nlp/spark-nlp_2.11/1.7.0/spark-nlp_2.11-1.7.0.jar)
+[Spark-NLP 1.7.1 PKG JAR](http://repo1.maven.org/maven2/com/johnsnowlabs/nlp/spark-nlp_2.11/1.7.1/spark-nlp_2.11-1.7.1.jar)
 Spark-NLP-OCR Module (Requires native Tesseract 4.x+ for image based OCR. Does not require Spark-NLP to work but highly suggested)
-[Spark-NLP-OCR 1.7.0 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-ocr-assembly-1.7.0.jar)
+[Spark-NLP-OCR 1.7.1 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-ocr-assembly-1.7.1.jar)
 
 ## Maven central
 
@@ -100,19 +100,19 @@ Our package is deployed to maven central. In order to add this package as a depe
 <dependency>
   <groupId>com.johnsnowlabs.nlp</groupId>
   <artifactId>spark-nlp_2.11</artifactId>
-  <version>1.7.0</version>
+  <version>1.7.1</version>
 </dependency>
 ```
 
 #### SBT
 ```sbtshell
-libraryDependencies += "com.johnsnowlabs.nlp" % "spark-nlp_2.11" % "1.7.0"
+libraryDependencies += "com.johnsnowlabs.nlp" % "spark-nlp_2.11" % "1.7.1"
 ```
 
 If you are using `scala 2.11`
 
 ```sbtshell
-libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "1.7.0"
+libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "1.7.1"
 ```
 
 ## Using the jar manually 
@@ -133,7 +133,7 @@ The preferred way to use the library when running spark programs is using the `-
 
 If you have troubles using pretrained() models in your environment, here a list to various models (only valid for latest versions).
 If there is any older than current version of a model, it means they still work for current versions.
-### Updated for 1.7.0
+### Updated for 1.7.1
 ### Pipelines
 * [Basic Pipeline](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_basic_en_1.6.1_2_1533856444797.zip)
 * [Advanced Pipeline](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_advanced_en_1.7.0_2_1539460910585.zip)
diff --git a/build.sbt b/build.sbt
@@ -9,7 +9,7 @@ name := "spark-nlp"
 
 organization := "com.johnsnowlabs.nlp"
 
-version := "1.7.0"
+version := "1.7.1"
 
 scalaVersion in ThisBuild := scalaVer
 
@@ -138,7 +138,7 @@ assemblyMergeStrategy in assembly := {
 lazy val ocr = (project in file("ocr"))
   .settings(
     name := "spark-nlp-ocr",
-    version := "1.7.0",
+    version := "1.7.1",
     libraryDependencies ++= ocrDependencies ++
       analyticsDependencies ++
       testDependencies,
diff --git a/docs/index.html b/docs/index.html
@@ -78,7 +78,7 @@ <h2 class="title">High Performance NLP with Apache Spark </h2>
                     </p>
                 <a class="btn btn-info btn-cta" style="float: center;margin-top: 10px;" href="mailto:nlp@johnsnowlabs.com?subject=SparkNLP%20Slack%20access" target="_blank"> Questions? Join our Slack</a>
                 <b/><p/><p/>
-                <p><span class="label label-warning">2018 Oct 13th - Update!</span> 1.7.0 Released! Word embeddings decoupled from annotators and better Windows support</p>
+                <p><span class="label label-warning">2018 Oct 19th - Update!</span> 1.7.1 Released! Word embeddings decoupled from annotators and better Windows support</p>
             </div>
             <div id="cards-wrapper" class="cards-wrapper row">
                 <div class="item item-green col-md-4 col-sm-6 col-xs-6">
diff --git a/docs/quickstart.html b/docs/quickstart.html
@@ -95,35 +95,35 @@ <h2 class="section-title">Requirements & Setup</h2>
                                 To start using the library, execute any of the following lines
                                 depending on your desired use case:
                                 </p>
-                                <pre><code class="language-javascript">spark-shell --packages JohnSnowLabs:spark-nlp:1.7.0
-pyspark --packages JohnSnowLabs:spark-nlp:1.7.0
-spark-submit --packages JohnSnowLabs:spark-nlp:1.7.0
+                                <pre><code class="language-javascript">spark-shell --packages JohnSnowLabs:spark-nlp:1.7.1
+pyspark --packages JohnSnowLabs:spark-nlp:1.7.1
+spark-submit --packages JohnSnowLabs:spark-nlp:1.7.1
 </code></pre>
                                 <div><b>NOTE: </b>Spark packages --packages has been reported to work unproperly, particularly in python, when utilizing physical clusters.
                                     Utilizing --jars is advised. For python, add python Spark-NLP through pip</div>
                                 <p/>
                                 <h3><b>Databricks cloud cluster</b> & <b>Apache Zeppelin</b></h3>
-                                <pre><code class="language-javascript">com.johnsnowlabs.nlp:spark-nlp_2.11:1.7.0</code></pre>
+                                <pre><code class="language-javascript">com.johnsnowlabs.nlp:spark-nlp_2.11:1.7.1</code></pre>
                                 <p>
                                     For Python in <b>Apache Zeppelin</b> you may need to setup <i><b>SPARK_SUBMIT_OPTIONS</b></i> utilizing --packages instruction shown above like this
                                 </p>
-                                <pre><code class="language-javascript">export SPARK_SUBMIT_OPTIONS="--packages JohnSnowLabs:spark-nlp:1.7.0"</code></pre>
+                                <pre><code class="language-javascript">export SPARK_SUBMIT_OPTIONS="--packages JohnSnowLabs:spark-nlp:1.7.1"</code></pre>
                                 <h3><b>Python Jupyter Notebook with PySpark</b></h3>
                                 <pre><code class="language-javascript">export SPARK_HOME=/path/to/your/spark/folder
 export PYSPARK_DRIVER_PYTHON=jupyter
 export PYSPARK_DRIVER_PYTHON_OPTS=notebook
 
-pyspark --packages JohnSnowLabs:spark-nlp:1.7.0</code></pre>
+pyspark --packages JohnSnowLabs:spark-nlp:1.7.1</code></pre>
                                 <h3><b>Python without explicit Spark Installation</b></h3>
                                 <p>Use pip to install (after you pip installed pyspark)</p>
-                                <pre><code class="language-javascript">pip install spark-nlp==1.7.0</code></pre>
+                                <pre><code class="language-javascript">pip install spark-nlp==1.7.1</code></pre>
                                 <p>In this way, you will have to start SparkSession in your python program manually, this is an example</p>
                                 <pre><code class="python">spark = SparkSession.builder \
     .appName("ner")\
     .master("local[*]")\
     .config("spark.driver.memory","4G")\
     .config("spark.driver.maxResultSize", "2G") \
-    .config("spark.driver.extraClassPath", "lib/spark-nlp-assembly-1.7.0.jar")\
+    .config("spark.driver.extraClassPath", "lib/spark-nlp-assembly-1.7.1.jar")\
     .config("spark.kryoserializer.buffer.max", "500m")\
     .getOrCreate()</code></pre>
                                 <h3>S3 based standalone cluster (No Hadoop)</h3>
@@ -145,11 +145,11 @@ <h3>S3 based standalone cluster (No Hadoop)</h3>
                                 <h3>Pre-Compiled Spark-NLP for download</h3>
                                 <p>
                                     Pre-compiled Spark-NLP assembly fat-jar for using in standalone projects, may be downloaded
-                                    <a href="https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-assembly-1.7.0.jar">here</a>
+                                    <a href="https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-assembly-1.7.1.jar">here</a>
                                     Non-fat-jar may be downloaded
-                                    <a href="http://repo1.maven.org/maven2/com/johnsnowlabs/nlp/spark-nlp_2.11/1.7.0/spark-nlp_2.11-1.7.0.jar">here</a>
+                                    <a href="http://repo1.maven.org/maven2/com/johnsnowlabs/nlp/spark-nlp_2.11/1.7.1/spark-nlp_2.11-1.7.1.jar">here</a>
                                     then, run spark-shell or spark-submit with appropriate <b>--jars
-                                    /path/to/spark-nlp_2.11-1.7.0.jar</b> to use the library in spark.
+                                    /path/to/spark-nlp_2.11-1.7.1.jar</b> to use the library in spark.
                                 </p>
                                 <p>
                                     For further alternatives and documentation check out our README page in <a href="https://github.com/JohnSnowLabs/spark-nlp">GitHub</a>.
@@ -435,7 +435,7 @@ <h2 class="section-title">Utilizing Spark-NLP OCR PDF Converter</h2>
                                 <h3 class="block-title">Installing Spark-NLP OCRHelper</h3>
                                 <p>
                                     First, either build from source or download the following standalone jar module (works both from Spark-NLP python and scala):
-                                    <a href="https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-ocr-assembly-1.7.0.jar">Spark-NLP-OCR</a>
+                                    <a href="https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-ocr-assembly-1.7.1.jar">Spark-NLP-OCR</a>
                                     And add it to your Spark environment (with --jars or spark.driver.extraClassPath and spark.executor.extraClassPath configuration)
                                     Second, if your PDFs don't have a text layer (this depends on how PDFs were created), the library will use Tesseract 4.0 on background.
                                     Tesseract will utilize native libraries, so you'll have to get them installed in your system.