Skip to content

Commit 298efe3

Browse files
authored
Merge pull request #286 from JohnSnowLabs/171-release-candidate
release candidate 1.7.1
2 parents 4836768 + 66bf70c commit 298efe3

File tree

5 files changed

+56
-30
lines changed

5 files changed

+56
-30
lines changed

CHANGELOG

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,29 @@
1+
========
2+
1.7.1
3+
========
4+
---------------
5+
Overview
6+
---------------
7+
Thanks to our slack community (Bryan Wilkinson, @maziyarpanahi, @apiltamang), a few bugs been pointed out very quickly from 1.7.0 release. This hotfix fixes an embeddings deserialization issue when cache_pretrained is located on a distributed filesystem.
8+
Also, fixes some path resolution in Windows OS. Thanks to Maziyar, .gitattributes been added in order to identify proper languages in GitHub.
9+
Finally, 1.7.1 adds a missing annotator from 1.7.0 Chunk2Doc, which converts CHUNK types into DOCUMENT types, for further retokenization or other annotations.
10+
11+
---------------
12+
Enhancements
13+
---------------
14+
* Chunk2Doc annotator converts annotatorType from CHUNK to DOCUMENT
15+
16+
---------------
17+
Bugfixes
18+
---------------
19+
* Fixed embedding-based annotators deserialization error when cache_pretrained is on distributed fs (Thanks Bryan Wilkinson for pointing out issue and testing fix)
20+
* Fixed windows path reading when deserializing embeddings (Thanks @apiltamang)
21+
22+
---------------
23+
Other
24+
---------------
25+
* .gitattributes added in order to properly discard jupyter as main language for GitHub repo (thanks @maziyarpanahi)
26+
127
========
228
1.7.0
329
========

README.md

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -14,18 +14,18 @@ Questions? Feedback? Request access sending an email to [email protected]
1414

1515
This library has been uploaded to the spark-packages repository https://spark-packages.org/package/JohnSnowLabs/spark-nlp .
1616

17-
To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:1.7.0` to you spark command
17+
To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:1.7.1` to you spark command
1818

1919
```sh
20-
spark-shell --packages JohnSnowLabs:spark-nlp:1.7.0
20+
spark-shell --packages JohnSnowLabs:spark-nlp:1.7.1
2121
```
2222

2323
```sh
24-
pyspark --packages JohnSnowLabs:spark-nlp:1.7.0
24+
pyspark --packages JohnSnowLabs:spark-nlp:1.7.1
2525
```
2626

2727
```sh
28-
spark-submit --packages JohnSnowLabs:spark-nlp:1.7.0
28+
spark-submit --packages JohnSnowLabs:spark-nlp:1.7.1
2929
```
3030

3131
## Jupyter Notebook
@@ -35,23 +35,23 @@ export SPARK_HOME=/path/to/your/spark/folder
3535
export PYSPARK_DRIVER_PYTHON=jupyter
3636
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
3737
38-
pyspark --packages JohnSnowLabs:spark-nlp:1.7.0
38+
pyspark --packages JohnSnowLabs:spark-nlp:1.7.1
3939
```
4040

4141
## Apache Zeppelin
4242
This way will work for both Scala and Python
4343
```
44-
export SPARK_SUBMIT_OPTIONS="--packages JohnSnowLabs:spark-nlp:1.7.0"
44+
export SPARK_SUBMIT_OPTIONS="--packages JohnSnowLabs:spark-nlp:1.7.1"
4545
```
4646
Alternatively, add the following Maven Coordinates to the interpreter's library list
4747
```
48-
com.johnsnowlabs.nlp:spark-nlp_2.11:1.7.0
48+
com.johnsnowlabs.nlp:spark-nlp_2.11:1.7.1
4949
```
5050

5151
## Python without explicit Spark installation
5252
If you installed pyspark through pip, you can now install sparknlp through pip
5353
```
54-
pip install spark-nlp==1.7.0
54+
pip install spark-nlp==1.7.1
5555
```
5656
Then you'll have to create a SparkSession manually, for example:
5757
```
@@ -84,11 +84,11 @@ sparknlp {
8484

8585
## Pre-compiled Spark-NLP and Spark-NLP-OCR
8686
You may download fat-jar from here:
87-
[Spark-NLP 1.7.0 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-assembly-1.7.0.jar)
87+
[Spark-NLP 1.7.1 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-assembly-1.7.1.jar)
8888
or non-fat from here
89-
[Spark-NLP 1.7.0 PKG JAR](http://repo1.maven.org/maven2/com/johnsnowlabs/nlp/spark-nlp_2.11/1.7.0/spark-nlp_2.11-1.7.0.jar)
89+
[Spark-NLP 1.7.1 PKG JAR](http://repo1.maven.org/maven2/com/johnsnowlabs/nlp/spark-nlp_2.11/1.7.1/spark-nlp_2.11-1.7.1.jar)
9090
Spark-NLP-OCR Module (Requires native Tesseract 4.x+ for image based OCR. Does not require Spark-NLP to work but highly suggested)
91-
[Spark-NLP-OCR 1.7.0 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-ocr-assembly-1.7.0.jar)
91+
[Spark-NLP-OCR 1.7.1 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-ocr-assembly-1.7.1.jar)
9292

9393
## Maven central
9494

@@ -100,19 +100,19 @@ Our package is deployed to maven central. In order to add this package as a depe
100100
<dependency>
101101
<groupId>com.johnsnowlabs.nlp</groupId>
102102
<artifactId>spark-nlp_2.11</artifactId>
103-
<version>1.7.0</version>
103+
<version>1.7.1</version>
104104
</dependency>
105105
```
106106

107107
#### SBT
108108
```sbtshell
109-
libraryDependencies += "com.johnsnowlabs.nlp" % "spark-nlp_2.11" % "1.7.0"
109+
libraryDependencies += "com.johnsnowlabs.nlp" % "spark-nlp_2.11" % "1.7.1"
110110
```
111111

112112
If you are using `scala 2.11`
113113

114114
```sbtshell
115-
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "1.7.0"
115+
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "1.7.1"
116116
```
117117

118118
## Using the jar manually
@@ -133,7 +133,7 @@ The preferred way to use the library when running spark programs is using the `-
133133

134134
If you have troubles using pretrained() models in your environment, here a list to various models (only valid for latest versions).
135135
If there is any older than current version of a model, it means they still work for current versions.
136-
### Updated for 1.7.0
136+
### Updated for 1.7.1
137137
### Pipelines
138138
* [Basic Pipeline](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_basic_en_1.6.1_2_1533856444797.zip)
139139
* [Advanced Pipeline](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_advanced_en_1.7.0_2_1539460910585.zip)

build.sbt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ name := "spark-nlp"
99

1010
organization := "com.johnsnowlabs.nlp"
1111

12-
version := "1.7.0"
12+
version := "1.7.1"
1313

1414
scalaVersion in ThisBuild := scalaVer
1515

@@ -138,7 +138,7 @@ assemblyMergeStrategy in assembly := {
138138
lazy val ocr = (project in file("ocr"))
139139
.settings(
140140
name := "spark-nlp-ocr",
141-
version := "1.7.0",
141+
version := "1.7.1",
142142
libraryDependencies ++= ocrDependencies ++
143143
analyticsDependencies ++
144144
testDependencies,

docs/index.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ <h2 class="title">High Performance NLP with Apache Spark </h2>
7878
</p>
7979
<a class="btn btn-info btn-cta" style="float: center;margin-top: 10px;" href="mailto:[email protected]?subject=SparkNLP%20Slack%20access" target="_blank"> Questions? Join our Slack</a>
8080
<b/><p/><p/>
81-
<p><span class="label label-warning">2018 Oct 13th - Update!</span> 1.7.0 Released! Word embeddings decoupled from annotators and better Windows support</p>
81+
<p><span class="label label-warning">2018 Oct 19th - Update!</span> 1.7.1 Released! Word embeddings decoupled from annotators and better Windows support</p>
8282
</div>
8383
<div id="cards-wrapper" class="cards-wrapper row">
8484
<div class="item item-green col-md-4 col-sm-6 col-xs-6">

docs/quickstart.html

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -95,35 +95,35 @@ <h2 class="section-title">Requirements & Setup</h2>
9595
To start using the library, execute any of the following lines
9696
depending on your desired use case:
9797
</p>
98-
<pre><code class="language-javascript">spark-shell --packages JohnSnowLabs:spark-nlp:1.7.0
99-
pyspark --packages JohnSnowLabs:spark-nlp:1.7.0
100-
spark-submit --packages JohnSnowLabs:spark-nlp:1.7.0
98+
<pre><code class="language-javascript">spark-shell --packages JohnSnowLabs:spark-nlp:1.7.1
99+
pyspark --packages JohnSnowLabs:spark-nlp:1.7.1
100+
spark-submit --packages JohnSnowLabs:spark-nlp:1.7.1
101101
</code></pre>
102102
<div><b>NOTE: </b>Spark packages --packages has been reported to work unproperly, particularly in python, when utilizing physical clusters.
103103
Utilizing --jars is advised. For python, add python Spark-NLP through pip</div>
104104
<p/>
105105
<h3><b>Databricks cloud cluster</b> & <b>Apache Zeppelin</b></h3>
106-
<pre><code class="language-javascript">com.johnsnowlabs.nlp:spark-nlp_2.11:1.7.0</code></pre>
106+
<pre><code class="language-javascript">com.johnsnowlabs.nlp:spark-nlp_2.11:1.7.1</code></pre>
107107
<p>
108108
For Python in <b>Apache Zeppelin</b> you may need to setup <i><b>SPARK_SUBMIT_OPTIONS</b></i> utilizing --packages instruction shown above like this
109109
</p>
110-
<pre><code class="language-javascript">export SPARK_SUBMIT_OPTIONS="--packages JohnSnowLabs:spark-nlp:1.7.0"</code></pre>
110+
<pre><code class="language-javascript">export SPARK_SUBMIT_OPTIONS="--packages JohnSnowLabs:spark-nlp:1.7.1"</code></pre>
111111
<h3><b>Python Jupyter Notebook with PySpark</b></h3>
112112
<pre><code class="language-javascript">export SPARK_HOME=/path/to/your/spark/folder
113113
export PYSPARK_DRIVER_PYTHON=jupyter
114114
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
115115

116-
pyspark --packages JohnSnowLabs:spark-nlp:1.7.0</code></pre>
116+
pyspark --packages JohnSnowLabs:spark-nlp:1.7.1</code></pre>
117117
<h3><b>Python without explicit Spark Installation</b></h3>
118118
<p>Use pip to install (after you pip installed pyspark)</p>
119-
<pre><code class="language-javascript">pip install spark-nlp==1.7.0</code></pre>
119+
<pre><code class="language-javascript">pip install spark-nlp==1.7.1</code></pre>
120120
<p>In this way, you will have to start SparkSession in your python program manually, this is an example</p>
121121
<pre><code class="python">spark = SparkSession.builder \
122122
.appName("ner")\
123123
.master("local[*]")\
124124
.config("spark.driver.memory","4G")\
125125
.config("spark.driver.maxResultSize", "2G") \
126-
.config("spark.driver.extraClassPath", "lib/spark-nlp-assembly-1.7.0.jar")\
126+
.config("spark.driver.extraClassPath", "lib/spark-nlp-assembly-1.7.1.jar")\
127127
.config("spark.kryoserializer.buffer.max", "500m")\
128128
.getOrCreate()</code></pre>
129129
<h3>S3 based standalone cluster (No Hadoop)</h3>
@@ -145,11 +145,11 @@ <h3>S3 based standalone cluster (No Hadoop)</h3>
145145
<h3>Pre-Compiled Spark-NLP for download</h3>
146146
<p>
147147
Pre-compiled Spark-NLP assembly fat-jar for using in standalone projects, may be downloaded
148-
<a href="https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-assembly-1.7.0.jar">here</a>
148+
<a href="https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-assembly-1.7.1.jar">here</a>
149149
Non-fat-jar may be downloaded
150-
<a href="http://repo1.maven.org/maven2/com/johnsnowlabs/nlp/spark-nlp_2.11/1.7.0/spark-nlp_2.11-1.7.0.jar">here</a>
150+
<a href="http://repo1.maven.org/maven2/com/johnsnowlabs/nlp/spark-nlp_2.11/1.7.1/spark-nlp_2.11-1.7.1.jar">here</a>
151151
then, run spark-shell or spark-submit with appropriate <b>--jars
152-
/path/to/spark-nlp_2.11-1.7.0.jar</b> to use the library in spark.
152+
/path/to/spark-nlp_2.11-1.7.1.jar</b> to use the library in spark.
153153
</p>
154154
<p>
155155
For further alternatives and documentation check out our README page in <a href="https://github.com/JohnSnowLabs/spark-nlp">GitHub</a>.
@@ -435,7 +435,7 @@ <h2 class="section-title">Utilizing Spark-NLP OCR PDF Converter</h2>
435435
<h3 class="block-title">Installing Spark-NLP OCRHelper</h3>
436436
<p>
437437
First, either build from source or download the following standalone jar module (works both from Spark-NLP python and scala):
438-
<a href="https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-ocr-assembly-1.7.0.jar">Spark-NLP-OCR</a>
438+
<a href="https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-ocr-assembly-1.7.1.jar">Spark-NLP-OCR</a>
439439
And add it to your Spark environment (with --jars or spark.driver.extraClassPath and spark.executor.extraClassPath configuration)
440440
Second, if your PDFs don't have a text layer (this depends on how PDFs were created), the library will use Tesseract 4.0 on background.
441441
Tesseract will utilize native libraries, so you'll have to get them installed in your system.

0 commit comments

Comments
 (0)