Skip to content

Commit 454610b

Browse files
authored
Merge pull request #334 from JohnSnowLabs/180-release-candidate-proper
Release Candidate 1.8.0
2 parents eeff5c1 + ee9af37 commit 454610b

File tree

11 files changed

+257
-205
lines changed

11 files changed

+257
-205
lines changed

CHANGELOG

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,50 @@
1+
========
2+
1.8.0
3+
========
4+
---------------
5+
Overview
6+
---------------
7+
This release is huge! Spark-NLP made the leap into Spark 2.4.0, even with the challenge of not having everyone yet on board there (i.e. Zeppelin doesn't yet support it).
8+
In this version we release three new NLP annotators. Two for dependency parsing processes and one for contextual deep learning based spell checking.
9+
We also significantly improved OCR functionality, fine-tuning capabilities and general output performance, particularly on tesseract.
10+
Finally, there's plenty of bug fixes and improvements in the word embeddings field, along with performance boosts and reduced disk IO.
11+
Feel free to shoot us with any feedback you have! Particularly on your Spark 2.4.x experience.
12+
13+
---------------
14+
New Features
15+
---------------
16+
* Built on top of Spark 2.4.0
17+
* Dependency Parser annotator allows for sentence relationship encoding
18+
* Typed Dependency Parser annotator allows for labeling relationships within dependency tags
19+
* ContextSpellChecker is our first Deep Learning based Spell Checker that evaluates context and not only tokens
20+
21+
---------------
22+
Enhancements
23+
---------------
24+
* More OCR parameters exposed for further fine tuning, including preferred methods priority and page segmentation modes
25+
* OCR now has a setting setSplitPages() which allows setting whether to output one page per row or the entire document instead
26+
* Improved word embeddings performance when working in local filesystems
27+
* Reduced the amount of disk IO when working with Word Embeddings
28+
* All python notebooks improved for better readability and better documentation
29+
* Simplified PySpark interface API
30+
* CoNLLGenerator utility class which helps building CoNLL-2003 files for NER training
31+
* EmbeddingsHelper now allows reading word embeddings files directly from s3a:// paths
32+
33+
---------------
34+
Bugfixes
35+
---------------
36+
* Solved race-condition issues in regards of cluster usage of RocksDB index for embeddings
37+
* Fixed application.conf reading bug which didn't properly refresh AWS credentials
38+
* RocksDB index no longer uses compression, in order to support Windows without native RocksDB compression libraries
39+
* Solved various python default parameter settings
40+
* Fixed circular dependency with jbig pdfbox image OCR
41+
42+
---------------
43+
Deprecations
44+
---------------
45+
* DeIdentification annotator is no longer supported in the open source version of Spark-NLP
46+
* AssertionStatus annotator is no longer supported in the open source version of Spark-NLP
47+
148
========
249
1.7.3
350
========

README.md

Lines changed: 31 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,9 @@ Take a look at our official spark-nlp page: http://nlp.johnsnowlabs.com/ for use
99
Questions? Feedback? Request access sending an email to [email protected]
1010

1111
# Apache Spark Support
12-
As of *1.7.x* Spark-NLP does _NOT_ yet work with Spark 2.4.x
12+
Spark-NLP *1.8.0* has been built on top of Apache Spark 2.4.0
13+
14+
Note that Spark is not retrocompatible with Spark 2.3.x, so models and environments might not work
1315

1416
# Usage
1517

@@ -20,18 +22,18 @@ This library has been uploaded to the spark-packages repository https://spark-pa
2022

2123
Benefit of spark-packages is that makes it available for both Scala-Java and Python
2224

23-
To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:1.7.3` to you spark command
25+
To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:1.8.0` to you spark command
2426

2527
```sh
26-
spark-shell --packages JohnSnowLabs:spark-nlp:1.7.3
28+
spark-shell --packages JohnSnowLabs:spark-nlp:1.8.0
2729
```
2830

2931
```sh
30-
pyspark --packages JohnSnowLabs:spark-nlp:1.7.3
32+
pyspark --packages JohnSnowLabs:spark-nlp:1.8.0
3133
```
3234

3335
```sh
34-
spark-submit --packages JohnSnowLabs:spark-nlp:1.7.3
36+
spark-submit --packages JohnSnowLabs:spark-nlp:1.8.0
3537
```
3638

3739
### offline mode using jars
@@ -43,14 +45,14 @@ Use either one of the following options
4345

4446
* Add the following Maven Coordinates to the interpreter's library list
4547
```
46-
com.johnsnowlabs.nlp:spark-nlp_2.11:1.7.3
48+
com.johnsnowlabs.nlp:spark-nlp_2.11:1.8.0
4749
```
4850
* Add path to pre-built jar from [here](#pre-compiled-spark-nlp-and-spark-nlp-ocr) in the interpreter's library list making sure the jar is available to driver path
4951

5052
### Python in Zeppelin
5153
Apart from previous step, install python module through pip
5254
```
53-
pip install spark-nlp==1.7.3
55+
pip install spark-nlp==1.8.0
5456
```
5557
Configure Zeppelin properly, use cells with %spark.pyspark or any interpreter name you chose.
5658

@@ -61,7 +63,7 @@ An alternative option would be to set `SPARK_SUBMIT_OPTIONS` (zeppelin-env.sh) a
6163
## Python without explicit Spark installation
6264
If you installed pyspark through pip, you can install sparknlp through pip as well
6365
```
64-
pip install spark-nlp==1.7.3
66+
pip install spark-nlp==1.8.0
6567
```
6668
Then you'll have to create a SparkSession manually, for example:
6769
```
@@ -87,7 +89,7 @@ export PYSPARK_PYTHON=python3
8789
export PYSPARK_DRIVER_PYTHON=jupyter
8890
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
8991
90-
pyspark --packages JohnSnowLabs:spark-nlp:1.7.3
92+
pyspark --packages JohnSnowLabs:spark-nlp:1.8.0
9193
```
9294

9395
Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp`
@@ -112,12 +114,12 @@ sparknlp {
112114
```
113115

114116
## Pre-compiled Spark-NLP and Spark-NLP-OCR
115-
You may download fat-jar from here:
116-
[Spark-NLP 1.7.3 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-assembly-1.7.3.jar)
117-
or non-fat from here
118-
[Spark-NLP 1.7.3 PKG JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp_2.11-1.7.3.jar)
117+
Spark-NLP FAT-JAR from here (Does NOT include Spark):
118+
[Spark-NLP 1.8.0 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-assembly-1.8.0.jar)
119+
Spark-NLP GPU Enhanced Tensorflow FAT-JAR:
120+
[Spark-NLP 1.8.0-gpu FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-assembly-1.8.0-gpu.jar)
119121
Spark-NLP-OCR Module (Requires native Tesseract 4.x+ for image based OCR. Does not require Spark-NLP to work but highly suggested)
120-
[Spark-NLP-OCR 1.7.3 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-ocr-assembly-1.7.3.jar)
122+
[Spark-NLP-OCR 1.8.0 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-ocr-assembly-1.8.0.jar)
121123

122124
## Maven central
123125

@@ -129,19 +131,19 @@ Our package is deployed to maven central. In order to add this package as a depe
129131
<dependency>
130132
<groupId>com.johnsnowlabs.nlp</groupId>
131133
<artifactId>spark-nlp_2.11</artifactId>
132-
<version>1.7.3</version>
134+
<version>1.8.0</version>
133135
</dependency>
134136
```
135137

136138
#### SBT
137139
```sbtshell
138-
libraryDependencies += "com.johnsnowlabs.nlp" % "spark-nlp_2.11" % "1.7.3"
140+
libraryDependencies += "com.johnsnowlabs.nlp" % "spark-nlp_2.11" % "1.8.0"
139141
```
140142

141143
If you are using `scala 2.11`
142144

143145
```sbtshell
144-
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "1.7.3"
146+
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "1.8.0"
145147
```
146148

147149
## Using the jar manually
@@ -162,22 +164,21 @@ The preferred way to use the library when running spark programs is using the `-
162164

163165
If you have troubles using pretrained() models in your environment, here a list to various models (only valid for latest versions).
164166
If there is any older than current version of a model, it means they still work for current versions.
165-
### Updated for 1.7.3
167+
### Updated for 1.8.0
166168
### Pipelines
167-
* [Basic Pipeline](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_basic_en_1.6.1_2_1533856444797.zip)
168-
* [Advanced Pipeline](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_advanced_en_1.7.0_2_1539460910585.zip)
169-
* [Vivekn Sentiment Pipeline](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_vivekn_en_1.6.2_2_1534781342094.zip)
169+
* [Basic Pipeline](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_basic_en_1.8.0_2.4_1545435998968.zip)
170+
* [Advanced Pipeline](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_advanced_en_1.8.0_2.4_1545436028146.zip)
171+
* [Vivekn Sentiment Pipeline](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_vivekn_en_1.8.0_2.4_1545436008101.zip)
170172

171173
### Models
172-
* [PerceptronModel (POS)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_fast_en_1.6.1_2_1533853928168.zip)
173-
* [ViveknSentimentModel (Sentiment)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/vivekn_fast_en_1.6.2_2_1534781337758.zip)
174-
* [SymmetricDeleteModel (Spell Checker)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/spell_sd_fast_en_1.6.2_2_1534781178138.zip)
175-
* [ContextSpellCheckerModel (Spell Checker)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/context_spell_gen_en_1.7.0_2_1544041161062.zip)
176-
* [NorvigSweetingModel (Spell Checker)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/spell_fast_en_1.6.2_2_1534781328404.zip)
177-
* [NerCRFModel (NER)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_fast_en_1.7.0_2_1539896043754.zip)
178-
* [NerDLModel (NER)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_precise_en_1.7.0_2_1539623388047.zip)
179-
* [LemmatizerModel (Lemmatizer)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_fast_en_1.6.1_2_1533854538211.zip)
180-
* [AssertionDLModel (Assertion)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/as_fast_dl_en_1.7.0_2_1539653960749.zip)
174+
* [PerceptronModel (POS)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_fast_en_1.8.0_2.4_1545434653742.zip)
175+
* [ViveknSentimentModel (Sentiment)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/vivekn_fast_en_1.8.0_2.4_1545435741623.zip)
176+
* [SymmetricDeleteModel (Spell Checker)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/spell_sd_fast_en_1.8.0_2.4_1545435558025.zip)
177+
* ContextSpellCheckerModel (Spell Checker)
178+
* [NorvigSweetingModel (Spell Checker)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/spell_fast_en_1.8.0_2.4_1545435732032.zip)
179+
* [NerCRFModel (NER)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_fast_en_1.8.0_2.4_1545435254745.zip)
180+
* [NerDLModel (NER)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_precise_en_1.8.0_2.4_1545439567330.zip)
181+
* [LemmatizerModel (Lemmatizer)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_fast_en_1.8.0_2.4_1545435317864.zip)
181182
``
182183
# FAQ
183184
[Check our Articles and FAQ page here](https://nlp.johnsnowlabs.com/articles.html)

build.sbt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ name := "spark-nlp"
99

1010
organization := "com.johnsnowlabs.nlp"
1111

12-
version := "1.7.3"
12+
version := "1.8.0"
1313

1414
scalaVersion in ThisBuild := scalaVer
1515

@@ -151,7 +151,7 @@ assemblyMergeStrategy in assembly := {
151151
lazy val ocr = (project in file("ocr"))
152152
.settings(
153153
name := "spark-nlp-ocr",
154-
version := "1.7.3",
154+
version := "1.8.0",
155155
libraryDependencies ++= ocrDependencies ++
156156
analyticsDependencies ++
157157
testDependencies,

docs/index.html

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -78,8 +78,7 @@ <h2 class="title">High Performance NLP with Apache Spark </h2>
7878
</p>
7979
<a class="btn btn-info btn-cta" style="float: center;margin-top: 10px;" href="mailto:[email protected]?subject=SparkNLP%20Slack%20access" target="_blank"> Questions? Join our Slack</a>
8080
<b/><p/><p/>
81-
<p><span class="label label-warning">2018 Nov 11st - Update!</span> 1.7.3 Released! Word embeddings decoupled from annotators, better Windows and improved cluster support</p>
82-
<p><span class="label label-danger">Apache Spark 2.4.x not yet supported</span></p>
81+
<p><span class="label label-warning">2018 Nov 21st - Update!</span> 1.8.0 Released! Dependency Parser, new Spell Checker, Spark 2.4.0, performance boosts and more!</p>
8382
</div>
8483
<div id="cards-wrapper" class="cards-wrapper row">
8584
<div class="item item-green col-md-4 col-sm-6 col-xs-6">

docs/notebooks.html

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,7 @@ <h4 id="scala-vivekn-notebook" class="section-block"> Vivekn Sentiment Analysis<
103103
Since we are dealing with small amounts of data, we put in practice LightPipelines.
104104
</p>
105105
<p>
106-
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.7.3/example/src/TrainViveknSentiment.scala" target="_blank"> Take me to code!</a>
106+
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.8.0/example/src/TrainViveknSentiment.scala" target="_blank"> Take me to code!</a>
107107
</p>
108108
</div>
109109
</section>
@@ -135,7 +135,7 @@ <h4 id="vivekn-notebook" class="section-block"> Vivekn Sentiment Analysis</h4>
135135
better Sentiment Analysis accuracy
136136
</p>
137137
<p>
138-
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.7.3/python/example/vivekn-sentiment/sentiment.ipynb" target="_blank"> Take me to notebook!</a>
138+
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.8.0/python/example/vivekn-sentiment/sentiment.ipynb" target="_blank"> Take me to notebook!</a>
139139
</p>
140140
</div>
141141
<div>
@@ -157,7 +157,7 @@ <h4 id="sentiment-notebook" class="section-block"> Rule-based Sentiment Analysis
157157
Each of these sentences will be used for giving a score to text
158158
</p>
159159
</p>
160-
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.7.3/python/example/dictionary-sentiment/sentiment.ipynb" target="_blank"> Take me to notebook!</a>
160+
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.8.0/python/example/dictionary-sentiment/sentiment.ipynb" target="_blank"> Take me to notebook!</a>
161161
</p>
162162
</div>
163163
<div>
@@ -177,7 +177,7 @@ <h4 id="crfner-notebook" class="section-block"> CRF Named Entity Recognition</h4
177177
approach to use the same pipeline for tagging external resources.
178178
</p>
179179
<p>
180-
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.7.3/python/example/crf-ner/ner.ipynb" target="_blank"> Take me to notebook!</a>
180+
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.8.0/python/example/crf-ner/ner.ipynb" target="_blank"> Take me to notebook!</a>
181181
</p>
182182
</div>
183183
<div>
@@ -196,7 +196,7 @@ <h4 id="dlner-notebook" class="section-block"> CNN Deep Learning NER</h4>
196196
and it will leverage batch-based distributed calls to native TensorFlow libraries during prediction.
197197
</p>
198198
<p>
199-
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.7.3/python/example/dl-ner/ner.ipynb" target="_blank"> Take me to notebook!</a>
199+
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.8.0/python/example/dl-ner/ner.ipynb" target="_blank"> Take me to notebook!</a>
200200
</p>
201201
</div>
202202
<div>
@@ -211,7 +211,7 @@ <h4 id="text-notebook" class="section-block"> Simple Text Matching</h4>
211211
This annotator is an Annotator Model and does not require training.
212212
</p>
213213
<p>
214-
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.7.3/python/example/text-matcher/extractor.ipynb" target="_blank"> Take me to notebook!</a>
214+
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.8.0/python/example/text-matcher/extractor.ipynb" target="_blank"> Take me to notebook!</a>
215215
</p>
216216
</div>
217217
<div>
@@ -224,7 +224,7 @@ <h4 id="downloader-notebook" class="section-block"> Retrieving Pretrained models
224224

225225
</p>
226226
<p>
227-
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.7.3/python/example/model-downloader/ModelDownloaderExample.ipynb" target="_blank"> Take me to notebook!</a>
227+
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.8.0/python/example/model-downloader/ModelDownloaderExample.ipynb" target="_blank"> Take me to notebook!</a>
228228
</p>
229229
</div>
230230
</section>

0 commit comments

Comments
 (0)