Skip to content

Commit e41ca09

Browse files
authored
Merge pull request #494 from JohnSnowLabs/202-release-candidate
Release candidate 2.0.2
2 parents 0977e10 + 934b678 commit e41ca09

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

52 files changed

+453
-874
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -314,3 +314,4 @@ test_crf_pipeline/
314314
test_*_pipeline/
315315
*metastore_db*
316316
python/src/
317+
.DS_Store

CHANGELOG

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,57 @@
1+
========
2+
2.0.2
3+
========
4+
---------------
5+
Overview
6+
---------------
7+
Thank you for joining us in this exciting Spark NLP year!. We continue to make progress towards a better performing library, both in speed and in accuracy.
8+
This release focuses strongly in the quality and stability of the library, making sure it works well in most cluster environments
9+
and improving the compatibility across systems. Word Embeddings continue to be improved for better performance and lower memory blueprint.
10+
Context Spell Checker continues to receive enhancements in concurrency and usage of spark. Finally, tensorflow based annotators
11+
have been significantly improved by refactoring the serialization design. Help us with feedback and we'll welcome any issue reports!
12+
13+
---------------
14+
New Features
15+
---------------
16+
* NerCrf annotator has now includeConfidence param that includes confidence scores for predictions in metadata
17+
18+
---------------
19+
Enhancements
20+
---------------
21+
* Cluster mode performance improved in tensorflow annotators by serializing to bytes internal information
22+
* Doc2Chunk annotator added new params startCol, startColByTokenIndex, failOnMissing and lowerCase allows better chunking of documents
23+
* All annotations that derive from sentence or chunk types now contain metadata information referring to the sentence or chunk ID they belong to
24+
* ContextSpellChecker now creates a window around the token to improve computation performance
25+
* Improved WordEmbeddings matching accuracy by trying alternative case sensitive tokens
26+
* WordEmbeddings won't load twice if already loaded
27+
* WordEmbeddings can use embeddingsRef if source was not provided, improving reutilization of embeddings in a pipeline
28+
* WordEmbeddings new param includeEmbeddings allow annotators not to save entire embeddings source along them
29+
* Contrib tensorflow dependencies now only load if necessary
30+
31+
---------------
32+
Bugfixes
33+
---------------
34+
* Added missing Symmetric delete pretrained model
35+
* Fixed a broken param name in Normalizer (thanks @RobertSassen)
36+
* Fixed Cloudera cluster support
37+
* Fixed concurrent access in ContextSpellChecker in high partition number use cases and LightPipelines
38+
* Fixed POS dataset creator to better handle corrupted pairs
39+
* Fixed a bug in Word Embeddings not matching exact case sensitive tokens in some scenarios
40+
* Fixed OCR Tess4J initialization problems in concurrent scenarios
41+
42+
---------------
43+
Models and Pipelines
44+
---------------
45+
* Renaming of models and pipelines (work in progress)
46+
* Better output column naming in pipelines
47+
48+
---------------
49+
Developer API
50+
---------------
51+
* Unified more WordEmbeddings interface with dimension params and individual setters
52+
* Improved unit tests for better compatibility on Windows
53+
* Python embeddings moved to sparknlp.embeddings
54+
155
========
256
2.0.1
357
========

README.md

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -43,14 +43,14 @@ Take a look at our official spark-nlp page: http://nlp.johnsnowlabs.com/ for use
4343

4444
## Apache Spark Support
4545

46-
Spark-NLP *2.0.1* has been built on top of Apache Spark 2.4.0
46+
Spark-NLP *2.0.2* has been built on top of Apache Spark 2.4.0
4747

4848
Note that Spark is not retrocompatible with Spark 2.3.x, so models and environments might not work.
4949

5050
If you are still stuck on Spark 2.3.x feel free to use [this assembly jar](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-2.3.2-nlp-assembly-1.8.0.jar) instead. Support is limited.
5151
For OCR module, [this](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-2.3.2-nlp-ocr-assembly-1.8.0.jar) is for spark `2.3.x`.
5252

53-
| Spark NLP | Spark 2.0.1 / Spark 2.3.x | Spark 2.4 |
53+
| Spark NLP | Spark 2.0.2 / Spark 2.3.x | Spark 2.4 |
5454
|-------------|-------------------------------------|--------------|
5555
| 2.x.x |NO |YES |
5656
| 1.8.x |Partially |YES |
@@ -68,18 +68,18 @@ This library has been uploaded to the [spark-packages repository](https://spark-
6868

6969
Benefit of spark-packages is that makes it available for both Scala-Java and Python
7070

71-
To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:2.0.1` to you spark command
71+
To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:2.0.2` to you spark command
7272

7373
```sh
74-
spark-shell --packages JohnSnowLabs:spark-nlp:2.0.1
74+
spark-shell --packages JohnSnowLabs:spark-nlp:2.0.2
7575
```
7676

7777
```sh
78-
pyspark --packages JohnSnowLabs:spark-nlp:2.0.1
78+
pyspark --packages JohnSnowLabs:spark-nlp:2.0.2
7979
```
8080

8181
```sh
82-
spark-submit --packages JohnSnowLabs:spark-nlp:2.0.1
82+
spark-submit --packages JohnSnowLabs:spark-nlp:2.0.2
8383
```
8484

8585
This can also be used to create a SparkSession manually by using the `spark.jars.packages` option in both Python and Scala
@@ -147,7 +147,7 @@ Our package is deployed to maven central. In order to add this package as a depe
147147
<dependency>
148148
<groupId>com.johnsnowlabs.nlp</groupId>
149149
<artifactId>spark-nlp_2.11</artifactId>
150-
<version>2.0.1</version>
150+
<version>2.0.2</version>
151151
</dependency>
152152
```
153153

@@ -158,22 +158,22 @@ and
158158
<dependency>
159159
<groupId>com.johnsnowlabs.nlp</groupId>
160160
<artifactId>spark-nlp-ocr_2.11</artifactId>
161-
<version>2.0.1</version>
161+
<version>2.0.2</version>
162162
</dependency>
163163
```
164164

165165
### SBT
166166

167167
```sbtshell
168168
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp
169-
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "2.0.1"
169+
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "2.0.2"
170170
```
171171

172172
and
173173

174174
```sbtshell
175175
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-ocr
176-
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-ocr" % "2.0.1"
176+
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-ocr" % "2.0.2"
177177
```
178178

179179
Maven Central: [https://mvnrepository.com/artifact/com.johnsnowlabs.nlp](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp)
@@ -187,7 +187,7 @@ Maven Central: [https://mvnrepository.com/artifact/com.johnsnowlabs.nlp](https:/
187187
If you installed pyspark through pip, you can install `spark-nlp` through pip as well.
188188

189189
```bash
190-
pip install spark-nlp==2.0.1
190+
pip install spark-nlp==2.0.2
191191
```
192192

193193
PyPI [spark-nlp package](https://pypi.org/project/spark-nlp/)
@@ -210,7 +210,7 @@ spark = SparkSession.builder \
210210
.master("local[4]")\
211211
.config("spark.driver.memory","4G")\
212212
.config("spark.driver.maxResultSize", "2G") \
213-
.config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.0.1")\
213+
.config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.0.2")\
214214
.config("spark.kryoserializer.buffer.max", "500m")\
215215
.getOrCreate()
216216
```
@@ -224,7 +224,7 @@ Use either one of the following options
224224
* Add the following Maven Coordinates to the interpreter's library list
225225

226226
```bash
227-
com.johnsnowlabs.nlp:spark-nlp_2.11:2.0.1
227+
com.johnsnowlabs.nlp:spark-nlp_2.11:2.0.2
228228
```
229229

230230
* Add path to pre-built jar from [here](#pre-compiled-spark-nlp-and-spark-nlp-ocr) in the interpreter's library list making sure the jar is available to driver path
@@ -234,7 +234,7 @@ com.johnsnowlabs.nlp:spark-nlp_2.11:2.0.1
234234
Apart from previous step, install python module through pip
235235

236236
```bash
237-
pip install spark-nlp==2.0.1
237+
pip install spark-nlp==2.0.2
238238
```
239239

240240
Or you can install `spark-nlp` from inside Zeppelin by using Conda:
@@ -260,7 +260,7 @@ export PYSPARK_PYTHON=python3
260260
export PYSPARK_DRIVER_PYTHON=jupyter
261261
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
262262

263-
pyspark --packages JohnSnowLabs:spark-nlp:2.0.1
263+
pyspark --packages JohnSnowLabs:spark-nlp:2.0.2
264264
```
265265

266266
Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp`

build.sbt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ if(is_gpu.equals("false")){
1616

1717
organization:= "com.johnsnowlabs.nlp"
1818

19-
version := "2.0.1"
19+
version := "2.0.2"
2020

2121
scalaVersion in ThisBuild := scalaVer
2222

@@ -178,7 +178,7 @@ assemblyMergeStrategy in assembly := {
178178
lazy val ocr = (project in file("ocr"))
179179
.settings(
180180
name := "spark-nlp-ocr",
181-
version := "2.0.1",
181+
version := "2.0.2",
182182

183183
test in assembly := {},
184184

docs/quickstart.html

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -112,14 +112,14 @@ <h2 class="section-title">Requirements & Setup</h2>
112112
To start using the library, execute any of the following lines
113113
depending on your desired use case:
114114
</p>
115-
<pre><code class="language-javascript">spark-shell --packages JohnSnowLabs:spark-nlp:2.0.1
116-
pyspark --packages JohnSnowLabs:spark-nlp:2.0.1
117-
spark-submit --packages JohnSnowLabs:spark-nlp:2.0.1
115+
<pre><code class="language-javascript">spark-shell --packages JohnSnowLabs:spark-nlp:2.0.2
116+
pyspark --packages JohnSnowLabs:spark-nlp:2.0.2
117+
spark-submit --packages JohnSnowLabs:spark-nlp:2.0.2
118118
</code></pre>
119119
<p/>
120120
<h3><b>Straight forward Python on jupyter notebook</b></h3>
121121
<p>Use pip to install (after you pip installed numpy and pyspark)</p>
122-
<pre><code class="language-javascript">pip install spark-nlp==2.0.1
122+
<pre><code class="language-javascript">pip install spark-nlp==2.0.2
123123
jupyter notebook</code></pre>
124124
<p>The easiest way to get started, is to run the following code: </p>
125125
<pre><code class="pytohn">import sparknlp
@@ -131,21 +131,21 @@ <h3><b>Straight forward Python on jupyter notebook</b></h3>
131131
.appName('OCR Eval') \
132132
.config("spark.driver.memory", "6g") \
133133
.config("spark.executor.memory", "6g") \
134-
.config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.0.1") \
134+
.config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.0.2") \
135135
.getOrCreate()</code></pre>
136136
<h3><b>Databricks cloud cluster</b> & <b>Apache Zeppelin</b></h3>
137137
<p>Add the following maven coordinates in the dependency configuration page: </p>
138-
<pre><code class="language-javascript">com.johnsnowlabs.nlp:spark-nlp_2.11:2.0.1</code></pre>
138+
<pre><code class="language-javascript">com.johnsnowlabs.nlp:spark-nlp_2.11:2.0.2</code></pre>
139139
<p>
140140
For Python in <b>Apache Zeppelin</b> you may need to setup <i><b>SPARK_SUBMIT_OPTIONS</b></i> utilizing --packages instruction shown above like this
141141
</p>
142-
<pre><code class="language-javascript">export SPARK_SUBMIT_OPTIONS="--packages JohnSnowLabs:spark-nlp:2.0.1"</code></pre>
142+
<pre><code class="language-javascript">export SPARK_SUBMIT_OPTIONS="--packages JohnSnowLabs:spark-nlp:2.0.2"</code></pre>
143143
<h3><b>Python Jupyter Notebook with PySpark</b></h3>
144144
<pre><code class="language-javascript">export SPARK_HOME=/path/to/your/spark/folder
145145
export PYSPARK_DRIVER_PYTHON=jupyter
146146
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
147147

148-
pyspark --packages JohnSnowLabs:spark-nlp:2.0.1</code></pre>
148+
pyspark --packages JohnSnowLabs:spark-nlp:2.0.2</code></pre>
149149
<h3>S3 based standalone cluster (No Hadoop)</h3>
150150
<p>
151151
If your distributed storage is S3 and you don't have a standard hadoop configuration (i.e. fs.defaultFS)
@@ -442,7 +442,7 @@ <h2 class="section-title">Utilizing Spark NLP OCR Module</h2>
442442
<p>
443443
Spark NLP OCR Module is not included within Spark NLP. It is not an annotator and not an extension to Spark ML.
444444
You can include it with the following coordinates for Maven:
445-
<pre><code class="java">com.johnsnowlabs.nlp:spark-nlp-ocr_2.11:2.0.1</code></pre>
445+
<pre><code class="java">com.johnsnowlabs.nlp:spark-nlp-ocr_2.11:2.0.2</code></pre>
446446
</p>
447447
<h3 class="block-title">Creating Spark datasets from PDF (To be used with Spark NLP)</h3>
448448
<p>

project/assembly.sbt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.5")
1+
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.9")

project/build.properties

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
sbt.version=0.13.16
1+
sbt.version=0.13.18

python/run-tests.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
unittest.TextTestRunner().run(PipelineTestSpec())
2020
unittest.TextTestRunner().run(SpellCheckerTestSpec())
2121
unittest.TextTestRunner().run(SymmetricDeleteTestSpec())
22-
unittest.TextTestRunner().run(ContextSpellCheckerTestSpec())
22+
# unittest.TextTestRunner().run(ContextSpellCheckerTestSpec())
2323
unittest.TextTestRunner().run(ParamsGettersTestSpec())
2424
unittest.TextTestRunner().run(DependencyParserTreeBankTestSpec())
2525
unittest.TextTestRunner().run(DependencyParserConllUTestSpec())
@@ -31,4 +31,4 @@
3131
unittest.TextTestRunner().run(UtilitiesTestSpec())
3232
unittest.TextTestRunner().run(ConfigPathTestSpec())
3333
unittest.TextTestRunner().run(SerializersTestSpec())
34-
unittest.TextTestRunner().run(OcrTestSpec())
34+
unittest.TextTestRunner().run(OcrTestSpec())

python/setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@
4040
# For a discussion on single-sourcing the version across setup.py and the
4141
# project code, see
4242
# https://packaging.python.org/en/latest/single_source_version.html
43-
version='2.0.1', # Required
43+
version='2.0.2', # Required
4444

4545
# This is a one-line description or tagline of what your project does. This
4646
# corresponds to the "Summary" metadata field:

python/sparknlp/__init__.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,8 +36,8 @@ def start(include_ocr=False):
3636
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
3737

3838
if include_ocr:
39-
builder.config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.0.1,com.johnsnowlabs.nlp:spark-nlp-ocr_2.11:2.0.1")
39+
builder.config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.0.2,com.johnsnowlabs.nlp:spark-nlp-ocr_2.11:2.0.2")
4040
else:
41-
builder.config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.0.1") \
41+
builder.config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.0.2") \
4242

4343
return builder.getOrCreate()

0 commit comments

Comments
 (0)