Skip to content

Commit 5cbaa02

Browse files
authored
Merge pull request #571 from JohnSnowLabs/210-release-candidate-4
2.1.0 Release Candidate #4
2 parents e37ec99 + 7f51807 commit 5cbaa02

File tree

603 files changed

+349146
-1035
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

603 files changed

+349146
-1035
lines changed

.sbtrc

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
alias assemblyAndCopy=;assembly;copyAssembledJar
22
alias assemblyOcrAndCopy=;ocr/assembly;copyAssembledOcrJar
33
alias assemblyEvalAndCopy=;evaluation/assembly;copyAssembledEvalJar
4-
alias assemblyAllAndCopy=;assemblyAndCopy;assemblyOcrAndCopy;assemblyEvalAndCopy;copyAssembledEvalJar
4+
alias assemblyAllAndCopy=;assemblyEvalAndCopy;assemblyOcrAndCopy
55
alias assemblyAndCopyForPyPi=;assembly;copyAssembledJarForPyPi
6-
alias publishSignedOcr=;ocr/assembly;ocr/publishSigned
6+
alias publishSignedOcr=;ocr/assembly;ocr/publishSigned

CHANGELOG

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,56 @@
1+
========
2+
2.1.0
3+
========
4+
---------------
5+
Overview
6+
---------------
7+
Thank you for following up with release candidates. This release is backwards breaking because two basic annotators have been redesigned.
8+
The tokenizer now has easier to customize params and simplified exception management.
9+
DocumentAssembler `trimAndClearNewLiens` was redesigned into a `cleanupMode` for further control over the cleanup process.
10+
Tokenizer now supports pretrained models, meaning you'll be capable of accessing any of our language based Tokenizers.
11+
Another big introduction is the `eval` module. An optional Spark NLP sub-module that provides evaluation scripts, to
12+
make it easier when looking to measure your own models are against a validation dataset, now using MLFlow.
13+
Some work also began on metrics during training, starting now with the `NerDLApproach`.
14+
Finally, we'll have Scaladocs ready for easy library reference.
15+
Thank you for your feedback in our Slack channels.
16+
Particular thanks to @csnardi for fixing a bug in one of the release candidates.
17+
18+
---------------
19+
New Features
20+
---------------
21+
* Spark NLP Eval module, includes functions to evaluate NER and Spell Checkers with MLFlow (Python support and more annotators to come)
22+
23+
---------------
24+
Enhancements
25+
---------------
26+
* DocumentAssembler new param `cleanupMode` allows user to decide what kind of cleanup to apply to source
27+
* Tokenizer has been severely enhanced to allow easier and more intuitive customization
28+
* Norvig and Symmetric spell checkers now report confidence scores in metadata
29+
* NerDLApproach now reports metrics and f1 scores with an automated dataset splitting through `setTrainValidationProp`
30+
* Began making progress towards OCR reporting more meaningful metadata (noise levels, confidence score, etc), sets ground base for further development
31+
32+
---------------
33+
Bugfixes
34+
---------------
35+
* Fixed Dependency Parser not reporting offsets correctly
36+
* Dependency Parser now only shows head token as part of the result, instead of pairs
37+
* Fixed NerDLModel not allowing to pick noncontrib versions from linux
38+
* Fixed a bug in embeddingsRef validation allowing the user to override ref when not possible
39+
* Removed unintentional gc calls causing some performance issues
40+
41+
---------------
42+
Framework
43+
---------------
44+
* ResourceDownloader now capable of utilizing credentials from aws standard means (variables, credentials folder)
45+
46+
---------------
47+
Documentation
48+
---------------
49+
* Scaladocs for Spark NLP reference
50+
* Added Google Colab workthrough guide
51+
* Added Approach and Model class names in reference documentation
52+
* Fixed various typos and outdated pieces in documentation
53+
154
========
255
2.0.8
356
========

README.md

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ Take a look at our official Spark NLP page: [http://nlp.johnsnowlabs.com/](http:
4040

4141
## Apache Spark Support
4242

43-
Spark NLP *2.0.8* has been built on top of Apache Spark 2.4.3
43+
Spark NLP *2.1.0* has been built on top of Apache Spark 2.4.3
4444

4545
Note that Spark is not retrocompatible with Spark 2.3.x, so models and environments might not work.
4646

@@ -65,18 +65,18 @@ This library has been uploaded to the [spark-packages repository](https://spark-
6565

6666
Benefit of spark-packages is that makes it available for both Scala-Java and Python
6767

68-
To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:2.0.8` to you spark command
68+
To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:2.1.0` to you spark command
6969

7070
```sh
71-
spark-shell --packages JohnSnowLabs:spark-nlp:2.0.8
71+
spark-shell --packages JohnSnowLabs:spark-nlp:2.1.0
7272
```
7373

7474
```sh
75-
pyspark --packages JohnSnowLabs:spark-nlp:2.0.8
75+
pyspark --packages JohnSnowLabs:spark-nlp:2.1.0
7676
```
7777

7878
```sh
79-
spark-submit --packages JohnSnowLabs:spark-nlp:2.0.8
79+
spark-submit --packages JohnSnowLabs:spark-nlp:2.1.0
8080
```
8181

8282
This can also be used to create a SparkSession manually by using the `spark.jars.packages` option in both Python and Scala
@@ -144,7 +144,7 @@ Our package is deployed to maven central. In order to add this package as a depe
144144
<dependency>
145145
<groupId>com.johnsnowlabs.nlp</groupId>
146146
<artifactId>spark-nlp_2.11</artifactId>
147-
<version>2.0.8</version>
147+
<version>2.1.0</version>
148148
</dependency>
149149
```
150150

@@ -155,22 +155,22 @@ and
155155
<dependency>
156156
<groupId>com.johnsnowlabs.nlp</groupId>
157157
<artifactId>spark-nlp-ocr_2.11</artifactId>
158-
<version>2.0.8</version>
158+
<version>2.1.0</version>
159159
</dependency>
160160
```
161161

162162
### SBT
163163

164164
```sbtshell
165165
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp
166-
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "2.0.8"
166+
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "2.1.0"
167167
```
168168

169169
and
170170

171171
```sbtshell
172172
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-ocr
173-
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-ocr" % "2.0.8"
173+
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-ocr" % "2.1.0"
174174
```
175175

176176
Maven Central: [https://mvnrepository.com/artifact/com.johnsnowlabs.nlp](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp)
@@ -185,7 +185,7 @@ If you installed pyspark through pip/conda, you can install `spark-nlp` through
185185

186186
Pip:
187187
```bash
188-
pip install spark-nlp==2.0.8
188+
pip install spark-nlp==2.1.0
189189
```
190190
Conda:
191191
```bash
@@ -202,7 +202,7 @@ spark = SparkSession.builder \
202202
.master("local[4]")\
203203
.config("spark.driver.memory","4G")\
204204
.config("spark.driver.maxResultSize", "2G") \
205-
.config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.0.8")\
205+
.config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.1.0")\
206206
.config("spark.kryoserializer.buffer.max", "500m")\
207207
.getOrCreate()
208208
```
@@ -216,7 +216,7 @@ Use either one of the following options
216216
* Add the following Maven Coordinates to the interpreter's library list
217217

218218
```bash
219-
com.johnsnowlabs.nlp:spark-nlp_2.11:2.0.8
219+
com.johnsnowlabs.nlp:spark-nlp_2.11:2.1.0
220220
```
221221

222222
* Add path to pre-built jar from [here](#pre-compiled-spark-nlp-and-spark-nlp-ocr) in the interpreter's library list making sure the jar is available to driver path
@@ -226,7 +226,7 @@ com.johnsnowlabs.nlp:spark-nlp_2.11:2.0.8
226226
Apart from previous step, install python module through pip
227227

228228
```bash
229-
pip install spark-nlp==2.0.8
229+
pip install spark-nlp==2.1.0
230230
```
231231

232232
Or you can install `spark-nlp` from inside Zeppelin by using Conda:
@@ -251,7 +251,7 @@ export PYSPARK_PYTHON=python3
251251
export PYSPARK_DRIVER_PYTHON=jupyter
252252
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
253253

254-
pyspark --packages JohnSnowLabs:spark-nlp:2.0.8
254+
pyspark --packages JohnSnowLabs:spark-nlp:2.1.0
255255
```
256256

257257
Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp`

build.sbt

Lines changed: 36 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ if(is_gpu.equals("false")){
1616

1717
organization:= "com.johnsnowlabs.nlp"
1818

19-
version := "2.0.8"
19+
version := "2.1.0"
2020

2121
scalaVersion in ThisBuild := scalaVer
2222

@@ -86,6 +86,7 @@ developers in ThisBuild:= List(
8686
Developer(id="showy", name="Eduardo Muñoz", email="[email protected]", url=url("https://github.com/showy"))
8787
)
8888

89+
target in Compile in doc := baseDirectory.value / "docs/api"
8990

9091
lazy val ocrDependencies = Seq(
9192
"net.sourceforge.tess4j" % "tess4j" % "4.2.1"
@@ -108,19 +109,19 @@ lazy val testDependencies = Seq(
108109
lazy val utilDependencies = Seq(
109110
"com.typesafe" % "config" % "1.3.0",
110111
"org.rocksdb" % "rocksdbjni" % "5.17.2",
111-
"org.apache.hadoop" % "hadoop-aws" % "2.7.3"
112+
"org.apache.hadoop" % "hadoop-aws" % "3.2.0"
112113
exclude("com.fasterxml.jackson.core", "jackson-annotations")
113114
exclude("com.fasterxml.jackson.core", "jackson-databind")
115+
exclude("com.fasterxml.jackson.core", "jackson-core")
114116
exclude("commons-configuration","commons-configuration")
117+
exclude("com.amazonaws","aws-java-sdk-bundle")
115118
exclude("org.apache.hadoop" ,"hadoop-common"),
116-
"com.amazonaws" % "aws-java-sdk" % "1.11.568"
117-
exclude("commons-codec", "commons-codec")
118-
exclude("com.fasterxml.jackson.core", "jackson-core")
119+
"com.amazonaws" % "aws-java-sdk-core" % "1.11.375"
119120
exclude("com.fasterxml.jackson.core", "jackson-annotations")
120121
exclude("com.fasterxml.jackson.core", "jackson-databind")
121-
exclude("com.fasterxml.jackson.dataformat", "jackson-dataformat-smile")
122-
exclude("com.fasterxml.jackson.datatype", "jackson-datatype-joda"),
123-
122+
exclude("com.fasterxml.jackson.core", "jackson-core")
123+
exclude("commons-configuration","commons-configuration"),
124+
"com.amazonaws" % "aws-java-sdk-s3" % "1.11.375",
124125
"org.rocksdb" % "rocksdbjni" % "5.17.2",
125126
"com.github.universal-automata" % "liblevenshtein" % "3.0.0"
126127
exclude("com.google.guava", "guava")
@@ -158,7 +159,6 @@ lazy val root = (project in file("."))
158159

159160

160161
val ocrMergeRules: String => MergeStrategy = {
161-
162162
case "versionchanges.txt" => MergeStrategy.discard
163163
case "StaticLoggerBinder" => MergeStrategy.discard
164164
case PathList("META-INF", fileName)
@@ -171,10 +171,23 @@ val ocrMergeRules: String => MergeStrategy = {
171171
case _ => MergeStrategy.deduplicate
172172
}
173173

174+
val evalMergeRules: String => MergeStrategy = {
175+
case "versionchanges.txt" => MergeStrategy.discard
176+
case "StaticLoggerBinder" => MergeStrategy.discard
177+
case PathList("META-INF", fileName)
178+
if List("NOTICE", "MANIFEST.MF", "DEPENDENCIES", "INDEX.LIST").contains(fileName) || fileName.endsWith(".txt")
179+
=> MergeStrategy.discard
180+
case PathList("META-INF", "services", _ @ _*) => MergeStrategy.first
181+
case PathList("META-INF", xs @ _*) => MergeStrategy.first
182+
case PathList("org", "apache", "spark", _ @ _*) => MergeStrategy.discard
183+
case PathList("apache", "commons", "logging", "impl", xs @ _*) => MergeStrategy.discard
184+
case _ => MergeStrategy.deduplicate
185+
}
186+
174187
assemblyMergeStrategy in assembly := {
175188
case PathList("apache.commons.lang3", _ @ _*) => MergeStrategy.discard
176-
case PathList("org.apache.hadoop", _ @ _*) => MergeStrategy.last
177-
case PathList("com.amazonaws", _ @ _*) => MergeStrategy.last
189+
case PathList("org.apache.hadoop", xs @ _*) => MergeStrategy.first
190+
case PathList("com.amazonaws", xs @ _*) => MergeStrategy.last
178191
case PathList("com.fasterxml.jackson") => MergeStrategy.first
179192
case PathList("META-INF", "io.netty.versions.properties") => MergeStrategy.first
180193
case PathList("org", "tensorflow", _ @ _*) => MergeStrategy.first
@@ -187,7 +200,15 @@ assemblyMergeStrategy in assembly := {
187200
lazy val evaluation = (project in file("eval"))
188201
.settings(
189202
name := "spark-nlp-eval",
190-
version := "2.0.8",
203+
version := "2.1.0",
204+
205+
assemblyMergeStrategy in assembly := evalMergeRules,
206+
207+
libraryDependencies ++= testDependencies ++ Seq(
208+
"org.mlflow" % "mlflow-client" % "1.0.0"
209+
),
210+
211+
test in assembly := {},
191212

192213
publishTo := Some(
193214
if (isSnapshot.value)
@@ -220,7 +241,7 @@ lazy val evaluation = (project in file("eval"))
220241
lazy val ocr = (project in file("ocr"))
221242
.settings(
222243
name := "spark-nlp-ocr",
223-
version := "2.0.8",
244+
version := "2.1.0",
224245

225246
test in assembly := {},
226247

@@ -294,9 +315,10 @@ copyAssembledOcrJar := {
294315
println(s"[info] $jarFilePath copied to $newJarFilePath ")
295316
}
296317

318+
// Includes spark-nlp, so use sparknlp.jar
297319
copyAssembledEvalJar := {
298320
val jarFilePath = (assemblyOutputPath in assembly in "evaluation").value
299-
val newJarFilePath = baseDirectory( _ / "python" / "lib" / "sparknlp-eval.jar").value
321+
val newJarFilePath = baseDirectory( _ / "python" / "lib" / "sparknlp.jar").value
300322
IO.copyFile(jarFilePath, newJarFilePath)
301323
println(s"[info] $jarFilePath copied to $newJarFilePath ")
302324
}

docs/_layouts/landing.html

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -49,22 +49,22 @@ <h1>{{ _section.title }}</h1>
4949
<div class="cell cell--12 cell--lg-12" style="text-align: left; background-color: #2d2d2d; padding: 10px">
5050
{% highlight bash %}
5151
# Install Spark NLP from PyPI
52-
$ pip install spark-nlp==2.0.8
52+
$ pip install spark-nlp==2.1.0
5353

5454
# Install Spark NLP from Anacodna/Conda
5555
$ conda install -c johnsnowlabs spark-nlp
5656

5757
# Load Spark NLP with Spark Shell
58-
$ spark-shell --packages JohnSnowLabs:spark-nlp:2.0.8
58+
$ spark-shell --packages JohnSnowLabs:spark-nlp:2.1.0
5959

6060
# Load Spark NLP with PySpark
61-
$ pyspark --packages JohnSnowLabs:spark-nlp:2.0.8
61+
$ pyspark --packages JohnSnowLabs:spark-nlp:2.1.0
6262

6363
# Load Spark NLP with Spark Submit
64-
$ spark-submit --packages JohnSnowLabs:spark-nlp:2.0.8
64+
$ spark-submit --packages JohnSnowLabs:spark-nlp:2.1.0
6565

6666
# Load Spark NLP as external JAR after comiling and bulding Spark NLP by `sbt assembly`
67-
$ spark-shell --jar spark-nlp-assembly-2.0.8
67+
$ spark-shell --jar spark-nlp-assembly-2.1.0
6868
{% endhighlight %}
6969
</div>
7070
</div>

0 commit comments

Comments
 (0)