Skip to content

Commit 4119c2d

Browse files
Merge pull request #12763 from JohnSnowLabs/release/420-release-candidate
Release/420 release candidate
2 parents 5c523eb + f7b8bae commit 4119c2d

File tree

1,553 files changed

+418491
-96976
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,553 files changed

+418491
-96976
lines changed

.sync/ignoreFiles

Lines changed: 352 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,352 @@
1+
docs
2+
docs/*
3+
*/docs
4+
*/docs/*
5+
**/docs/**
6+
docs/**
7+
**/*.min.js
8+
**/*.js
9+
**/*.py
10+
python/**
11+
**/python/**
12+
/python/tensorflow/
13+
/python/tensorflow/*
14+
15+
target
16+
target/*
17+
/target
18+
/target/*
19+
*/target/*
20+
21+
*.json
22+
23+
### Eclipse ###
24+
25+
.metadata
26+
bin/
27+
tmp/
28+
*.tmp
29+
*.bak
30+
*.swp
31+
*~.nib
32+
local.properties
33+
.settings/
34+
.loadpath
35+
.recommenders
36+
PubMed*
37+
*cache_pretrained*
38+
*.crc
39+
*.sst
40+
_SUCCESS*
41+
*stages*
42+
*auxdata*
43+
# External tool builders
44+
.externalToolBuilders/
45+
46+
# Locally stored "Eclipse launch configurations"
47+
*.launch
48+
49+
# PyDev specific (Python IDE for Eclipse)
50+
*.pydevproject
51+
52+
# CDT-specific (C/C++ Development Tooling)
53+
.cproject
54+
55+
# Java annotation processor (APT)
56+
.factorypath
57+
58+
# PDT-specific (PHP Development Tools)
59+
.buildpath
60+
61+
# sbteclipse plugin
62+
.target
63+
64+
# Tern plugin
65+
.tern-project
66+
67+
# TeXlipse plugin
68+
.texlipse
69+
70+
# STS (Spring Tool Suite)
71+
.springBeans
72+
73+
# Code Recommenders
74+
.recommenders/
75+
76+
# Scala IDE specific (Scala & Java development for Eclipse)
77+
.cache-main
78+
.scala_dependencies
79+
.worksheet
80+
81+
### Eclipse Patch ###
82+
# Eclipse Core
83+
.project
84+
85+
# JDT-specific (Eclipse Java Development Tools)
86+
.classpath
87+
88+
### Intellij ###
89+
# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio and Webstorm
90+
# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839
91+
92+
# User-specific stuff:
93+
.idea/**/workspace.xml
94+
.idea/**/tasks.xml
95+
.idea/dictionaries
96+
97+
# Sensitive or high-churn files:
98+
.idea/**/dataSources/
99+
.idea/**/dataSources.ids
100+
.idea/**/dataSources.xml
101+
.idea/**/dataSources.local.xml
102+
.idea/**/sqlDataSources.xml
103+
.idea/**/dynamic.xml
104+
.idea/**/uiDesigner.xml
105+
106+
# Gradle:
107+
.idea/**/gradle.xml
108+
.idea/**/libraries
109+
110+
# CMake
111+
cmake-build-debug/
112+
113+
# Mongo Explorer plugin:
114+
.idea/**/mongoSettings.xml
115+
116+
## File-based project format:
117+
*.iws
118+
119+
## Plugin-specific files:
120+
121+
# IntelliJ
122+
/out/
123+
124+
# mpeltonen/sbt-idea plugin
125+
.idea_modules/
126+
127+
# JIRA plugin
128+
atlassian-ide-plugin.xml
129+
130+
# Cursive Clojure plugin
131+
.idea/replstate.xml
132+
133+
# Crashlytics plugin (for Android Studio and IntelliJ)
134+
com_crashlytics_export_strings.xml
135+
crashlytics.properties
136+
crashlytics-build.properties
137+
fabric.properties
138+
139+
### Intellij Patch ###
140+
# Comment Reason: https://github.com/joeblau/gitignore.io/issues/186#issuecomment-215987721
141+
142+
*.iml
143+
# modules.xml
144+
# .idea/misc.xml
145+
# *.ipr
146+
147+
# Sonarlint plugin
148+
.idea/sonarlint
149+
150+
### Intellij+all ###
151+
# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio and Webstorm
152+
# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839
153+
154+
# User-specific stuff:
155+
156+
# Sensitive or high-churn files:
157+
158+
# Gradle:
159+
160+
# CMake
161+
162+
# Mongo Explorer plugin:
163+
164+
## File-based project format:
165+
166+
## Plugin-specific files:
167+
168+
# IntelliJ
169+
170+
# mpeltonen/sbt-idea plugin
171+
172+
# JIRA plugin
173+
174+
# Cursive Clojure plugin
175+
176+
# Crashlytics plugin (for Android Studio and IntelliJ)
177+
178+
### Intellij+all Patch ###
179+
# Ignores the whole idea folder
180+
# See https://github.com/joeblau/gitignore.io/issues/186 and https://github.com/joeblau/gitignore.io/issues/360
181+
182+
.idea/
183+
184+
### Java ###
185+
# Compiled class file
186+
*.class
187+
188+
# Log file
189+
*.log
190+
191+
# BlueJ files
192+
*.ctxt
193+
194+
# Mobile Tools for Java (J2ME)
195+
.mtj.tmp/
196+
197+
# Package Files #
198+
*.jar
199+
*.war
200+
*.ear
201+
*.zip
202+
*.tar.gz
203+
*.rar
204+
205+
# virtual machine crash logs, see http://www.java.com/en/download/help/error_hotspot.xml
206+
hs_err_pid*
207+
208+
### Python ###
209+
# Byte-compiled / optimized / DLL files
210+
__pycache__/
211+
*.py[cod]
212+
*$py.class
213+
214+
# C extensions
215+
*.so
216+
217+
# Distribution / packaging
218+
.Python
219+
build/
220+
develop-eggs/
221+
dist/
222+
downloads/
223+
eggs/
224+
.eggs/
225+
python/lib/
226+
lib64/
227+
parts/
228+
sdist/
229+
var/
230+
wheels/
231+
*.egg-info/
232+
.installed.cfg
233+
*.egg
234+
235+
# PyInstaller
236+
# Usually these files are written by a python script from a template
237+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
238+
*.manifest
239+
*.spec
240+
241+
# Installer logs
242+
pip-log.txt
243+
pip-delete-this-directory.txt
244+
245+
# Unit test / coverage reports
246+
htmlcov/
247+
.tox/
248+
.coverage
249+
.coverage.*
250+
.cache
251+
nosetests.xml
252+
coverage.xml
253+
*.cover
254+
.hypothesis/
255+
256+
# Translations
257+
*.mo
258+
*.pot
259+
260+
# Django stuff:
261+
local_settings.py
262+
263+
# Flask stuff:
264+
instance/
265+
.webassets-cache
266+
267+
# Scrapy stuff:
268+
.scrapy
269+
270+
# Sphinx documentation
271+
docs/_build/
272+
docs/vendor/
273+
274+
# Frontend
275+
docs/_frontend/node_modules
276+
docs/_frontend/static
277+
278+
# PyBuilder
279+
target/
280+
281+
# Jupyter Notebook
282+
.ipynb_checkpoints
283+
284+
# pyenv
285+
.python-version
286+
287+
# celery beat schedule file
288+
celerybeat-schedule
289+
290+
# SageMath parsed files
291+
*.sage.py
292+
293+
# Environments
294+
.env
295+
.venv
296+
env/
297+
venv/
298+
ENV/
299+
env.bak/
300+
venv.bak/
301+
302+
# Spyder project settings
303+
.spyderproject
304+
.spyproject
305+
306+
# Rope project settings
307+
.ropeproject
308+
309+
# mkdocs documentation
310+
/site
311+
312+
# mypy
313+
.mypy_cache/
314+
315+
### SBT ###
316+
# Simple Build Tool
317+
# http://www.scala-sbt.org/release/docs/Getting-Started/Directories.html#configuring-version-control
318+
319+
dist/*
320+
lib_managed/
321+
src_managed/
322+
project/boot/
323+
project/plugins/project/
324+
.history
325+
.lib/
326+
327+
### Scala ###
328+
329+
# End of https://www.gitignore.io/api/sbt,java,scala,python,eclipse,intellij,intellij+all
330+
331+
### Local ###
332+
tmp_pipeline/
333+
tmp_symspell/
334+
test-output-tmp/
335+
spark-warehouse/
336+
/python/python.iml
337+
test_crf_pipeline/
338+
test_*_pipeline/
339+
*metastore_db*
340+
python/src/
341+
python/tensorflow/bert/models/**
342+
**/.DS_Store
343+
**/tmp_*
344+
docs/_site/**
345+
docs/.sass-cache/**
346+
tst_shortcut_sd/
347+
src/*/resources/*.classes
348+
/word_segmenter_metrics/
349+
/special_class.ser
350+
.bsp/sbt.json
351+
python/docs/_build/**
352+
python/docs/reference/_autosummary/**

CHANGELOG

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,27 @@
1+
========
2+
4.2.0
3+
========
4+
----------------
5+
New Features & Enhancements
6+
----------------
7+
* **NEW:** Introducing **Wav2Vec2ForCTC** annotator in Spark NLP 🚀. `Wav2Vec2ForCTC` can load `Wav2Vec2` models for the Automatic Speech Recognition (ASR) task. Wav2Vec2 is a multi-modal model, that combines speech and text. It's the first multi-modal model of its kind we welcome in Spark NLP. This annotator is compatible with all the models trained/fine-tuned by using `Wav2Vec2ForCTC` for **PyTorch** or `TFWav2Vec2ForCTC` for **TensorFlow** models in HuggingFace 🤗 (https://github.com/JohnSnowLabs/spark-nlp/pull/12767)
8+
* **NEW:** Introducing **TapasForQuestionAnswering** annotator in Spark NLP 🚀. `TapasForQuestionAnswering` can load TAPAS Models with a cell selection head and optional aggregation head on top for question-answering tasks on tables (linear layers on top of the hidden-states output to compute logits and optional logits_aggregation), e.g. for SQA, WTQ or WikiSQL-supervised tasks. TAPAS is a BERT-based model specifically designed (and pre-trained) for answering questions about tabular data. This annotator is compatible with all the models trained/fine-tuned by using `TapasForQuestionAnswering` for **PyTorch** or `TFTapasForQuestionAnswering` for **TensorFlow** models in HuggingFace 🤗
9+
* **NEW:** Introducing **CamemBertForTokenClassification** annotator in Spark NLP 🚀. `CamemBertForTokenClassification` can load CamemBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using `CamembertForTokenClassification` for PyTorch or `TFCamembertForTokenClassification` for TensorFlow in HuggingFace 🤗
10+
(https://github.com/JohnSnowLabs/spark-nlp/pull/12752)
11+
* Implementing `setTestDataset` to evaluate metrics on an external dataset during training of Text Classifiers in Spark NLP. This feature is similar to NerDLApproach where metrics are calculated on each Epoch and have been added to the following multi-class/multi-label text classifier annotators: `ClassifierDLApproach`, `SentimentDLApproach`, and `MultiClassifierDLApproach` (https://github.com/JohnSnowLabs/spark-nlp/pull/12796)
12+
* Refactoring and improving `EntityRuler` annotator inference to up to 24x faster especially when used with a long list of labels/entities. We speed up the inference process by implementing the Aho-Corasick algorithm to match patterns in a string. This requires the following changes when using `EntityRuler` https://github.com/JohnSnowLabs/spark-nlp/pull/12634
13+
* Add support for S3 storage in the `cache_folder` where models are downloaded, extracted, and loaded from. Previously, we only supported all local file systems, HDFS, and DBFS. This new feature is especially useful for users on Kubernetes clusters with no access to HDFS or any other distributed file systems (https://github.com/JohnSnowLabs/spark-nlp/pull/12707)
14+
* Implementing `lookaround` functionalities in `DocumentNormalizer` annotator. Currently, `DocumentNormalizer` has both `lookahead` and `lookbehind` functionalities. To extend support for more complex normalizations, especially within the clinical text we are introducing the `lookaround` feature (https://github.com/JohnSnowLabs/spark-nlp/pull/12735)
15+
* Implementing `setReplaceEntities` param to `NerOverwriter` annotator to replace all the NER labels (entities) with the given new labels (entities) (https://github.com/JohnSnowLabs/spark-nlp/pull/12745)
16+
17+
----------------
18+
Bug Fixes
19+
----------------
20+
* Fix a bug in generating the NerDL graph by using TF v2. The previous graph generated by the `TFGraphBuilder` annotator resulted in an exception when the length of the sequence was 1. This issue has been resolved and the new graphs created by `TFGraphBuilder` won't have this issue anymore (https://github.com/JohnSnowLabs/spark-nlp/pull/12636)
21+
* Fix a bug introduced in the 4.0.0 release between Transformer-based Word Embeddings annotators. In the 4.0.0 release, the following annotators were migrated to BatchAnnotate to improve their performance, especially on GPU. However, a bug was introduced in sentence indices which when it is combined with SentenceEmbeddings for Text Classifications tasks (ClassifierDLApproach, SentimentDLApproach, and ClassifierDLApproach) resulted in low accuracy: AlbertEmbeddings, CamemBertEmbeddings, DeBertaEmbeddings, DistilBertEmbeddings, LongformerEmbeddings, RoBertaEmbeddings, XlmRoBertaEmbeddings, and XlnetEmbeddings (https://github.com/JohnSnowLabs/spark-nlp/pull/12641)
22+
* Add support for a list of questions and context in LightPipline. Previously, only one context and question at a time were supported in LightPipeline for Question Answering annotators. We have added support to `fullAnnotate` and `annotate` to receive two lists of questions and contexts (https://github.com/JohnSnowLabs/spark-nlp/pull/12653)
23+
* Fix division by zero exception in the `GPT2Transformer` annotator when the `setDoSample` param was set to true (https://github.com/JohnSnowLabs/spark-nlp/pull/12661)
24+
125
========
226
4.1.0
327
========

0 commit comments

Comments
 (0)