📢 Spark NLP 6.2.0: A new stage for unstructured document ingestion and processing at scale

Spark NLP 6.2.0 introduces key upgrades across entity extraction, document normalization, HTML reading, and GGUF-based models. To recap, since the releases of Spark NLP 6.1 you can:

Infer quantized cutting-edge LLMs and VLMs such as Gemma 3, Phi-4, Llama 3.1, Qwen 2.5
Rerank documents using llama.cpp with AutoGGUFReranker
Ingest unstructured documents of diverse formats
- Reader2Doc: streamlines the process of loading and integrating diverse file formats (PDFs, Word, Excel, PowerPoint, HTML, Text, Email, Markdown) directly into Spark NLP pipelines with a unified and flexible interface.
- Reader2Table: streamlines tabular data extraction from multiple document formats with seamless pipeline integration.
- Reader2Image: extract structured image content from various document types

Spark NLP release 6.2.0 further focuses on automation, structure-awareness, and resource efficiency, making pipelines easier to configure, manage, and extend.

🔥 Highlights

Auto Modes for EntityRuler and DocumentNormalizer: automatic regex and text-cleaning presets for faster setup.
Hierarchical Element Tracking in HTMLReader: adds element and parent identifiers for structure-aware document processing.
Resource Management for AutoGGUF Annotators: improved control and cleanup of llama.cpp-based models.

🚀 New Features & Enhancements

EntityRulerModel and DocumentNormalizer Auto Modes

`EntityRulerModel`

Added autoMode parameter to enable predefined regex entity groups ("network_entities", "communication_entities", "media_entities", "email_entities", "all_entities").
Added extractEntities parameter to filter entities within auto modes.
Automatically applies case-insensitive regex presets and falls back to manual mode if not specified.
Retains full backward compatibility with JSON or RocksDB-based rules.

`DocumentNormalizer`

Added presetPattern and autoMode parameters to apply built-in text cleaning patterns.
New modes include "light_clean", "document_clean", "social_clean", "html_clean", and "full_auto".
Enables quick application of multiple cleaning operations without manual configuration.

Together, these additions significantly reduce boilerplate setup for common text extraction and normalization workflows.

Hierarchical Element Identification in HTMLReader

Introduced element_id and parent_id metadata fields for each parsed HTML element.
Enables explicit structural relationships (e.g., title → paragraph → link) for hierarchical retrieval and contextual reasoning.
Supports graph-based indexing, hybrid search, and multi-level document analysis.
Metadata propagation improvements ensure Sentence Detector outputs also retain upstream hierarchy information.

AutoGGUF Annotator Enhancements

For AutoGGUFModel, AutoGGUFVision, AutoGGUFEmbeddings, AutoGGUFReranker

Added close() method to explicitly release llama.cpp model resources, preventing memory retention in long-running sessions.
Introduced setRemoveThinkingTag(tag: String) parameter to remove internal <think>...</think> sections from model outputs.
- Regex pattern: (?s)<$tag>.+?</$tag>
- Simplifies downstream processing for chat and reasoning models.

🐛 Bug Fixes

RobertaEmbeddings Warmup Test - fixed token sequence bug where unknown tokens caused initialization errors.

❤️ Community Support

Slack - real-time discussion with the Spark NLP community and team
GitHub - issue tracking, feature requests, and contributions
Discussions - community ideas and showcases
Medium - latest Spark NLP articles and tutorials
YouTube - educational videos and demos

💻 Installation

Python

pip install spark-nlp==6.2.0

Spark Packages

CPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.2.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.2.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.2.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.2.0

Apple Silicon

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.2.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.2.0

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.2.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.2.0

Maven

<dependency>
  <groupId>com.johnsnowlabs.nlp</groupId>
  <artifactId>spark-nlp_2.12</artifactId>
  <version>6.2.0</version>
</dependency>

FAT JARs

What's Changed

Full Changelog: 6.1.5...6.2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

6.2.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

📢 Spark NLP 6.2.0: A new stage for unstructured document ingestion and processing at scale

🔥 Highlights

🚀 New Features & Enhancements

EntityRulerModel and DocumentNormalizer Auto Modes

`EntityRulerModel`

`DocumentNormalizer`

Hierarchical Element Identification in HTMLReader

AutoGGUF Annotator Enhancements

🐛 Bug Fixes

❤️ Community Support

💻 Installation

Python

Spark Packages

Maven

FAT JARs

What's Changed

Contributors

Uh oh!