6.2.0
📢 Spark NLP 6.2.0: A new stage for unstructured document ingestion and processing at scale
Spark NLP 6.2.0 introduces key upgrades across entity extraction, document normalization, HTML reading, and GGUF-based models. To recap, since the releases of Spark NLP 6.1 you can:
- Infer quantized cutting-edge LLMs and VLMs such as Gemma 3, Phi-4, Llama 3.1, Qwen 2.5
- Rerank documents using llama.cpp with
AutoGGUFReranker - Ingest unstructured documents of diverse formats
Reader2Doc: streamlines the process of loading and integrating diverse file formats (PDFs, Word, Excel, PowerPoint, HTML, Text, Email, Markdown) directly into Spark NLP pipelines with a unified and flexible interface.Reader2Table: streamlines tabular data extraction from multiple document formats with seamless pipeline integration.Reader2Image: extract structured image content from various document types
Spark NLP release 6.2.0 further focuses on automation, structure-awareness, and resource efficiency, making pipelines easier to configure, manage, and extend.
🔥 Highlights
- Auto Modes for EntityRuler and DocumentNormalizer: automatic regex and text-cleaning presets for faster setup.
- Hierarchical Element Tracking in HTMLReader: adds element and parent identifiers for structure-aware document processing.
- Resource Management for AutoGGUF Annotators: improved control and cleanup of llama.cpp-based models.
🚀 New Features & Enhancements
EntityRulerModel and DocumentNormalizer Auto Modes
EntityRulerModel
- Added
autoModeparameter to enable predefined regex entity groups ("network_entities","communication_entities","media_entities","email_entities","all_entities"). - Added
extractEntitiesparameter to filter entities within auto modes. - Automatically applies case-insensitive regex presets and falls back to manual mode if not specified.
- Retains full backward compatibility with JSON or RocksDB-based rules.
DocumentNormalizer
- Added
presetPatternandautoModeparameters to apply built-in text cleaning patterns. - New modes include
"light_clean","document_clean","social_clean","html_clean", and"full_auto". - Enables quick application of multiple cleaning operations without manual configuration.
Together, these additions significantly reduce boilerplate setup for common text extraction and normalization workflows.
Hierarchical Element Identification in HTMLReader
- Introduced
element_idandparent_idmetadata fields for each parsed HTML element. - Enables explicit structural relationships (e.g.,
title → paragraph → link) for hierarchical retrieval and contextual reasoning. - Supports graph-based indexing, hybrid search, and multi-level document analysis.
- Metadata propagation improvements ensure Sentence Detector outputs also retain upstream hierarchy information.
AutoGGUF Annotator Enhancements
For AutoGGUFModel, AutoGGUFVision, AutoGGUFEmbeddings, AutoGGUFReranker
- Added
close()method to explicitly release llama.cpp model resources, preventing memory retention in long-running sessions. - Introduced
setRemoveThinkingTag(tag: String)parameter to remove internal<think>...</think>sections from model outputs.- Regex pattern:
(?s)<$tag>.+?</$tag> - Simplifies downstream processing for chat and reasoning models.
- Regex pattern:
🐛 Bug Fixes
- RobertaEmbeddings Warmup Test - fixed token sequence bug where unknown tokens caused initialization errors.
❤️ Community Support
- Slack - real-time discussion with the Spark NLP community and team
- GitHub - issue tracking, feature requests, and contributions
- Discussions - community ideas and showcases
- Medium - latest Spark NLP articles and tutorials
- YouTube - educational videos and demos
💻 Installation
Python
pip install spark-nlp==6.2.0Spark Packages
CPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.2.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.2.0GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.2.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.2.0Apple Silicon
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.2.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.2.0AArch64
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.2.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.2.0Maven
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>6.2.0</version>
</dependency>FAT JARs
- CPU: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-6.2.0.jar
- GPU: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-6.2.0.jar
- Apple Silicon: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-6.2.0.jar
- AArch64: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-6.2.0.jar
What's Changed
- #14671 by @DevinTDHa
- #14672 by @DevinTDHa
- #14674 by @danilojsl
- #14675 by @danilojsl
- #14677 by @ahmedlone127
- #14673 by @AbdullahMubeenAnwar
Full Changelog: 6.1.5...6.2.0