Skip to content

6.2.0

Choose a tag to compare

@DevinTDHa DevinTDHa released this 22 Oct 15:38
· 19 commits to master since this release
6.2.0

📢 Spark NLP 6.2.0: A new stage for unstructured document ingestion and processing at scale

Spark NLP 6.2.0 introduces key upgrades across entity extraction, document normalization, HTML reading, and GGUF-based models. To recap, since the releases of Spark NLP 6.1 you can:

  • Infer quantized cutting-edge LLMs and VLMs such as Gemma 3, Phi-4, Llama 3.1, Qwen 2.5
  • Rerank documents using llama.cpp with AutoGGUFReranker
  • Ingest unstructured documents of diverse formats
    • Reader2Doc: streamlines the process of loading and integrating diverse file formats (PDFs, Word, Excel, PowerPoint, HTML, Text, Email, Markdown) directly into Spark NLP pipelines with a unified and flexible interface.
    • Reader2Table: streamlines tabular data extraction from multiple document formats with seamless pipeline integration.
    • Reader2Image: extract structured image content from various document types

Spark NLP release 6.2.0 further focuses on automation, structure-awareness, and resource efficiency, making pipelines easier to configure, manage, and extend.

🔥 Highlights

  • Auto Modes for EntityRuler and DocumentNormalizer: automatic regex and text-cleaning presets for faster setup.
  • Hierarchical Element Tracking in HTMLReader: adds element and parent identifiers for structure-aware document processing.
  • Resource Management for AutoGGUF Annotators: improved control and cleanup of llama.cpp-based models.

🚀 New Features & Enhancements

EntityRulerModel and DocumentNormalizer Auto Modes

EntityRulerModel

  • Added autoMode parameter to enable predefined regex entity groups ("network_entities", "communication_entities", "media_entities", "email_entities", "all_entities").
  • Added extractEntities parameter to filter entities within auto modes.
  • Automatically applies case-insensitive regex presets and falls back to manual mode if not specified.
  • Retains full backward compatibility with JSON or RocksDB-based rules.

DocumentNormalizer

  • Added presetPattern and autoMode parameters to apply built-in text cleaning patterns.
  • New modes include "light_clean", "document_clean", "social_clean", "html_clean", and "full_auto".
  • Enables quick application of multiple cleaning operations without manual configuration.

Together, these additions significantly reduce boilerplate setup for common text extraction and normalization workflows.

Hierarchical Element Identification in HTMLReader

  • Introduced element_id and parent_id metadata fields for each parsed HTML element.
  • Enables explicit structural relationships (e.g., title → paragraph → link) for hierarchical retrieval and contextual reasoning.
  • Supports graph-based indexing, hybrid search, and multi-level document analysis.
  • Metadata propagation improvements ensure Sentence Detector outputs also retain upstream hierarchy information.

AutoGGUF Annotator Enhancements

For AutoGGUFModel, AutoGGUFVision, AutoGGUFEmbeddings, AutoGGUFReranker

  • Added close() method to explicitly release llama.cpp model resources, preventing memory retention in long-running sessions.
  • Introduced setRemoveThinkingTag(tag: String) parameter to remove internal <think>...</think> sections from model outputs.
    • Regex pattern: (?s)<$tag>.+?</$tag>
    • Simplifies downstream processing for chat and reasoning models.

🐛 Bug Fixes

  • RobertaEmbeddings Warmup Test - fixed token sequence bug where unknown tokens caused initialization errors.

❤️ Community Support

  • Slack - real-time discussion with the Spark NLP community and team
  • GitHub - issue tracking, feature requests, and contributions
  • Discussions - community ideas and showcases
  • Medium - latest Spark NLP articles and tutorials
  • YouTube - educational videos and demos

💻 Installation

Python

pip install spark-nlp==6.2.0

Spark Packages

CPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.2.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.2.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.2.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.2.0

Apple Silicon

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.2.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.2.0

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.2.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.2.0

Maven

<dependency>
  <groupId>com.johnsnowlabs.nlp</groupId>
  <artifactId>spark-nlp_2.12</artifactId>
  <version>6.2.0</version>
</dependency>

FAT JARs

What's Changed

Full Changelog: 6.1.5...6.2.0