6.2.1 #14692

DevinTDHa · 2025-11-07T16:51:30Z

DevinTDHa
Nov 7, 2025
Maintainer

📢 Spark NLP 6.2.1: Enhanced hierarchical document processing and training optimizations

Spark NLP 6.2.1 brings significant improvements to document ingestion with expanded hierarchical support, XML processing enhancements, and optimizations for NerDL training. This release builds on the foundation of 6.2.0, continuing to focus on structure-awareness, flexibility, and performance for production NLP pipelines.

🔥 Highlights

Hierarchical Document Processing: Extended support for PDF, Word, and Markdown with parent-child element relationships
NerDLApproach Training Optimizations: Reduced memory footprint and improved training performance with BERT based embeddings
Improved Document Output Format: Single document annotations by default for more intuitive behavior with large documents
Enhanced XML Reading: Attribute extraction and improved tag handling in Reader2Doc

🚀 New Features & Enhancements

Hierarchical Support for Multiple Document Formats

Building on the HTMLReader hierarchical features introduced in 6.2.0, this release extends structured element tracking to additional document formats:

Reader2Doc now supports hierarchical processing for PDF, Microsoft Word, and Markdown files
Each extracted element includes:
- element_id: Unique UUID identifier per element
- parent_id: References the parent element's ID for logical document structure

Enables tree-like navigation and contextual understanding of document hierarchy:

Chapter 1
 ├── Narrative Text A
 ├── Narrative Text B
Chapter 2
 ├── Paragraph C

Supports advanced use cases including hierarchical retrieval, graph-based indexing, and multi-level document analysis
Metadata propagation ensures downstream annotators maintain structural relationships

NerDLApproach Training Optimizations

Significant performance improvements for training of NerDLApproach:

Reduced Memory Usage with BERT based embeddings: Optimized output embeddings allocations, lowering peak memory footprint during training
Automatic Dataset Caching: When using setEnableMemoryOptimizer(true) with maxEpoch > 1, input datasets are automatically cached to improve training speed
Graph Metadata Reuse: NerDLGraphChecker now populates TensorFlow graph metadata that NerDLApproach can reuse, reducing redundant computations during training initialization

With all these improvements you can expect up half the memory consumption and training time on RAM constrained environments (when using setEnableMemoryOptimizer(true)). For larger distributed datasets, the effect will be more pronounced.

XML Reader and Reader2Doc Enhancements

Single Document Output by Default: Reader2Doc now creates single document annotations per file by default, providing more expected behavior when processing large documents
- Lines are joined by newline character \n by default, configurable via new setJoinString(string) parameter for custom separators
- Automatically includes specified attribute values in document output
Improved Tag Handling: XML reader now ignores empty tags without text content, reducing noise in parsed output
Enhanced content type handling for application/xml documents

XML Tag Attribute Extraction: New setExtractTagAttributes(attributes: list[str]) parameter enables extraction of XML attribute values. Example:

<bookstore>
    <book category="children">
        <title lang="en">Harry Potter</title>
        <author>J K. Rowling</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="web">
        <title lang="en">Learning XML</title>
        <author>Erik T. Ray</author>
        <year>2003</year>
        <price>39.95</price>
    </book>
</bookstore>

We can extract category and lang values with the Reader2Doc Config

reader2doc = Reader2Doc() \
    .setContentType("application/xml") \
    .setContentPath("../src/test/resources/reader/xml/test.xml") \
    .setOutputCol("document") \
    .setExtractTagAttributes(["category", "lang"])

Resulting in

children
en
Harry Potter
J K. Rowling
2005
29.99
web
en
Learning XML
Erik T. Ray
2003
39.95

🐛 Bug Fixes

Colab Environment Setup: Added Java installation to Colab setup script for improved out-of-the-box compatibility

❤️ Community Support

Slack - real-time discussion with the Spark NLP community and team
GitHub - issue tracking, feature requests, and contributions
Discussions - community ideas and showcases
Medium - latest Spark NLP articles and tutorials
YouTube - educational videos and demos

💻 Installation

Python

pip install spark-nlp==6.2.1

Spark Packages

CPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.2.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.2.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.2.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.2.1

Apple Silicon

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.2.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.2.1

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.2.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.2.1

Maven

<dependency>
  <groupId>com.johnsnowlabs.nlp</groupId>
  <artifactId>spark-nlp_2.12</artifactId>
  <version>6.2.1</version>
</dependency>

FAT JARs

What's Changed

Full Changelog: 6.2.0...6.2.1

This discussion was created from the release 6.2.1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

6.2.1 #14692

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

6.2.1 #14692

Uh oh!

DevinTDHa Nov 7, 2025 Maintainer

📢 Spark NLP 6.2.1: Enhanced hierarchical document processing and training optimizations

🔥 Highlights

🚀 New Features & Enhancements

Hierarchical Support for Multiple Document Formats

NerDLApproach Training Optimizations

XML Reader and Reader2Doc Enhancements

🐛 Bug Fixes

❤️ Community Support

💻 Installation

Python

Spark Packages

Maven

FAT JARs

What's Changed

Replies: 0 comments

DevinTDHa
Nov 7, 2025
Maintainer