6.2.1 #14692
DevinTDHa
announced in
Announcement
6.2.1
#14692
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
📢 Spark NLP 6.2.1: Enhanced hierarchical document processing and training optimizations
Spark NLP 6.2.1 brings significant improvements to document ingestion with expanded hierarchical support, XML processing enhancements, and optimizations for NerDL training. This release builds on the foundation of 6.2.0, continuing to focus on structure-awareness, flexibility, and performance for production NLP pipelines.
🔥 Highlights
Reader2Doc🚀 New Features & Enhancements
Hierarchical Support for Multiple Document Formats
Building on the HTMLReader hierarchical features introduced in 6.2.0, this release extends structured element tracking to additional document formats:
Reader2Doc now supports hierarchical processing for PDF, Microsoft Word, and Markdown files
Each extracted element includes:
element_id: Unique UUID identifier per elementparent_id: References the parent element's ID for logical document structureEnables tree-like navigation and contextual understanding of document hierarchy:
Supports advanced use cases including hierarchical retrieval, graph-based indexing, and multi-level document analysis
Metadata propagation ensures downstream annotators maintain structural relationships
NerDLApproach Training Optimizations
Significant performance improvements for training of
NerDLApproach:setEnableMemoryOptimizer(true)withmaxEpoch > 1, input datasets are automatically cached to improve training speedNerDLGraphCheckernow populates TensorFlow graph metadata that NerDLApproach can reuse, reducing redundant computations during training initializationWith all these improvements you can expect up half the memory consumption and training time on RAM constrained environments (when using
setEnableMemoryOptimizer(true)). For larger distributed datasets, the effect will be more pronounced.XML Reader and Reader2Doc Enhancements
Single Document Output by Default:
Reader2Docnow creates single document annotations per file by default, providing more expected behavior when processing large documents\nby default, configurable via newsetJoinString(string)parameter for custom separatorsImproved Tag Handling: XML reader now ignores empty tags without text content, reducing noise in parsed output
Enhanced content type handling for
application/xmldocumentsXML Tag Attribute Extraction: New
setExtractTagAttributes(attributes: list[str])parameter enables extraction of XML attribute values. Example:We can extract
categoryandlangvalues with the Reader2Doc ConfigResulting in
🐛 Bug Fixes
❤️ Community Support
💻 Installation
Python
Spark Packages
CPU
GPU
Apple Silicon
AArch64
Maven
FAT JARs
What's Changed
Full Changelog: 6.2.0...6.2.1
This discussion was created from the release 6.2.1.
Beta Was this translation helpful? Give feedback.
All reactions