Skip to content

Commit 1d534f7

Browse files
prabodDevinTDHa
andauthored
[SPARKNLP-1286] GGUFRankingFinisher (#14653)
* Add GGUFRankingFinisher and corresponding tests for ranking capabilities * Add GGUFRankingFinisher implementation and tests for ranking functionality * Update test case to tag "finisher the reranked documents" as SlowTest * Add documentation for GGUFRankingFinisher with features, usage examples, and output schema * Resolve partition warning for windowing * Add GGUFRankingFinisher notebook * Change pretrained AutoGGUFReranking model --------- Co-authored-by: Devin Ha <[email protected]>
1 parent 0354d2f commit 1d534f7

File tree

12 files changed

+1889
-49
lines changed

12 files changed

+1889
-49
lines changed

docs/en/annotator_entries/AutoGGUFReranker.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ val reranker = AutoGGUFReranker.pretrained()
3333
.setQuery("A man is eating pasta.")
3434
```
3535

36-
The default model is `"bge-reranker-v2-m3-Q4_K_M"`, if no name is provided.
36+
The default model is `"bge_reranker_v2_m3_Q4_K_M"`, if no name is provided.
3737

3838
For available pretrained models please see the [Models Hub](https://sparknlp.org/models).
3939

@@ -105,7 +105,7 @@ val document = new DocumentAssembler()
105105
.setOutputCol("document")
106106

107107
val reranker = AutoGGUFReranker
108-
.pretrained("bge-reranker-v2-m3-Q4_K_M")
108+
.pretrained()
109109
.setInputCols("document")
110110
.setOutputCol("reranked_documents")
111111
.setBatchSize(4)
Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
# GGUFRankingFinisher
2+
3+
The `GGUFRankingFinisher` is a Spark NLP finisher designed to post-process the output of `AutoGGUFReranker`. It provides advanced ranking capabilities including top-k selection, score-based filtering, and normalization.
4+
5+
## Features
6+
7+
- **Top-K Selection**: Select only the top k documents by relevance score
8+
- **Score Thresholding**: Filter documents by minimum relevance score
9+
- **Min-Max Scaling**: Normalize relevance scores to 0-1 range
10+
- **Sorting**: Automatically sorts documents by relevance score in descending order
11+
- **Ranking**: Adds rank metadata to each document
12+
13+
## Parameters
14+
15+
| Parameter | Type | Description | Default |
16+
|-----------|------|-------------|---------|
17+
| `inputCols` | `Array[String]` | Name of input annotation columns containing reranked documents | - |
18+
| `outputCol` | `String` | Name of output annotation column containing ranked documents | `"ranked_documents"` |
19+
| `topK` | `Int` | Maximum number of top documents to return (-1 for no limit) | `-1` |
20+
| `minRelevanceScore` | `Double` | Minimum relevance score threshold | `Double.MinValue` |
21+
| `minMaxScaling` | `Boolean` | Whether to apply min-max scaling to normalize scores | `false` |
22+
23+
## Usage
24+
25+
### Basic Usage
26+
27+
```scala
28+
import com.johnsnowlabs.nlp.finisher.GGUFRankingFinisher
29+
30+
val finisher = new GGUFRankingFinisher()
31+
.setInputCols("reranked_documents")
32+
.setOutputCol("ranked_documents")
33+
```
34+
35+
### Top-K Selection
36+
37+
```scala
38+
val finisher = new GGUFRankingFinisher()
39+
.setInputCols("reranked_documents")
40+
.setOutputCol("ranked_documents")
41+
.setTopK(5) // Get top 5 most relevant documents
42+
```
43+
44+
### Score Thresholding
45+
46+
```scala
47+
val finisher = new GGUFRankingFinisher()
48+
.setInputCols("reranked_documents")
49+
.setOutputCol("ranked_documents")
50+
.setMinRelevanceScore(0.3) // Only documents with score >= 0.3
51+
```
52+
53+
### Min-Max Scaling
54+
55+
```scala
56+
val finisher = new GGUFRankingFinisher()
57+
.setInputCols("reranked_documents")
58+
.setOutputCol("ranked_documents")
59+
.setMinMaxScaling(true) // Normalize scores to 0-1 range
60+
```
61+
62+
### Combined Usage
63+
64+
```scala
65+
val finisher = new GGUFRankingFinisher()
66+
.setInputCols("reranked_documents")
67+
.setOutputCol("ranked_documents")
68+
.setTopK(3)
69+
.setMinRelevanceScore(0.2)
70+
.setMinMaxScaling(true)
71+
```
72+
73+
## Complete Pipeline Example
74+
75+
```scala
76+
import com.johnsnowlabs.nlp.base.DocumentAssembler
77+
import com.johnsnowlabs.nlp.annotators.seq2seq.AutoGGUFReranker
78+
import com.johnsnowlabs.nlp.finisher.GGUFRankingFinisher
79+
import org.apache.spark.ml.Pipeline
80+
81+
// Document assembler
82+
val documentAssembler = new DocumentAssembler()
83+
.setInputCol("text")
84+
.setOutputCol("document")
85+
86+
// Reranker
87+
val reranker = AutoGGUFReranker
88+
.pretrained()
89+
.setInputCols("document")
90+
.setOutputCol("reranked_documents")
91+
.setQuery("A man is eating pasta.")
92+
93+
// Finisher
94+
val finisher = new GGUFRankingFinisher()
95+
.setInputCols("reranked_documents")
96+
.setOutputCol("ranked_documents")
97+
.setTopK(3)
98+
.setMinMaxScaling(true)
99+
100+
// Pipeline
101+
val pipeline = new Pipeline()
102+
.setStages(Array(documentAssembler, reranker, finisher))
103+
```
104+
105+
## Python Usage
106+
107+
```python
108+
from sparknlp.finisher import GGUFRankingFinisher
109+
from sparknlp.annotator import AutoGGUFReranker
110+
from sparknlp.base import DocumentAssembler
111+
from pyspark.ml import Pipeline
112+
113+
# Create finisher
114+
finisher = GGUFRankingFinisher() \
115+
.setInputCols("reranked_documents") \
116+
.setOutputCol("ranked_documents") \
117+
.setTopK(3) \
118+
.setMinMaxScaling(True)
119+
120+
# Create pipeline
121+
pipeline = Pipeline(stages=[document_assembler, reranker, finisher])
122+
```
123+
124+
## Output Schema
125+
126+
The finisher produces a DataFrame with the output annotation column containing ranked documents. Each document annotation contains:
127+
128+
- **result**: The document text
129+
- **metadata**: Including `relevance_score`, `rank`, and original `query` information
130+
- **begin/end**: Character positions in the original text
131+
- **annotatorType**: Set to `DOCUMENT`
132+
133+
## Processing Order
134+
135+
The finisher applies operations in the following order:
136+
137+
1. **Extract** documents and metadata from annotations across all rows
138+
2. **Scale** relevance scores (if min-max scaling is enabled)
139+
3. **Filter** by minimum relevance score threshold
140+
4. **Sort** by relevance score (descending)
141+
5. **Limit** to top-k results globally (if specified)
142+
6. **Add rank** metadata to each document
143+
7. **Return** filtered rows with ranked annotations
144+
145+
## Notes
146+
147+
- The finisher expects input from `AutoGGUFReranker` or compatible annotators that produce documents with `relevance_score` metadata
148+
- Min-max scaling is applied before threshold filtering, so thresholds should be set according to the scaled range (0.0-1.0)
149+
- Results are always sorted by relevance score in descending order
150+
- Top-k filtering is applied globally across all input rows, not per row
151+
- The finisher adds `rank` metadata to each document indicating its position in the ranking
152+
- Rows with empty annotation arrays are filtered out from the result

0 commit comments

Comments
 (0)