|
| 1 | +# GGUFRankingFinisher |
| 2 | + |
| 3 | +The `GGUFRankingFinisher` is a Spark NLP finisher designed to post-process the output of `AutoGGUFReranker`. It provides advanced ranking capabilities including top-k selection, score-based filtering, and normalization. |
| 4 | + |
| 5 | +## Features |
| 6 | + |
| 7 | +- **Top-K Selection**: Select only the top k documents by relevance score |
| 8 | +- **Score Thresholding**: Filter documents by minimum relevance score |
| 9 | +- **Min-Max Scaling**: Normalize relevance scores to 0-1 range |
| 10 | +- **Sorting**: Automatically sorts documents by relevance score in descending order |
| 11 | +- **Ranking**: Adds rank metadata to each document |
| 12 | + |
| 13 | +## Parameters |
| 14 | + |
| 15 | +| Parameter | Type | Description | Default | |
| 16 | +|-----------|------|-------------|---------| |
| 17 | +| `inputCols` | `Array[String]` | Name of input annotation columns containing reranked documents | - | |
| 18 | +| `outputCol` | `String` | Name of output annotation column containing ranked documents | `"ranked_documents"` | |
| 19 | +| `topK` | `Int` | Maximum number of top documents to return (-1 for no limit) | `-1` | |
| 20 | +| `minRelevanceScore` | `Double` | Minimum relevance score threshold | `Double.MinValue` | |
| 21 | +| `minMaxScaling` | `Boolean` | Whether to apply min-max scaling to normalize scores | `false` | |
| 22 | + |
| 23 | +## Usage |
| 24 | + |
| 25 | +### Basic Usage |
| 26 | + |
| 27 | +```scala |
| 28 | +import com.johnsnowlabs.nlp.finisher.GGUFRankingFinisher |
| 29 | + |
| 30 | +val finisher = new GGUFRankingFinisher() |
| 31 | + .setInputCols("reranked_documents") |
| 32 | + .setOutputCol("ranked_documents") |
| 33 | +``` |
| 34 | + |
| 35 | +### Top-K Selection |
| 36 | + |
| 37 | +```scala |
| 38 | +val finisher = new GGUFRankingFinisher() |
| 39 | + .setInputCols("reranked_documents") |
| 40 | + .setOutputCol("ranked_documents") |
| 41 | + .setTopK(5) // Get top 5 most relevant documents |
| 42 | +``` |
| 43 | + |
| 44 | +### Score Thresholding |
| 45 | + |
| 46 | +```scala |
| 47 | +val finisher = new GGUFRankingFinisher() |
| 48 | + .setInputCols("reranked_documents") |
| 49 | + .setOutputCol("ranked_documents") |
| 50 | + .setMinRelevanceScore(0.3) // Only documents with score >= 0.3 |
| 51 | +``` |
| 52 | + |
| 53 | +### Min-Max Scaling |
| 54 | + |
| 55 | +```scala |
| 56 | +val finisher = new GGUFRankingFinisher() |
| 57 | + .setInputCols("reranked_documents") |
| 58 | + .setOutputCol("ranked_documents") |
| 59 | + .setMinMaxScaling(true) // Normalize scores to 0-1 range |
| 60 | +``` |
| 61 | + |
| 62 | +### Combined Usage |
| 63 | + |
| 64 | +```scala |
| 65 | +val finisher = new GGUFRankingFinisher() |
| 66 | + .setInputCols("reranked_documents") |
| 67 | + .setOutputCol("ranked_documents") |
| 68 | + .setTopK(3) |
| 69 | + .setMinRelevanceScore(0.2) |
| 70 | + .setMinMaxScaling(true) |
| 71 | +``` |
| 72 | + |
| 73 | +## Complete Pipeline Example |
| 74 | + |
| 75 | +```scala |
| 76 | +import com.johnsnowlabs.nlp.base.DocumentAssembler |
| 77 | +import com.johnsnowlabs.nlp.annotators.seq2seq.AutoGGUFReranker |
| 78 | +import com.johnsnowlabs.nlp.finisher.GGUFRankingFinisher |
| 79 | +import org.apache.spark.ml.Pipeline |
| 80 | + |
| 81 | +// Document assembler |
| 82 | +val documentAssembler = new DocumentAssembler() |
| 83 | + .setInputCol("text") |
| 84 | + .setOutputCol("document") |
| 85 | + |
| 86 | +// Reranker |
| 87 | +val reranker = AutoGGUFReranker |
| 88 | + .pretrained() |
| 89 | + .setInputCols("document") |
| 90 | + .setOutputCol("reranked_documents") |
| 91 | + .setQuery("A man is eating pasta.") |
| 92 | + |
| 93 | +// Finisher |
| 94 | +val finisher = new GGUFRankingFinisher() |
| 95 | + .setInputCols("reranked_documents") |
| 96 | + .setOutputCol("ranked_documents") |
| 97 | + .setTopK(3) |
| 98 | + .setMinMaxScaling(true) |
| 99 | + |
| 100 | +// Pipeline |
| 101 | +val pipeline = new Pipeline() |
| 102 | + .setStages(Array(documentAssembler, reranker, finisher)) |
| 103 | +``` |
| 104 | + |
| 105 | +## Python Usage |
| 106 | + |
| 107 | +```python |
| 108 | +from sparknlp.finisher import GGUFRankingFinisher |
| 109 | +from sparknlp.annotator import AutoGGUFReranker |
| 110 | +from sparknlp.base import DocumentAssembler |
| 111 | +from pyspark.ml import Pipeline |
| 112 | + |
| 113 | +# Create finisher |
| 114 | +finisher = GGUFRankingFinisher() \ |
| 115 | + .setInputCols("reranked_documents") \ |
| 116 | + .setOutputCol("ranked_documents") \ |
| 117 | + .setTopK(3) \ |
| 118 | + .setMinMaxScaling(True) |
| 119 | + |
| 120 | +# Create pipeline |
| 121 | +pipeline = Pipeline(stages=[document_assembler, reranker, finisher]) |
| 122 | +``` |
| 123 | + |
| 124 | +## Output Schema |
| 125 | + |
| 126 | +The finisher produces a DataFrame with the output annotation column containing ranked documents. Each document annotation contains: |
| 127 | + |
| 128 | +- **result**: The document text |
| 129 | +- **metadata**: Including `relevance_score`, `rank`, and original `query` information |
| 130 | +- **begin/end**: Character positions in the original text |
| 131 | +- **annotatorType**: Set to `DOCUMENT` |
| 132 | + |
| 133 | +## Processing Order |
| 134 | + |
| 135 | +The finisher applies operations in the following order: |
| 136 | + |
| 137 | +1. **Extract** documents and metadata from annotations across all rows |
| 138 | +2. **Scale** relevance scores (if min-max scaling is enabled) |
| 139 | +3. **Filter** by minimum relevance score threshold |
| 140 | +4. **Sort** by relevance score (descending) |
| 141 | +5. **Limit** to top-k results globally (if specified) |
| 142 | +6. **Add rank** metadata to each document |
| 143 | +7. **Return** filtered rows with ranked annotations |
| 144 | + |
| 145 | +## Notes |
| 146 | + |
| 147 | +- The finisher expects input from `AutoGGUFReranker` or compatible annotators that produce documents with `relevance_score` metadata |
| 148 | +- Min-max scaling is applied before threshold filtering, so thresholds should be set according to the scaled range (0.0-1.0) |
| 149 | +- Results are always sorted by relevance score in descending order |
| 150 | +- Top-k filtering is applied globally across all input rows, not per row |
| 151 | +- The finisher adds `rank` metadata to each document indicating its position in the ranking |
| 152 | +- Rows with empty annotation arrays are filtered out from the result |
0 commit comments