-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Description
Is your feature request related to a problem? Please describe.
I’m using Label Studio Starter On-Prem with a team that annotates NER classes on text. The current agreement score options don’t reflect the real quality of our annotations, so I can’t reliably measure how consistently annotators label entities.
– Exact match is too strict: any tiny difference, like an extra or missing space at the edges of a span, causes a big drop in the score, even though the entity itself is annotated correctly.
– Percentage of regions by IoU (with IoU threshold 0.99) looks promising conceptually, but it does not handle overlapping spans well. For example, we use two labels: name and name.first. Our internal logic for LLM training is: first annotate "Joe", "Alex", "Mark" as name, and then overlay them with name.first. If 10 people annotate only name, we get 100% agreement. But if the same 10 people also add the overlapping name.first spans, agreement can drop to around 50%, even though they all annotated the same entities in a consistent way. That is very confusing.
– Intersection over 1D regions is essentially similar to Percentage of regions by IoU, but it also takes into account the order in which spans are stored. If two annotators label the same spans with the same labels but in a different order, the metric penalizes them, which doesn’t make sense for our use case.
I end up in a situation where my team may be doing a good job, but the agreement score suggests otherwise, or vice versa.
Describe the solution you’d like
I would like a very simple, dedicated span-based agreement metric for NER that:
– Compares spans by their character positions and by their label (e.g., same [start, end) and same label name).
– Ignores the order of spans in the annotation.
– Supports overlapping spans (e.g., name and name.first on the same text segment) without artificially lowering the score when everyone annotates them consistently.
– Optionally allows a small tolerance for whitespace differences at the edges of spans (for example, an extra leading or trailing space does not automatically break the match).
In other words, I’d like a metric that simply answers: “For each character span and label, did annotators agree that this piece of text is an entity with this label?” and computes an agreement score (e.g., precision/recall/F1 over spans) based on that.
This could be a built-in metric like “Span label match (order-independent)” that I can select in the agreement settings.
⸻
Describe alternatives you’ve considered
– Exact match: rejected because it is too sensitive to minor differences such as extra spaces at the span boundaries. These do not change the semantic correctness of the annotation, but they heavily reduce the score.
– Percentage of regions by IoU with a high threshold (0.99): works on intervals, but fails for overlapping spans and our name / name.first setup. It can show high agreement when important characters are missing inside the span, and low agreement when people annotate overlapping labels consistently.
– Intersection over 1D regions: conceptually close, but it also considers the sequence of spans in the annotation. If two annotators label the same entities with the same labels but in a different order, the agreement score drops, which is not desirable for NER.
I also considered writing a custom metric, but for a common NER use case a built-in, well-tested span+label agreement metric would be much more reliable and easier to maintain.
⸻
Additional context
– Plan: Starter On-Prem.
– Task type: text NER with multiple labels, including overlapping labels such as name and name.first.
– Goal: monitor annotator consistency and catch real mistakes (like missing part of a name), without penalizing harmless formatting differences (like spacing) or the internal logic of our overlapping labels used for LLM training.