Skip to content

Commit 59f6c01

Browse files
[v2] refactor and create docs for results objects (#3155)
* [v2] refactor and create docs for results objects This is the last thing needed for a documentation of the API. https://embeddings-benchmark.github.io/mteb/ - renamed `load_results` to `results` - moved `ModelResult` to its own script - restructured import of results to use `results` module - added documentation for the results module - converted a few function marked with TODOs to private and deleted legacy loaders/converters - moved `load_results.py` out of `results` - Added missing documentation for a few of the types sidenote: I really feel like we are starting to resolve some of our circular import issues, I rarely run into them now and when I do they are typically very easy to fix * avoid rename of namespace during import * Merge branch 'v2.0.0' of https://github.com/embeddings-benchmark/mteb into refactor-results * remove comment
1 parent 5d9b873 commit 59f6c01

25 files changed

+462
-439
lines changed

docs/api/model.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,5 @@
11
# Models
22

3-
<!-- TODO: Encoder or model? Encoder is consistent with the code, but might be less used WDYT? We also use ModelMeta ... -->
4-
53
A model in `mteb` covers two concepts: metadata and implementation.
64
- Metadata contains information about the model such as maximum input
75
length, valid frameworks, license, and degree of openness.

docs/api/results.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Results
2+
3+
When a models is evaluated in MTEB it produces results. These results consist of:
4+
5+
- `TaskResult`: Result for a single task
6+
- `ModelResult`: Result for a model on a set of tasks
7+
- `BenchmarkResults`: Result for a set of models models on a set of tasks
8+
9+
![](../images/visualizations/result_objects.png)
10+
11+
In normal use these come up when running a model:
12+
```python
13+
# ...
14+
models_results = mteb.evaluate(model, tasks)
15+
type(models_results) # mteb.results.ModelResults
16+
17+
task_result = models_results.task_results
18+
type(models_results) # mteb.results.TaskResult
19+
```
20+
21+
## Result Objects
22+
23+
:::mteb.results.TaskResult
24+
25+
:::mteb.results.ModelResult
26+
27+
:::mteb.results.BenchmarkResults

mkdocs.yml

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -80,12 +80,6 @@ nav:
8080
- Loading Results: usage/loading_results.md
8181
- Command Line Interface: usage/cli.md
8282
- Running the Leaderboard: usage/leaderboard.md
83-
- API:
84-
- api/index.md
85-
- Benchmark: api/benchmark.md
86-
- Task: api/task.md
87-
- Model: api/model.md
88-
- Additional Types: api/types.md
8983
- Overview:
9084
- overview/index.md
9185
- Benchmarks:
@@ -99,6 +93,13 @@ nav:
9993
# - Adding a Benchmark: adding_a_leaderboard_tab.md
10094
# - Adding a Task: adding_a_dataset.md
10195
# - Development Setup: CONTRIBUTING.md
96+
- API:
97+
- Overview: api/index.md
98+
- Benchmark: api/benchmark.md
99+
- Task: api/task.md
100+
- Model: api/model.md
101+
- Results: api/results.md
102+
- Additional Types: api/types.md
102103
- Leaderboard: https://huggingface.co/spaces/mteb/leaderboard
103104

104105
plugins:

mteb/MTEB.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,14 +27,14 @@
2727

2828
import mteb
2929
from mteb.abstasks import AbsTask
30-
from mteb.load_results.task_results import TaskResult
3130
from mteb.models import (
3231
CrossEncoderWrapper,
3332
Encoder,
3433
ModelMeta,
3534
MTEBModels,
3635
SentenceTransformerEncoderWrapper,
3736
)
37+
from mteb.results import TaskResult
3838

3939
if TYPE_CHECKING:
4040
from sentence_transformers import CrossEncoder, SentenceTransformer

mteb/__init__.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,12 @@
55
from mteb.abstasks import AbsTask
66
from mteb.abstasks.task_metadata import TaskMetadata
77
from mteb.evaluate import evaluate
8-
from mteb.load_results import BenchmarkResults, load_results
9-
from mteb.load_results.task_results import TaskResult
8+
from mteb.load_results import load_results
109
from mteb.models import Encoder, SentenceTransformerEncoderWrapper
1110
from mteb.models.get_model_meta import get_model, get_model_meta, get_model_metas
1211
from mteb.MTEB import MTEB
1312
from mteb.overview import TASKS_REGISTRY, get_task, get_tasks
13+
from mteb.results import BenchmarkResults, TaskResult
1414

1515
from .benchmarks.benchmark import Benchmark
1616
from .benchmarks.get_benchmark import get_benchmark, get_benchmarks

mteb/abstasks/AbsTaskTextRegression.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,8 @@
1818
calculate_score_statistics,
1919
calculate_text_statistics,
2020
)
21-
from mteb.load_results.task_results import ScoresDict
2221
from mteb.models import MTEBModels
23-
from mteb.types import HFSubset
22+
from mteb.types import HFSubset, ScoresDict
2423
from mteb.types.statistics import DescriptiveStatistics, ScoreStatistics, TextStatistics
2524

2625
from .AbsTask import AbsTask

mteb/abstasks/aggregated_task.py

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,14 @@
55

66
import numpy as np
77

8+
from mteb.results.task_result import TaskResult
9+
810
from .AbsTask import AbsTask
911
from .aggregate_task_metadata import AggregateTaskMetadata
1012

1113
if TYPE_CHECKING:
1214
from datasets import Dataset, DatasetDict
1315

14-
from mteb.load_results.task_results import TaskResult
1516
from mteb.models.models_protocols import Encoder
1617
from mteb.types import HFSubset, ScoresDict
1718
from mteb.types.statistics import DescriptiveStatistics
@@ -49,7 +50,7 @@ def task_results_to_scores(
4950
for task_res in task_results:
5051
for langs in eval_langs:
5152
main_scores.append(
52-
task_res.get_score_fast(
53+
task_res._get_score_fast(
5354
languages=[lang.split("-")[0] for lang in langs],
5455
splits=self.metadata.eval_splits,
5556
subsets=subsets,
@@ -68,10 +69,6 @@ def combine_task_results(self, task_results: list[TaskResult]) -> TaskResult:
6869
"""Combined the task results for using `task_results_to_scores`. Do not redefine this function if you want to implement a custom aggregation.
6970
Instead redefin `task_results_to_scores`.
7071
"""
71-
from mteb.load_results.task_results import (
72-
TaskResult, # to prevent circular imports, # TODO: can potentially likely be out of function in in v2.0.0
73-
)
74-
7572
eval_times = [tr.evaluation_time for tr in task_results if tr.evaluation_time]
7673
if len(eval_times) != len(task_results):
7774
logger.info(

mteb/benchmarks/benchmark.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,12 @@
44
from dataclasses import dataclass
55
from typing import TYPE_CHECKING
66

7-
from mteb.load_results.load_results import load_results
7+
from mteb.load_results import load_results
8+
from mteb.results import BenchmarkResults
89
from mteb.types import StrURL
910

1011
if TYPE_CHECKING:
11-
from mteb.abstasks.AbsTask import AbsTask
12-
from mteb.load_results.benchmark_results import BenchmarkResults
12+
from mteb.abstasks import AbsTask
1313

1414

1515
@dataclass

mteb/cache.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,8 @@
1111
from typing import cast
1212

1313
from mteb.abstasks import AbsTask
14-
from mteb.load_results.benchmark_results import BenchmarkResults, ModelResult
15-
from mteb.load_results.task_results import TaskResult
1614
from mteb.models import ModelMeta
15+
from mteb.results import BenchmarkResults, ModelResult, TaskResult
1716
from mteb.types import ModelName, Revision
1817

1918
logger = logging.getLogger(__name__)

mteb/evaluate.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,6 @@
1111
from mteb.abstasks.AbsTask import AbsTask
1212
from mteb.abstasks.aggregated_task import AbsTaskAggregate
1313
from mteb.cache import ResultCache
14-
from mteb.load_results.benchmark_results import ModelResult
15-
from mteb.load_results.task_results import TaskResult
1614
from mteb.models.model_meta import ModelMeta
1715
from mteb.models.models_protocols import (
1816
CrossEncoderProtocol,
@@ -23,6 +21,7 @@
2321
CrossEncoderWrapper,
2422
SentenceTransformerEncoderWrapper,
2523
)
24+
from mteb.results import ModelResult, TaskResult
2625
from mteb.types import HFSubset, SplitName
2726
from mteb.types._metadata import ModelName, Revision
2827

0 commit comments

Comments
 (0)