Skip to content

Commit 79c6559

Browse files
authored
docs: Update results to MTEB V2 (#311)
* Update results * Update plot * Update plot * Update docs
1 parent e707803 commit 79c6559

4 files changed

Lines changed: 73 additions & 73 deletions

File tree

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -174,6 +174,8 @@ We provide a number of models that can be used out of the box. These models are
174174

175175
We have performed extensive experiments to evaluate the performance of Model2Vec models. The results are documented in the [results](results/README.md) folder. The results are presented in the following sections:
176176
- [MTEB Results](results/README.md#mteb-results)
177+
- [MMTEB Results](results/README.md#mmteb-results)
178+
- [Retrieval Results](results/README.md#retrieval-results)
177179
- [Training Results](results/README.md#training-results)
178180
- [Ablations](results/README.md#ablations)
179181

145 KB
Loading

results/README.md

Lines changed: 22 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
11
# Results
22

33
This document contains the results of the Model2Vec project. The results are presented in the following sections:
4-
- [MTEB Results](#mteb-results)
4+
- [MTEB Results (English)](#mteb-results-english)
5+
- [MMTEB Results (Multilingual)](#mmteb-results-multilingual)
6+
- [Retrieval Results](#retrieval-results)
57
- [Training Results](#training-results)
68
- [Ablations](#ablations)
79

@@ -13,17 +15,12 @@ Note: The `potion` and `M2V` models are our static models.
1315

1416
| Model | Avg (All) | Avg (MTEB) | Class | Clust | PairClass | Rank | Ret | STS | Sum | Pearl | WordSim |
1517
|:-----------------------|------------:|-------------:|--------:|--------:|------------:|-------:|-------:|-------:|-------:|--------:|----------:|
16-
| [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | 56.08 | 56.09 | 62.62 | 41.94 | 82.37 | 58.04 | 41.95 | 78.90 | 30.81 | 60.83 | 49.91 |
17-
| [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) | 52.46 | 51.66 | 65.97 | 35.29 | 78.17 | 50.92 | 33.52 | 74.22 | 29.78 | 55.37 | 55.15 |
18-
| [potion-base-8M](https://huggingface.co/minishlab/potion-base-8M) | 50.54 | 50.03 | 64.44 | 32.93 | 76.62 | 49.73 | 31.71 | 73.24 | 29.28 | 53.54 | 50.75 |
19-
| [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M) | 49.73 | 49.76 | 59.56 | 30.55 | 76.38 | 50.05 | 36.35 | 73.22 | 28.85 | 49.31 | 50.02 |
20-
| [potion-base-4M](https://huggingface.co/minishlab/potion-base-4M) | 48.87 | 48.23 | 62.19 | 31.47 | 75.37 | 48.75 | 29.11 | 72.19 | 28.89 | 52.55 | 49.21 |
21-
| [static-retrieval-mrl-en-v1](https://huggingface.co/minishlab/static-retrieval-mrl-en-v1) | 48.18 | 48.36 | 57.39 | 28.32 | 75.63 | 49.16 | 35.61 | 72.18 | 28.64 | 49.68 | 44.76 |
22-
| [static-similarity-mrl-multilingual-v1](https://huggingface.co/minishlab/static-similarity-mrl-multilingual-v1) | 48.15 | 47.15 | 59.96 | 24.40 | 79.02 | 48.25 | 29.54 | 74.88 | 30.28 | 51.66 | 51.66 |
23-
| [M2V_base_output](https://huggingface.co/minishlab/M2V_base_output) | 46.79 | 45.34 | 61.25 | 25.58 | 74.9 | 47.63 | 26.14 | 68.58 | 29.2 | 54.02 | 49.18 |
24-
| [potion-base-2M](https://huggingface.co/minishlab/potion-base-2M) | 45.52 | 44.77 | 58.45 | 27.5 | 73.72 | 46.82 | 24.13 | 70.14 | 31.51 | 50.82 | 44.72 |
25-
| [GloVe_300d](https://huggingface.co/sentence-transformers/average_word_embeddings_glove.6B.300d) | 42.84 | 42.36 | 57.31 | 27.66 | 72.48 | 43.3 | 22.78 | 61.9 | 28.81 | 45.65 | 43.05 |
26-
| [BPEmb_50k_300d](https://github.com/bheinzerling/bpemb) | 39.34 | 37.78 | 55.76 | 23.35 | 57.86 | 43.21 | 17.5 | 55.1 | 29.74 | 47.56 | 41.28 |
18+
| [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | 55.80 | 55.93 | 69.25 | 44.90 | 82.37 | 47.14 | 42.92 | 78.95 | 25.96 | 60.83 | 49.91 |
19+
| [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) | 52.83 | 52.13 | 71.70 | 41.25 | 78.17 | 42.45 | 32.67 | 73.93 | 24.74 | 55.37 | 55.15 |
20+
| [potion-base-8M](https://huggingface.co/minishlab/potion-base-8M) | 51.32 | 51.08 | 70.34 | 39.74 | 76.62 | 41.79 | 31.11 | 72.91 | 25.06 | 53.54 | 50.75 |
21+
| [M2V_base_output](https://huggingface.co/minishlab/M2V_base_output) | 48.77 | 47.96 | 66.84 | 33.96 | 74.90 | 39.31 | 25.36 | 68.76 | 26.61 | 54.02 | 49.18 |
22+
| [GloVe_300d](https://huggingface.co/sentence-transformers/average_word_embeddings_glove.6B.300d) | 45.49 | 45.82 | 62.73 | 37.10 | 72.48 | 38.28 | 21.80 | 61.52 | 26.81 | 45.65 | 43.05 |
23+
| [BPEmb_50k_300d](https://github.com/bheinzerling/bpemb) | 42.33 | 41.74 | 61.72 | 35.17 | 57.86 | 37.26 | 15.36 | 55.30 | 29.49 | 47.56 | 41.28 |
2724

2825

2926
<details>
@@ -39,22 +36,22 @@ For readability, the MTEB task names are abbreviated as follows:
3936
- Sum: Summarization
4037
</details>
4138

42-
The results show that [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) is the most performant static embedding model. It reaches 92.11% of the performance of [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) with an average MTEB score of 51.66 while being orders of magnitude faster.
43-
44-
Note: the [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M), [static-retrieval-mrl-en-v1](https://huggingface.co/minishlab/static-retrieval-mrl-en-v1), and [static-similarity-mrl-multilingual-v1](https://huggingface.co/minishlab/static-similarity-mrl-multilingual-v1) models are task-specific models. We've included them for completeness, but they should not be compared directly to the other models for tasks that they are not designed for.
39+
The results show that [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) is the most performant static embedding model. It reaches 93.21% of the performance of [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) with an average MTEB score of 52.13 while being orders of magnitude faster.
4540

4641
The figure below shows the relationship between the number of sentences per second and the average MTEB score. The circle sizes correspond to the number of parameters in the models (larger = more parameters).
4742
This plot shows that the potion and M2V models are much faster than the other models, while still being competitive in terms of performance with the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model.
4843
NOTE: for fairness of comparison, we disabled multiprocessing for Model2Vec for this benchmark. All sentence-transformers models are run with the [sentence-transformers](https://github.com/UKPLab/sentence-transformers) library's default settings for `encode`.
4944

50-
| ![Description](../assets/images/speed_vs_mteb_score_v3.png) |
45+
| ![Description](../assets/images/speed_vs_mteb_plot.png) |
5146
|:--:|
5247
|*Figure: The average MTEB score plotted against sentences per second. The circle size indicates model size.*|
5348

5449

55-
### MMTEB Results (Multilingual)
50+
## MMTEB Results (Multilingual)
5651
The results for the multilingual models are shown in the table below. We compare against the [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) model, as well as other multilingual static embedding models.
5752

53+
Note: the MMTEB leaderboard ranks models using a [Borda count](https://en.wikipedia.org/wiki/Borda_count) over per-task ranks rather than a simple average. This rewards models that perform consistently well across all tasks, rather than those that excel on one task type while performing poorly on others.
54+
5855
| Model | Mean (Task) | Mean (TaskType) | BitMining | Class | Clust | InstRet | MultiClass | PairClass | Rank | Ret | STS |
5956
| :---------------------------------------- | :---------- | :-------------- | :------------ | :------------- | :--------- | :-------------------- | :------------------------ | :------------------ | :-------- | :-------- | :-------- |
6057
| [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) | 52.07 | 45.65 | 76.35 | 54.60 | 38.08 | −3.00 | 20.12 | 75.97 | 50.20 | 33.17 | 65.35 |
@@ -74,27 +71,26 @@ For readability, the MMTEB task names are abbreviated as follows:
7471
- Class: Classification
7572
- Clust: Clustering
7673
- InstRet: Instruction Retrieval
77-
- MuliClass: Multilabel Classification
74+
- MultiClass: Multilabel Classification
7875
- PairClass: PairClassification
7976
- Rank: Reranking
8077
- Ret: Retrieval
8178
- STS: Semantic Textual Similarity
8279

8380
</details>
8481

85-
### Retrieval Results
82+
## Retrieval Results
8683

87-
A subset of models we created and compare against are specifically designed for retrieval tasks. The results are shown in the table below, including two general-purpose models for comparison and a transformer.
84+
Some of our models are specifically designed for retrieval tasks. The results are shown in the table below, with [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) included as a transformer baseline and [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) as a general-purpose static baseline.
8885

8986
| Model | Retrieval Score |
9087
|:-----------------------|------------------:|
91-
| [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | 41.95 |
92-
| [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M) | 36.35 |
93-
| [static-retrieval-mrl-en-v1](https://huggingface.co/minishlab/static-retrieval-mrl-en-v1) | 35.61 |
94-
| [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) | 33.52 |
95-
| [potion-base-8M](https://huggingface.co/minishlab/potion-base-8M) | 31.71 |
88+
| [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | 42.92 |
89+
| [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M) | 35.06 |
90+
| [static-retrieval-mrl-en-v1](https://huggingface.co/minishlab/static-retrieval-mrl-en-v1) | 34.95 |
91+
| [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) | 32.67 |
9692

97-
As can be seen, [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M) model is the most performant static retrieval model, reaching 86.65%% of the performance of [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) with a retrieval score of 36.35.
93+
As can be seen, [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M) is the most performant static retrieval model, reaching 81.69% of the performance of [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) with a retrieval score of 35.06.
9894

9995
## Training Results
10096

results/make_speed_vs_mteb_plot.py

Lines changed: 49 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
"""Script to benchmark the speed of various text embedding models and generate a plot of the MTEB score vs samples per second."""
22

33
import argparse
4-
import json
54
import logging
65
from pathlib import Path
76
from time import perf_counter
@@ -57,7 +56,7 @@ def make_plot(df: pd.DataFrame) -> ggplot:
5756
"""Create a plot of the MTEB score vs samples per second."""
5857
df["label_y"] = (
5958
df["Average score"]
60-
+ 0.5 # a constant "base" offset for all bubbles
59+
+ 0.2 # a constant "base" offset for all bubbles
6160
+ 0.08 * np.sqrt(df["Params (Million)"])
6261
)
6362
plot = (
@@ -106,68 +105,66 @@ def benchmark_model(name: str, info: list[str], texts: list[str]) -> dict[str, f
106105
return {"docs_per_second": docs_per_second, "total_time": total_time}
107106

108107

109-
def main(save_path: str, n_texts: int) -> None:
108+
def main(save_path: str, n_texts: int, force_benchmark: bool) -> None:
110109
"""Benchmark text embedding models and generate a plot."""
111-
# Define the models to benchmark
112-
models: dict[str, list[str]] = {
113-
"BPEmb-50k-300d": ["", "BPEmb"],
114-
"all-MiniLM-L6-v2": ["sentence-transformers/all-MiniLM-L6-v2", "ST"],
115-
"bge-base-en-v1.5": ["BAAI/bge-base-en-v1.5", "ST"],
116-
"GloVe 6B 300d": ["sentence-transformers/average_word_embeddings_glove.6B.300d", "ST"],
117-
"potion-base-8M": ["minishlab/potion-base-8M", "M2V"],
118-
}
110+
save_dir = Path(save_path)
111+
save_dir.mkdir(parents=True, exist_ok=True)
119112

120-
# Load the dataset
121-
ds = load_dataset("wikimedia/wikipedia", data_files="20231101.en/train-00000-of-00041.parquet")["train"]
122-
texts = ds["text"][:n_texts]
113+
# Cached speeds (samples/sec, measured on CPU with 1k Wikipedia texts).
114+
# Re-run with --force-benchmark to update these values.
115+
cached_speeds: dict[str, float] = {
116+
"BPEmb-50k-300d": 84.39,
117+
"all-MiniLM-L6-v2": 62.00,
118+
"bge-base-en-v1.5": 7.42,
119+
"GloVe 6B 300d": 538.39,
120+
"potion-base-8M": 5165.96,
121+
}
123122

124123
summarized_results = [
125-
{"Model": "potion-base-2M", "Average score": 44.77, "Samples per second": None, "Params (Million)": 1.875},
126-
{"Model": "GloVe 6B 300d", "Average score": 42.36, "Samples per second": None, "Params (Million)": 120.000},
127-
{"Model": "potion-base-4M", "Average score": 48.23, "Samples per second": None, "Params (Million)": 3.750},
128-
{"Model": "all-MiniLM-L6-v2", "Average score": 56.09, "Samples per second": None, "Params (Million)": 23.000},
129-
{"Model": "potion-base-8M", "Average score": 50.03, "Samples per second": None, "Params (Million)": 7.500},
130-
{"Model": "bge-base-en-v1.5", "Average score": 63.56, "Samples per second": None, "Params (Million)": 109.000},
131-
{"Model": "M2V_base_output", "Average score": 45.34, "Samples per second": None, "Params (Million)": 7.500},
132-
{"Model": "BPEmb-50k-300d", "Average score": 37.78, "Samples per second": None, "Params (Million)": 15.000},
133-
{"Model": "potion-base-32M", "Average score": 51.66, "Samples per second": None, "Params (Million)": 32.300},
124+
{"Model": "potion-base-2M", "Average score": 47.49, "Samples per second": None, "Params (Million)": 1.875},
125+
{"Model": "GloVe 6B 300d", "Average score": 45.82, "Samples per second": None, "Params (Million)": 120.000},
126+
{"Model": "potion-base-4M", "Average score": 49.77, "Samples per second": None, "Params (Million)": 3.750},
127+
{"Model": "all-MiniLM-L6-v2", "Average score": 55.93, "Samples per second": None, "Params (Million)": 23.000},
128+
{"Model": "potion-base-8M", "Average score": 51.08, "Samples per second": None, "Params (Million)": 7.500},
129+
{"Model": "bge-base-en-v1.5", "Average score": 60.77, "Samples per second": None, "Params (Million)": 109.000},
130+
{"Model": "BPEmb-50k-300d", "Average score": 41.74, "Samples per second": None, "Params (Million)": 15.000},
131+
{"Model": "potion-base-32M", "Average score": 52.13, "Samples per second": None, "Params (Million)": 32.300},
134132
]
135133

136-
timings = {}
137-
138-
for name, info in models.items():
139-
timing = benchmark_model(name, info, texts)
140-
timings[name] = timing
141-
# Update summarized results
142-
for result in summarized_results:
143-
if result["Model"] == name:
144-
result["Samples per second"] = timing["docs_per_second"]
145-
146-
# Set potion-base-8M as the reference speed for the other potion models
134+
if force_benchmark:
135+
models: dict[str, list[str]] = {
136+
"BPEmb-50k-300d": ["", "BPEmb"],
137+
"all-MiniLM-L6-v2": ["sentence-transformers/all-MiniLM-L6-v2", "ST"],
138+
"bge-base-en-v1.5": ["BAAI/bge-base-en-v1.5", "ST"],
139+
"GloVe 6B 300d": ["sentence-transformers/average_word_embeddings_glove.6B.300d", "ST"],
140+
"potion-base-8M": ["minishlab/potion-base-8M", "M2V"],
141+
}
142+
ds = load_dataset("wikimedia/wikipedia", data_files="20231101.en/train-00000-of-00041.parquet")["train"]
143+
texts = ds["text"][:n_texts]
144+
for name, info in models.items():
145+
timing = benchmark_model(name, info, texts)
146+
cached_speeds[name] = float(timing["docs_per_second"])
147+
logger.info("Updated speeds: %s", cached_speeds)
148+
149+
for result in summarized_results:
150+
name = str(result["Model"])
151+
if name in cached_speeds:
152+
result["Samples per second"] = cached_speeds[name]
153+
154+
# Set potion-base-8M as the reference speed for the other M2V models
147155
potion_base_8m_speed = next(
148156
result["Samples per second"] for result in summarized_results if result["Model"] == "potion-base-8M"
149157
)
150-
for model_name in ["M2V_base_output", "potion-base-2M", "potion-base-4M", "potion-base-32M"]:
158+
for model_name in ["potion-base-2M", "potion-base-4M", "potion-base-32M"]:
151159
for result in summarized_results:
152160
if result["Model"] == model_name:
153161
result["Samples per second"] = potion_base_8m_speed
154162

155-
# Ensure save_path is a directory
156-
save_dir = Path(save_path)
157-
save_dir.mkdir(parents=True, exist_ok=True)
158-
159-
# Save timings to JSON
160-
json_path = save_dir / "speed_benchmark_results.json"
161-
with open(json_path, "w") as file:
162-
json.dump(timings, file, indent=4)
163-
164163
# Create and save the plot
165164
df = pd.DataFrame(summarized_results)
166165
plot = make_plot(df)
167166
plot_path = save_dir / "speed_vs_mteb_plot.png"
168167
plot.save(plot_path, width=12, height=10)
169-
170-
logger.info(f"Timings saved to {json_path}")
171168
logger.info(f"Plot saved to {plot_path}")
172169

173170

@@ -179,6 +176,11 @@ def main(save_path: str, n_texts: int) -> None:
179176
parser.add_argument(
180177
"--n-texts", type=int, default=100_000, help="Number of texts to use from the dataset for benchmarking."
181178
)
179+
parser.add_argument(
180+
"--force-benchmark",
181+
action="store_true",
182+
help="Re-run the speed benchmark even if cached results exist.",
183+
)
182184
args = parser.parse_args()
183185

184-
main(save_path=args.save_path, n_texts=args.n_texts)
186+
main(save_path=args.save_path, n_texts=args.n_texts, force_benchmark=args.force_benchmark)

0 commit comments

Comments
 (0)