Open In App

MTEB Leaderboard

Last Updated : 25 Aug, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

MTEB stands for Massive Text Embedding Benchmark. It is a test that compares different text embedding models to see how well they work. Text embeddings are ways to turn sentences or paragraphs into numbers that computers can understand. Different models create these embeddings but they don’t all do it equally well.

Key features of MTEB are:

  • Multi-Task Evaluation: Tests models across diverse tasks like classification, retrieval, clustering and semantic similarity.
  • Standardized Benchmarking: Uses a unified evaluation protocol for fair model comparison.
  • Domain Variety: Includes datasets from multiple domains to assess generalisation.
  • Model Transparency: Displays details like architecture, size and training data for informed selection.

Tasks Included in MTEB

MTEB covers 8 types of tasks that show how embeddings help in real-world problems:

  • Bitext Mining: Finding matching sentences in two languages.
  • Classification: Sorting texts into categories.
  • Clustering: Grouping similar texts.
  • Pair Classification: Deciding if two texts are similar.
  • Reranking: Ordering texts based on relevance.
  • Retrieval: Finding relevant documents for a query.
  • Semantic Textual Similarity (STS): Measuring how similar two sentences are.
  • Summarization: Checking how well machine summaries match human ones.

Why do we need MTEB?

MTEB (Massive Text Embedding Benchmark) is needed because it provides a standardized, comprehensive way to evaluate text embedding models across many diverse tasks and datasets. It helps:

  • Compare models fairly on different NLP tasks like classification, retrieval, clustering and semantic similarity.
  • Assess how well embedding models generalise beyond a single task or dataset.
  • Guides users in choosing the right embedding model for their specific application.
  • Encourage the development of better, more versatile embeddings by benchmarking many models in one unified framework.

MTEB gives a broad, reliable measure of embedding quality across varied real-world scenarios, rather than relying on limited single-task evaluations.

Implementation

Step 1: Install dependencies

We will install:

  • MTEB: which is used for evaluating text embedding models.
  • sentence-transformers: which provides pre-trained models for embeddings.
Python
!pip install mteb sentence-transformers

Step 2: Define the Models

We will define the models on which we want to run benchmark.

  • model_names: list of embedding models to evaluate.
  • Includes models from HuggingFace (e.g., MiniLM, MPNet, BGE, Jina embeddings).
  • All will be tested on the same benchmark task.
Python
model_names = [
    "sentence-transformers/all-MiniLM-L6-v2",
    "sentence-transformers/all-mpnet-base-v2",
    "sentence-transformers/all-distilroberta-v1",
    "sentence-transformers/paraphrase-MiniLM-L6-v2",
    "sentence-transformers/paraphrase-mpnet-base-v2",
    "intfloat/e5-base-v2",
    "BAAI/bge-base-en-v1.5",
    "jinaai/jina-embeddings-v2-base-en",
]

We may add customize the model list according to our needs.

Step 3: Load Models and Run Benchmark

  • SentenceTransformer(model_name): loads pre-trained model.
  • MTEB(tasks=tasks): prepares evaluation on STSBenchmark (Semantic Textual Similarity).
  • evaluation.run(...): executes benchmarking and saves results to results_local_comp/... with separate sub-folder per model.
  • model_name.replace('/', '_') ensures no invalid folder names.
  • verbosity=0: keeps output minimal.
Python
from sentence_transformers import SentenceTransformer
from mteb import MTEB

tasks = ["STSBenchmark"]
results_base_dir = "results_local_comp"

for model_name in model_names:
    print(f"Evaluating {model_name} ...")
    model = SentenceTransformer(model_name)
    evaluation = MTEB(tasks=tasks)
    evaluation.run(
        model,
        output_folder=f"{results_base_dir}/{model_name.replace('/', '_')}",
        verbosity=0,
    )
print("Benchmarking done.")

Output:

model-loading-
Model Loading and Evaluation

Step 4: Setup Result Extraction

We will setup the result extraction block,

  • Import utilities for handling files (os), JSON results (json) and tables (pandas).
  • results_base_dir updated to absolute path where results were saved.
Python
import os
import json
import pandas as pd

results_base_dir = "/content/results_local_comp"

Step 5: Define Helper Function to Read Metrics

We will define the helper function,

  • The function starts in a given model_folder and navigates two levels of subfolders.
  • It then looks for a file named STSBenchmark.json inside the second-level folder.
  • If the file exists, it loads the JSON data.
  • From the data, it extracts test scores (under "scores": "test").
  • Metric: metric name (e.g., accuracy, Pearson, Spearman).
  • Value: metric value (formatted to 4 decimal places).
  • Model: the name of the model_folder.
  • If any step fails (missing folders, missing file or no scores), it prints a clear message and returns None.
Python
def extract_metrics(model_folder):
    try:
        first_level = next(os.walk(model_folder))[1]
        if not first_level:
            print(f"No first level folders in {model_folder}")
            return None
        first_level_folder = first_level[0]

        second_level_path = os.path.join(model_folder, first_level_folder)
        second_level_walk = next(os.walk(second_level_path))
        second_level = second_level_walk[1]
        if not second_level:
            print(f"No second level folders in {second_level_path}")
            return None
        second_level_folder = second_level[0]

        json_path = os.path.join(
            second_level_path, second_level_folder, "STSBenchmark.json")
        if not os.path.isfile(json_path):
            print(f"No STSBenchmark.json file at {json_path}")
            return None

        with open(json_path, "r") as f:
            result_data = json.load(f)

        scores = result_data.get("scores", {})
        test_scores = scores.get("test", [])

        if not test_scores:
            print(f"No test scores found in {json_path}")
            return None

        metrics = test_scores[0]
        df = pd.DataFrame(list(metrics.items()), columns=["Metric", "Value"])
        df["Value"] = df["Value"].apply(
            lambda x: f"{x:.4f}" if isinstance(x, float) else x)
        df["Model"] = os.path.basename(model_folder)
        return df
    except Exception as e:
        print(f"Error processing folder {model_folder}: {e}")
        return None

Output:

Screenshot-2025-08-23-114033
Unorganized Result

Step 6: Collect Results for all Models

We will collect the results for all the models,

  • List all model result folders inside results_local_comp.
  • Loops through each model folder.
  • Extracts metrics using extract_metrics.
  • Stores results in a list of DataFrames.
Python
model_folders = [os.path.join(results_base_dir, d) for d in os.listdir(
    results_base_dir) if os.path.isdir(os.path.join(results_base_dir, d))]

dfs = []
for folder in model_folders:
    df_metrics = extract_metrics(folder)
    if df_metrics is not None:
        dfs.append(df_metrics)

Step 7: Combine and Display all Metrics

We will display the results:

  • Concatenates all DataFrames into one large table (all models + scores).
  • Prints final evaluation results.
Python
if dfs:
    combined_df = pd.concat(dfs).reset_index(drop=True)
    print(combined_df)
else:
    print("No metrics found for any models.")

display(combined_df)

Output:

Screenshot-2025-08-23-113253
Result of Various Metrics

Step 8: Visualize the Result

We will plot the results and the metrics used are:

  • cosine_pearson: Pearson correlation between model’s cosine similarity scores and human judgments.
  • cosine_spearman: Spearman correlation (rank-order) for cosine similarity scores vs. human rankings.
  • euclidean_pearson: Pearson correlation for Euclidean distances converted to similarity, compared to human scores.
  • euclidean_spearman: Spearman correlation for Euclidean distance-based similarities vs. human rankings.
  • manhattan_pearson: Pearson correlation for Manhattan (L1) distance-based similarities.
  • manhattan_spearman: Spearman correlation for Manhattan distance-based similarities.
  • pearson: Pearson correlation on the model’s primary similarity scores (often cosine).
  • spearman: Spearman (rank) correlation on the primary similarity scores.
  • main_score: An overall score or summary metric for model performance (could be an average or primary metric used for ranking).
  • languages: Number of languages covered in the evaluation.
  • hf_subset: Indicates a specific subset of evaluation data (often from HuggingFace datasets).
Python
import matplotlib.pyplot as plt
import pandas as pd

combined_df['Value'] = pd.to_numeric(combined_df['Value'], errors='coerce')

pivot_df = combined_df.pivot(index='Model', columns='Metric', values='Value')

pivot_df.plot(kind='bar', figsize=(12, 6))

plt.title('Model Performance on STSBenchmark Metrics')
plt.ylabel('Metric Value')
plt.xlabel('Model')
plt.xticks(rotation=45, ha='right')

plt.legend(title='Metric', bbox_to_anchor=(
    1.02, 1), loc='upper left', borderaxespad=0)
plt.tight_layout()
plt.show()

Output:

bar-MTEB
Visual Representation of Results

Application

  • Model Selection: Helps researchers and developers choose the best text embedding model for tasks like search, recommendation or clustering. It saves time by providing ready performance comparisons.
  • Performance Benchmarking: Provides a standardized way to compare embeddings across multiple domains and tasks ensuring fair evaluation.
  • Progress Tracking: Enables monitoring of improvements in embedding models over time and spotting emerging trends.
  • Research Insights: Highlights strengths and weaknesses of models guiding future NLP research and dataset design.

Advantages

  • Tests many embedding models on a wide range of tasks and languages.
  • Uses the same method for all models to allow fair comparison.
  • Publicly available and updated often for transparency.
  • Helps users quickly find the best model for their needs.
  • Tracks improvements in embedding models over time.

Challenges

  • Task Diversity: A model may perform well overall but poorly in specific domains making selection task dependent and tricky.
  • Evaluation Bias: Benchmarks may favor models trained on datasets similar to MTEB tasks reducing generalizability.
  • Resource Requirements: High performing models can be computationally expensive to run hence limiting accessibility.
  • Rapid Model Evolution: Frequent new releases can quickly make rankings outdated, requiring constant updates.

Explore