MTEB Leaderboard

Last Updated : 25 Aug, 2025

MTEB stands for Massive Text Embedding Benchmark. It is a test that compares different text embedding models to see how well they work. Text embeddings are ways to turn sentences or paragraphs into numbers that computers can understand. Different models create these embeddings but they don’t all do it equally well.

Key features of MTEB are:

Multi-Task Evaluation: Tests models across diverse tasks like classification, retrieval, clustering and semantic similarity.
Standardized Benchmarking: Uses a unified evaluation protocol for fair model comparison.
Domain Variety: Includes datasets from multiple domains to assess generalisation.
Model Transparency: Displays details like architecture, size and training data for informed selection.

Tasks Included in MTEB

MTEB covers 8 types of tasks that show how embeddings help in real-world problems:

Bitext Mining: Finding matching sentences in two languages.
Classification: Sorting texts into categories.
Clustering: Grouping similar texts.
Pair Classification: Deciding if two texts are similar.
Reranking: Ordering texts based on relevance.
Retrieval: Finding relevant documents for a query.
Semantic Textual Similarity (STS): Measuring how similar two sentences are.
Summarization: Checking how well machine summaries match human ones.

Why do we need MTEB?

MTEB (Massive Text Embedding Benchmark) is needed because it provides a standardized, comprehensive way to evaluate text embedding models across many diverse tasks and datasets. It helps:

Compare models fairly on different NLP tasks like classification, retrieval, clustering and semantic similarity.
Assess how well embedding models generalise beyond a single task or dataset.
Guides users in choosing the right embedding model for their specific application.
Encourage the development of better, more versatile embeddings by benchmarking many models in one unified framework.

MTEB gives a broad, reliable measure of embedding quality across varied real-world scenarios, rather than relying on limited single-task evaluations.

Implementation

Step 1: Install dependencies

We will install:

MTEB: which is used for evaluating text embedding models.
sentence-transformers: which provides pre-trained models for embeddings.

Python

!pip install mteb sentence-transformers

Step 2: Define the Models

We will define the models on which we want to run benchmark.

model_names: list of embedding models to evaluate.
Includes models from HuggingFace (e.g., MiniLM, MPNet, BGE, Jina embeddings).
All will be tested on the same benchmark task.

Python

model_names = [
    "sentence-transformers/all-MiniLM-L6-v2",
    "sentence-transformers/all-mpnet-base-v2",
    "sentence-transformers/all-distilroberta-v1",
    "sentence-transformers/paraphrase-MiniLM-L6-v2",
    "sentence-transformers/paraphrase-mpnet-base-v2",
    "intfloat/e5-base-v2",
    "BAAI/bge-base-en-v1.5",
    "jinaai/jina-embeddings-v2-base-en",
]

We may add customize the model list according to our needs.

Step 3: Load Models and Run Benchmark

SentenceTransformer(model_name): loads pre-trained model.
MTEB(tasks=tasks): prepares evaluation on STSBenchmark (Semantic Textual Similarity).
evaluation.run(...): executes benchmarking and saves results to results_local_comp/... with separate sub-folder per model.
model_name.replace('/', '_') ensures no invalid folder names.
verbosity=0: keeps output minimal.

Python

from sentence_transformers import SentenceTransformer
from mteb import MTEB

tasks = ["STSBenchmark"]
results_base_dir = "results_local_comp"

for model_name in model_names:
    print(f"Evaluating {model_name} ...")
    model = SentenceTransformer(model_name)
    evaluation = MTEB(tasks=tasks)
    evaluation.run(
        model,
        output_folder=f"{results_base_dir}/{model_name.replace('/', '_')}",
        verbosity=0,
    )
print("Benchmarking done.")

Output:

model-loading- — Model Loading and Evaluation

Step 4: Setup Result Extraction

We will setup the result extraction block,

Import utilities for handling files (os), JSON results (json) and tables (pandas).
results_base_dir updated to absolute path where results were saved.

Python

import os
import json
import pandas as pd

results_base_dir = "/content/results_local_comp"

Step 5: Define Helper Function to Read Metrics

We will define the helper function,

The function starts in a given model_folder and navigates two levels of subfolders.
It then looks for a file named STSBenchmark.json inside the second-level folder.
If the file exists, it loads the JSON data.
From the data, it extracts test scores (under "scores": "test").
Metric: metric name (e.g., accuracy, Pearson, Spearman).
Value: metric value (formatted to 4 decimal places).
Model: the name of the model_folder.
If any step fails (missing folders, missing file or no scores), it prints a clear message and returns None.

Python

def extract_metrics(model_folder):
    try:
        first_level = next(os.walk(model_folder))[1]
        if not first_level:
            print(f"No first level folders in {model_folder}")
            return None
        first_level_folder = first_level[0]

        second_level_path = os.path.join(model_folder, first_level_folder)
        second_level_walk = next(os.walk(second_level_path))
        second_level = second_level_walk[1]
        if not second_level:
            print(f"No second level folders in {second_level_path}")
            return None
        second_level_folder = second_level[0]

        json_path = os.path.join(
            second_level_path, second_level_folder, "STSBenchmark.json")
        if not os.path.isfile(json_path):
            print(f"No STSBenchmark.json file at {json_path}")
            return None

        with open(json_path, "r") as f:
            result_data = json.load(f)

        scores = result_data.get("scores", {})
        test_scores = scores.get("test", [])

        if not test_scores:
            print(f"No test scores found in {json_path}")
            return None

        metrics = test_scores[0]
        df = pd.DataFrame(list(metrics.items()), columns=["Metric", "Value"])
        df["Value"] = df["Value"].apply(
            lambda x: f"{x:.4f}" if isinstance(x, float) else x)
        df["Model"] = os.path.basename(model_folder)
        return df
    except Exception as e:
        print(f"Error processing folder {model_folder}: {e}")
        return None

Output:

Screenshot-2025-08-23-114033 — Unorganized Result

Step 6: Collect Results for all Models

We will collect the results for all the models,

List all model result folders inside results_local_comp.
Loops through each model folder.
Extracts metrics using extract_metrics.
Stores results in a list of DataFrames.

Python

model_folders = [os.path.join(results_base_dir, d) for d in os.listdir(
    results_base_dir) if os.path.isdir(os.path.join(results_base_dir, d))]

dfs = []
for folder in model_folders:
    df_metrics = extract_metrics(folder)
    if df_metrics is not None:
        dfs.append(df_metrics)

Step 7: Combine and Display all Metrics

We will display the results:

Concatenates all DataFrames into one large table (all models + scores).
Prints final evaluation results.

Python

if dfs:
    combined_df = pd.concat(dfs).reset_index(drop=True)
    print(combined_df)
else:
    print("No metrics found for any models.")

display(combined_df)

Output:

Screenshot-2025-08-23-113253 — Result of Various Metrics

Step 8: Visualize the Result

We will plot the results and the metrics used are:

cosine_pearson: Pearson correlation between model’s cosine similarity scores and human judgments.
cosine_spearman: Spearman correlation (rank-order) for cosine similarity scores vs. human rankings.
euclidean_pearson: Pearson correlation for Euclidean distances converted to similarity, compared to human scores.
euclidean_spearman: Spearman correlation for Euclidean distance-based similarities vs. human rankings.
manhattan_pearson: Pearson correlation for Manhattan (L1) distance-based similarities.
manhattan_spearman: Spearman correlation for Manhattan distance-based similarities.
pearson: Pearson correlation on the model’s primary similarity scores (often cosine).
spearman: Spearman (rank) correlation on the primary similarity scores.
main_score: An overall score or summary metric for model performance (could be an average or primary metric used for ranking).
languages: Number of languages covered in the evaluation.
hf_subset: Indicates a specific subset of evaluation data (often from HuggingFace datasets).

Python

import matplotlib.pyplot as plt
import pandas as pd

combined_df['Value'] = pd.to_numeric(combined_df['Value'], errors='coerce')

pivot_df = combined_df.pivot(index='Model', columns='Metric', values='Value')

pivot_df.plot(kind='bar', figsize=(12, 6))

plt.title('Model Performance on STSBenchmark Metrics')
plt.ylabel('Metric Value')
plt.xlabel('Model')
plt.xticks(rotation=45, ha='right')

plt.legend(title='Metric', bbox_to_anchor=(
    1.02, 1), loc='upper left', borderaxespad=0)
plt.tight_layout()
plt.show()

Output:

bar-MTEB — Visual Representation of Results

Application

Model Selection: Helps researchers and developers choose the best text embedding model for tasks like search, recommendation or clustering. It saves time by providing ready performance comparisons.
Performance Benchmarking: Provides a standardized way to compare embeddings across multiple domains and tasks ensuring fair evaluation.
Progress Tracking: Enables monitoring of improvements in embedding models over time and spotting emerging trends.
Research Insights: Highlights strengths and weaknesses of models guiding future NLP research and dataset design.

Advantages

Tests many embedding models on a wide range of tasks and languages.
Uses the same method for all models to allow fair comparison.
Publicly available and updated often for transparency.
Helps users quickly find the best model for their needs.
Tracks improvements in embedding models over time.

Challenges

Task Diversity: A model may perform well overall but poorly in specific domains making selection task dependent and tricky.
Evaluation Bias: Benchmarks may favor models trained on datasets similar to MTEB tasks reducing generalizability.
Resource Requirements: High performing models can be computationally expensive to run hence limiting accessibility.
Rapid Model Evolution: Frequent new releases can quickly make rankings outdated, requiring constant updates.

shrurfu5

Improve

Article Tags :

MTEB Leaderboard

Tasks Included in MTEB

Why do we need MTEB?

Implementation

Step 1: Install dependencies

Step 2: Define the Models

Step 3: Load Models and Run Benchmark

Step 4: Setup Result Extraction

Step 5: Define Helper Function to Read Metrics

Step 6: Collect Results for all Models

Step 7: Combine and Display all Metrics

Step 8: Visualize the Result

Application

Advantages

Challenges

Explore

Introduction to AI

AI Concepts

Machine Learning in AI

Robotics and AI

Generative AI

AI Practice

Thank You!

What kind of Experience do you want to share?