Evaluation¶

`mteb.evaluate` ¶

`OverwriteStrategy` ¶

Bases: HelpfulStrEnum

Enum for the overwrite strategy when running a task.

"always": Always run the task, overwriting the results
"never": Run the task only if the results are not found in the cache. If the results are found, it will not run the task.
"only-missing": Only rerun the missing splits of a task. It will not rerun the splits if the dataset revision or mteb version has changed.
"only-cache": Only load the results from the cache folder and do not run the task. Useful if you just want to load the results from the cache.

Source code in mteb/evaluate.py

class OverwriteStrategy(HelpfulStrEnum):
    """Enum for the overwrite strategy when running a task.

    - "always": Always run the task, overwriting the results
    - "never": Run the task only if the results are not found in the cache. If the results are found, it will not run the task.
    - "only-missing": Only rerun the missing splits of a task. It will not rerun the splits if the dataset revision or mteb version has
        changed.
    - "only-cache": Only load the results from the cache folder and do not run the task. Useful if you just want to load the results from the
        cache.
    """

    ALWAYS = "always"
    NEVER = "never"
    ONLY_MISSING = "only-missing"
    ONLY_CACHE = "only-cache"

`evaluate(model, tasks, *, co2_tracker=None, raise_error=True, encode_kwargs=None, cache=ResultCache(), overwrite_strategy='only-missing', prediction_folder=None, show_progress_bar=True, public_only=None, num_proc=None, timer=None)` ¶

This function runs a model on a given task and returns the results.

Parameters:

Name	Type	Description	Default
`model`	`ModelMeta \| MTEBModels \| SentenceTransformer \| CrossEncoder`	The model to use for encoding.	required
`tasks`	`AbsTask \| Iterable[AbsTask]`	A task to run.	required
`co2_tracker`	`bool \| None`	If True, track the CO₂ emissions of the evaluation, required codecarbon to be installed, which can be installed using `pip install mteb[codecarbon]`. If none is passed co2 tracking will only be run if codecarbon is installed.	`None`
`encode_kwargs`	`EncodeKwargs \| None`	Additional keyword arguments passed to the models `encode` and `load_data` methods;	`None`
`raise_error`	`bool`	If True, raise an error if the task fails. If False, return an empty list.	`True`
`cache`	`ResultCache \| None`	The cache to use for loading the results. If None, then no cache will be used. The default cache saved the cache in the `~/.cache/mteb` directory. It can be overridden by setting the `MTEB_CACHE` environment variable to a different directory or by directly passing a `ResultCache` object.	`ResultCache()`
`overwrite_strategy`	`str \| OverwriteStrategy`	The strategy to use for run a task and overwrite the results. Can be: - "always": Always run the task, overwriting the results - "never": Run the task only if the results are not found in the cache. If the results are found, it will not run the task. - "only-missing": Only rerun the missing splits of a task. It will not rerun the splits if the dataset revision or mteb version has changed. - "only-cache": Only load the results from the cache folder and do not run the task. Useful if you just want to load the results from the cache.	`'only-missing'`
`prediction_folder`	`Path \| str \| None`	Optional folder in which to save model predictions for the task. Predictions of the tasks will be saved in `prediction_folder/{task_name}_predictions.json`	`None`
`show_progress_bar`	`bool`	Whether to show a progress bar when running the evaluation. Default is True. Setting this to False will also set the `encode_kwargs['show_progress_bar']` to False if encode_kwargs is unspecified.	`True`
`public_only`	`bool \| None`	Run only public tasks. If None, it will attempt to run the private task.	`None`
`num_proc`	`int \| None`	Number of processes to use during data loading and transformation. Defaults to 1.	`None`
`timer`	`TimingStack \| None`	A context manager that tracks the timing of evaluation phases.	`None`

Returns:

Type	Description
`ModelResult`	The results of the evaluation.

Examples:

>>> import mteb
>>> model_meta = mteb.get_model_meta("sentence-transformers/all-MiniLM-L6-v2")
>>> task = mteb.get_task("STS12")
>>> result = mteb.evaluate(ModelMeta, task)
>>>
>>> # with CO2 tracking
>>> result = mteb.evaluate(model_meta, task, co2_tracker=True)
>>>
>>> # with encode kwargs
>>> result = mteb.evaluate(model_meta, task, encode_kwargs={"batch_size": 16})
>>>
>>> # with online cache
>>> cache = mteb.ResultCache(cache_path="~/.cache/mteb")
>>>
>>> cache.download_from_remote()
>>> result = mteb.evaluate(model_meta, task, cache=cache)

Source code in mteb/evaluate.py

def evaluate(  # noqa: PLR0913, PLR0914
    model: ModelMeta | MTEBModels | SentenceTransformer | CrossEncoder,
    tasks: AbsTask | Iterable[AbsTask],
    *,
    co2_tracker: bool | None = None,
    raise_error: bool = True,
    encode_kwargs: EncodeKwargs | None = None,
    cache: ResultCache | None = ResultCache(),
    overwrite_strategy: str | OverwriteStrategy = "only-missing",
    prediction_folder: Path | str | None = None,
    show_progress_bar: bool = True,
    public_only: bool | None = None,
    num_proc: int | None = None,
    timer: TimingStack | None = None,
) -> ModelResult:
    """This function runs a model on a given task and returns the results.

    Args:
        model: The model to use for encoding.
        tasks: A task to run.
        co2_tracker: If True, track the CO₂ emissions of the evaluation, required codecarbon to be installed, which can be installed using
            `pip install mteb[codecarbon]`. If none is passed co2 tracking will only be run if codecarbon is installed.
        encode_kwargs: Additional keyword arguments passed to the models `encode` and `load_data` methods;
        raise_error: If True, raise an error if the task fails. If False, return an empty list.
        cache: The cache to use for loading the results. If None, then no cache will be used. The default cache saved the cache in the
            `~/.cache/mteb` directory. It can be overridden by setting the `MTEB_CACHE` environment variable to a different directory or by directly
            passing a `ResultCache` object.
        overwrite_strategy: The strategy to use for run a task and overwrite the results. Can be:
            - "always": Always run the task, overwriting the results
            - "never": Run the task only if the results are not found in the cache. If the results are found, it will not run the task.
            - "only-missing": Only rerun the missing splits of a task. It will not rerun the splits if the dataset revision or mteb version has
                changed.
            - "only-cache": Only load the results from the cache folder and do not run the task. Useful if you just want to load the results from the
                cache.
        prediction_folder: Optional folder in which to save model predictions for the task. Predictions of the tasks will be saved in `prediction_folder/{task_name}_predictions.json`
        show_progress_bar: Whether to show a progress bar when running the evaluation. Default is True. Setting this to False will also set the
            `encode_kwargs['show_progress_bar']` to False if encode_kwargs is unspecified.
        public_only: Run only public tasks. If None, it will attempt to run the private task.
        num_proc: Number of processes to use during data loading and transformation. Defaults to 1.
        timer: A context manager that tracks the timing of evaluation phases.

    Returns:
        The results of the evaluation.

    Examples:
        >>> import mteb
        >>> model_meta = mteb.get_model_meta("sentence-transformers/all-MiniLM-L6-v2")
        >>> task = mteb.get_task("STS12")
        >>> result = mteb.evaluate(ModelMeta, task)
        >>>
        >>> # with CO2 tracking
        >>> result = mteb.evaluate(model_meta, task, co2_tracker=True)
        >>>
        >>> # with encode kwargs
        >>> result = mteb.evaluate(model_meta, task, encode_kwargs={"batch_size": 16})
        >>>
        >>> # with online cache
        >>> cache = mteb.ResultCache(cache_path="~/.cache/mteb")
        >>>
        >>> cache.download_from_remote()
        >>> result = mteb.evaluate(model_meta, task, cache=cache)
    """
    if isinstance(prediction_folder, str):
        prediction_folder = Path(prediction_folder)

    if encode_kwargs is None:
        encode_kwargs = (
            {"show_progress_bar": False} if show_progress_bar is False else {}
        )
    if "batch_size" not in encode_kwargs:
        encode_kwargs["batch_size"] = 32
        logger.info(
            "No batch size defined in encode_kwargs. Setting `encode_kwargs['batch_size'] = 32`. Explicitly set the batch size to silence this message."
        )

    model, meta, model_name, model_revision = _sanitize_model(model)
    _check_model_modalities(meta, tasks)
    overwrite_strategy = OverwriteStrategy.from_str(overwrite_strategy)

    # AbsTaskAggregate is a special case where we have to run multiple tasks and combine the results
    if isinstance(tasks, AbsTaskAggregate):
        existing_results, missing_eval = _check_cache(
            tasks, meta, cache, overwrite_strategy
        )

        if (
            existing_results
            and not missing_eval
            and overwrite_strategy != OverwriteStrategy.ALWAYS
        ):
            logger.info(
                f"Results for {tasks.metadata.name} already exist in cache. Skipping evaluation and loading results."
            )
            return ModelResult(
                model_name=model_name,
                model_revision=model_revision,
                task_results=[existing_results],
            )

        results = evaluate(
            model,
            tasks.metadata.tasks,
            co2_tracker=co2_tracker,
            raise_error=raise_error,
            encode_kwargs=encode_kwargs,
            cache=cache,
            overwrite_strategy=overwrite_strategy,
            prediction_folder=prediction_folder,
            show_progress_bar=show_progress_bar,
            public_only=public_only,
            num_proc=num_proc,
            timer=timer,
        )
        combined_results = tasks.combine_task_results(results.task_results)

        if existing_results:
            combined_results = existing_results.merge(combined_results)

        if cache:
            cache.save_to_cache(
                combined_results,
                meta,
                encode_kwargs=encode_kwargs,
            )

        return ModelResult(
            model_name=results.model_name,
            model_revision=results.model_revision,
            task_results=[combined_results],
            exceptions=results.exceptions,
        )

    if isinstance(tasks, AbsTask):
        task = tasks
    else:
        evaluate_results = []
        exceptions = []
        tasks_tqdm = tqdm(
            tasks,
            desc="Evaluating tasks",
            disable=not show_progress_bar,
        )
        for i, task in enumerate(tasks_tqdm):
            tasks_tqdm.set_description(f"Evaluating task {task.metadata.name}")
            _res = evaluate(
                model,
                task,
                co2_tracker=co2_tracker,
                raise_error=raise_error,
                encode_kwargs=encode_kwargs,
                cache=cache,
                overwrite_strategy=overwrite_strategy,
                prediction_folder=prediction_folder,
                show_progress_bar=False,
                public_only=public_only,
                num_proc=num_proc,
                timer=timer,
            )
            evaluate_results.extend(_res.task_results)
            if _res.exceptions:
                exceptions.extend(_res.exceptions)
        return ModelResult(
            model_name=_res.model_name,
            model_revision=_res.model_revision,
            task_results=evaluate_results,
            exceptions=exceptions,
        )

    existing_results, missing_eval = _check_cache(task, meta, cache, overwrite_strategy)

    if (
        existing_results
        and not missing_eval
        and overwrite_strategy != OverwriteStrategy.ALWAYS
    ):
        # if there are no missing evals we can just return the results
        logger.info(
            f"Results for {task.metadata.name} already exist in cache. Skipping evaluation and loading results."
        )
        return ModelResult(
            model_name=model_name,
            model_revision=model_revision,
            task_results=[existing_results],
        )
    if existing_results:
        logger.info(
            f"Found existing results for {task.metadata.name}, only running missing splits (subsets): {missing_eval}"
        )

    if isinstance(model, ModelMeta):
        logger.info(
            f"Loading model {model_name} with revision {model_revision} from ModelMeta."
        )
        model = model.load_model()
        logger.info("✓ Model loaded")

    if raise_error is False:
        try:
            result = _evaluate_task(
                model=model,
                splits=missing_eval,
                task=task,
                co2_tracker=co2_tracker,
                encode_kwargs=encode_kwargs,
                prediction_folder=prediction_folder,
                public_only=public_only,
                cache=cache,
                num_proc=num_proc,
                existing_results=existing_results,
            )
        except Exception as e:
            logger.error(
                f"Error while running task {task.metadata.name} on splits {list(missing_eval.keys())}: {e}"
            )
            result = TaskError(task_name=task.metadata.name, exception=str(e))
    else:
        result = _evaluate_task(
            model=model,
            splits=missing_eval,
            task=task,
            co2_tracker=co2_tracker,
            encode_kwargs=encode_kwargs,
            prediction_folder=prediction_folder,
            public_only=public_only,
            cache=cache,
            num_proc=num_proc,
            existing_results=existing_results,
        )
    logger.info(f"✓ Finished evaluation for {task.metadata.name}")

    if isinstance(result, TaskError):
        return ModelResult(
            model_name=model_name,
            model_revision=model_revision,
            task_results=[],
            exceptions=[result],
        )

    if cache:
        cache.save_to_cache(
            result,
            meta,
            encode_kwargs=encode_kwargs,
        )

    return ModelResult(
        model_name=model_name,
        model_revision=model_revision,
        task_results=[result],
    )

Evaluation¶

mteb.evaluate ¶

OverwriteStrategy ¶

evaluate(model, tasks, *, co2_tracker=None, raise_error=True, encode_kwargs=None, cache=ResultCache(), overwrite_strategy='only-missing', prediction_folder=None, show_progress_bar=True, public_only=None, num_proc=None, timer=None) ¶

`mteb.evaluate` ¶

`OverwriteStrategy` ¶

`evaluate(model, tasks, *, co2_tracker=None, raise_error=True, encode_kwargs=None, cache=ResultCache(), overwrite_strategy='only-missing', prediction_folder=None, show_progress_bar=True, public_only=None, num_proc=None, timer=None)` ¶