Additional Types¶

MTEB implements a variety of utility types to allow us and you to better know what a model returns. This page documents some of these types.

Encoder Input/Output types ¶

`Array = NDArray[np.floating | np.integer | np.bool_] | torch.Tensor` `module-attribute` ¶

General array type, can be a numpy array (float, int, or bool) or a torch tensor.

`Conversation = list[ConversationTurn]` `module-attribute` ¶

A conversation, consisting of a list of messages.

`BatchedInput = TextInput | CorpusInput | QueryInput | ImageInput | AudioInput | VideoInput | MultimodalInput` `module-attribute` ¶

Represents the input format accepted by the encoder for a batch of data.

The encoder can process several input types depending on the task or modality. Each type is defined as a separate structured input with its own fields.

Supported input types¶

TextInput For pure text inputs.

{"text": ["This is a sample text.", "Another text."]}

2. CorpusInput For corpus-style inputs with titles and bodies.

{"text": ["Title 1 Body 1", "Title 2 Body 2"], "title": ["Title 1", "Title 2"], "body": ["Body 1", "Body 2"]}

3. QueryInput For query–instruction pairs, typically used in retrieval or question answering tasks. Queries and instructions are combined with the model's instruction template.

{
    "text": ["Instruction: Your task is to find document for this query. Query: What is AI?", "Instruction: Your task is to find term for definition. Query: Define machine learning."],
    "query": ["What is AI?", "Define machine learning."],
    "instruction": ["Your task is find document for this query.", "Your task is to find term for definition."]
}

4. ImageInput For visual inputs consisting of images.

{"image": [PIL.Image1, PIL.Image2]}

5. MultimodalInput For combined text–image (multimodal) inputs.

{"text": ["This is a sample text."], "image": [PIL.Image1]}

`TextBatchedInput = TextInput | CorpusInput | QueryInput` `module-attribute` ¶

The input to the encoder for a batch of text data.

`QueryDatasetType = Dataset` `module-attribute` ¶

Retrieval query dataset, containing queries. Should have columns: 1. id, text, instruction (optionally) for text queries 2. id, image for image queries 3. id, audio for audio queries 4. id, video for video queries or a combination of these for multimodal queries.

`CorpusDatasetType = Dataset` `module-attribute` ¶

Retrieval corpus dataset, containing documents. Should have columns: 1. id, title (optionally), body for text corpus 2. id, image for image corpus 3. id, audio for audio corpus 4. id, video for video corpus or a combination of these for multimodal corpus.

`InstructionDatasetType = Dataset` `module-attribute` ¶

Retrieval instruction dataset, containing instructions. Should have columns query-id, instruction.

`RelevantDocumentsType = Mapping[str, Mapping[str, int]]` `module-attribute` ¶

Relevant documents for each query, mapping query IDs to a mapping of document IDs and their relevance scores. Should have columns query-id, corpus-id, score.

`TopRankedDocumentsType = Mapping[str, list[str]]` `module-attribute` ¶

Top-ranked documents for each query, mapping query IDs to a list of document IDs. Should have columns query-id, corpus-ids.

`RetrievalOutputType = dict[str, dict[str, float]]` `module-attribute` ¶

Retrieval output, containing the scores for each query-document pair.

`EncodeKwargs` ¶

Bases: TypedDict

Keyword arguments for encoding methods.

Attributes:

Name	Type	Description
`batch_size`	`NotRequired[int]`	The batch size to use for encoding.
`show_progress_bar`	`NotRequired[bool]`	Whether to show a progress bar during encoding.
`precision`	`NotRequired[str]`	Quantization embeddings settings for sentence transformers

Source code in mteb/types/_encoder_io.py

class EncodeKwargs(TypedDict):
    """Keyword arguments for encoding methods.

    Attributes:
        batch_size: The batch size to use for encoding.
        show_progress_bar: Whether to show a progress bar during encoding.
        precision: Quantization embeddings settings for sentence transformers
    """

    batch_size: NotRequired[int]
    show_progress_bar: NotRequired[bool]
    precision: NotRequired[str]

`PromptType` ¶

Bases: HelpfulStrEnum

The type of prompt used in the input for retrieval models. Used to differentiate between queries and documents.

Attributes:

Name	Type	Description
`query`		A prompt that is a query.
`document`		A prompt that is a document.

Source code in mteb/types/_encoder_io.py

class PromptType(HelpfulStrEnum):
    """The type of prompt used in the input for retrieval models. Used to differentiate between queries and documents.

    Attributes:
        query: A prompt that is a query.
        document: A prompt that is a document.
    """

    query = "query"
    document = "document"

`ConversationTurn` ¶

Bases: TypedDict

A conversation, consisting of a list of messages.

Attributes:

Name	Type	Description
`role`	`str`	The role of the message sender.
`content`	`str`	The content of the message.

Source code in mteb/types/_encoder_io.py

class ConversationTurn(TypedDict):
    """A conversation, consisting of a list of messages.

    Attributes:
        role: The role of the message sender.
        content: The content of the message.
    """

    role: str
    content: str

`TextInput` ¶

Bases: TypedDict

The input to the encoder for text.

Attributes:

Name	Type	Description
`text`	`list[str]`	The text to encode. Can be a list of texts or a list of lists of texts.

Source code in mteb/types/_encoder_io.py

class TextInput(TypedDict):
    """The input to the encoder for text.

    Attributes:
        text: The text to encode. Can be a list of texts or a list of lists of texts.
    """

    text: list[str]

`CorpusInput` ¶

Bases: TextInput

The input to the encoder for retrieval corpus.

Attributes:

Name	Type	Description
`title`	`list[str]`	The title of the text to encode. Can be a list of titles or a list of lists of titles.
`body`	`list[str]`	The body of the text to encode. Can be a list of bodies or a list of lists of bodies.

Source code in mteb/types/_encoder_io.py

class CorpusInput(TextInput):
    """The input to the encoder for retrieval corpus.

    Attributes:
        title: The title of the text to encode. Can be a list of titles or a
            list of lists of titles.
        body: The body of the text to encode. Can be a list of bodies or a
            list of lists of bodies.
    """

    title: list[str]
    body: list[str]

`QueryInput` ¶

Bases: TextInput

The input to the encoder for queries.

Attributes:

Name	Type	Description
`query`	`list[str]`	The query to encode. Can be a list of queries or a list of lists of queries.
`conversation`	`NotRequired[list[Conversation]]`	Optional. A list of conversations, each conversation is a list of messages.
`instruction`	`NotRequired[list[str]]`	Optional. A list of instructions to encode.

Source code in mteb/types/_encoder_io.py

class QueryInput(TextInput):
    """The input to the encoder for queries.

    Attributes:
        query: The query to encode. Can be a list of queries or a list of lists of queries.
        conversation: Optional. A list of conversations, each conversation is a list of messages.
        instruction: Optional. A list of instructions to encode.
    """

    query: list[str]
    conversation: NotRequired[list[Conversation]]
    instruction: NotRequired[list[str]]

`ImageInput` ¶

Bases: TypedDict

The input to the encoder for images.

Attributes:

Name	Type	Description
`image`	`list[Image]`	The image to encode. Can be a list of images or a list of lists of images.

Source code in mteb/types/_encoder_io.py

class ImageInput(TypedDict):
    """The input to the encoder for images.

    Attributes:
        image: The image to encode. Can be a list of images or a list of lists of images.
    """

    image: list[Image.Image]

`AudioInputItem` ¶

Bases: TypedDict

An audio item for the AudioInput.

Dataset based on datasets.Audio will be converted to this format during encoding.

Attributes:

Name	Type	Description
`array`	`NDArray[floating]`	The audio array as bytes.
`sampling_rate`	`int`	The sampling rate of the audio.

Source code in mteb/types/_encoder_io.py

class AudioInputItem(TypedDict):
    """An audio item for the AudioInput.

    Dataset based on `datasets.Audio` will be converted to this format during encoding.

    Attributes:
        array: The audio array as bytes.
        sampling_rate: The sampling rate of the audio.
    """

    array: npt.NDArray[np.floating]
    sampling_rate: int

`AudioInput` ¶

Bases: TypedDict

The input to the encoder for audio.

Attributes:

Name	Type	Description
`audio`	`list[AudioInputItem]`	The audio to encode. Can be a list of audio files or a list of lists of audio files.

Source code in mteb/types/_encoder_io.py

class AudioInput(TypedDict):
    """The input to the encoder for audio.

    Attributes:
        audio: The audio to encode. Can be a list of audio files or a list of lists of audio files.
    """

    audio: list[AudioInputItem]

`VideoInput` ¶

Bases: TypedDict

The input to the encoder for video frames. Audio is currently included in the AudioInput.

Attributes:

Name	Type	Description
`video`	`Tensor`	The video frames as Tensor.

Source code in mteb/types/_encoder_io.py

class VideoInput(TypedDict):
    """The input to the encoder for video frames. Audio is currently included in the AudioInput.

    Attributes:
        video: The video frames as Tensor.
    """

    video: torch.Tensor

`MultimodalInput` ¶

Bases: TextInput, CorpusInput, QueryInput, ImageInput, AudioInput, VideoInput

The input to the encoder for multimodal data.

Source code in mteb/types/_encoder_io.py

class MultimodalInput(  # type: ignore[misc]
    TextInput, CorpusInput, QueryInput, ImageInput, AudioInput, VideoInput
):
    """The input to the encoder for multimodal data."""

    pass

`OutputDType` ¶

Bases: HelpfulStrEnum

Enum for valid compression levels.

Used by the CompressionWrapper class and specified by models to indicate the dtypes of output embeddings they support internally.

Source code in mteb/types/_encoder_io.py

class OutputDType(HelpfulStrEnum):
    """Enum for valid compression levels.

    Used by the CompressionWrapper class and specified by models to indicate the dtypes of output embeddings they
    support internally.
    """

    FLOAT16 = "float16"
    BF16 = "bfloat16"
    INT8 = "int8"
    INT4 = "int4"
    UINT8 = "uint8"
    UINT4 = "uint4"
    BINARY = "binary"
    FLOAT8_E4M3FN = "float8_e4m3fn"
    FLOAT8_E5M2 = "float8_e5m2"
    FLOAT8_E8M0FNU = "float8_e8m0fnu"
    FLOAT8_E4M3FNUZ = "float8_e4m3fnuz"
    FLOAT8_E5M2FNUZ = "float8_e5m2fnuz"

    def get_dtype(self) -> torch.dtype:
        """Returns the PyTorch dtype that matches the enum.

        Output types that are not natively supported by PyTorch like 4-bit integers require specific mapping to the
        desired dtype.
        """
        if self == OutputDType.UINT4:
            return torch.uint8
        elif self == OutputDType.INT4:
            return torch.int8
        elif self == OutputDType.BINARY:
            return torch.bool
        return cast("torch.dtype", getattr(torch, self.value))

`get_dtype()` ¶

Returns the PyTorch dtype that matches the enum.

Output types that are not natively supported by PyTorch like 4-bit integers require specific mapping to the desired dtype.

Source code in mteb/types/_encoder_io.py

def get_dtype(self) -> torch.dtype:
    """Returns the PyTorch dtype that matches the enum.

    Output types that are not natively supported by PyTorch like 4-bit integers require specific mapping to the
    desired dtype.
    """
    if self == OutputDType.UINT4:
        return torch.uint8
    elif self == OutputDType.INT4:
        return torch.int8
    elif self == OutputDType.BINARY:
        return torch.bool
    return cast("torch.dtype", getattr(torch, self.value))

Metadata types ¶

`ISOLanguageScript = str` `module-attribute` ¶

A string representing the language and script. Language is denoted as a 3-letter ISO 639-3 language code and the script is denoted by a 4-letter ISO 15924 script code (e.g. "eng-Latn").

`ISOLanguage = str` `module-attribute` ¶

A string representing the language. Language is denoted as a 3-letter ISO 639-3 language code (e.g. "eng").

`ISOScript = str` `module-attribute` ¶

A string representing the script. The script is denoted by a 4-letter ISO 15924 script code (e.g. "Latn").

`Languages = list[ISOLanguageScript] | Mapping[HFSubset, list[ISOLanguageScript]]` `module-attribute` ¶

A list of languages or a mapping from HFSubset to a list of languages. E.g. ["eng-Latn", "deu-Latn"] or {"en-de": ["eng-Latn", "deu-Latn"], "fr-it": ["fra-Latn", "ita-Latn"]}.

`Licenses = Literal['not specified', 'mit', 'cc-by-2.0', 'cc-by-3.0', 'cc-by-4.0', 'cc-by-sa-3.0', 'cc-by-sa-4.0', 'cc-by-nc-3.0', 'cc-by-nc-4.0', 'cc-by-nc-sa-3.0', 'cc-by-nc-sa-4.0', 'cc-by-nc-nd-4.0', 'cc-by-nd-4.0', 'openrail', 'openrail++', 'odc-by', 'afl-3.0', 'apache-2.0', 'cc-by-nd-2.1-jp', 'cc0-1.0', 'bsd-3-clause', 'gpl-3.0', 'lgpl', 'lgpl-3.0', 'cdla-sharing-1.0', 'mpl-2.0', 'msr-la-nc', 'multiple', 'gemma', 'eupl-1.2']` `module-attribute` ¶

The different licenses that a dataset or model can have. This list can be extended as needed.

`ModelName = str` `module-attribute` ¶

The name of a model, typically as found on HuggingFace e.g. sentence-transformers/all-MiniLM-L6-v2.

`Revision = str` `module-attribute` ¶

The revision of a model, typically a git commit hash. For APIs this can be a version string e.g. 1.

`Modalities = Literal['text', 'image', 'audio', 'video']` `module-attribute` ¶

The different modalities that a model can support.

Results types ¶

`HFSubset = str` `module-attribute` ¶

The name of a HuggingFace dataset subset, e.g. 'en-de', 'en', 'default' (default is used when there is no subset).

`SplitName = str` `module-attribute` ¶

The name of a data split, e.g. 'test', 'validation', 'train'.

`Score = Any` `module-attribute` ¶

A score value, could e.g. be accuracy. Normally it is a float or int, but it can take on any value. Should be json serializable.

`ScoresDict = Mapping[str, Score]` `module-attribute` ¶

A dictionary of scores, typically also include metadata, e.g {'main_score': 0.5, 'accuracy': 0.5, 'f1': 0.6, 'hf_subset': 'en-de', 'languages': ['eng-Latn', 'deu-Latn']}

`RetrievalEvaluationResult` ¶

Bases: NamedTuple

Holds the results of retrieval evaluation metrics.

Source code in mteb/types/_result.py

class RetrievalEvaluationResult(NamedTuple):
    """Holds the results of retrieval evaluation metrics."""

    all_scores: dict[str, dict[str, float]]
    ndcg: dict[str, float]
    map: dict[str, float]
    recall: dict[str, float]
    precision: dict[str, float]
    naucs: dict[str, float]
    mrr: dict[str, float]
    naucs_mrr: dict[str, float]
    hit_rate: dict[str, float]

`SubmitResultsResponse` ¶

Bases: TypedDict

Metadata returned by ResultCache.submit_results().

Source code in mteb/types/_result.py

class SubmitResultsResponse(TypedDict):
    """Metadata returned by ResultCache.submit_results()."""

    status: Literal["no_changes", "ready_for_submission", "pr_created"]
    models_submitted: list[tuple[str | None, str | None]]
    result_count: int
    path: NotRequired[str]
    pr_url: NotRequired[str]
    pr_number: NotRequired[int]
    fork_url: NotRequired[str | None]
    branch_name: NotRequired[str | None]

Statistics types ¶

`SplitDescriptiveStatistics` ¶

Bases: TypedDict

Base class for descriptive statistics for the subset.

Every per-task descriptive-stats TypedDict (Classification, Retrieval, STS, …) inherits from this. The per-task fields are added by each subclass; the multilingual subset wrapper is provided separately by :class:DescriptiveStatistics.

Source code in mteb/types/statistics.py

class SplitDescriptiveStatistics(TypedDict):
    """Base class for descriptive statistics for the subset.

    Every per-task descriptive-stats TypedDict (Classification, Retrieval,
    STS, …) inherits from this. The per-task fields are added by each
    subclass; the multilingual subset wrapper is provided separately by
    :class:`DescriptiveStatistics`.
    """

    pass

`DescriptiveStatistics` ¶

Bases: SplitDescriptiveStatistics

Multilingual wrapper for per-task descriptive statistics.

The concrete per-task *Statistics classes below mirror this shape with a tighter hf_subset_descriptive_stats value type (e.g. dict[HFSubset, ClassificationDescriptiveStatistics]). They don't inherit from this class — TypedDict forbids redeclaring fields across extension — but they're structurally assignable to it.

Attributes:

Name	Type	Description
`hf_subset_descriptive_stats`	`NotRequired[dict[HFSubset, SplitDescriptiveStatistics]]`	HFSubset descriptive statistics (only for multilingual datasets)

Source code in mteb/types/statistics.py

class DescriptiveStatistics(SplitDescriptiveStatistics):
    """Multilingual wrapper for per-task descriptive statistics.

    The concrete per-task ``*Statistics`` classes below mirror this shape with
    a tighter ``hf_subset_descriptive_stats`` value type (e.g.
    ``dict[HFSubset, ClassificationDescriptiveStatistics]``). They don't
    inherit from this class — TypedDict forbids redeclaring fields across
    extension — but they're structurally assignable to it.

    Attributes:
        hf_subset_descriptive_stats: HFSubset descriptive statistics (only for multilingual datasets)
    """

    hf_subset_descriptive_stats: NotRequired[dict[HFSubset, SplitDescriptiveStatistics]]

`TextStatistics` ¶

Bases: TypedDict

Class for descriptive statistics for texts.

Attributes:

Name	Type	Description
`total_text_length`	`int`	Total length of all texts
`min_text_length`	`int`	Minimum length of text
`average_text_length`	`float`	Average length of text
`max_text_length`	`int`	Maximum length of text
`unique_texts`	`int`	Number of unique texts

Source code in mteb/types/statistics.py

class TextStatistics(TypedDict):
    """Class for descriptive statistics for texts.

    Attributes:
        total_text_length: Total length of all texts
        min_text_length: Minimum length of text
        average_text_length: Average length of text
        max_text_length: Maximum length of text
        unique_texts: Number of unique texts
    """

    total_text_length: int
    min_text_length: int
    average_text_length: float
    max_text_length: int
    unique_texts: int

`ImageStatistics` ¶

Bases: TypedDict

Class for descriptive statistics for images.

Attributes:

Name	Type	Description
`min_image_width`	`float`	Minimum width of images
`average_image_width`	`float`	Average width of images
`max_image_width`	`float`	Maximum width of images
`min_image_height`	`float`	Minimum height of images
`average_image_height`	`float`	Average height of images
`max_image_height`	`float`	Maximum height of images
`unique_images`	`int`	Number of unique images

Source code in mteb/types/statistics.py

class ImageStatistics(TypedDict):
    """Class for descriptive statistics for images.

    Attributes:
        min_image_width: Minimum width of images
        average_image_width: Average width of images
        max_image_width: Maximum width of images

        min_image_height: Minimum height of images
        average_image_height: Average height of images
        max_image_height: Maximum height of images

        unique_images: Number of unique images
    """

    min_image_width: float
    average_image_width: float
    max_image_width: float

    min_image_height: float
    average_image_height: float
    max_image_height: float

    unique_images: int

`AudioStatistics` ¶

Bases: TypedDict

Class for descriptive statistics for audio.

Attributes:

Name	Type	Description
`total_duration_seconds`	`float`	Total length of all audio clips in total frames
`min_duration_seconds`	`float`	Minimum length of audio clip in seconds
`average_duration_seconds`	`float`	Average length of audio clip in seconds
`max_duration_seconds`	`float`	Maximum length of audio clip in seconds
`unique_audios`	`int`	Number of unique audio clips
`average_sampling_rate`	`float`	Average sampling rate
`sampling_rates`	`dict[int, int]`	Dict of unique sampling rates and their frequencies

Source code in mteb/types/statistics.py

class AudioStatistics(TypedDict):
    """Class for descriptive statistics for audio.

    Attributes:
        total_duration_seconds: Total length of all audio clips in total frames
        min_duration_seconds: Minimum length of audio clip in seconds
        average_duration_seconds: Average length of audio clip in seconds
        max_duration_seconds: Maximum length of audio clip in seconds
        unique_audios: Number of unique audio clips
        average_sampling_rate: Average sampling rate
        sampling_rates: Dict of unique sampling rates and their frequencies
    """

    total_duration_seconds: float

    min_duration_seconds: float
    average_duration_seconds: float
    max_duration_seconds: float

    unique_audios: int

    average_sampling_rate: float
    sampling_rates: dict[int, int]

`VideoStatistics` ¶

Bases: TypedDict

Class for descriptive statistics for video.

Attributes:

Name	Type	Description
`total_duration_seconds`	`float \| None`	Total duration of all video clips in seconds
`total_frames`	`int \| None`	Total number of frames across all video clips
`min_width`	`int \| None`	Minimum width of video frames
`average_width`	`float \| None`	Average width of video frames
`max_width`	`int \| None`	Maximum width of video frames
`min_height`	`int \| None`	Minimum height of video frames
`average_height`	`float \| None`	Average height of video frames
`max_height`	`int \| None`	Maximum height of video frames
`min_duration_seconds`	`float \| None`	Minimum duration of a video clip in seconds
`average_duration_seconds`	`float \| None`	Average duration of a video clip in seconds
`max_duration_seconds`	`float \| None`	Maximum duration of a video clip in seconds
`unique_videos`	`int`	Number of unique video clips
`average_fps`	`float \| None`	Average frames per second across all video clips
`fps`	`dict[int, int]`	Dict of unique (rounded) fps values and their frequencies
`min_resolution`	`tuple[int, int] \| None`	Resolution (width, height) with the smallest area
`average_resolution`	`tuple[float, float] \| None`	Average resolution (average_width, average_height)
`max_resolution`	`tuple[int, int] \| None`	Resolution (width, height) with the largest area
`resolutions`	`dict[str, int]`	Dict mapping "WxH" resolution strings to their frequency counts

Source code in mteb/types/statistics.py

class VideoStatistics(TypedDict):
    """Class for descriptive statistics for video.

    Attributes:
        total_duration_seconds: Total duration of all video clips in seconds
        total_frames: Total number of frames across all video clips

        min_width: Minimum width of video frames
        average_width: Average width of video frames
        max_width: Maximum width of video frames

        min_height: Minimum height of video frames
        average_height: Average height of video frames
        max_height: Maximum height of video frames

        min_duration_seconds: Minimum duration of a video clip in seconds
        average_duration_seconds: Average duration of a video clip in seconds
        max_duration_seconds: Maximum duration of a video clip in seconds

        unique_videos: Number of unique video clips

        average_fps: Average frames per second across all video clips
        fps: Dict of unique (rounded) fps values and their frequencies

        min_resolution: Resolution (width, height) with the smallest area
        average_resolution: Average resolution (average_width, average_height)
        max_resolution: Resolution (width, height) with the largest area
        resolutions: Dict mapping "WxH" resolution strings to their frequency counts
    """

    total_duration_seconds: float | None
    total_frames: int | None

    min_width: int | None
    average_width: float | None
    max_width: int | None

    min_height: int | None
    average_height: float | None
    max_height: int | None

    min_duration_seconds: float | None
    average_duration_seconds: float | None
    max_duration_seconds: float | None

    unique_videos: int

    average_fps: float | None
    fps: dict[int, int]

    min_resolution: tuple[int, int] | None
    average_resolution: tuple[float, float] | None
    max_resolution: tuple[int, int] | None
    resolutions: dict[str, int]

`LabelStatistics` ¶

Bases: TypedDict

Class for descriptive statistics for texts.

Attributes:

Name	Type	Description
`min_labels_per_text`	`int`	Minimum number of labels per text
`average_label_per_text`	`float`	Average number of labels per text
`max_labels_per_text`	`int`	Maximum number of labels per text
`unique_labels`	`int`	Number of unique labels
`labels`	`dict[str, dict[str, int]]`	dict of label frequencies

Source code in mteb/types/statistics.py

class LabelStatistics(TypedDict):
    """Class for descriptive statistics for texts.

    Attributes:
        min_labels_per_text: Minimum number of labels per text
        average_label_per_text: Average number of labels per text
        max_labels_per_text: Maximum number of labels per text

        unique_labels: Number of unique labels
        labels: dict of label frequencies
    """

    min_labels_per_text: int
    average_label_per_text: float
    max_labels_per_text: int

    unique_labels: int
    labels: dict[str, dict[str, int]]

`ScoreStatistics` ¶

Bases: TypedDict

Class for descriptive statistics for texts.

Attributes:

Name	Type	Description
`min_score`	`int \| float`	Minimum score
`avg_score`	`float`	Average score
`max_score`	`int \| float`	Maximum score

Source code in mteb/types/statistics.py

class ScoreStatistics(TypedDict):
    """Class for descriptive statistics for texts.

    Attributes:
        min_score: Minimum score
        avg_score: Average score
        max_score: Maximum score
    """

    min_score: int | float
    avg_score: float
    max_score: int | float

`TopRankedStatistics` ¶

Bases: TypedDict

Statistics for top ranked documents in a retrieval task.

Attributes:

Name	Type	Description
`num_top_ranked`	`int`	Total number of top ranked documents across all queries.
`min_top_ranked_per_query`	`int`	Minimum number of top ranked documents for any query.
`average_top_ranked_per_query`	`float`	Average number of top ranked documents per query.
`max_top_ranked_per_query`	`int`	Maximum number of top ranked documents for any query.

Source code in mteb/types/statistics.py

class TopRankedStatistics(TypedDict):
    """Statistics for top ranked documents in a retrieval task.

    Attributes:
        num_top_ranked: Total number of top ranked documents across all queries.
        min_top_ranked_per_query: Minimum number of top ranked documents for any query.
        average_top_ranked_per_query: Average number of top ranked documents per query.
        max_top_ranked_per_query: Maximum number of top ranked documents for any query.
    """

    num_top_ranked: int
    min_top_ranked_per_query: int
    average_top_ranked_per_query: float
    max_top_ranked_per_query: int

`RelevantDocsStatistics` ¶

Bases: TypedDict

Statistics for relevant documents in a retrieval task.

Attributes:

Name	Type	Description
`num_relevant_docs`	`int`	Total number of relevant documents across all queries.
`min_relevant_docs_per_query`	`int`	Minimum number of relevant documents for any query.
`average_relevant_docs_per_query`	`float`	Average number of relevant documents per query.
`max_relevant_docs_per_query`	`float`	Maximum number of relevant documents for any query.
`unique_relevant_docs`	`int`	Number of unique relevant documents across all queries.

Source code in mteb/types/statistics.py

class RelevantDocsStatistics(TypedDict):
    """Statistics for relevant documents in a retrieval task.

    Attributes:
        num_relevant_docs: Total number of relevant documents across all queries.
        min_relevant_docs_per_query: Minimum number of relevant documents for any query.
        average_relevant_docs_per_query: Average number of relevant documents per query.
        max_relevant_docs_per_query: Maximum number of relevant documents for any query.
        unique_relevant_docs: Number of unique relevant documents across all queries.
    """

    num_relevant_docs: int
    min_relevant_docs_per_query: int
    average_relevant_docs_per_query: float
    max_relevant_docs_per_query: float
    unique_relevant_docs: int

`SingleInputModalityStatistics` ¶

Bases: TypedDict

Per-modality statistics for a single-input dataset (Classification, Regression, …).

Fields are None when the corresponding modality is absent from the task.

Attributes:

Name	Type	Description
`text_statistics`	`TextStatistics \| None`	Statistics for the text column.
`image_statistics`	`ImageStatistics \| None`	Statistics for the image column.
`audio_statistics`	`AudioStatistics \| None`	Statistics for the audio column.
`video_statistics`	`VideoStatistics \| None`	Statistics for the video column.

Source code in mteb/types/statistics.py

class SingleInputModalityStatistics(TypedDict):
    """Per-modality statistics for a single-input dataset (Classification, Regression, …).

    Fields are ``None`` when the corresponding modality is absent from the task.

    Attributes:
        text_statistics: Statistics for the text column.
        image_statistics: Statistics for the image column.
        audio_statistics: Statistics for the audio column.
        video_statistics: Statistics for the video column.
    """

    text_statistics: TextStatistics | None
    image_statistics: ImageStatistics | None
    audio_statistics: AudioStatistics | None
    video_statistics: VideoStatistics | None

`PairModalityStatistics` ¶

Bases: TypedDict

Per-modality statistics for a paired dataset (STS, PairClassification, …).

Each modality has a *1_statistics field for the first item in the pair and a *2_statistics field for the second item. Fields are None when the corresponding modality is absent from the task.

Attributes:

Name	Type	Description
`text1_statistics`	`TextStatistics \| None`	Text statistics for the first item.
`text2_statistics`	`TextStatistics \| None`	Text statistics for the second item.
`image1_statistics`	`ImageStatistics \| None`	Image statistics for the first item.
`image2_statistics`	`ImageStatistics \| None`	Image statistics for the second item.
`audio1_statistics`	`AudioStatistics \| None`	Audio statistics for the first item.
`audio2_statistics`	`AudioStatistics \| None`	Audio statistics for the second item.
`video1_statistics`	`VideoStatistics \| None`	Video statistics for the first item.
`video2_statistics`	`VideoStatistics \| None`	Video statistics for the second item.
`unique_pairs`	`int`	Number of unique (item1, item2) pairs.

Source code in mteb/types/statistics.py

class PairModalityStatistics(TypedDict):
    """Per-modality statistics for a paired dataset (STS, PairClassification, …).

    Each modality has a ``*1_statistics`` field for the first item in the pair
    and a ``*2_statistics`` field for the second item.  Fields are ``None`` when
    the corresponding modality is absent from the task.

    Attributes:
        text1_statistics: Text statistics for the first item.
        text2_statistics: Text statistics for the second item.
        image1_statistics: Image statistics for the first item.
        image2_statistics: Image statistics for the second item.
        audio1_statistics: Audio statistics for the first item.
        audio2_statistics: Audio statistics for the second item.
        video1_statistics: Video statistics for the first item.
        video2_statistics: Video statistics for the second item.
        unique_pairs: Number of unique (item1, item2) pairs.
    """

    text1_statistics: TextStatistics | None
    text2_statistics: TextStatistics | None
    image1_statistics: ImageStatistics | None
    image2_statistics: ImageStatistics | None
    audio1_statistics: AudioStatistics | None
    audio2_statistics: AudioStatistics | None
    video1_statistics: VideoStatistics | None
    video2_statistics: VideoStatistics | None
    unique_pairs: int

`AnySTSDescriptiveStatistics` ¶

Bases: SplitDescriptiveStatistics

Descriptive statistics for STS.

Attributes:

Name	Type	Description
`num_samples`	`int`	number of samples in the dataset.
`number_of_characters`	`int \| None`	Total number of symbols in the dataset.
`unique_pairs`	`int \| None`	Number of unique pairs
`text1_statistics`	`TextStatistics \| None`	Statistics for sentence1
`text2_statistics`	`TextStatistics \| None`	Statistics for sentence2
`image1_statistics`	`ImageStatistics \| None`	Statistics for image1
`image2_statistics`	`ImageStatistics \| None`	Statistics for image2
`audio1_statistics`	`AudioStatistics \| None`	Statistics for audio1
`audio2_statistics`	`AudioStatistics \| None`	Statistics for audio2
`video1_statistics`	`VideoStatistics \| None`	Statistics for video1
`video2_statistics`	`VideoStatistics \| None`	Statistics for video2
`label_statistics`	`ScoreStatistics`	Statistics for labels

Source code in mteb/types/statistics.py

class AnySTSDescriptiveStatistics(SplitDescriptiveStatistics):
    """Descriptive statistics for STS.

    Attributes:
        num_samples: number of samples in the dataset.
        number_of_characters: Total number of symbols in the dataset.
        unique_pairs: Number of unique pairs

        text1_statistics: Statistics for sentence1
        text2_statistics: Statistics for sentence2

        image1_statistics: Statistics for image1
        image2_statistics: Statistics for image2

        audio1_statistics: Statistics for audio1
        audio2_statistics: Statistics for audio2

        video1_statistics: Statistics for video1
        video2_statistics: Statistics for video2

        label_statistics: Statistics for labels
    """

    num_samples: int
    number_of_characters: int | None
    unique_pairs: int | None

    text1_statistics: TextStatistics | None
    text2_statistics: TextStatistics | None

    image1_statistics: ImageStatistics | None
    image2_statistics: ImageStatistics | None

    audio1_statistics: AudioStatistics | None
    audio2_statistics: AudioStatistics | None

    video1_statistics: VideoStatistics | None
    video2_statistics: VideoStatistics | None

    label_statistics: ScoreStatistics

`BitextDescriptiveStatistics` ¶

Bases: SplitDescriptiveStatistics

Descriptive statistics for Bitext.

Attributes:

Name	Type	Description
`num_samples`	`int`	number of samples in the dataset.
`number_of_characters`	`int`	Total number of symbols in the dataset.
`unique_pairs`	`int`	Number of duplicate pairs
`sentence1_statistics`	`TextStatistics`	Statistics for sentence1
`sentence2_statistics`	`TextStatistics`	Statistics for sentence2

Source code in mteb/types/statistics.py

class BitextDescriptiveStatistics(SplitDescriptiveStatistics):
    """Descriptive statistics for Bitext.

    Attributes:
        num_samples: number of samples in the dataset.
        number_of_characters: Total number of symbols in the dataset.
        unique_pairs: Number of duplicate pairs

        sentence1_statistics: Statistics for sentence1
        sentence2_statistics: Statistics for sentence2
    """

    num_samples: int
    number_of_characters: int
    unique_pairs: int

    sentence1_statistics: TextStatistics
    sentence2_statistics: TextStatistics

`ClassificationDescriptiveStatistics` ¶

Bases: SplitDescriptiveStatistics

Descriptive statistics for Classification.

Attributes:

Name	Type	Description
`num_samples`	`int`	number of samples in the dataset.
`samples_in_train`	`int \| None`	Number of unique test samples (across all input modalities) that also appear in the train split. None when evaluated on the train split itself.
`text_statistics`	`TextStatistics \| None`	Statistics for text
`image_statistics`	`ImageStatistics \| None`	Statistics for images
`audio_statistics`	`AudioStatistics \| None`	Statistics for audio
`video_statistics`	`VideoStatistics \| None`	Statistics for video
`label_statistics`	`LabelStatistics`	Statistics for labels

Source code in mteb/types/statistics.py

class ClassificationDescriptiveStatistics(SplitDescriptiveStatistics):
    """Descriptive statistics for Classification.

    Attributes:
        num_samples: number of samples in the dataset.
        samples_in_train: Number of unique test samples (across all input modalities)
            that also appear in the train split. None when evaluated on the train split itself.

        text_statistics: Statistics for text
        image_statistics: Statistics for images
        audio_statistics: Statistics for audio
        video_statistics: Statistics for video
        label_statistics: Statistics for labels
    """

    num_samples: int
    samples_in_train: int | None

    text_statistics: TextStatistics | None
    image_statistics: ImageStatistics | None
    audio_statistics: AudioStatistics | None
    video_statistics: VideoStatistics | None
    label_statistics: LabelStatistics

`RegressionDescriptiveStatistics` ¶

Bases: SplitDescriptiveStatistics

Descriptive statistics for Regression.

Attributes:

Name	Type	Description
`num_samples`	`int`	number of samples in the dataset.
`samples_in_train`	`int \| None`	Number of texts in the train split
`text_statistics`	`TextStatistics \| None`	Statistics of texts
`image_statistics`	`ImageStatistics \| None`	Statistics of images
`audio_statistics`	`AudioStatistics \| None`	Statistics of audio
`video_statistics`	`VideoStatistics \| None`	Statistics of video
`values_statistics`	`ScoreStatistics`	Statistics of values

Source code in mteb/types/statistics.py

class RegressionDescriptiveStatistics(SplitDescriptiveStatistics):
    """Descriptive statistics for Regression.

    Attributes:
        num_samples: number of samples in the dataset.
        samples_in_train: Number of texts in the train split

        text_statistics: Statistics of texts
        image_statistics: Statistics of images
        audio_statistics: Statistics of audio
        video_statistics: Statistics of video

        values_statistics: Statistics of values
    """

    num_samples: int
    samples_in_train: int | None

    text_statistics: TextStatistics | None
    image_statistics: ImageStatistics | None
    audio_statistics: AudioStatistics | None
    video_statistics: VideoStatistics | None
    values_statistics: ScoreStatistics

`ClusteringDescriptiveStatistics` ¶

Bases: SplitDescriptiveStatistics

Descriptive statistics for Clustering (legacy AbsTaskClusteringLegacy).

Attributes:

Name	Type	Description
`num_samples`	`int`	number of samples in the dataset.
`text_statistics`	`TextStatistics \| None`	Statistics for text
`image_statistics`	`ImageStatistics \| None`	Statistics for images
`audio_statistics`	`AudioStatistics \| None`	Statistics for audio
`video_statistics`	`VideoStatistics \| None`	Statistics for video
`label_statistics`	`LabelStatistics`	Statistics for labels

Source code in mteb/types/statistics.py

class ClusteringDescriptiveStatistics(SplitDescriptiveStatistics):
    """Descriptive statistics for Clustering (legacy AbsTaskClusteringLegacy).

    Attributes:
        num_samples: number of samples in the dataset.

        text_statistics: Statistics for text
        image_statistics: Statistics for images
        audio_statistics: Statistics for audio
        video_statistics: Statistics for video
        label_statistics: Statistics for labels
    """

    num_samples: int

    text_statistics: TextStatistics | None
    image_statistics: ImageStatistics | None
    audio_statistics: AudioStatistics | None
    video_statistics: VideoStatistics | None
    label_statistics: LabelStatistics

`ClusteringFastDescriptiveStatistics` ¶

Bases: SplitDescriptiveStatistics

Descriptive statistics for ClusteringFast.

Attributes:

Name	Type	Description
`num_samples`	`int`	number of samples in the dataset.
`text_statistics`	`TextStatistics \| None`	Statistics for text
`image_statistics`	`ImageStatistics \| None`	Statistics for images
`audio_statistics`	`AudioStatistics \| None`	Statistics for audio
`video_statistics`	`VideoStatistics \| None`	Statistics for video
`labels_statistics`	`LabelStatistics`	Statistics for labels

Source code in mteb/types/statistics.py

class ClusteringFastDescriptiveStatistics(SplitDescriptiveStatistics):
    """Descriptive statistics for ClusteringFast.

    Attributes:
        num_samples: number of samples in the dataset.

        text_statistics: Statistics for text
        image_statistics: Statistics for images
        audio_statistics: Statistics for audio
        video_statistics: Statistics for video
        labels_statistics: Statistics for labels
    """

    num_samples: int

    text_statistics: TextStatistics | None
    image_statistics: ImageStatistics | None
    audio_statistics: AudioStatistics | None
    video_statistics: VideoStatistics | None
    labels_statistics: LabelStatistics

`PairClassificationDescriptiveStatistics` ¶

Bases: SplitDescriptiveStatistics

Descriptive statistics for PairClassification.

Attributes:

Name	Type	Description
`num_samples`	`int`	number of samples in the dataset.
`number_of_characters`	`int \| None`	Total number of symbols in the dataset.
`unique_pairs`	`int \| None`	Number of unique pairs
`text1_statistics`	`TextStatistics \| None`	Statistics for sentence1
`image1_statistics`	`ImageStatistics \| None`	Statistics for image1
`audio1_statistics`	`AudioStatistics \| None`	Statistics for audio1
`text2_statistics`	`TextStatistics \| None`	Statistics for sentence2
`image2_statistics`	`ImageStatistics \| None`	Statistics for image2
`audio2_statistics`	`AudioStatistics \| None`	Statistics for audio2
`labels_statistics`	`LabelStatistics`	Statistics for labels

Source code in mteb/types/statistics.py

class PairClassificationDescriptiveStatistics(SplitDescriptiveStatistics):
    """Descriptive statistics for PairClassification.

    Attributes:
        num_samples: number of samples in the dataset.
        number_of_characters: Total number of symbols in the dataset.
        unique_pairs: Number of unique pairs

        text1_statistics: Statistics for sentence1
        image1_statistics: Statistics for image1
        audio1_statistics: Statistics for audio1

        text2_statistics: Statistics for sentence2
        image2_statistics: Statistics for image2
        audio2_statistics: Statistics for audio2

        labels_statistics: Statistics for labels
    """

    num_samples: int
    number_of_characters: int | None
    unique_pairs: int | None

    text1_statistics: TextStatistics | None
    image1_statistics: ImageStatistics | None
    audio1_statistics: AudioStatistics | None
    video1_statistics: VideoStatistics | None
    text2_statistics: TextStatistics | None
    image2_statistics: ImageStatistics | None
    audio2_statistics: AudioStatistics | None
    video2_statistics: VideoStatistics | None
    labels_statistics: LabelStatistics

`ZeroShotClassificationDescriptiveStatistics` ¶

Bases: SplitDescriptiveStatistics

Descriptive statistics for ZeroShotClassification.

Attributes:

Name	Type	Description
`num_samples`	`int`	number of samples in the dataset.
`text_statistics`	`TextStatistics \| None`	Statistics for texts
`image_statistics`	`ImageStatistics \| None`	Statistics for images
`audio_statistics`	`AudioStatistics \| None`	Statistics for audio
`video_statistics`	`VideoStatistics \| None`	Statistics for video
`label_statistics`	`LabelStatistics`	Statistics for dataset labels
`candidates_labels_text_statistics`	`TextStatistics`	Statistics for candidate labels text

Source code in mteb/types/statistics.py

class ZeroShotClassificationDescriptiveStatistics(SplitDescriptiveStatistics):
    """Descriptive statistics for ZeroShotClassification.

    Attributes:
        num_samples: number of samples in the dataset.

        text_statistics: Statistics for texts
        image_statistics: Statistics for images
        audio_statistics: Statistics for audio
        video_statistics: Statistics for video
        label_statistics: Statistics for dataset labels

        candidates_labels_text_statistics: Statistics for candidate labels text
    """

    num_samples: int

    text_statistics: TextStatistics | None
    image_statistics: ImageStatistics | None
    audio_statistics: AudioStatistics | None
    video_statistics: VideoStatistics | None
    label_statistics: LabelStatistics
    candidates_labels_text_statistics: TextStatistics

`RetrievalDescriptiveStatistics` ¶

Bases: SplitDescriptiveStatistics

Descriptive statistics for Retrieval.

Attributes:

Name	Type	Description
`num_samples`	`int`	Total number of queries and documents
`num_queries`	`int`	Number of queries
`num_documents`	`int`	Number of documents
`number_of_characters`	`int`	Total number of characters in queries and documents
`documents_text_statistics`	`TextStatistics \| None`	Statistics for documents
`documents_image_statistics`	`ImageStatistics \| None`	Statistics for documents
`documents_audio_statistics`	`AudioStatistics \| None`	Statistics for documents
`documents_video_statistics`	`VideoStatistics \| None`	Statistics for documents
`queries_text_statistics`	`TextStatistics \| None`	Statistics for queries
`queries_image_statistics`	`ImageStatistics \| None`	Statistics for queries
`queries_audio_statistics`	`AudioStatistics \| None`	Statistics for queries
`queries_video_statistics`	`VideoStatistics \| None`	Statistics for queries
`relevant_docs_statistics`	`RelevantDocsStatistics`	Statistics for relevant documents
`top_ranked_statistics`	`TopRankedStatistics \| None`	Statistics for top ranked documents (if available)

Source code in mteb/types/statistics.py

class RetrievalDescriptiveStatistics(SplitDescriptiveStatistics):
    """Descriptive statistics for Retrieval.

    Attributes:
        num_samples: Total number of queries and documents
        num_queries: Number of queries
        num_documents: Number of documents
        number_of_characters: Total number of characters in queries and documents

        documents_text_statistics: Statistics for documents
        documents_image_statistics: Statistics for documents
        documents_audio_statistics: Statistics for documents
        documents_video_statistics: Statistics for documents
        queries_text_statistics: Statistics for queries
        queries_image_statistics: Statistics for queries
        queries_audio_statistics: Statistics for queries
        queries_video_statistics: Statistics for queries
        relevant_docs_statistics: Statistics for relevant documents
        top_ranked_statistics: Statistics for top ranked documents (if available)
    """

    num_samples: int
    num_queries: int
    num_documents: int
    number_of_characters: int

    documents_text_statistics: TextStatistics | None
    documents_image_statistics: ImageStatistics | None
    documents_audio_statistics: AudioStatistics | None
    documents_video_statistics: VideoStatistics | None

    queries_text_statistics: TextStatistics | None
    queries_image_statistics: ImageStatistics | None
    queries_audio_statistics: AudioStatistics | None
    queries_video_statistics: VideoStatistics | None

    relevant_docs_statistics: RelevantDocsStatistics

    # this is for datasets that do reranking
    top_ranked_statistics: TopRankedStatistics | None

`SummarizationDescriptiveStatistics` ¶

Bases: SplitDescriptiveStatistics

Descriptive statistics for Summarization.

Attributes:

Name	Type	Description
`num_samples`	`int`	number of samples in the dataset.
`number_of_characters`	`int`	Total number of symbols in the dataset.
`text_statistics`	`TextStatistics`	Statistics for the text
`human_summaries_statistics`	`TextStatistics`	Statistics for human summaries
`machine_summaries_statistics`	`TextStatistics`	Statistics for machine summaries
`score_statistics`	`ScoreStatistics`	Statistics for the relevance scores

Source code in mteb/types/statistics.py

class SummarizationDescriptiveStatistics(SplitDescriptiveStatistics):
    """Descriptive statistics for Summarization.

    Attributes:
        num_samples: number of samples in the dataset.
        number_of_characters: Total number of symbols in the dataset.

        text_statistics: Statistics for the text
        human_summaries_statistics: Statistics for human summaries
        machine_summaries_statistics: Statistics for machine summaries
        score_statistics: Statistics for the relevance scores
    """

    num_samples: int
    number_of_characters: int

    text_statistics: TextStatistics
    human_summaries_statistics: TextStatistics
    machine_summaries_statistics: TextStatistics
    score_statistics: ScoreStatistics

`ImageTextPairClassificationDescriptiveStatistics` ¶

Bases: SplitDescriptiveStatistics

Descriptive statistics for ImageTextPairClassification.

Attributes:

Name	Type	Description
`num_samples`	`int`	number of samples in the dataset.
`text_statistics`	`TextStatistics`	Statistics for text
`image_statistics`	`ImageStatistics`	Statistics for images

Source code in mteb/types/statistics.py

class ImageTextPairClassificationDescriptiveStatistics(SplitDescriptiveStatistics):
    """Descriptive statistics for ImageTextPairClassification.

    Attributes:
        num_samples: number of samples in the dataset.
        text_statistics: Statistics for text
        image_statistics: Statistics for images
    """

    num_samples: int
    text_statistics: TextStatistics
    image_statistics: ImageStatistics

`AnySTSStatistics` ¶

Bases: AnySTSDescriptiveStatistics

STS descriptive statistics, optionally with multilingual subsets.

Source code in mteb/types/statistics.py

class AnySTSStatistics(AnySTSDescriptiveStatistics):
    """STS descriptive statistics, optionally with multilingual subsets."""

    hf_subset_descriptive_stats: NotRequired[
        dict[HFSubset, AnySTSDescriptiveStatistics]
    ]

`BitextStatistics` ¶

Bases: BitextDescriptiveStatistics

Bitext mining descriptive statistics, optionally with multilingual subsets.

Source code in mteb/types/statistics.py

class BitextStatistics(BitextDescriptiveStatistics):
    """Bitext mining descriptive statistics, optionally with multilingual subsets."""

    hf_subset_descriptive_stats: NotRequired[
        dict[HFSubset, BitextDescriptiveStatistics]
    ]

`ClassificationStatistics` ¶

Bases: ClassificationDescriptiveStatistics

Classification descriptive statistics, optionally with multilingual subsets.

Source code in mteb/types/statistics.py

class ClassificationStatistics(ClassificationDescriptiveStatistics):
    """Classification descriptive statistics, optionally with multilingual subsets."""

    hf_subset_descriptive_stats: NotRequired[
        dict[HFSubset, ClassificationDescriptiveStatistics]
    ]

`ClusteringStatistics` ¶

Bases: ClusteringDescriptiveStatistics

Clustering descriptive statistics, optionally with multilingual subsets.

Source code in mteb/types/statistics.py

class ClusteringStatistics(ClusteringDescriptiveStatistics):
    """Clustering descriptive statistics, optionally with multilingual subsets."""

    hf_subset_descriptive_stats: NotRequired[
        dict[HFSubset, ClusteringDescriptiveStatistics]
    ]

`ClusteringFastStatistics` ¶

Bases: ClusteringFastDescriptiveStatistics

Clustering-fast descriptive statistics, optionally with multilingual subsets.

Source code in mteb/types/statistics.py

class ClusteringFastStatistics(ClusteringFastDescriptiveStatistics):
    """Clustering-fast descriptive statistics, optionally with multilingual subsets."""

    hf_subset_descriptive_stats: NotRequired[
        dict[HFSubset, ClusteringFastDescriptiveStatistics]
    ]

`ImageTextPairClassificationStatistics` ¶

Bases: ImageTextPairClassificationDescriptiveStatistics

Image/text pair classification stats, optionally with multilingual subsets.

Source code in mteb/types/statistics.py

class ImageTextPairClassificationStatistics(
    ImageTextPairClassificationDescriptiveStatistics
):
    """Image/text pair classification stats, optionally with multilingual subsets."""

    hf_subset_descriptive_stats: NotRequired[
        dict[HFSubset, ImageTextPairClassificationDescriptiveStatistics]
    ]

`PairClassificationStatistics` ¶

Bases: PairClassificationDescriptiveStatistics

Pair classification descriptive statistics, optionally with multilingual subsets.

Source code in mteb/types/statistics.py

class PairClassificationStatistics(PairClassificationDescriptiveStatistics):
    """Pair classification descriptive statistics, optionally with multilingual subsets."""

    hf_subset_descriptive_stats: NotRequired[
        dict[HFSubset, PairClassificationDescriptiveStatistics]
    ]

`RegressionStatistics` ¶

Bases: RegressionDescriptiveStatistics

Regression descriptive statistics, optionally with multilingual subsets.

Source code in mteb/types/statistics.py

class RegressionStatistics(RegressionDescriptiveStatistics):
    """Regression descriptive statistics, optionally with multilingual subsets."""

    hf_subset_descriptive_stats: NotRequired[
        dict[HFSubset, RegressionDescriptiveStatistics]
    ]

`RetrievalStatistics` ¶

Bases: RetrievalDescriptiveStatistics

Retrieval descriptive statistics, optionally with multilingual subsets.

Source code in mteb/types/statistics.py

class RetrievalStatistics(RetrievalDescriptiveStatistics):
    """Retrieval descriptive statistics, optionally with multilingual subsets."""

    hf_subset_descriptive_stats: NotRequired[
        dict[HFSubset, RetrievalDescriptiveStatistics]
    ]

`SummarizationStatistics` ¶

Bases: SummarizationDescriptiveStatistics

Summarization descriptive statistics, optionally with multilingual subsets.

Source code in mteb/types/statistics.py

class SummarizationStatistics(SummarizationDescriptiveStatistics):
    """Summarization descriptive statistics, optionally with multilingual subsets."""

    hf_subset_descriptive_stats: NotRequired[
        dict[HFSubset, SummarizationDescriptiveStatistics]
    ]

`ZeroShotClassificationStatistics` ¶

Bases: ZeroShotClassificationDescriptiveStatistics

Zero-shot classification stats, optionally with multilingual subsets.

Source code in mteb/types/statistics.py

class ZeroShotClassificationStatistics(ZeroShotClassificationDescriptiveStatistics):
    """Zero-shot classification stats, optionally with multilingual subsets."""

    hf_subset_descriptive_stats: NotRequired[
        dict[HFSubset, ZeroShotClassificationDescriptiveStatistics]
    ]

Additional Types¶

Encoder Input/Output types ¶

Array = NDArray[np.floating | np.integer | np.bool_] | torch.Tensor module-attribute ¶

Conversation = list[ConversationTurn] module-attribute ¶

BatchedInput = TextInput | CorpusInput | QueryInput | ImageInput | AudioInput | VideoInput | MultimodalInput module-attribute ¶

Supported input types¶

TextBatchedInput = TextInput | CorpusInput | QueryInput module-attribute ¶

QueryDatasetType = Dataset module-attribute ¶

CorpusDatasetType = Dataset module-attribute ¶

InstructionDatasetType = Dataset module-attribute ¶

RelevantDocumentsType = Mapping[str, Mapping[str, int]] module-attribute ¶

TopRankedDocumentsType = Mapping[str, list[str]] module-attribute ¶

RetrievalOutputType = dict[str, dict[str, float]] module-attribute ¶

EncodeKwargs ¶

PromptType ¶

ConversationTurn ¶

TextInput ¶

CorpusInput ¶

QueryInput ¶

ImageInput ¶

AudioInputItem ¶

AudioInput ¶

VideoInput ¶

MultimodalInput ¶

OutputDType ¶

get_dtype() ¶

Metadata types ¶

ISOLanguageScript = str module-attribute ¶

ISOLanguage = str module-attribute ¶

ISOScript = str module-attribute ¶

Languages = list[ISOLanguageScript] | Mapping[HFSubset, list[ISOLanguageScript]] module-attribute ¶

ModelName = str module-attribute ¶

Revision = str module-attribute ¶

Modalities = Literal['text', 'image', 'audio', 'video'] module-attribute ¶

Results types ¶

HFSubset = str module-attribute ¶

SplitName = str module-attribute ¶

Score = Any module-attribute ¶

ScoresDict = Mapping[str, Score] module-attribute ¶

RetrievalEvaluationResult ¶

SubmitResultsResponse ¶

Statistics types ¶

SplitDescriptiveStatistics ¶

DescriptiveStatistics ¶

TextStatistics ¶

ImageStatistics ¶

AudioStatistics ¶

VideoStatistics ¶

LabelStatistics ¶

ScoreStatistics ¶

TopRankedStatistics ¶

RelevantDocsStatistics ¶

SingleInputModalityStatistics ¶

PairModalityStatistics ¶

AnySTSDescriptiveStatistics ¶

BitextDescriptiveStatistics ¶

ClassificationDescriptiveStatistics ¶

RegressionDescriptiveStatistics ¶

ClusteringDescriptiveStatistics ¶

ClusteringFastDescriptiveStatistics ¶

PairClassificationDescriptiveStatistics ¶

ZeroShotClassificationDescriptiveStatistics ¶

RetrievalDescriptiveStatistics ¶

SummarizationDescriptiveStatistics ¶

ImageTextPairClassificationDescriptiveStatistics ¶

AnySTSStatistics ¶

BitextStatistics ¶

ClassificationStatistics ¶

ClusteringStatistics ¶

ClusteringFastStatistics ¶

ImageTextPairClassificationStatistics ¶

PairClassificationStatistics ¶

RegressionStatistics ¶

RetrievalStatistics ¶

SummarizationStatistics ¶

ZeroShotClassificationStatistics ¶

`Array = NDArray[np.floating | np.integer | np.bool_] | torch.Tensor` `module-attribute` ¶

`Conversation = list[ConversationTurn]` `module-attribute` ¶

`BatchedInput = TextInput | CorpusInput | QueryInput | ImageInput | AudioInput | VideoInput | MultimodalInput` `module-attribute` ¶

`TextBatchedInput = TextInput | CorpusInput | QueryInput` `module-attribute` ¶

`QueryDatasetType = Dataset` `module-attribute` ¶

`CorpusDatasetType = Dataset` `module-attribute` ¶

`InstructionDatasetType = Dataset` `module-attribute` ¶

`RelevantDocumentsType = Mapping[str, Mapping[str, int]]` `module-attribute` ¶

`TopRankedDocumentsType = Mapping[str, list[str]]` `module-attribute` ¶

`RetrievalOutputType = dict[str, dict[str, float]]` `module-attribute` ¶

`EncodeKwargs` ¶

`PromptType` ¶

`ConversationTurn` ¶

`TextInput` ¶

`CorpusInput` ¶

`QueryInput` ¶

`ImageInput` ¶

`AudioInputItem` ¶

`AudioInput` ¶

`VideoInput` ¶

`MultimodalInput` ¶

`OutputDType` ¶

`get_dtype()` ¶

`ISOLanguageScript = str` `module-attribute` ¶

`ISOLanguage = str` `module-attribute` ¶

`ISOScript = str` `module-attribute` ¶

`Languages = list[ISOLanguageScript] | Mapping[HFSubset, list[ISOLanguageScript]]` `module-attribute` ¶

`ModelName = str` `module-attribute` ¶

`Revision = str` `module-attribute` ¶

`Modalities = Literal['text', 'image', 'audio', 'video']` `module-attribute` ¶

`HFSubset = str` `module-attribute` ¶

`SplitName = str` `module-attribute` ¶

`Score = Any` `module-attribute` ¶

`ScoresDict = Mapping[str, Score]` `module-attribute` ¶

`RetrievalEvaluationResult` ¶

`SubmitResultsResponse` ¶

`SplitDescriptiveStatistics` ¶

`DescriptiveStatistics` ¶

`TextStatistics` ¶

`ImageStatistics` ¶

`AudioStatistics` ¶

`VideoStatistics` ¶

`LabelStatistics` ¶

`ScoreStatistics` ¶

`TopRankedStatistics` ¶

`RelevantDocsStatistics` ¶

`SingleInputModalityStatistics` ¶

`PairModalityStatistics` ¶

`AnySTSDescriptiveStatistics` ¶

`BitextDescriptiveStatistics` ¶

`ClassificationDescriptiveStatistics` ¶

`RegressionDescriptiveStatistics` ¶

`ClusteringDescriptiveStatistics` ¶

`ClusteringFastDescriptiveStatistics` ¶

`PairClassificationDescriptiveStatistics` ¶

`ZeroShotClassificationDescriptiveStatistics` ¶

`RetrievalDescriptiveStatistics` ¶

`SummarizationDescriptiveStatistics` ¶

`ImageTextPairClassificationDescriptiveStatistics` ¶

`AnySTSStatistics` ¶

`BitextStatistics` ¶

`ClassificationStatistics` ¶

`ClusteringStatistics` ¶

`ClusteringFastStatistics` ¶

`ImageTextPairClassificationStatistics` ¶

`PairClassificationStatistics` ¶

`RegressionStatistics` ¶

`RetrievalStatistics` ¶

`SummarizationStatistics` ¶

`ZeroShotClassificationStatistics` ¶