One speaker segmentation model to rule them all

Deep dive into pyannote/segmentation pretrained model

Hervé Bredin


October 23, 2022

In this blog post, I talk about pretrained speaker segmentation model, which happens to be one of the most popular audio model available on 🤗 Huggingface model hub.

from import Model
model = Model.from_pretrained("pyannote/segmentation")

What does pyannote/segmentation do?

Every model has a specifications attribute that tells us a bit more about itself:

    problem=<Problem.MULTI_LABEL_CLASSIFICATION: 2>,
    resolution=<Resolution.FRAME: 1>,
    warm_up=(0.0, 0.0),
    classes=['speaker#1', 'speaker#2', 'speaker#3'],

These specifications tell us the following about pyannote/segmentation:

  • it ingests audio chunks of 5 seconds duration
  • it addresses a multi-label classification problem
  • … whose possible classes are chosen among speaker#1, speaker#2, and speaker#3
  • … and are permutation_invariant (more about that below)

We also learn that its output temporal resolution is the frame (i.e. it outputs a sequence of frame-wise decisions rather than just one decision for the whole chunk). The actual temporal resolution can be obtained through the magic introspection attribute (approximately 17ms for pyannote/segmentation):


OK, but what does pyannote/segmentation really do?

To answer this question, let us consider the audio recording of a 30s conversation between two speakers (the blue one and the red one):