One speaker segmentation model to rule them all

Deep dive into pyannote/segmentation pretrained model
Author

Hervé Bredin

Published

October 23, 2022

In this blog post, I talk about pyannote.audio pretrained speaker segmentation model, which happens to be one of the most popular audio model available on 🤗 Huggingface model hub.

from pyannote.audio import Model
model = Model.from_pretrained("pyannote/segmentation")

What does pyannote/segmentation do?

Every pyannote.audio model has a specifications attribute that tells us a bit more about itself:

print(model.specifications)
Specifications(
    problem=<Problem.MULTI_LABEL_CLASSIFICATION: 2>,
    resolution=<Resolution.FRAME: 1>,
    duration=5.0,
    warm_up=(0.0, 0.0),
    classes=['speaker#1', 'speaker#2', 'speaker#3'],
    permutation_invariant=True
)

These specifications tell us the following about pyannote/segmentation:

  • it ingests audio chunks of 5 seconds duration
  • it addresses a multi-label classification problem
  • … whose possible classes are chosen among speaker#1, speaker#2, and speaker#3
  • … and are permutation_invariant (more about that below)

We also learn that its output temporal resolution is the frame (i.e. it outputs a sequence of frame-wise decisions rather than just one decision for the whole chunk). The actual temporal resolution can be obtained through the magic introspection attribute (approximately 17ms for pyannote/segmentation):

model.introspection.frames.step
0.016875

OK, but what does pyannote/segmentation really do?

To answer this question, let us consider the audio recording of a 30s conversation between two speakers (the blue one and the red one):