# setting things for pretty visualization

from rich import print
from pyannote.core import notebook, Segment
SAMPLE_EXTENT = Segment(0, 30)
notebook.crop = SAMPLE_EXTENT

SAMPLE_CHUNK = Segment(15, 20)
SAMPLE_URI = "sample"
SAMPLE_WAV = f"{SAMPLE_URI}.wav"
SAMPLE_REF = f"{SAMPLE_URI}.rttm"

In this blog post, I talk about pyannote.audio pretrained speaker segmentation model, which happens to be one of the most popular audio model available on 🤗 Huggingface model hub.

from pyannote.audio import Model
model = Model.from_pretrained("pyannote/segmentation")

What does pyannote/segmentation do?

Every pyannote.audio model has a specifications attribute that tells us a bit more about itself:

print(model.specifications)
Specifications(
    problem=<Problem.MULTI_LABEL_CLASSIFICATION: 2>,
    resolution=<Resolution.FRAME: 1>,
    duration=5.0,
    warm_up=(0.0, 0.0),
    classes=['speaker#1', 'speaker#2', 'speaker#3'],
    permutation_invariant=True
)

These specifications tell us the following about pyannote/segmentation:

  • it ingests audio chunks of 5 seconds duration
  • it addresses a multi-label classification problem...
  • ... whose possible classes are chosen among speaker#1, speaker#2, and speaker#3 ...
  • ... and are permutation_invariant (more about that below)

We also learn that its output temporal resolution is the frame (i.e. it outputs a sequence of frame-wise decisions rather than just one decision for the whole chunk). The actual temporal resolution can be obtained through the magic introspection attribute (approximately 17ms for pyannote/segmentation):

model.introspection.frames.step
0.016875

OK, but what does pyannote/segmentation really do?

To answer this question, let us consider the audio recording of a 30s conversation between two speakers (the blue one and the red one):

from pyannote.database.util import load_rttm
reference = load_rttm(SAMPLE_REF)[SAMPLE_URI]
reference

from IPython.display import Audio as AudioPlayer
AudioPlayer(SAMPLE_WAV)