from pyannote.audio import Model
= Model.from_pretrained("pyannote/segmentation") model
In this blog post, I talk about pyannote.audio pretrained speaker segmentation model, which happens to be one of the most popular audio model available on 🤗 Huggingface model hub.
What does pyannote/segmentation
do?
Every pyannote.audio
model has a specifications
attribute that tells us a bit more about itself:
print(model.specifications)
Specifications( problem=<Problem.MULTI_LABEL_CLASSIFICATION: 2>, resolution=<Resolution.FRAME: 1>, duration=5.0, warm_up=(0.0, 0.0), classes=['speaker#1', 'speaker#2', 'speaker#3'], permutation_invariant=True )
These specifications tell us the following about pyannote/segmentation
:
- it ingests audio chunks of 5 seconds
duration
- it addresses a multi-label classification
problem
… - … whose possible
classes
are chosen amongspeaker#1
,speaker#2
, andspeaker#3
… - … and are
permutation_invariant
(more about that below)
We also learn that its output temporal resolution
is the frame (i.e. it outputs a sequence of frame-wise decisions rather than just one decision for the whole chunk). The actual temporal resolution can be obtained through the magic introspection
attribute (approximately 17ms for pyannote/segmentation
):
model.introspection.frames.step
0.016875
OK, but what does pyannote/segmentation
really do?
To answer this question, let us consider the audio recording of a 30s conversation between two speakers (the blue one and the red one):