2017

Hervé Bredin

ICASSP 2017, IEEE International Conference on Acoustics, Speech, and Signal Processing

TristouNet is a neural network architecture based on Long Short-Term Memory recurrent networks, meant to project speech sequences into a fixed-dimensional euclidean space. Thanks to the triplet loss paradigm used for training, the resulting sequence embeddings can be compared directly with the euclidean distance, for speaker comparison purposes. Experiments on short (between 500ms and 5s) speech turn comparison and speaker change detection show that TristouNet brings significant improvements over the current state-of-the-art techniques for both tasks.

.bib [Bredin2017] | .pdf

2016

Johann Poignant, Mateusz Budnik, Hervé Bredin, Claude Barras, Mickaël Stefas, Pierrick Bruneau, Gilles Adda, Laurent Besacier, Hazim Ekenel, Gil Francopoulo, Javier Hernando, Joseph Mariani, Ramon Morros, Georges Quénot, Sophie Rosset, Thomas Tamisier

In this paper, we describe the organization and the implementation of the CAMOMILE collaborative annotation framework for multimodal, multimedia, multilingual (3M) data. Given the versatile nature of the analysis which can be performed on 3M data, the structure of the server was kept intentionally simple in order to preserve its genericity, relying on standard Web technologies. Layers of annotations, defined as data associated to a media fragment from the corpus, are stored in a database and can be managed through standard interfaces with authentication. Interfaces tailored specifically to the needed task can then be developed in an agile way, relying on simple but reliable services for the management of the centralized annotations. We then present our implementation of an active learning scenario for person annotation in video, relying on the CAMOMILE server; during a dry run experiment, the manual annotation of 716 speech segments was thus propagated to 3504 labeled tracks. The code of the CAMOMILE framework is distributed in open source.

.bib [Poignant2016b] | .pdf
Johann Poignant, Hervé Bredin, Claude Barras, Mickaël Stefas, Pierrick Bruneau, Thomas Tamisier

In this paper, we claim that the CAMOMILE collaborative annotation platform (developed in the framework of the eponymous CHIST-ERA project) eases the organization of multimedia technology benchmarks, automating most of the campaign technical workflow and enabling collaborative (hence faster and cheaper) annotation of the evaluation data. This is demonstrated through the successful organization of a new multimedia task at MediaEval 2015, Multimodal Person Discovery in Broadcast TV.

.bib [Poignant2016] | .pdf
Hervé Bredin, Grégory Gelly

ACM MM 2016, 24th ACM International Conference on Multimedia

While successful on broadcast news, meetings or telephone conversation, state-of-the-art speaker diarization techniques tend to perform poorly on TV series or movies. In this paper, we propose to rely on state-of-the-art face clustering techniques to guide acoustic speaker diarization. Two approaches are tested and evaluated on the first season of Game Of Thrones TV series. The second (better) approach relies on a novel talking-face detection module based on bi-directional long short-term memory recurrent neural network. Both audio-visual approaches outperform the audio-only baseline. A detailed study of the behavior of these approaches is also provided and paves the way to future improvements.

.bib [Bredin2016] | .pdf
Pierrick Bruneau, Mickaël Stefas, Johann Poignant, Hervé Bredin, Claude Barras

ISM 2016, 12th IEEE International Symposium on Multimedia

Part of the research effort in automatic person discovery in multimedia content consists in analyzing the errors made by algorithms. However exploring the space of models relating algorithmic errors in person discovery to intrinsic properties of associated shots (e.g. person facing the camera) - coined as post-hoc analysis in this paper - requires data cura- tion and statistical model tuning, which can be cumbersome. In this paper we present a visual and interactive tool that facilitates this exploration. Adequate statistical building blocks are defined, and coordinated by visual and interactive components inspired from the literature in information visualization. A case study is conducted with multimedia researchers to validate the tool. Real data obtained from the MediaEval person discovery task was used for this experiment. Our approach yielded novel insight that was completely unsuspected previously.

.bib [Bruneau2016] | .pdf

2015

Delphine Charlet, Johann Poignant, Hervé Bredin, Corinne Fredouille, Sylvain Meignier

ERRARE 2015, Second Workshop on Errors By Humans and Machines in Multimedia, Multimodal, and Multilingual Data Processing

Speaker identification approaches for TV broadcast are usually evaluated and compared based on global error rates derived from the overall duration of missed detection, false alarm and confusion. Based on the analysis of the output of the systems submitted to the final round of the French evaluation campaign REPERE, this paper highlights the fact that these average metrics lead to the incorrect intuition that current state-of-the-art algorithms partially recognize all speakers. Setting aside incorrect diarization and adverse acoustic conditions, we show that their performance is in fact essentially bi-modal: in a given show, either all speech turns of a speaker are correctly identified or none of them are. We then proceed with trying to understand and explain this behavior, through perfomance prediction experiments. These experiments show that the most discriminant speaker characteristics are -- first -- their total speech duration in the current show and -- then only -- the amount of training data available to build their acoustic model.

.bib [Charlet2015] | .pdf
Elena Knyazeva, Guillaume Wisniewski, Hervé Bredin, Fran\ccois Yvon

Interspeech 2015, 16th Annual Conference of the International Speech Communication Association

Though radio and TV broadcast are highly structured documents, state-of-the-art speaker identification algorithms do not take advantage of this information to improve prediction performance: speech turns are usually identified independently from each other, using unstructured multi-class classification approaches. In this work, we propose to address speaker identification as a sequence labeling task and use two structured prediction techniques to account for the inherent temporal structure of interactions between speakers: the first one relies on Conditional Random Field and can take into account local relations between two consecutive speech turns; the second one, based on the SEARN framework, sacrifices exact inference for the sake of the expressiveness of the model and is able to incorporate rich structure information during prediction. Experiments performed on The Big Bang Theory TV series show that structured prediction techniques outperform the standard unstructured approach.

.bib [Knyazeva2015] | .pdf
Matheuz Budnik, Laurent Besacier, Johann Poignant, Hervé Bredin, Claude Barras, Mickaël Stefas, Pierrick Bruneau, Thomas Tamisier

Interspeech 2015, 16th Annual Conference of the International Speech Communication Association

This paper presents a collaborative annotation framework for person identification in TV shows. The web annotation front-end will be demonstrated during the Show and Tell session. All the code for annotation is made available on github. The tool can also be used in a crowd-sourcing environment.

.bib [Budnik2015] | .pdf
Johann Poignant, Hervé Bredin, Claude Barras

MediaEval 2015

We describe the ``Multimodal Person Discovery in Broadcast TV'' task of MediaEval 2015 benchmarking initiative. Participants are asked to return the names of people who can be both seen as well as heard in every shot of a collection of videos. The list of people is not known a priori and their names must be discovered in an unsupervised way from media content using text overlay or speech transcripts. The task is evaluated using information retrieval metrics, based on a posteriori collaborative annotation of the test corpus.

.bib [Poignant2015] | .pdf
Johann Poignant, Hervé Bredin, Claude Barras

MediaEval 2015

This paper describes the algorithm tested by the LIMSI team in the MediaEval 2015 Person Discovery in Broadcast TV Task. For this task we used an audio/video diarization process constrained by names written on screen. These names are used to both identify clusters and prevent the fusion of two clusters with different co-occurring names. This method obtained 83.1% of EwMAP tuned on the out-domain development corpus.

.bib [Poignant2015a] | .pdf
Pierrick Bruneau, Mickaël Stefas, Hervé Bredin, Johann Poignant, Thomas Tamisier, Claude Barras

ICMI 2015, 17th International Conference on Multimodal Interaction

Classification quality criteria such as precision, recall, and F-measure are generally the basis for evaluating contributions in automatic speaker recognition. Specifically, comparisons are carried out mostly via mean values estimated on a set of media. Whilst this approach is relevant to assess improvement w.r.t. the state-of-the-art, or ranking participants in the context of an automatic annotation challenge, it gives little insight to system designers in terms of cues for improving algorithms, hypothesis formulation, and evidence display. This paper presents a design study of a visual and interactive approach to analyze errors made by automatic annotation algorithms. A timeline-based tool emerged from prior steps of this study. A critical review, driven by user interviews, exposes caveats and refines user objectives. The next step of the study is then initiated by sketching designs combining elements of the current prototype to principles newly identified as relevant.

.bib [Bruneau2015] | .pdf

2014

Anindya Roy, Hervé Bredin, William Hartmann, Viet-Bac Le, Claude Barras, Jean-Luc Gauvain

Multimedia Tools and Applications

It is possible to use lexical information extracted from speech transcripts for speaker identification (SID), either on its own or to improve the performance of standard cepstral-based SID systems upon fusion. This was established before typically using isolated speech from single speakers (NIST SRE corpora, parliamentary speeches). On the contrary, this work applies lexical approaches for SID on a different type of data. It uses the REPERE corpus consisting of unsegmented multiparty conversations, mostly debates, discussions and Q&A sessions from TV shows. It is hypothesized that people give out clues to their identity when speaking in such settings which this work aims to exploit. The impact on SID performance of the diarization front-end required to pre-process the unsegmented data is also measured. Four lexical SID approaches are studied in this work, including TFIDF, BM25 and LDA-based topic modeling. Results are analysed in terms of TV shows and speaker roles. Lexical approaches achieve low error rates for certain speaker roles such as anchors and journalists, sometimes lower than a standard cepstral-based Gaussian Supervector -- Support Vector Machine (GSV-SVM) system. Also, in certain cases, the lexical system shows modest improvement over the cepstral-based system performance using score-level sum fusion. To highlight the potential of using lexical information not just to improve upon cepstral-based SID systems but as an independent approach in its own right, initial studies on crossmedia SID is briefly reported. Instead of using speech data as all cepstral systems require, this approach uses Wikipedia texts to train lexical speaker models which are then tested on speech transcripts to identify speakers.

.bib [Roy2014] | .pdf
Hervé Bredin, Anindya Roy, Viet-Bac Le, Claude Barras

International Journal of Multimedia Information Retrieval

This work introduces a unified framework for mono-, cross- and multi-modal person recognition in multimedia data. Dubbed Person Instance Graph, it models the person recognition task as a graph mining problem: i.e. finding the best mapping between person instance vertices and identity vertices. Practically, we describe how the approach can be applied to speaker identification in TV broadcast. Then, a solution to the above-mentioned mapping problem is proposed. It relies on Integer Linear Programming to model the problem of clustering person instances based on their identity. We provide an in-depth theoretical definition of the optimization problem. Moreover, we improve two fundamental aspects of our previous related work: the problem constraints and the optimized objective function. Finally, a thorough experimental evaluation of the proposed framework is performed on a publicly available benchmark database. Depending on the graph configuration (i.e. the choice of its vertices and edges), we show that multiple tasks can be addressed interchangeably (e.g. speaker diarization, supervised or unsupervised speaker identification), significantly outperforming state-of-the-art mono-modal approaches.

.bib [Bredin2014] | .pdf
Anindya Roy, Camille Guinaudeau, Hervé Bredin, Claude Barras

LREC 2014, 9th Language Resources and Evaluation Conference

We present a new dataset built around two TV series, The Big Bang Theory (a situation comedy) and Game of Thrones (a fantasy drama). It has multiple tracks including dialogue, crowd-sourced textual descriptions and metadata, all time-stamped and temporally aligned with each other. We provide tools to reproduce it for research purposes, provided that one has legally acquired the DVDs of the series. The alignment algorithm used is evaluated on a manually aligned subset of the data.

.bib [Roy2014a] | .pdf
Hervé Bredin, Antoine Laurent, Achintya Sarkar, Viet-Bac Le, Sophie Rosset, Claude Barras

Odyssey 2014, The Speaker and Language Recognition Workshop

We address the problem of named speaker identification in TV broadcast which consists in answering the question ''who speaks when?'' with the real identity of speakers, using person names automatically obtained from speech transcripts. While existing approaches rely on a first speaker diarization step followed by a local name propagation step to speaker clusters, we propose a unified framework called person instance graph where both steps are jointly modeled as a global optimization problem, then solved using integer linear programming. Moreover, when available, acoustic speaker models can be added seamlessly to the graph structure for joint named and acoustic speaker identification - leading to a 10% error decrease (from 45% down to 35%) over a state-of-the-art i-vector speaker identification system on the REPERE TV broadcast corpus.

.bib [Bredin2014a] | .pdf
Sabin Tiberius Strat, Alexandre Benoit, Patrick Lambert, Hervé Bredin, Georges Quénot

Fusion in Computer Vision -- Understanding Complex Visual Content

Current research shows that the detection of semantic concepts (animal, bus, person, dancing etc.) in multimedia documents such as videos, requires the use of several types of complementary descriptors in order to achieve good results. In this work, we explore strategies for combining dozens of complementary content descriptors (or ``experts'') in an efficient way, through the use of late fusion approaches, for concept detection in multimedia documents. We explore two fusion approaches that share a common structure: both start with a clustering of experts stage, continue with an intra-cluster fusion and finish with an inter-cluster fusion, and we also experiment with other state-of-the-art methods. The first fusion approach relies on a priori knowledge about the internals of each expert to group the set of available experts by similarity. The second approach automatically obtains measures on the similarity of experts from their output to group the experts using agglomerative clustering, and then combines the results of this fusion with those from other methods. In the end, we show that an additional performance boost can be obtained by also considering the context of multimedia elements.

.bib [Strat2014] | .pdf
Pierrick Bruneau, Mickaël Stefas, Hervé Bredin, Anh-Phuong Ta, Thomas Tamisier, Claude Barras

iV 2014, 18th International Conference Information Visualisation

Multimedia annotation algorithms infer localized meta-data in multimedia content, e.g. speakers or appearing faces. There is a growing need of experts from this domain to perform advanced analyses, that go beyond medium-scale quality metrics. This paper describes a novel visual tool, that applies interactive visualization principles to the multimedia expert concerns. Multiple coordinated views, augmented by interactive inspection facilities, ease the navigation in media annotations, and the visual detection of relevant information. The effectiveness of the proposition is demonstrated by experimental scenarios on a real multimedia corpus.

.bib [Bruneau2014] | .pdf
Pierrick Bruneau, Mickaël Stefas, Mateusz Budnik, Johann Poignant, Hervé Bredin, Thomas Tamisier, Beno\^it Otjacques

CDVE 2014, 11th International Conference on Cooperative Design, Visualization and Engineering

Reference multimedia corpora for use in automated annotation algorithms are very demanding of manual work. The Camomile project advocates the joint progress of automated annotation methods and tools for improving the benchmark resources. This paper shows some work in progress in interactive visualization of annotations, and perspectives in harnessing the collaboration between manual annotators, algorithm designers, and benchmark administrators.

.bib [Bruneau2014a] | .pdf
Camille Guinaudeau, Antoine Laurent, Hervé Bredin

MediaEval 2014

This paper provides an overview of the Social Event Detection (SED) system developed at LIMSI for the 2014 campaign. Our approach is based on a hierarchical agglomerative clustering that uses textual metadata, user-based knowledge and geographical information. These different sources of knowledge, either used separately or in cascade, reach good results for the full clustering subtask with a normalized mutual information equal to 0.95 and F1 scores greater than 0.82 for our best run.

.bib [Guinaudeau2014] | .pdf
Hervé Bredin, Anindya Roy, Nicolas Pécheux, Alexandre Allauzen

ACM MM 2014, 22nd ACM International Conference on Multimedia

We address the problem of speaker identification in multimedia data, and TV series in particular. While speaker identification is traditionally a supervised machine-learning task, our first contribution is to significantly reduce the need for costly preliminary manual annotations through the use of automatically aligned (and potentially noisy) fan-generated transcripts and subtitles. We show that both speech activity detection and speech turn identification modules trained in this weakly supervised manner achieve similar performance as their fully supervised counterparts (i.e. relying on fine manual speech/non-speech/speaker annotation). Our second contribution relates to the use of multilingual audio tracks usually available with this kind of content to significantly improve the overall speaker identification performance. Reproducible experiments (including dataset, manual annotations and source code) performed on the first six episodes of The Big Bang Theory TV series show that combining the French audio track (containing dubbed actor voices) with the English one (with the original actor voices) improves the overall English speaker identification performance by 5% absolute and up to 70% relative on the five main characters.

.bib [Bredin2014b] | .pdf

2013

Hervé Bredin, Johann Poignant

Interspeech 2013, 14th Annual Conference of the International Speech Communication Association

Most state-of-the-art approaches address speaker diarization as a hierarchical agglomerative clustering problem in the audio domain. In this paper, we propose to revisit one of them: speech turns clustering based on the Bayesian Information Criterion (a.k.a. BIC clustering). First, we show how to model it as an integer linear programming (ILP) problem.Its resolution leads to the same overall diarization error rate as standard BIC clustering but generates significantly purer speaker clusters. Then, we describe how this approach can easily be extended to the audiovisual domain and TV broadcast in particular. The straightforward integration of detected overlaid names (used to introduce guests or journalists, and obtained via video OCR) into a multimodal ILP problem yields significantly better speaker diarization results. Finally, we explain how this novel paradigm can incidentally be used for unsupervised speaker identification (i.e. not relying on any prior acoustic speaker models). Experiments on the REPERE TV broadcast corpus show that it achieves performance close to that of an oracle capable of identifying any speaker as long as their name appears on screen at least once in the video.

.bib [Bredin2013] | .pdf
Hervé Bredin, Johann Poignant, Guillaume Fortier, Makarand Tapaswi, Viet-Bac Le, Anindya Roy, Claude Barras, Sophie Rosset, Achintya Sarkar, Qian Yang, Hua Gao, Alexis Mignon, Jakob Verbeek, Laurent Besacier, Georges Quénot, Hazim Kemal Ekenel, Rainer Stiefelhagen

SLAM 2013, First Workshop on Speech, Language and Audio for Multimedia

We describe QCompere consortium submissions to the REPERE 2013 evaluation campaign. The REPERE challenge aims at gathering four communities (face recognition, speaker identification, optical character recognition and named entity detection) towards the same goal: multimodal person recognition in TV broadcast. First, four mono-modal components are introduced (one for each foregoing community) constituting the elementary building blocks of our various submissions. Then, depending on the target modality (speaker or face recognition) and on the task (supervised or unsupervised recognition), four different fusion techniques are introduced: they can be summarized as propagation-, classifier-, rule- or graph-based approaches. Finally, their performance is evaluated on REPERE 2013 test set and their advantages and limitations are discussed.

.bib [Bredin2013a] | .pdf
Johann Poignant, Hervé Bredin, Laurent Besacier, Georges Quénot, Claude Barras

SLAM 2013, First Workshop on Speech, Language and Audio for Multimedia

Existing methods for unsupervised identification of speakers in TV broadcast usually rely on the output of a speaker diarization module and try to name each cluster using names provided by another source of information: we call it ``late naming''. Hence, written names extracted from title blocks tend to lead to high precision identification, although they cannot correct errors made during the clustering step. In this paper, we extend our previous ``late naming'' approach in two ways: ``integrated naming'' and ``early naming''. While ``late naming'' relies on a speaker diarization module optimized for speaker diarization, ``integrated naming'' jointly optimize speaker diarization and name propagation in terms of identification errors. ``Early naming'' modifies the speaker diarization module by adding constraints preventing two clusters with different written names to be merged together. While ``integrated naming'' yields similar identification performance as ``late naming'' (with better precision), ``early naming'' improves over this baseline both in terms of identification error rate and stability of the clustering stopping criterion.

.bib [Poignant2013] | .pdf

2012

Bertrand Delezoide, Frédéric Precioso, Philippe Gosselin, Miriam Redi, Bernard Mérialdo, Lionel Granjon, Denis Pellerin, Michèle Rombaut, Hervé Jégou, Rémi Vieux, Boris Mansencal, Jenny Benois-Pineau, Stéphane Ayache, Bahjat Safadi, Franck Thollard, Georges Quénot, Hervé Bredin, Matthieu Cord, Alexandre Benoit, Patrick Lambert, Tiberius Strat, Joseph Razik, Sébastion Paris, Hervé Glotin

TRECVid 2011, TREC Video Retrieval Evaluation Online Proceedings

The IRIM group is a consortium of French teams working on Multimedia Indexing and Retrieval. This paper describes its participation to the TRECVID 2011 semantic indexing and instance search tasks. For the semantic indexing task, our approach uses a six-stages processing pipelines for computing scores for the likelihood of a video shot to contain a target concept. These scores are then used for producing a ranked list of images or shots that are the most likely to contain the target concept. The pipeline is composed of the following steps: descriptor extraction, descriptor optimization, classification, fusion of descriptor variants, higher-level fusion, and re-ranking. We evaluated a number of different descriptors and tried different fusion strategies. The best IRIM run has a Mean Inferred Average Precision of 0.1387, which ranked us 5th out of 19 participants. For the instance search task, we we used both object based query and frame based query. We formulated the query in standard way as comparison of visual signatures either of object with parts of DB frames or as a comparison of visual signatures of query and DB frames. To produce visual signatures we also used two apporaches: the first one is the baseline Bag-Of-Visual-Words (BOVW) model based on SURF interest point descriptor; the second approach is a Bag-Of-Regions (BOR) model that extends the traditional notion of BOVW vocabulary not only to keypoint-based descriptors but to region based descriptors.

.bib [Delezoide2012] | .pdf
Hervé Bredin

ICASSP 2012, IEEE International Conference on Acoustics, Speech, and Signal Processing

We investigate the use of speaker diarization (SD) and automatic speech recognition (ASR) for the segmentation of audiovisual documents into scenes. We introduce multiple monomodal and multimodal approaches based on a state-of-the-art algorithm called generalized scene transition graph (GSTG). First, we extend the latter with the use of semantic information derived from both SD and ASR. Then, multimodal fusion of color histograms, SD and ASR is investigated at various point of the GSTG pipeline (early, late or intermediate fusion). Experiments driven on a few episodes of a popular TV show indicate that SD and ASR can be successfully combined with visual information and bring an additional +11% relative increase in terms of F-Measure for scene boundary detection over the state-of-the-art baseline.

.bib [Bredin2012] | .pdf
Hervé Bredin

ICASSP 2012, IEEE International Conference on Acoustics, Speech, and Signal Processing

We deal with the issue of combining dozens of classifiers into a better one. Our first contribution is the introduction of the notion of communities of classifiers. We build a complete graph with one node per classifier and edges weighted by a measure of similarity between connected classifiers. The resulting community structure is uncovered from this graph using the state-of-the-art Louvain algorithm. Our second contribution is a hierarchical fusion approach driven by these communities. First, intra-community fusion results in one classifier per community. Then, inter-community fusion takes advantage of their complementarity to achieve much better classification performance. Application to the combination of 90 classifiers in the framework of TRECVid 2010 Semantic Indexing task shows a 30% increase in performance relative to a baseline flat fusion.

.bib [Bredin2012a] | .pdf
Philippe Ercolessi, Christine Sénac, Hervé Bredin

CBMI 2012, 10th Workshop on Content-Based Multimedia Indexing

Multiple sub-stories usually coexist in every episode of a TV series. We propose several variants of an approach for plot de-interlacing based on scenes clustering -- with the ultimate goal of providing the end-user with tools for fast and easy overview of one episode, one season or the whole TV series. Each scene can be described in three different ways (based on color histograms, speaker diarization or automatic speech recognition outputs) and four clustering approaches are investigated, one of them based on a graphical representation of the video. Experiments are performed on two TV series of different lengths and formats. We show that semantic descriptors (such as speaker diarization) give the best results and underline that our approach provides useful information for plot de-interlacing.

.bib [Ercolessi2012] | .pdf
Johann Poignant, Hervé Bredin, Viet-Bac Le, Laurent Besacier, Claude Barras, Georges Quénot

Interspeech 2012, 13th Annual Conference of the International Speech Communication Association

We propose an approach for unsupervised speaker identification in TV broadcast videos, by combining acoustic speaker diarization with person names obtained via video OCR from overlaid texts. Three methods for the propagation of the overlaid names to the speech turns are compared, taking into account the co-occurence duration between the speaker clusters and the names provided by the video OCR and using a task-adapted variant of the TF-IDF information retrieval coefficient. These methods were tested on the REPERE dry-run evaluation corpus, containing 3 hours of annotated videos. Our best unsupervised system reaches a F-measure of 70.2\% when considering all the speakers, and 81.7\% if anchor speakers are left out. By comparison, a mono-modal, supervised speaker identification system with 535 speaker models trained on matching development data and additional TV and radio data only provided a 57.5\% F-measure when considering all the speakers and 45.7\% without anchor.

.bib [Poignant2012] | .pdf
Philippe Ercolessi, Christine Sénac, Hervé Bredin, Sandrine Mouysset

Document Numérique -- Numéro Spécial ``Résumé Automatique des Documents''

Modern TV series have complex plots made of several intertwined stories following numerous characters. In this paper, we propose an approach for automatically detecting these stories in order to generate video summaries and we propose a visualization tool to have a quick and easy look at TV series. Based on automatic scene segmentation of each TV series episode (a scene is defined as temporally and spatially continuous and semantically coherent), scenes are clustered into stories, made of (non necessarily adjacent) semantically similar scenes. Visual, audio and text modalities are combined to achieve better scene segmentation and story detection performance. An extraction of salient scenes from stories is performed to create the summary. Experimentations are conducted on two TV series with different formats.

.bib [Ercolessi2012a] | .pdf
Hervé Bredin, Johann Poignant, Makarand Tapaswi, Guillaume Fortier, Viet Bac Le, Thibault Napoleon, Hua Gao, Claude Barras, Sophie Rosset, Laurent Besacier, Jakob Verbeek, Georges Quénot, Frédéric Jurie, Hazim Kemal Ekenel

ECCV 2012, Workshop on Information Fusion in Computer Vision for Concept Recognition

The REPERE challenge is a project aiming at the evaluation of systems for supervised and unsupervised multimodal recognition of people in TV broadcast. In this paper, we describe, evaluate and discuss QCompere consortium submissions to the 2012 \repere evaluation campaign dry-run. Speaker identification (and face recognition) can be greatly improved when combined with name detection through video optical character recognition. Moreover, we show that unsupervised multimodal person recognition systems can achieve performance nearly as good as supervised monomodal ones (with several hundreds of identity models).

.bib [Bredin2012b] | .pdf
Tiberius Strat, Alexandre Benoit, Hervé Bredin, Georges Quénot, Patrick Lambert

ECCV 2012, Workshop on Information Fusion in Computer Vision for Concept Recognition

We deal with the issue of combining dozens of classifiers into a better one, for concept detection in videos. We compare three fusion approaches that share a common structure: they all start with a classifier clustering stage, continue with an intra-cluster fusion and end with an inter-cluster fusion. The main difference between them comes from the first stage. The first approach relies on a priori knowledge about the internals of each classifier (low-level descriptors and classification algorithm) to group the set of available classifiers by similarity. The second and third approaches obtain classifier similarity measures directly from their output and group them using agglomerative clustering for the second approach and community detection for the third one.

.bib [Strat2012] | .pdf
Philippe Ercolessi, Christine Sénac, Sandrine Mouysset, Hervé Bredin

AMVA 2012, 1st ACM International Workshop on Audio and Multimedia Methods for Large-Scale Video Analysis at ACM Multimedia 2012

Since the 90s, TV series tend to introduce more and more main characters and they are often composed of multiple intertwined stories. In this paper, we propose a hierarchical framework of plot de-interlacing which permits to cluster semantic scenes into stories: a story is a group of scenes not necessarily contiguous but showing a strong semantic relation. Each scene is described using three different modalities (based on color histograms, speaker diarization or automatic speech recognition outputs) as well as their multimodal combination. We introduce the notion of character-driven episodes as episodes where stories are emphasized by the presence or absence of characters, and we propose an automatic method, based on a social graph, to detect these episodes. Depending on whether an episode is character-driven or not, the plot-de-interlacing -which is a scene clustering- is made either through a traditional average-link agglomerative clustering with speaker modality only, either through a spectral clustering with the fusion of all modalities. Experiments, conducted on twenty three episodes from three quite different TV series (different lengths and formats), show that the hierarchical framework brings an improvement for all the series.

.bib [Ercolessi2012b] | .pdf
Philippe Ercolessi, Hervé Bredin, Christine Sénac

ACM MM 2012, 20th ACM International Conference on Multimedia

Recent TV series tend to have more and more complex plot. They follow the lives of numerous characters and are made of multiple intertwined stories. In this paper, we introduce StoViz, a web-based interface allowing a fast overview of this kind of episode structure, based on our plot de-interlacing system. StoViz has two main goals. First, it provides the user with a useful overview of the episode by displaying each story separately and a short abstract extracted from them. Then, it allows an efficient visual comparison of the output of any automatic plot de-interlacing algorithm with the manual annotation in terms of stories and is therefore very helpful for evaluation purposes. StoViz is available online at http://stoviz.niderb.fr.

.bib [Ercolessi2012c] | .pdf

2011

David Gorisse, Frédéric Precioso, Philippe Gosselin, Lionel Granjon, Denis Pellerin, Michèle Rombaut, Hervé Bredin, Lionel Koenig, Rémi Vieux, Boris Mansencal, Jenny Benois-Pineau, Hugo Boujut, Claire Morand, Hervé Jégou, Stéphane Ayache, Bahjat Safadi, Yubing Tong, Franck Thollard, Georges Quénot, Matthieu Cord, Alexandre Beno\^it, Patrick Lambert

TRECVid 2010, TREC Video Retrieval Evaluation Online Proceedings

The IRIM group is a consortium of French teams working on Multimedia Indexing and Retrieval. This paper describes our participation to the TRECVID 2010 semantic indexing and instance search tasks. For the semantic indexing task, we evaluated a number of different descriptors and tried different fusion strategies, in particular hierarchical fusion. The best IRIM run has a Mean Inferred Average Precision of 0.0442, which is above the task median performance. We found that fusion of the classification scores from different classifier types improves the performance and that even with a quite low individual performance, audio descriptors can help. For the instance search task, we used only one of the example images in our queries. The rank is nearly in the middle of the list of participants. The experiment showed that HSV features outperform the concatenation of HSV and Edge histograms or the Wavelet features.

.bib [Gorisse2011] | .pdf
Philippe Ercolessi, Hervé Bredin, Christine Sénac, Philippe Joly

WIAMIS 2011, 12th International Workshop on Image Analysis for Multimedia Interactive Services

In this paper, we propose a novel approach to perform scene segmentation of TV series. Using the output of our existing speaker diarization system, any temporal segment of the video can be described as a binary feature vector. A straightforward segmentation algorithm then allows to group similar contiguous speaker segments into scenes. An additional visual-only color-based segmentation is then used to refine the first segmentation. Experiments are performed on a subset of the Ally McBeal TV series and show promising results, obtained with a rule-free and generic method. For comparison purposes, test corpus annotations and description are made available to the community.

.bib [Ercolessi2011] | .pdf
Mathieu Ramona, Sébastien Fenet, Raphaël Blouet, Hervé Bredin, Thomas Fillon, Geoffroy Peeters

Applied Artificial Intelligence

This paper presents the first public framework for the evaluation of audio fingerprinting techniques. Although the domain of audio identification is very active, both in the industry and the academic world, there is nowadays no common basis to compare the proposed techniques. This is because corpuses and evaluation protocols differ between the authors. The framework we present here corresponds to a use-case in which audio excerpts have to be detected in a radio broadcast stream. This scenario indeed naturally provides a large variety of audio distortions that makes this task a real challenge for fingerprinting systems. Scoring metrics are discussed, with regard to this particular scenario. We then describe a whole evaluation framework including an audio corpus, along with the related groundtruth annotation, and a toolkit for the computation of the score metrics. An example of application of this framework is finally detailed. This took place during the evaluation campaign of the Quaero project. This evaluation framework is publicly available for download and constitutes a simple, yet thorough, platform that can be used by the community in the field of audio identification, to encourage reproducible results.

.bib [Ramona2011] | .pdf

2010

Delezoide, Bertrand, Le Borgne, Hervé, Moëllic, Pierre-Alain, Gorisse, David, Precioso, Frédéric, Wang, Feng, Merialdo, Bernard, Gosselin, Philippe, Granjon, Lionel, Pellerin, Denis, Rombaut, Michèle, Bredin, Hervé, Koenig, Lionel, Lachambre, Hélène, El Khoury, Elie, Mansencal, Boris, Zhou, Yifan, Benois-Pineau, Jenny, Jégou, Hervé, Ayache, Stéphane, Safadi, Bahjat, Quenot, Georges, Fabrizio, Jonathan, Cord, Matthieu, Glotin, Hervé, Zhao, Zhongqiu, Dumont, Emilie, Augereau, Bertrand

TRECVid 2009, TREC Video Retrieval Evaluation Online Proceedings

The IRIM group is a consortium of French teams working on Multimedia Indexing and Retrieval. This paper describes our participation to the TRECVID 2009 High Level Features detection task. We evaluated a large number of different descriptors (on TRECVID 2008 data) and tried different fusion strategies, in particular hierarchical fusion and genetic fusion. The best IRIM run has a Mean Inferred Average Precision of 0.1220, which is significantly above TRECVID 2009 HLF detection task median performance. We found that fusion of the classification scores from different classifier types improves the performance and that even with a quite low individual performance, audio descriptors can help.

.bib [Delezoide2010] | .pdf
Hervé Bredin, Lionel Koenig, Hélène Lachambre, Elie El Khoury

TRECVid 2009, TREC Video Retrieval Evaluation Online Proceedings

.bib [Bredin2010] | .pdf

2009

Hervé Bredin, Aurélien Mayoue, Gérard Chollet, Bernadette Dorizzi

Guide to Biometric Reference Systems and Performance Evaluation

.bib [Bredin2009] | .pdf
Walid Karam, Hervé Bredin, Hanna Greige, Gérard Chollet, Chafic Mokbel

EURASIP Journal on Advances in Signal Processing, Special Issue on Recent Advances in Biometric Systems: A Signal Processing Perspective

.bib [Karam2009] | .pdf
Saman H. Cooray, Hervé Bredin, Li-Qun Xu, Noel E. O'Connor

ACM MM 2009, 17th ACM International Conference on Multimedia

.bib [Cooray2009] | .pdf

2008

Hervé Bredin, Gérard Chollet

ICASSP 2008, IEEE International Conference on Acoustics, Speech, and Signal Processing

.bib [Bredin2008] | .pdf
Benoït Fauve, Hervé Bredin, Walid Karam, Florian Verdet, Aurélien Mayoue, Gérard Chollet, Jean Hennebert, R. Lewis, John Mason, Chafic Mokbel, Dijana Petrovska

ICASSP 2008, IEEE International Conference on Acoustics, Speech, and Signal Processing

.bib [Fauve2008] | .pdf
Emilie Dumont, Bernard Merialdo, Slim Essid, Werner Bailer, Herwig Rehatschek, Daragh Byrne, Hervé Bredin, Noel O'Connor, Gareth JF Jones, Alan F Smeaton, Martin Haller, Andreas Krutz, Thomas Sikora, Tomas Piatrik

TRECVID 2008, ACM International Conference on Multimedia Information Retrieval

.bib [Dumont2008] | .pdf
Hervé Bredin, Daragh Byrne, Hyowon Lee, Noel O'Connor, Gareth JF Jones

TRECVID 2008, ACM International Conference on Multimedia Information Retrieval 2008

.bib [Bredin2008a] | .pdf
Emilie Dumont, Bernard Merialdo, Slim Essid, Werner Bailer, Daragh Byrne, Hervé Bredin, Noel O'Connor, Gareth JF Jones, Martin Haller, Andreas Krutz, Thomas Sikora, Tomas Piatrik

SAMT 2008, 3rd International Conference on Semantic and Digital Media Technologies

.bib [Dumont2008a] | .pdf

2007

Hervé Bredin, Gérard Chollet

EURASIP Journal on Advances in Signal Processing, Special Issue on Knowledge-Assisted Media Analysis for Interactive Multimedia Applications

.bib [Bredin2007] | .pdf
Hervé Bredin, Gérard Chollet

ICASSP 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing

.bib [Bredin2007a] | .pdf
Rémi Landais, Hervé Bredin, Leila Zouari, Gérard Chollet

Traitement et Analyse de l'Information : Méthodes et Applications

.bib [Landais2007] | .pdf
Enrique Argones-Rúa, Carmen García-Mateo, Hervé Bredin, Gérard Chollet

1st Spanish Workshop on Biometrics

.bib [Argones-Rua2007] | .pdf
Patrick Perrot, Hervé Bredin, Gérard Chollet

2007 International Crime Science Conference

.bib [Perrot2007] | .pdf
Enrique Argones-Rúa, Hervé Bredin, Carmen García-Mateo, Gérard Chollet, Daniel González-Jiménez

Pattern Analysis and Applications Journal

.bib [Argones-Rua2007a] | .pdf
Gérard Chollet, Rémi Landais, Hervé Bredin, Thomas Hueber, Chafic Mokbel, Patrick Perrot, Leila Zouari

Non-Linear Speech Processing

.bib [Chollet2007] | .pdf
Bouchra Abboud, Hervé Bredin, Guido Aversano, Gérard Chollet

Progress in Nonlinear Speech Processing

.bib [Abboud2007] | .pdf

2006

Hervé Bredin, Antonio Miguel, Ian Witten, Gérard Chollet

ICASSP 2006, IEEE International Conference on Acoustics, Speech, and Signal Processing

.bib [Bredin2006] | .pdf
Jacques Koreman, Andrew C Morris, D. Wu, Sabah Jassim, Harin Sellahewa, J. Ehlers, Gérard Chollet, Guido Aversano, Hervé Bredin, Sonia Garcia-Salicetti, Lorène Allano, Bao Ly Van, Bernadette Dorizzi

MMUA 2006, Workshop on Multimodal User Authentication

.bib [Koreman2006] | .pdf
Hervé Bredin, Guido Aversano, Chafic Mokbel, Gérard Chollet

MMUA 2006, Workshop on Multimodal User Authentication

.bib [Bredin2006a] | .pdf
Fabian Brugger, Leila Zouari, Hervé Bredin, Asma Amehraye, Gérard Chollet, Dominique Pastor, Yang Ni

JEP 2006, Journées d'Etudes sur la Parole

.bib [Brugger2006] | .pdf
Hervé Bredin, Najim Dehak, Gérard Chollet

ICPR 2006, IAPR International Conference on Pattern Recognition

.bib [Bredin2006b] | .pdf
Hervé Bredin, Gérard Chollet

VIE 2006, IEE International Conference on Visual Information Engineering

.bib [Bredin2006c] | .pdf

2005

Kevin McTait, Hervé Bredin, Silvia Colón, Thomas Fillon, Gérard Chollet

ISISPA 2005, International Symposium on Image and Signal Processing and Analysis

.bib [McTait2005] | .pdf

2014

Anindya Roy, Hervé Bredin, William Hartmann, Viet-Bac Le, Claude Barras, Jean-Luc Gauvain

Multimedia Tools and Applications

It is possible to use lexical information extracted from speech transcripts for speaker identification (SID), either on its own or to improve the performance of standard cepstral-based SID systems upon fusion. This was established before typically using isolated speech from single speakers (NIST SRE corpora, parliamentary speeches). On the contrary, this work applies lexical approaches for SID on a different type of data. It uses the REPERE corpus consisting of unsegmented multiparty conversations, mostly debates, discussions and Q&A sessions from TV shows. It is hypothesized that people give out clues to their identity when speaking in such settings which this work aims to exploit. The impact on SID performance of the diarization front-end required to pre-process the unsegmented data is also measured. Four lexical SID approaches are studied in this work, including TFIDF, BM25 and LDA-based topic modeling. Results are analysed in terms of TV shows and speaker roles. Lexical approaches achieve low error rates for certain speaker roles such as anchors and journalists, sometimes lower than a standard cepstral-based Gaussian Supervector -- Support Vector Machine (GSV-SVM) system. Also, in certain cases, the lexical system shows modest improvement over the cepstral-based system performance using score-level sum fusion. To highlight the potential of using lexical information not just to improve upon cepstral-based SID systems but as an independent approach in its own right, initial studies on crossmedia SID is briefly reported. Instead of using speech data as all cepstral systems require, this approach uses Wikipedia texts to train lexical speaker models which are then tested on speech transcripts to identify speakers.

.bib [Roy2014] | .pdf
Hervé Bredin, Anindya Roy, Viet-Bac Le, Claude Barras

International Journal of Multimedia Information Retrieval

This work introduces a unified framework for mono-, cross- and multi-modal person recognition in multimedia data. Dubbed Person Instance Graph, it models the person recognition task as a graph mining problem: i.e. finding the best mapping between person instance vertices and identity vertices. Practically, we describe how the approach can be applied to speaker identification in TV broadcast. Then, a solution to the above-mentioned mapping problem is proposed. It relies on Integer Linear Programming to model the problem of clustering person instances based on their identity. We provide an in-depth theoretical definition of the optimization problem. Moreover, we improve two fundamental aspects of our previous related work: the problem constraints and the optimized objective function. Finally, a thorough experimental evaluation of the proposed framework is performed on a publicly available benchmark database. Depending on the graph configuration (i.e. the choice of its vertices and edges), we show that multiple tasks can be addressed interchangeably (e.g. speaker diarization, supervised or unsupervised speaker identification), significantly outperforming state-of-the-art mono-modal approaches.

.bib [Bredin2014] | .pdf

2012

Philippe Ercolessi, Christine Sénac, Hervé Bredin, Sandrine Mouysset

Document Numérique -- Numéro Spécial ``Résumé Automatique des Documents''

Modern TV series have complex plots made of several intertwined stories following numerous characters. In this paper, we propose an approach for automatically detecting these stories in order to generate video summaries and we propose a visualization tool to have a quick and easy look at TV series. Based on automatic scene segmentation of each TV series episode (a scene is defined as temporally and spatially continuous and semantically coherent), scenes are clustered into stories, made of (non necessarily adjacent) semantically similar scenes. Visual, audio and text modalities are combined to achieve better scene segmentation and story detection performance. An extraction of salient scenes from stories is performed to create the summary. Experimentations are conducted on two TV series with different formats.

.bib [Ercolessi2012a] | .pdf

2011

Mathieu Ramona, Sébastien Fenet, Raphaël Blouet, Hervé Bredin, Thomas Fillon, Geoffroy Peeters

Applied Artificial Intelligence

This paper presents the first public framework for the evaluation of audio fingerprinting techniques. Although the domain of audio identification is very active, both in the industry and the academic world, there is nowadays no common basis to compare the proposed techniques. This is because corpuses and evaluation protocols differ between the authors. The framework we present here corresponds to a use-case in which audio excerpts have to be detected in a radio broadcast stream. This scenario indeed naturally provides a large variety of audio distortions that makes this task a real challenge for fingerprinting systems. Scoring metrics are discussed, with regard to this particular scenario. We then describe a whole evaluation framework including an audio corpus, along with the related groundtruth annotation, and a toolkit for the computation of the score metrics. An example of application of this framework is finally detailed. This took place during the evaluation campaign of the Quaero project. This evaluation framework is publicly available for download and constitutes a simple, yet thorough, platform that can be used by the community in the field of audio identification, to encourage reproducible results.

.bib [Ramona2011] | .pdf

2009

Walid Karam, Hervé Bredin, Hanna Greige, Gérard Chollet, Chafic Mokbel

EURASIP Journal on Advances in Signal Processing, Special Issue on Recent Advances in Biometric Systems: A Signal Processing Perspective

.bib [Karam2009] | .pdf

2007

Hervé Bredin, Gérard Chollet

EURASIP Journal on Advances in Signal Processing, Special Issue on Knowledge-Assisted Media Analysis for Interactive Multimedia Applications

.bib [Bredin2007] | .pdf
Enrique Argones-Rúa, Hervé Bredin, Carmen García-Mateo, Gérard Chollet, Daniel González-Jiménez

Pattern Analysis and Applications Journal

.bib [Argones-Rua2007a] | .pdf

2014

Sabin Tiberius Strat, Alexandre Benoit, Patrick Lambert, Hervé Bredin, Georges Quénot

Fusion in Computer Vision -- Understanding Complex Visual Content

Current research shows that the detection of semantic concepts (animal, bus, person, dancing etc.) in multimedia documents such as videos, requires the use of several types of complementary descriptors in order to achieve good results. In this work, we explore strategies for combining dozens of complementary content descriptors (or ``experts'') in an efficient way, through the use of late fusion approaches, for concept detection in multimedia documents. We explore two fusion approaches that share a common structure: both start with a clustering of experts stage, continue with an intra-cluster fusion and finish with an inter-cluster fusion, and we also experiment with other state-of-the-art methods. The first fusion approach relies on a priori knowledge about the internals of each expert to group the set of available experts by similarity. The second approach automatically obtains measures on the similarity of experts from their output to group the experts using agglomerative clustering, and then combines the results of this fusion with those from other methods. In the end, we show that an additional performance boost can be obtained by also considering the context of multimedia elements.

.bib [Strat2014] | .pdf

2009

Hervé Bredin, Aurélien Mayoue, Gérard Chollet, Bernadette Dorizzi

Guide to Biometric Reference Systems and Performance Evaluation

.bib [Bredin2009] | .pdf

2007

Gérard Chollet, Rémi Landais, Hervé Bredin, Thomas Hueber, Chafic Mokbel, Patrick Perrot, Leila Zouari

Non-Linear Speech Processing

.bib [Chollet2007] | .pdf
Bouchra Abboud, Hervé Bredin, Guido Aversano, Gérard Chollet

Progress in Nonlinear Speech Processing

.bib [Abboud2007] | .pdf

2017

Hervé Bredin

ICASSP 2017, IEEE International Conference on Acoustics, Speech, and Signal Processing

TristouNet is a neural network architecture based on Long Short-Term Memory recurrent networks, meant to project speech sequences into a fixed-dimensional euclidean space. Thanks to the triplet loss paradigm used for training, the resulting sequence embeddings can be compared directly with the euclidean distance, for speaker comparison purposes. Experiments on short (between 500ms and 5s) speech turn comparison and speaker change detection show that TristouNet brings significant improvements over the current state-of-the-art techniques for both tasks.

.bib [Bredin2017] | .pdf

2016

Johann Poignant, Mateusz Budnik, Hervé Bredin, Claude Barras, Mickaël Stefas, Pierrick Bruneau, Gilles Adda, Laurent Besacier, Hazim Ekenel, Gil Francopoulo, Javier Hernando, Joseph Mariani, Ramon Morros, Georges Quénot, Sophie Rosset, Thomas Tamisier

In this paper, we describe the organization and the implementation of the CAMOMILE collaborative annotation framework for multimodal, multimedia, multilingual (3M) data. Given the versatile nature of the analysis which can be performed on 3M data, the structure of the server was kept intentionally simple in order to preserve its genericity, relying on standard Web technologies. Layers of annotations, defined as data associated to a media fragment from the corpus, are stored in a database and can be managed through standard interfaces with authentication. Interfaces tailored specifically to the needed task can then be developed in an agile way, relying on simple but reliable services for the management of the centralized annotations. We then present our implementation of an active learning scenario for person annotation in video, relying on the CAMOMILE server; during a dry run experiment, the manual annotation of 716 speech segments was thus propagated to 3504 labeled tracks. The code of the CAMOMILE framework is distributed in open source.

.bib [Poignant2016b] | .pdf
Johann Poignant, Hervé Bredin, Claude Barras, Mickaël Stefas, Pierrick Bruneau, Thomas Tamisier

In this paper, we claim that the CAMOMILE collaborative annotation platform (developed in the framework of the eponymous CHIST-ERA project) eases the organization of multimedia technology benchmarks, automating most of the campaign technical workflow and enabling collaborative (hence faster and cheaper) annotation of the evaluation data. This is demonstrated through the successful organization of a new multimedia task at MediaEval 2015, Multimodal Person Discovery in Broadcast TV.

.bib [Poignant2016] | .pdf
Hervé Bredin, Grégory Gelly

ACM MM 2016, 24th ACM International Conference on Multimedia

While successful on broadcast news, meetings or telephone conversation, state-of-the-art speaker diarization techniques tend to perform poorly on TV series or movies. In this paper, we propose to rely on state-of-the-art face clustering techniques to guide acoustic speaker diarization. Two approaches are tested and evaluated on the first season of Game Of Thrones TV series. The second (better) approach relies on a novel talking-face detection module based on bi-directional long short-term memory recurrent neural network. Both audio-visual approaches outperform the audio-only baseline. A detailed study of the behavior of these approaches is also provided and paves the way to future improvements.

.bib [Bredin2016] | .pdf
Pierrick Bruneau, Mickaël Stefas, Johann Poignant, Hervé Bredin, Claude Barras

ISM 2016, 12th IEEE International Symposium on Multimedia

Part of the research effort in automatic person discovery in multimedia content consists in analyzing the errors made by algorithms. However exploring the space of models relating algorithmic errors in person discovery to intrinsic properties of associated shots (e.g. person facing the camera) - coined as post-hoc analysis in this paper - requires data cura- tion and statistical model tuning, which can be cumbersome. In this paper we present a visual and interactive tool that facilitates this exploration. Adequate statistical building blocks are defined, and coordinated by visual and interactive components inspired from the literature in information visualization. A case study is conducted with multimedia researchers to validate the tool. Real data obtained from the MediaEval person discovery task was used for this experiment. Our approach yielded novel insight that was completely unsuspected previously.

.bib [Bruneau2016] | .pdf

2015

Delphine Charlet, Johann Poignant, Hervé Bredin, Corinne Fredouille, Sylvain Meignier

ERRARE 2015, Second Workshop on Errors By Humans and Machines in Multimedia, Multimodal, and Multilingual Data Processing

Speaker identification approaches for TV broadcast are usually evaluated and compared based on global error rates derived from the overall duration of missed detection, false alarm and confusion. Based on the analysis of the output of the systems submitted to the final round of the French evaluation campaign REPERE, this paper highlights the fact that these average metrics lead to the incorrect intuition that current state-of-the-art algorithms partially recognize all speakers. Setting aside incorrect diarization and adverse acoustic conditions, we show that their performance is in fact essentially bi-modal: in a given show, either all speech turns of a speaker are correctly identified or none of them are. We then proceed with trying to understand and explain this behavior, through perfomance prediction experiments. These experiments show that the most discriminant speaker characteristics are -- first -- their total speech duration in the current show and -- then only -- the amount of training data available to build their acoustic model.

.bib [Charlet2015] | .pdf
Elena Knyazeva, Guillaume Wisniewski, Hervé Bredin, Fran\ccois Yvon

Interspeech 2015, 16th Annual Conference of the International Speech Communication Association

Though radio and TV broadcast are highly structured documents, state-of-the-art speaker identification algorithms do not take advantage of this information to improve prediction performance: speech turns are usually identified independently from each other, using unstructured multi-class classification approaches. In this work, we propose to address speaker identification as a sequence labeling task and use two structured prediction techniques to account for the inherent temporal structure of interactions between speakers: the first one relies on Conditional Random Field and can take into account local relations between two consecutive speech turns; the second one, based on the SEARN framework, sacrifices exact inference for the sake of the expressiveness of the model and is able to incorporate rich structure information during prediction. Experiments performed on The Big Bang Theory TV series show that structured prediction techniques outperform the standard unstructured approach.

.bib [Knyazeva2015] | .pdf
Matheuz Budnik, Laurent Besacier, Johann Poignant, Hervé Bredin, Claude Barras, Mickaël Stefas, Pierrick Bruneau, Thomas Tamisier

Interspeech 2015, 16th Annual Conference of the International Speech Communication Association

This paper presents a collaborative annotation framework for person identification in TV shows. The web annotation front-end will be demonstrated during the Show and Tell session. All the code for annotation is made available on github. The tool can also be used in a crowd-sourcing environment.

.bib [Budnik2015] | .pdf
Johann Poignant, Hervé Bredin, Claude Barras

MediaEval 2015

We describe the ``Multimodal Person Discovery in Broadcast TV'' task of MediaEval 2015 benchmarking initiative. Participants are asked to return the names of people who can be both seen as well as heard in every shot of a collection of videos. The list of people is not known a priori and their names must be discovered in an unsupervised way from media content using text overlay or speech transcripts. The task is evaluated using information retrieval metrics, based on a posteriori collaborative annotation of the test corpus.

.bib [Poignant2015] | .pdf
Johann Poignant, Hervé Bredin, Claude Barras

MediaEval 2015

This paper describes the algorithm tested by the LIMSI team in the MediaEval 2015 Person Discovery in Broadcast TV Task. For this task we used an audio/video diarization process constrained by names written on screen. These names are used to both identify clusters and prevent the fusion of two clusters with different co-occurring names. This method obtained 83.1% of EwMAP tuned on the out-domain development corpus.

.bib [Poignant2015a] | .pdf
Pierrick Bruneau, Mickaël Stefas, Hervé Bredin, Johann Poignant, Thomas Tamisier, Claude Barras

ICMI 2015, 17th International Conference on Multimodal Interaction

Classification quality criteria such as precision, recall, and F-measure are generally the basis for evaluating contributions in automatic speaker recognition. Specifically, comparisons are carried out mostly via mean values estimated on a set of media. Whilst this approach is relevant to assess improvement w.r.t. the state-of-the-art, or ranking participants in the context of an automatic annotation challenge, it gives little insight to system designers in terms of cues for improving algorithms, hypothesis formulation, and evidence display. This paper presents a design study of a visual and interactive approach to analyze errors made by automatic annotation algorithms. A timeline-based tool emerged from prior steps of this study. A critical review, driven by user interviews, exposes caveats and refines user objectives. The next step of the study is then initiated by sketching designs combining elements of the current prototype to principles newly identified as relevant.

.bib [Bruneau2015] | .pdf

2014

Anindya Roy, Camille Guinaudeau, Hervé Bredin, Claude Barras

LREC 2014, 9th Language Resources and Evaluation Conference

We present a new dataset built around two TV series, The Big Bang Theory (a situation comedy) and Game of Thrones (a fantasy drama). It has multiple tracks including dialogue, crowd-sourced textual descriptions and metadata, all time-stamped and temporally aligned with each other. We provide tools to reproduce it for research purposes, provided that one has legally acquired the DVDs of the series. The alignment algorithm used is evaluated on a manually aligned subset of the data.

.bib [Roy2014a] | .pdf
Hervé Bredin, Antoine Laurent, Achintya Sarkar, Viet-Bac Le, Sophie Rosset, Claude Barras

Odyssey 2014, The Speaker and Language Recognition Workshop

We address the problem of named speaker identification in TV broadcast which consists in answering the question ''who speaks when?'' with the real identity of speakers, using person names automatically obtained from speech transcripts. While existing approaches rely on a first speaker diarization step followed by a local name propagation step to speaker clusters, we propose a unified framework called person instance graph where both steps are jointly modeled as a global optimization problem, then solved using integer linear programming. Moreover, when available, acoustic speaker models can be added seamlessly to the graph structure for joint named and acoustic speaker identification - leading to a 10% error decrease (from 45% down to 35%) over a state-of-the-art i-vector speaker identification system on the REPERE TV broadcast corpus.

.bib [Bredin2014a] | .pdf
Pierrick Bruneau, Mickaël Stefas, Hervé Bredin, Anh-Phuong Ta, Thomas Tamisier, Claude Barras

iV 2014, 18th International Conference Information Visualisation

Multimedia annotation algorithms infer localized meta-data in multimedia content, e.g. speakers or appearing faces. There is a growing need of experts from this domain to perform advanced analyses, that go beyond medium-scale quality metrics. This paper describes a novel visual tool, that applies interactive visualization principles to the multimedia expert concerns. Multiple coordinated views, augmented by interactive inspection facilities, ease the navigation in media annotations, and the visual detection of relevant information. The effectiveness of the proposition is demonstrated by experimental scenarios on a real multimedia corpus.

.bib [Bruneau2014] | .pdf
Pierrick Bruneau, Mickaël Stefas, Mateusz Budnik, Johann Poignant, Hervé Bredin, Thomas Tamisier, Beno\^it Otjacques

CDVE 2014, 11th International Conference on Cooperative Design, Visualization and Engineering

Reference multimedia corpora for use in automated annotation algorithms are very demanding of manual work. The Camomile project advocates the joint progress of automated annotation methods and tools for improving the benchmark resources. This paper shows some work in progress in interactive visualization of annotations, and perspectives in harnessing the collaboration between manual annotators, algorithm designers, and benchmark administrators.

.bib [Bruneau2014a] | .pdf
Camille Guinaudeau, Antoine Laurent, Hervé Bredin

MediaEval 2014

This paper provides an overview of the Social Event Detection (SED) system developed at LIMSI for the 2014 campaign. Our approach is based on a hierarchical agglomerative clustering that uses textual metadata, user-based knowledge and geographical information. These different sources of knowledge, either used separately or in cascade, reach good results for the full clustering subtask with a normalized mutual information equal to 0.95 and F1 scores greater than 0.82 for our best run.

.bib [Guinaudeau2014] | .pdf
Hervé Bredin, Anindya Roy, Nicolas Pécheux, Alexandre Allauzen

ACM MM 2014, 22nd ACM International Conference on Multimedia

We address the problem of speaker identification in multimedia data, and TV series in particular. While speaker identification is traditionally a supervised machine-learning task, our first contribution is to significantly reduce the need for costly preliminary manual annotations through the use of automatically aligned (and potentially noisy) fan-generated transcripts and subtitles. We show that both speech activity detection and speech turn identification modules trained in this weakly supervised manner achieve similar performance as their fully supervised counterparts (i.e. relying on fine manual speech/non-speech/speaker annotation). Our second contribution relates to the use of multilingual audio tracks usually available with this kind of content to significantly improve the overall speaker identification performance. Reproducible experiments (including dataset, manual annotations and source code) performed on the first six episodes of The Big Bang Theory TV series show that combining the French audio track (containing dubbed actor voices) with the English one (with the original actor voices) improves the overall English speaker identification performance by 5% absolute and up to 70% relative on the five main characters.

.bib [Bredin2014b] | .pdf

2013

Hervé Bredin, Johann Poignant

Interspeech 2013, 14th Annual Conference of the International Speech Communication Association

Most state-of-the-art approaches address speaker diarization as a hierarchical agglomerative clustering problem in the audio domain. In this paper, we propose to revisit one of them: speech turns clustering based on the Bayesian Information Criterion (a.k.a. BIC clustering). First, we show how to model it as an integer linear programming (ILP) problem.Its resolution leads to the same overall diarization error rate as standard BIC clustering but generates significantly purer speaker clusters. Then, we describe how this approach can easily be extended to the audiovisual domain and TV broadcast in particular. The straightforward integration of detected overlaid names (used to introduce guests or journalists, and obtained via video OCR) into a multimodal ILP problem yields significantly better speaker diarization results. Finally, we explain how this novel paradigm can incidentally be used for unsupervised speaker identification (i.e. not relying on any prior acoustic speaker models). Experiments on the REPERE TV broadcast corpus show that it achieves performance close to that of an oracle capable of identifying any speaker as long as their name appears on screen at least once in the video.

.bib [Bredin2013] | .pdf
Hervé Bredin, Johann Poignant, Guillaume Fortier, Makarand Tapaswi, Viet-Bac Le, Anindya Roy, Claude Barras, Sophie Rosset, Achintya Sarkar, Qian Yang, Hua Gao, Alexis Mignon, Jakob Verbeek, Laurent Besacier, Georges Quénot, Hazim Kemal Ekenel, Rainer Stiefelhagen

SLAM 2013, First Workshop on Speech, Language and Audio for Multimedia

We describe QCompere consortium submissions to the REPERE 2013 evaluation campaign. The REPERE challenge aims at gathering four communities (face recognition, speaker identification, optical character recognition and named entity detection) towards the same goal: multimodal person recognition in TV broadcast. First, four mono-modal components are introduced (one for each foregoing community) constituting the elementary building blocks of our various submissions. Then, depending on the target modality (speaker or face recognition) and on the task (supervised or unsupervised recognition), four different fusion techniques are introduced: they can be summarized as propagation-, classifier-, rule- or graph-based approaches. Finally, their performance is evaluated on REPERE 2013 test set and their advantages and limitations are discussed.

.bib [Bredin2013a] | .pdf
Johann Poignant, Hervé Bredin, Laurent Besacier, Georges Quénot, Claude Barras

SLAM 2013, First Workshop on Speech, Language and Audio for Multimedia

Existing methods for unsupervised identification of speakers in TV broadcast usually rely on the output of a speaker diarization module and try to name each cluster using names provided by another source of information: we call it ``late naming''. Hence, written names extracted from title blocks tend to lead to high precision identification, although they cannot correct errors made during the clustering step. In this paper, we extend our previous ``late naming'' approach in two ways: ``integrated naming'' and ``early naming''. While ``late naming'' relies on a speaker diarization module optimized for speaker diarization, ``integrated naming'' jointly optimize speaker diarization and name propagation in terms of identification errors. ``Early naming'' modifies the speaker diarization module by adding constraints preventing two clusters with different written names to be merged together. While ``integrated naming'' yields similar identification performance as ``late naming'' (with better precision), ``early naming'' improves over this baseline both in terms of identification error rate and stability of the clustering stopping criterion.

.bib [Poignant2013] | .pdf

2012

Bertrand Delezoide, Frédéric Precioso, Philippe Gosselin, Miriam Redi, Bernard Mérialdo, Lionel Granjon, Denis Pellerin, Michèle Rombaut, Hervé Jégou, Rémi Vieux, Boris Mansencal, Jenny Benois-Pineau, Stéphane Ayache, Bahjat Safadi, Franck Thollard, Georges Quénot, Hervé Bredin, Matthieu Cord, Alexandre Benoit, Patrick Lambert, Tiberius Strat, Joseph Razik, Sébastion Paris, Hervé Glotin

TRECVid 2011, TREC Video Retrieval Evaluation Online Proceedings

The IRIM group is a consortium of French teams working on Multimedia Indexing and Retrieval. This paper describes its participation to the TRECVID 2011 semantic indexing and instance search tasks. For the semantic indexing task, our approach uses a six-stages processing pipelines for computing scores for the likelihood of a video shot to contain a target concept. These scores are then used for producing a ranked list of images or shots that are the most likely to contain the target concept. The pipeline is composed of the following steps: descriptor extraction, descriptor optimization, classification, fusion of descriptor variants, higher-level fusion, and re-ranking. We evaluated a number of different descriptors and tried different fusion strategies. The best IRIM run has a Mean Inferred Average Precision of 0.1387, which ranked us 5th out of 19 participants. For the instance search task, we we used both object based query and frame based query. We formulated the query in standard way as comparison of visual signatures either of object with parts of DB frames or as a comparison of visual signatures of query and DB frames. To produce visual signatures we also used two apporaches: the first one is the baseline Bag-Of-Visual-Words (BOVW) model based on SURF interest point descriptor; the second approach is a Bag-Of-Regions (BOR) model that extends the traditional notion of BOVW vocabulary not only to keypoint-based descriptors but to region based descriptors.

.bib [Delezoide2012] | .pdf
Hervé Bredin

ICASSP 2012, IEEE International Conference on Acoustics, Speech, and Signal Processing

We investigate the use of speaker diarization (SD) and automatic speech recognition (ASR) for the segmentation of audiovisual documents into scenes. We introduce multiple monomodal and multimodal approaches based on a state-of-the-art algorithm called generalized scene transition graph (GSTG). First, we extend the latter with the use of semantic information derived from both SD and ASR. Then, multimodal fusion of color histograms, SD and ASR is investigated at various point of the GSTG pipeline (early, late or intermediate fusion). Experiments driven on a few episodes of a popular TV show indicate that SD and ASR can be successfully combined with visual information and bring an additional +11% relative increase in terms of F-Measure for scene boundary detection over the state-of-the-art baseline.

.bib [Bredin2012] | .pdf
Hervé Bredin

ICASSP 2012, IEEE International Conference on Acoustics, Speech, and Signal Processing

We deal with the issue of combining dozens of classifiers into a better one. Our first contribution is the introduction of the notion of communities of classifiers. We build a complete graph with one node per classifier and edges weighted by a measure of similarity between connected classifiers. The resulting community structure is uncovered from this graph using the state-of-the-art Louvain algorithm. Our second contribution is a hierarchical fusion approach driven by these communities. First, intra-community fusion results in one classifier per community. Then, inter-community fusion takes advantage of their complementarity to achieve much better classification performance. Application to the combination of 90 classifiers in the framework of TRECVid 2010 Semantic Indexing task shows a 30% increase in performance relative to a baseline flat fusion.

.bib [Bredin2012a] | .pdf
Philippe Ercolessi, Christine Sénac, Hervé Bredin

CBMI 2012, 10th Workshop on Content-Based Multimedia Indexing

Multiple sub-stories usually coexist in every episode of a TV series. We propose several variants of an approach for plot de-interlacing based on scenes clustering -- with the ultimate goal of providing the end-user with tools for fast and easy overview of one episode, one season or the whole TV series. Each scene can be described in three different ways (based on color histograms, speaker diarization or automatic speech recognition outputs) and four clustering approaches are investigated, one of them based on a graphical representation of the video. Experiments are performed on two TV series of different lengths and formats. We show that semantic descriptors (such as speaker diarization) give the best results and underline that our approach provides useful information for plot de-interlacing.

.bib [Ercolessi2012] | .pdf
Johann Poignant, Hervé Bredin, Viet-Bac Le, Laurent Besacier, Claude Barras, Georges Quénot

Interspeech 2012, 13th Annual Conference of the International Speech Communication Association

We propose an approach for unsupervised speaker identification in TV broadcast videos, by combining acoustic speaker diarization with person names obtained via video OCR from overlaid texts. Three methods for the propagation of the overlaid names to the speech turns are compared, taking into account the co-occurence duration between the speaker clusters and the names provided by the video OCR and using a task-adapted variant of the TF-IDF information retrieval coefficient. These methods were tested on the REPERE dry-run evaluation corpus, containing 3 hours of annotated videos. Our best unsupervised system reaches a F-measure of 70.2\% when considering all the speakers, and 81.7\% if anchor speakers are left out. By comparison, a mono-modal, supervised speaker identification system with 535 speaker models trained on matching development data and additional TV and radio data only provided a 57.5\% F-measure when considering all the speakers and 45.7\% without anchor.

.bib [Poignant2012] | .pdf
Hervé Bredin, Johann Poignant, Makarand Tapaswi, Guillaume Fortier, Viet Bac Le, Thibault Napoleon, Hua Gao, Claude Barras, Sophie Rosset, Laurent Besacier, Jakob Verbeek, Georges Quénot, Frédéric Jurie, Hazim Kemal Ekenel

ECCV 2012, Workshop on Information Fusion in Computer Vision for Concept Recognition

The REPERE challenge is a project aiming at the evaluation of systems for supervised and unsupervised multimodal recognition of people in TV broadcast. In this paper, we describe, evaluate and discuss QCompere consortium submissions to the 2012 \repere evaluation campaign dry-run. Speaker identification (and face recognition) can be greatly improved when combined with name detection through video optical character recognition. Moreover, we show that unsupervised multimodal person recognition systems can achieve performance nearly as good as supervised monomodal ones (with several hundreds of identity models).

.bib [Bredin2012b] | .pdf
Tiberius Strat, Alexandre Benoit, Hervé Bredin, Georges Quénot, Patrick Lambert

ECCV 2012, Workshop on Information Fusion in Computer Vision for Concept Recognition

We deal with the issue of combining dozens of classifiers into a better one, for concept detection in videos. We compare three fusion approaches that share a common structure: they all start with a classifier clustering stage, continue with an intra-cluster fusion and end with an inter-cluster fusion. The main difference between them comes from the first stage. The first approach relies on a priori knowledge about the internals of each classifier (low-level descriptors and classification algorithm) to group the set of available classifiers by similarity. The second and third approaches obtain classifier similarity measures directly from their output and group them using agglomerative clustering for the second approach and community detection for the third one.

.bib [Strat2012] | .pdf
Philippe Ercolessi, Christine Sénac, Sandrine Mouysset, Hervé Bredin

AMVA 2012, 1st ACM International Workshop on Audio and Multimedia Methods for Large-Scale Video Analysis at ACM Multimedia 2012

Since the 90s, TV series tend to introduce more and more main characters and they are often composed of multiple intertwined stories. In this paper, we propose a hierarchical framework of plot de-interlacing which permits to cluster semantic scenes into stories: a story is a group of scenes not necessarily contiguous but showing a strong semantic relation. Each scene is described using three different modalities (based on color histograms, speaker diarization or automatic speech recognition outputs) as well as their multimodal combination. We introduce the notion of character-driven episodes as episodes where stories are emphasized by the presence or absence of characters, and we propose an automatic method, based on a social graph, to detect these episodes. Depending on whether an episode is character-driven or not, the plot-de-interlacing -which is a scene clustering- is made either through a traditional average-link agglomerative clustering with speaker modality only, either through a spectral clustering with the fusion of all modalities. Experiments, conducted on twenty three episodes from three quite different TV series (different lengths and formats), show that the hierarchical framework brings an improvement for all the series.

.bib [Ercolessi2012b] | .pdf
Philippe Ercolessi, Hervé Bredin, Christine Sénac

ACM MM 2012, 20th ACM International Conference on Multimedia

Recent TV series tend to have more and more complex plot. They follow the lives of numerous characters and are made of multiple intertwined stories. In this paper, we introduce StoViz, a web-based interface allowing a fast overview of this kind of episode structure, based on our plot de-interlacing system. StoViz has two main goals. First, it provides the user with a useful overview of the episode by displaying each story separately and a short abstract extracted from them. Then, it allows an efficient visual comparison of the output of any automatic plot de-interlacing algorithm with the manual annotation in terms of stories and is therefore very helpful for evaluation purposes. StoViz is available online at http://stoviz.niderb.fr.

.bib [Ercolessi2012c] | .pdf

2011

David Gorisse, Frédéric Precioso, Philippe Gosselin, Lionel Granjon, Denis Pellerin, Michèle Rombaut, Hervé Bredin, Lionel Koenig, Rémi Vieux, Boris Mansencal, Jenny Benois-Pineau, Hugo Boujut, Claire Morand, Hervé Jégou, Stéphane Ayache, Bahjat Safadi, Yubing Tong, Franck Thollard, Georges Quénot, Matthieu Cord, Alexandre Beno\^it, Patrick Lambert

TRECVid 2010, TREC Video Retrieval Evaluation Online Proceedings

The IRIM group is a consortium of French teams working on Multimedia Indexing and Retrieval. This paper describes our participation to the TRECVID 2010 semantic indexing and instance search tasks. For the semantic indexing task, we evaluated a number of different descriptors and tried different fusion strategies, in particular hierarchical fusion. The best IRIM run has a Mean Inferred Average Precision of 0.0442, which is above the task median performance. We found that fusion of the classification scores from different classifier types improves the performance and that even with a quite low individual performance, audio descriptors can help. For the instance search task, we used only one of the example images in our queries. The rank is nearly in the middle of the list of participants. The experiment showed that HSV features outperform the concatenation of HSV and Edge histograms or the Wavelet features.

.bib [Gorisse2011] | .pdf
Philippe Ercolessi, Hervé Bredin, Christine Sénac, Philippe Joly

WIAMIS 2011, 12th International Workshop on Image Analysis for Multimedia Interactive Services

In this paper, we propose a novel approach to perform scene segmentation of TV series. Using the output of our existing speaker diarization system, any temporal segment of the video can be described as a binary feature vector. A straightforward segmentation algorithm then allows to group similar contiguous speaker segments into scenes. An additional visual-only color-based segmentation is then used to refine the first segmentation. Experiments are performed on a subset of the Ally McBeal TV series and show promising results, obtained with a rule-free and generic method. For comparison purposes, test corpus annotations and description are made available to the community.

.bib [Ercolessi2011] | .pdf

2010

Delezoide, Bertrand, Le Borgne, Hervé, Moëllic, Pierre-Alain, Gorisse, David, Precioso, Frédéric, Wang, Feng, Merialdo, Bernard, Gosselin, Philippe, Granjon, Lionel, Pellerin, Denis, Rombaut, Michèle, Bredin, Hervé, Koenig, Lionel, Lachambre, Hélène, El Khoury, Elie, Mansencal, Boris, Zhou, Yifan, Benois-Pineau, Jenny, Jégou, Hervé, Ayache, Stéphane, Safadi, Bahjat, Quenot, Georges, Fabrizio, Jonathan, Cord, Matthieu, Glotin, Hervé, Zhao, Zhongqiu, Dumont, Emilie, Augereau, Bertrand

TRECVid 2009, TREC Video Retrieval Evaluation Online Proceedings

The IRIM group is a consortium of French teams working on Multimedia Indexing and Retrieval. This paper describes our participation to the TRECVID 2009 High Level Features detection task. We evaluated a large number of different descriptors (on TRECVID 2008 data) and tried different fusion strategies, in particular hierarchical fusion and genetic fusion. The best IRIM run has a Mean Inferred Average Precision of 0.1220, which is significantly above TRECVID 2009 HLF detection task median performance. We found that fusion of the classification scores from different classifier types improves the performance and that even with a quite low individual performance, audio descriptors can help.

.bib [Delezoide2010] | .pdf
Hervé Bredin, Lionel Koenig, Hélène Lachambre, Elie El Khoury

TRECVid 2009, TREC Video Retrieval Evaluation Online Proceedings

.bib [Bredin2010] | .pdf

2009

Saman H. Cooray, Hervé Bredin, Li-Qun Xu, Noel E. O'Connor

ACM MM 2009, 17th ACM International Conference on Multimedia

.bib [Cooray2009] | .pdf

2008

Hervé Bredin, Gérard Chollet

ICASSP 2008, IEEE International Conference on Acoustics, Speech, and Signal Processing

.bib [Bredin2008] | .pdf
Benoït Fauve, Hervé Bredin, Walid Karam, Florian Verdet, Aurélien Mayoue, Gérard Chollet, Jean Hennebert, R. Lewis, John Mason, Chafic Mokbel, Dijana Petrovska

ICASSP 2008, IEEE International Conference on Acoustics, Speech, and Signal Processing

.bib [Fauve2008] | .pdf
Emilie Dumont, Bernard Merialdo, Slim Essid, Werner Bailer, Herwig Rehatschek, Daragh Byrne, Hervé Bredin, Noel O'Connor, Gareth JF Jones, Alan F Smeaton, Martin Haller, Andreas Krutz, Thomas Sikora, Tomas Piatrik

TRECVID 2008, ACM International Conference on Multimedia Information Retrieval

.bib [Dumont2008] | .pdf
Hervé Bredin, Daragh Byrne, Hyowon Lee, Noel O'Connor, Gareth JF Jones

TRECVID 2008, ACM International Conference on Multimedia Information Retrieval 2008

.bib [Bredin2008a] | .pdf
Emilie Dumont, Bernard Merialdo, Slim Essid, Werner Bailer, Daragh Byrne, Hervé Bredin, Noel O'Connor, Gareth JF Jones, Martin Haller, Andreas Krutz, Thomas Sikora, Tomas Piatrik

SAMT 2008, 3rd International Conference on Semantic and Digital Media Technologies

.bib [Dumont2008a] | .pdf

2007

Hervé Bredin, Gérard Chollet

ICASSP 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing

.bib [Bredin2007a] | .pdf
Rémi Landais, Hervé Bredin, Leila Zouari, Gérard Chollet

Traitement et Analyse de l'Information : Méthodes et Applications

.bib [Landais2007] | .pdf
Enrique Argones-Rúa, Carmen García-Mateo, Hervé Bredin, Gérard Chollet

1st Spanish Workshop on Biometrics

.bib [Argones-Rua2007] | .pdf
Patrick Perrot, Hervé Bredin, Gérard Chollet

2007 International Crime Science Conference

.bib [Perrot2007] | .pdf

2006

Hervé Bredin, Antonio Miguel, Ian Witten, Gérard Chollet

ICASSP 2006, IEEE International Conference on Acoustics, Speech, and Signal Processing

.bib [Bredin2006] | .pdf
Jacques Koreman, Andrew C Morris, D. Wu, Sabah Jassim, Harin Sellahewa, J. Ehlers, Gérard Chollet, Guido Aversano, Hervé Bredin, Sonia Garcia-Salicetti, Lorène Allano, Bao Ly Van, Bernadette Dorizzi

MMUA 2006, Workshop on Multimodal User Authentication

.bib [Koreman2006] | .pdf
Hervé Bredin, Guido Aversano, Chafic Mokbel, Gérard Chollet

MMUA 2006, Workshop on Multimodal User Authentication

.bib [Bredin2006a] | .pdf
Fabian Brugger, Leila Zouari, Hervé Bredin, Asma Amehraye, Gérard Chollet, Dominique Pastor, Yang Ni

JEP 2006, Journées d'Etudes sur la Parole

.bib [Brugger2006] | .pdf
Hervé Bredin, Najim Dehak, Gérard Chollet

ICPR 2006, IAPR International Conference on Pattern Recognition

.bib [Bredin2006b] | .pdf
Hervé Bredin, Gérard Chollet

VIE 2006, IEE International Conference on Visual Information Engineering

.bib [Bredin2006c] | .pdf

2005

Kevin McTait, Hervé Bredin, Silvia Colón, Thomas Fillon, Gérard Chollet

ISISPA 2005, International Symposium on Image and Signal Processing and Analysis

.bib [McTait2005] | .pdf