For my first participation to the TREC Video Retrieval Evaluation campaign, I focused on the rushes summarisation task. In a few words, starting from a given video V of duration D, the objective was to produce a video summary v of duration d = 2% x D, containing all the relevant information of the original video V.
Deciding whether a video segment contains relevant information is a difficult (impossible? subjective?) task in itself and that is not the purpose of this article. I will definitely have to come back to this particular issue in future posts.
In this post, I will only write about the core of the abstraction algorithm: video footprints. For a full description of the two systems we, the DCU team -- Daragh Byrne and myself, submitted to TRECVid 2008, please refer to [Bredin2008a]:
H. Bredin, D. Byrne, H. Lee, N. O’Connor, and G. J. Jones
"Dublin City University at TRECVid 2008 BBC Rushes Summarisation Task,"
in TRECVID 2008, ACM International Conference on Multimedia Information Retrieval 2008, Vancouver, BC, Canada, 2008.
As for any content-based video analysis algorithm, the first mandatory step is to extract features directly from the video content. We chose simplistic features: RGB color histogram from every frame of the video, therefore only focusing on visual content…
Dealing with the audio content is definitely one question I am planning on dealing with in the near future.
Histogram computation with 8 bins per color channel leads to the extraction of a 512-dimensional feature vector for each frame of the video (around 25 frames per second). This makes an ENORMOUS set of HUGE vectors. Applying dimensionality reduction techniques would certainly help us have a clearer insight of what is going on here. Here comes principal component analysis (a.k.a. PCA). Keeping the first two principal components, each frame can be described by a 2-dimensional feature vector as described in picture below.
Consequently, it is possible to follow the trajectory of the video in this 2-dimensional space which is divided into 30x30 bins (as shown in the video below). Each time the video goes through a bin, it is activated (i.e. it turns to black). The resulting binary 30x30 matrix is called the footprint of the video.
In this sample video, along with the footprint of the whole video (on the left), a footprint is also computed for each shot. The structure of the original video is clearly uncovered here: shots 1 and 2 are retakes of the same scene (like shots 3, 4 and 5) and this can be deduced from the observation of their footprints (they are very similar). In a summarization framework, this can be used to get rid of redundant shots (with similar footprints) or select the most informative sub-segment of a shot -- that is the one with the largest footprint (see video below).
More applications of video footprints to come…