Video Primitives: the trackable, sketchable, and intrackable
Youdong Zhao and Song-Chun Zhu
In this project, we study mathematical models for small video patches (e.g., 15 x 15 x 5), called video "bricks", in natural videos. We cluster these video bricks into a variety of subspaces (called video words or video primitives) of varying dimensions in the high (1, 125) dimensional space. The structures of the words are characterized by both appearance and motion dynamics. As these small video bricks have less compositional effects, we can divide their appearance into three pure types: flat patch, structural primitives (texton), and texture, and their motion into three pure types: still, trackable motion and intrackable motion. A common generative model is introduced to model these video words individually. The representation power of a word is measured by a information gain. And the words are pursued one by one based on significance of each cluster. These video words are atomic structures to construct higher level video patterns through composition. We will show some experiments for representing and segmenting the video sequences using the learnt video vocabulary, which show the potential power of our framework for learning basic video primitives.
All images are collected from Google and BBC Motion Gallery.