Video Primitives: the trackable, sketchable, and intrackable

Youdong Zhao and Song-Chun Zhu


In this project, we study mathematical models for small video patches (e.g., 15 x 15 x 5), called video "bricks", in natural videos. We cluster these video bricks into a variety of subspaces (called video words or video primitives) of varying dimensions in the high (1, 125) dimensional space. The structures of the words are characterized by both appearance and motion dynamics. As these small video bricks have less compositional effects, we can divide their appearance into three pure types: flat patch, structural primitives (texton), and texture, and their motion into three pure types: still, trackable motion and intrackable motion. A common generative model is introduced to model these video words individually. The representation power of a word is measured by a information gain. And the words are pursued one by one based on significance of each cluster. These video words are atomic structures to construct higher level video patterns through composition. We will show some experiments for representing and segmenting the video sequences using the learnt video vocabulary, which show the potential power of our framework for learning basic video primitives.

Filter Bank

Large Scale 3D Gabor Filters (13X13X5 pixels)

Small Scale 3D Gabor Filters (7X7X5 pixels)


Dataset arrangement
Motion Static Motion Lighting
Appearance Trackable Intrackable
Sketch Bar
Texture water

All images are collected from Google and BBC Motion Gallery.