Video understanding aims to automatically recognize the objects,
scenes and events. A significant portion of videos are accompanied by
textual descriptions, which are informative for video understanding. We
propose to process video and text jointly in order to obtain a more
accurate and comprehensive interpretation of the scenes and events. The
joint interpretation of video and text is represented by a parse graph,
which is an extension of the constituency-based parse trees used in
natural language syntactic parsing. We propose a probabilistic
generative model that takes into
account the parsing of video and text, the relationships between the
joint parse graph and the video and text parse graphs, and the prior
knowledge of reasonable joint parse graphs. Based on the probabilistic
model, we propose a system consisting of three modules: video parsing,
text parsing and joint inference. For joint inference, we propose a
novel algorithm based on graph matching, deduction and revision. We
evaluated our approach on several surveillance datasets and analyzed
the impact of the degree of video-text overlap on the performance of
joint parsing. We also demonstrate the usefulness of the joint parse graph by applying it to semantic query answering.
Joint Parsing: Spatial, Temporal and Causal Inference for Understanding Images and Videos