Joint Video and Text Parsing for Understanding Events and Answering Queries

Demo Download

Video understanding aims to automatically recognize the objects, scenes and events. A significant portion of videos are accompanied by textual descriptions, which are informative for video understanding. We propose to process video and text jointly in order to obtain a more accurate and comprehensive interpretation of the scenes and events. The joint interpretation of video and text is represented by a parse graph, which is an extension of the constituency-based parse trees used in natural language syntactic parsing. We propose a probabilistic generative model that takes into account the parsing of video and text, the relationships between the joint parse graph and the video and text parse graphs, and the prior knowledge of reasonable joint parse graphs. Based on the probabilistic model, we propose a system consisting of three modules: video parsing, text parsing and joint inference. For joint inference, we propose a novel algorithm based on graph matching, deduction and revision. We evaluated our approach on several surveillance datasets and analyzed the impact of the degree of video-text overlap on the performance of joint parsing. We also demonstrate the usefulness of the joint parse graph by applying it to semantic query answering.

Joint Video and Text Parsing for Understanding Events and Answering Queries

Joint Parsing: Spatial, Temporal and Causal Inference for Understanding Images and Videos

Natural Language Query Based on Joint Parsing