Inferring the "Dark Matter" and "Dark Energy" from Image and Video
|(a) Open-door is not defined by poses||(b) Where are the chairs and trashcans?||(c) What is water?|
|Figure 1 (a) Many daily actions are defined by their causal effects not by poses. (b) The courtyard scene includes three chairs, two trashcans, and vending machines which cannot be recognized by their appearance due to low resolution and large within-category variations, but can be inferred from the scene layout or human trajectories in the video. (c) Water, and other fluid, play important roles in our activities, but are hardly detectable in images.|
In images and videos, many entities (functional objects, stuff like water, object fluents, intents in mind) and relations (causal-effects, physical supports, attraction fields) are infeasible to detect by their appearances using existing approaches, and most of them do not even show in any pixels. Yet, they are pervasive and govern the placement and motion of the visible entities that are relatively easier to detect. By analogy, they are like the dark matters and dark energy in cosmology which physicists study in a standard cosmology model. Studying such "dark entities" and "dark relations" in vision are crucial for filling the performance gaps in the recognition of objects, scenes, actions and events. This proposal will study these dark entities and dark relations in three projects:
- learning causal models and reasoning perceptual causality from video;
- parsing 3D scenes by reasoning physical relations, stability, and safety;
- understanding scenes by inferring attractions relations, hidden objects, and intents.
The project has the potential transforming computer vision research in several aspects.
- Representing causal knowledge to go beyond associational knowledge in vision. Graphical models have been widely used in computer vision as the backbones for representing objects, scenes, actions and events. These models are associational or contextual in space and time, but not causal, and they are good at answering what, who are where and when. Casual models are a large part of human knowledge and allow us to answer deeper questions on why, why not, what if (counterfactual). The proposed study will be the first formal study of causality (learning, modeling, and reasoning) in the vision literature.
- Reasoning the dark entities and relations to go beyond the current geometry and appearance-based paradigm. Perceptual causality, human intents and physics are generally applicable to all categories of object, scene, action and events, i.e. transportable across datasets. These entities and relations are deeper, and more invariant, than geometry and appearance --- the dominating features used in visual recognition.
- Developing joint representation and joint inference algorithm. The rich contextual and causal links in this joint representation are essential for building robust vision systems where each visual entity can be inferred through multi-routes, but are not systematically studied and integrated in the existing paradigm.
The proposed research will provide the core techniques to improve the performance of key tasks in computer vision: object recognition, scene understanding, and action and event recognition. Improving the performance of these tasks will generate broader impacts to the following applications.
- Video surveillance for security and timely intelligence. Understanding functional objects, scenes, actions, and causality are crucial for detecting and predicting abnormal / suspicious actions for security and timely intelligence. The PI will collaborate with a leading surveillance company to transfer the results.
- Intelligent robots for rescue in disaster areas. A typical challenging task is the rescue mission at disaster area (like the earthquake and tsunami). The robots must be able to reason physical stability and safety of the scenes, and reason the causal effects of its actions to the damaged sites. Reasoning physics, causality, intents, stuff like water are also crucial capabilities for intelligent robots searching in a building, or providing health care for seniors at home.
- Aerial scene and activity understanding. Unmanned aerial vehicles (UAV or Drones) are becoming widely used in mission like: anti-terrorism and search for fugitives in remote areas. Objects and humans in aerial videos can hardly be recognized by their appearance due to the top-view and low resolution. The ground objects, like a building, a site, and the activity of a person/vehicle will have to be reasoned through their intents, and activity patterns ĘC a topic studied in the proposed project.
Motivation: going beyond geometry-based and appearance-based approachesThe goal of computer vision, as coined by Marr, is to compute what are where by looking. This paradigm has guided the geometry-based approaches in the 1980s-1990s and the appearance-based methods in the past 20 years. Despite of the remarkable progresses in recognizing objects, actions and scenes by using large data sets, better designed features, and machine learning techniques, performances in challenging benchmarks are still far from being satisfactory. To gain the remaining percentages, we must look for a bigger picture to model and reason the missing dimensions. By analogy, this is similar to research in cosmology and astronomy. Physicists proposed in the 1980s and now began to accept a standard cosmology model that the mass-energy visible by telescopes only account for less than 5% of the universe, and the rest are dark matters (23%) and dark energy (72%) [url: map.gsfc.nasa.gov/universe/]. The properties and characteristics of the dark matters and dark energy have to be reasoned jointly from the visible mass-energy using a sophisticated cosmology model. The dark matters and energy, in return, help to explain the formation, evolution, and motion of the visible universe. In vision, "dark matter" corresponds to entities which are infeasible to recognize by visual appearances. This includes, not exclusively,
- status of an agent (human and animal)'s goals and intents, like hungry, thirsty, which trigger actions;
- status of an object, such as a door is "locked";
- stuff like water which has no specific geometric shape or appearance.
- physical forces like gravity and supporting relations between objects;
- causal effects and causal relations between actions and the changing object statuses;
- attraction relations between an object (like food) and an agent (hungry); and so on.
Objective: visual recognition by reasoning fluents, causality, intents, attractions and physicsOur objective is to study joint reasoning algorithms on a joint representation that integrates
- the "visible"--- traditional recognition categories: objects, scenes, actions and events;
- the "dark" --- higher level cognition concepts: fluent, causality, intents, attractions and physics.
- we will study a joint spatial, temporal, and causal and-or graph for representation.
- we will study joint and reasoning algorithms on the STC-AoG, building upon existing work in our group.
This work is supported by the NSF Robust Intelligence program: IIS 1423305