The goal of this project is to develop a unified knowledge representation for robot autonomy. The representation is homogeneous in the sense that a simple and-or graph structure repeats itself in the spatial, temporal and causal dimensions to represent huge complexity in the following heterogeneous data sources.
- Image and video: capturing objects, scenes, humans, actions, events and various attributes, fluents and relations;
- Point clouds: capturing depth maps of objects and scene, human skeleton and hand poses;
- Natural language: speeches and text from humans in queries and situated dialogues;
- Sound in waveform: recording the responses of objects to actions and thus reflecting their status (fluent) changes, such as door closed, microwave is working, etc.
- Force sensor signals: recording the force, friction, torque and pressure when a human (or robot) acts on tools and objects.
These data come in offline from Internet searches or online when robots interact with humans and environments. The data record the states of the world in which robots navigate in and operate on, and contain complex compositional structures in space and time, at multi-scales and complex dynamics, functional and causal relations which are typical of complex systems, in biology and material science.
We will develop datafication tools for the following tasks and applications:
- Visual perception --- reconstructing 3D scene layouts from images and point clouds, recognizing objects in 3D shapes, understanding the actions and events performed by others (humans and robots) in the scene, as well as the spatial and temporal relations between these entities in multiple scales, from a 3rd person and 1st person view;
- Commonsense reasoning --- understanding the functions of objects, the use of tools, the underlying physics, such as supporting relations and stability, and causal-effects of the actions on the status of objects using sound and force sensors;
- Situated dialogue with humans --- understanding the goals and intents of humans (or other agents) in the scenes, understanding natural language in text or speech, and perform situated dialogues with humans for learning and cooperation; understanding and learning from human instructions and demonstrations; and
- Robot demonstrations --- demonstrating robot actions in real environments. We are planning for two sets of tasks: i) robot assembling furniture and set up rooms; and ii) robot unfolding and folding clothes.
Acknowledgments
This work is supported by the DARPA Award N66001-15-C-4035