SIG-11: Second International Workshop on Stochastic Image Grammars

Rama Chellappa (University of Maryland, College Park)

The Evolution of Stochastic Grammars for Representation and Recognition of Activities in Videos

Slides

The speaker is one of the privileged many to have been taught syntactic pattern recognition methods by the Late Prof. K.S. Fu. In this talk, I will discuss the evolution of stochastic image grammars from the early seventies to now with a focus on image and video understanding applications. In the early days, the grammars that were used for image analysis applications were mostly based on regular or context free grammars and were largely deterministic. Given his interest in statistical methods for pattern recognition. Prof. Fu assigned probabilities to production rules and designed many stochastic parsers for image recognition problems. Despite the many advances made by Profs. Fu, Pavlidis, and many others, the grammar-based approach to image recognition problems went out of fashion in the mid eighties when Prof. Fu passed away. However, over the last decade, this field has been revitalized and has become an active area of research. In particular, researchers such as Bobick, Stu Geman, Song-Chun Zhu, Larry Davis, Subrahmanian have proposed many approaches for image analysis and activity recognition in videos based on grammar or grammar-like representations. I will briefly discuss these methods and present many examples of stochastic grammar based- approaches for activity recognition in videos done in collaboration with Prof. Subrahmanian's group in UMD. I will also discuss some open issues to be addressed.
Pedro Domingos (University of Washington)

Sum-Product Networks: A New Deep Architecture

Slides

The key limiting factor in graphical model inference and learning is the complexity of the partition function. We thus ask the question: what are the most general conditions under which the partition function is tractable? The answer leads to a new kind of deep architecture, which we call sum-product networks (SPNs), and will present in this talk. The key idea of SPNs is to compactly represent the partition function by introducing multiple layers of hidden variables. An SPN is a rooted directed acyclic graph with variables as leaves, sums and products as internal nodes, and weighted edges. Essentially all tractable graphical models can be cast as SPNs, but SPNs are also strictly more general. For example, SPNs can be exponentially more compact than hierarchical mixture models, because they allow mixtures over subsets of variables and their reuse. SPNs can be exponentially more compact than junction trees, when context-specific independence and determinism are present, since they exploit these and junction trees do not. This holds even when junction trees are formed from Bayesian networks with context-specific independence in the form of decision trees at the nodes, because decision trees suffer from the replication problem, and can be exponentially larger than a DAG representation of the same function. SPNs can also compactly represent some classes of distributions in which no conditional independences hold, and whose graphical models are, therefore, completely connected graphs.

We evaluated SPNs by applying them to to the problem of completing faces. This is a good test for a deep architecture, because it is an extremely difficult task, where detecting deep structure is key. To our knowledge, no previous learner has done well on this problem. We used two well-known datasets: Caltech-101, and Olivetti. The SPNs learned in our experiments contained up to 46 layers. Compared to state-of-the-art deep architectures such as DBNs, and DBMs, SPNs are simple and require much less engineering, and are at least an order of magnitude faster in inference and learning.

Much remains to be explored about SPNs, including the extension of SPNs to sequential domains, design principles for SPN architectures, other learning methods for SPNs, and further applications. While graphical models can compactly represent more distributions than SPNs, this potential is hampered by the need of approximate inference. Proper comparison between the two model classes is a fascinating question both theoretically and empirically.
Pedro Felzenszwalb (Brown University)

Object Detection Grammars

Slides

In this talk I will discuss various aspects of object detection using compositional models, focusing on the framework of object detection grammars, discriminative training and efficient computation.

Object detection grammars provide a formalism for expressing very general types of models for object detection. Over the past few years we have considered a sequence of increasingly richer models. Each model in this sequence builds on the structures and methods employed by the previous models, while staying within the framework of discriminatively trained grammar models. Along the way, we have increased representational capacity, developed new machine learning techniques, and focused on efficient computation. We are now at a stage where grammar based models are starting to outperform simpler models. We have a complete implementation of the formalism that makes it possible to quickly define new types of models using a simple modeling language.

This is joint work with Ross Girshick and David McAllester.
Roxana Girju (University of Illinois at Urbana-Champaign)

Slides

Stochastic Models for Semantic Parsing, Multi-Faceted Topic Discovery, and Causal Event Inference: Perspectives from Natural Language Processing

In this talk I will address three important problems in Natural Language Processing with direct relevance to Image Understanding: Semantic Parsing, Multi-Faceted Topic Discovery, and Causal Event Inference.

Although notoriously difficult, the problem of semantic relation discovery is simple to define: Given a noun-noun pair n1 -- n2, determine the semantic relationship between the two nouns in the context of a sentence. For example, the pair "door" -- "car" in the sentence 'I opened the door of the car and grabbed my phone' encodes a part-whole relationship (i.e., the door is part of the car). This task requires several local and global decisions -- i.e., it involves the meaning of the two noun concepts along with the meaning of other words in the sentence context. I will present a state-of-the-art semantic parser of English which relies on a stochastic hierarchical model of semantic relation discovery. It is an efficient WordNet-based learning model that identifies and extracts concept features from the WordNet IS-A backbone, which was designed to capture and relate various concept meanings (e.g., “car” is a kind of “vehicle”). The model finds the best abstraction level of concept semantic classes that would separate the positive and the negative examples, and that would accurately classify unseen instances. This is done by finding a boundary (i.e., a division in the WordNet noun concept hierarchy) that would best generalize over the training examples. The results show that some relations are better suited for such hierarchical models than others, and that contextual data are important in the performance of such a parser. The advantages and the limitations of the semantic parser will be exemplified on an application of Text-to-Scene conversion of various textual descriptions.

In the second part of the talk I will address the problem of multi-faceted topic discovery. Oftentimes, similar information can be presented from different viewpoints, aspects, or perspectives, and it is useful to model such differences. Probabilistic topic models have emerged as a popular approach to uncovering hidden structures in text collections, and can be used to cluster both words and documents. These models, however, typically learn distributions over words along only a single dimension of topicality, and ignore the fact that words may fall along other aspects such as sentiment, perspective, or theme. I will present a novel topic model, TAM, which not only allows documents to contain multiple aspects, but it can learn these aspects automatically. The model can be applied to a single collection and can discover patterns without document labels among multiple text collections as well as differences between them. Here a document is a mixture of such aspects, rather than belonging to exactly one aspect. For example, a recipe for curry pizza would contain elements from both Indian and American food, rather than strictly one or the other. The model is also applied to a corpus of editorials on the Israeli-Palestinian conflict.

In the last part of the talk I will address another challenging but very important problem: causal event inference. Unlike computers, people are very good at perceiving and inferring the causal relationships between events in a discourse context. Detecting such relations enables people to ﬁnd meaningful order in events and activities, which in turn helps them organize, plan, and even predict the future. I will present an unsupervised learning model which identifies causal relationships between scenario-specific events in news articles. The model generates ranked causal relationships by identifying appropriate candidate event pairs for each scenario of a particular domain. Scenario-specific events, contributing towards the same objectives in a domain, are likely to be dependent on each other, and thus form good candidates for causal relationships. For evaluation, we rely on the manipulation theory of causation and a comparison of precision-recall performance curves. The tests performed also bring insights into the cognitive aspects of this challenging task - i.e., how people perceive such causal relationships in context.
Noah Goodman (Stanford University)

Learning and the Language of Thought

Slides

Logic and probability are key themes of cognitive science that have long had an uneasy coexistence. I will describe the Probabilistic Language of Thought approach that brings them together into compositional representations with probabilistic meaning -- formalized as stochastic lambda calculus. I will describe how this general framework is realized in the probabilistic programming language Church. This language provides a compact formal representation for complex generative models. Church also provides a central metaphor for cognitive science: concepts as functions in probabilistic programs. However, this view leads to a view of concept learning as the induction of probabilistic programs. I will turn to a discussion of human concept learning, within the probabilistic language of thought framework, beginning with simple categorization tasks and extending to acquisition of abstract concepts, such as the natural numbers. I will finally discuss a technique -- Bayesian program merging -- for searching over the space of Church programs, and apply this to learning structured visual concepts from structured (tree-like) observations.
Edwin Hancock (University of York)

Information Theoretic Methods for Learning Generative Models for Relational Structures

Slides

This talk focusses on work aimed at developing a principled probabilistic and information theoretic framework for learning generative models of relational structure. The aim is develop methods that can be used to learn models that can capture the variability present in graph-structures used to represent shapes or arrangements of shape-primitives in images. Here nodes represent the parts of shape-primitives representing an object, and the the edges represent the relationships which prevail between the parts. The aim is to learn the relationships from examples. Of course such structures can exhibit variability in the arrangement of parts, and the data used in training can be subject to uncertainty. It hence represents a demanding learning problem, for which there is limited available methodology.

Whereas most of traditional pattern recognition and machine learning is concerned with pattern vectors, the issue of how to capture variability in graph, tree or string representations has received relatively little attention in the literature. The main reason for the lack of progress is the difficulty in developing representations that can capture variations in graph-structure. This variability can be attributed to a) variations in either node or edge attributes, b) variations in node or edge composition and c) variations in edge-connectivity. This trichotomy provides a natural framework for analyzing the state-of-theart in the literature. Most of the work on Bayes nets in the graphical models literature can be viewed as modeling variations in node or edge attributes. Examples also include the work aimed at using Gaussian models to capture variations in edge attributes. The problems of modeling variations in node and edge composition are more challenging since they focus on modeling the structure of the graph rather than its attributes.

The problem of learning edge structure is probably the most challenging of those listed above. Broadly speaking there are two approaches to characterizing variations in edge structure for graphs. The first of these is graph spectral, while the second is probabilistic. In the case of graph spectra, many of the ideas developed in the generative modeling of shape using principal components analysis can be translated relatively directly to graphs using simple vectorization procedures based on the correspondences conveyed by the ordering of Laplacian eigenvectors. Although these methods are simple and effective, they are limited by the stability of the Laplacian spectrum under perturbations in graph structure. The probabilistic approach is potentially more robust, but requires accurate correspondence information to be inferred from the available graph structure. If this is to hand, then a representation of edge structure can be learned. To date the most effective algorithm falling into this category exploits a part-based representation.

In this talk, we focus on the third problem and aim to learn a generative model that can be used to describe the distribution of structural variations present in a set of sample graphs, and in particular to characterize the variations of the edge structure present in the set. We follow recent work by Torsello and Hancock, and pose the problem as that of learning a generative supergraph representation from which we can sample. However, their work is based on trees, and since the trees are rooted the learning process can be effected by performing tree merging operations in polynomial time. This greedy strategy does not translate tractably to graphs where the complexity becomes exponential, and we require different strategies for learning and sampling. Torsello and Hancock realize both using edit operations, here on the other hand we use a soft-assign method for optimization and then generate new instances by Gibbs sampling. Here, we take an information theoretic approach to estimating the supergraph structure by using a minimum description length criterion. By taking into account the overall code-length in the model, MDL allows us to select a supergraph representation that trades-off goodness-of-fit with the observed sample graphs against the complexity of the model. We adopt the probabilistic model to furnish the required learning framework and encode the complexity of the supergraph using its von-Neumann entropy (i.e. the entropy of its Normalized Laplacian eigenvalues). Finally, a variant of EM algorithm is developed to minimize the total code-length criterion, in which the correspondences between the nodes of the sample graphs and those of the supergraph are treated as missing data. In the maximization step, we update both the node correspondence information and the structure of supergraph using graduated assignment. This novel technique is applied to a large database of object views, and used to learn class prototypes that can be used for the purposes of object recognition.