[1] E.T. Jaynes, "Information theory and statistical mechanics", Physical Review,
      106, 620-630, 1957.
 [2] S.C. Zhu, Y.N. Wu, and D.B.Mumford, "Minimax Entropy Principle and Its Applications to 
      Texture Modeling", Neural Computation, 9, pp1627-1660, nov. 1997.
 [3] S.C. Zhu, Y.N. Wu, and D.B.Mumford, "FRAME: Filters, Random fields And Maximum Entropy
  --- towards a unified theory for texture modeling", Int'l Journal of Computer Vision, 27(2),  
      pp1-20, 1998.

Minimax Entropy : a Mathematical Theory for Descriptive Learning

Click here to see results of minimax texture models

The red cloud represents a joint density P  in a high dimensional space. 
A set of axes (blue) are chosen to observe the distribution like viewing 
a room through an aperture.    
The observed  by an axis is a marginal distribution
(projections of density P onto the axis) which is 
estimated conveniently by empirical histogram.
In computer vision, one often faces the problem of estimating a probability density that is characteristic of a certain visual pattern, such as textures, shapes, and generic images. As the ensemble equivalence theorem indicates that a pattern on finite lattices is defined by probability models. The difficulties for learning such densities lie in two aspects. One is the well-known curse-of-dimensionality. A density of an image could be 256 x 256 dimensions with only small number of observations available. The other is the non-Gaussian shape of the density ---a growing awareness in the vision community. Intuitively, the density function has multiple modes, usually caused by hidden variables which are not accounted in the joint distribution.

The minimax entropy learning theory studied in (Zhu, Wu, and Mumford, 1997)[2,3] provide a mathematically rigorious scheme for learning high dimensional density. The general idea is illustrated in the figure below. For a density in K-dimensional space (see red clouds and K=3 in the left figure). One can measure as observation the marginal distributions through various axes, which are projection of the density to those axes (see the right figure). So one constructs a model that can re-produce all observed statistics (marginal distributions). Among all densities satisfying the constraints, we choose the one with maximum entropy[3]. This is posed as a constrained optimization problem, and yields the Gibbs (Markov random field) model[1,2]. The the axis (observations) must be chosen so that they are informative, in the sense that the constructed model approximate the underlying density by minimizing a Kullback-Leibler divergence (or crosss entropy). This leads to the general minimax entropy principle expressed below:
1. We choose best features and statistics F to minimize the entropy of the model.
2. We choose the best parameters beta (model) that has maximum entropy.

Minimax entropy provides a unifying learning scheme for learning homogeneous and inhomogeneous Gibbs models, as well as the verification of such models. The bottom Figure illustrates this scheme.

The minimax entropy learning scheme [2,3]: Given a set of training images as instances of a pattern (texture, shape) generated by some underlying stochastic processes, we pursue a Gibbs model by choosing a set of information features(minimizing the entropy), then a maximum entropy distribution is learned, and verified through a general MCMC sampling process. The process ends when the MCMC samples appears non-distinguishable from the observed ones.,