Reference:

[1] E.T. Jaynes, "Information theory and statistical mechanics", Physical Review, 106, 620-630, 1957. [2] S.C. Zhu, Y.N. Wu, and D.B.Mumford, "Minimax Entropy Principle and Its Applications to Texture Modeling",Neural Computation, 9, pp1627-1660, nov. 1997. [3] S.C. Zhu, Y.N. Wu, and D.B.Mumford, "FRAME: Filters, Random fields And Maximum Entropy --- towards a unified theory for texture modeling",Int'l Journal of Computer Vision, 27(2), pp1-20, 1998.

Minimax Entropy : a Mathematical Theory for Descriptive Learning

The red cloud represents a joint density P in a high dimensional space. A set of axes (blue) are chosen to observe the distribution like viewing a room through an aperture. |
The observed by an axis is a marginal distribution (projections of density P onto the axis) which is estimated conveniently by empirical histogram. |

In computer vision, one often faces the problem of estimating a probability
density that is characteristic of a certain visual pattern, such as textures,
shapes, and generic images. As the ensemble equivalence theorem indicates that
a pattern on finite lattices is defined by probability models. The difficulties
for learning such densities lie in two aspects. One is the well-known curse-of-dimensionality.
A density of an image could be 256 x 256 dimensions with only small number of
observations available. The other is the non-Gaussian shape of the density ---a
growing awareness in the vision community. Intuitively, the density function
has multiple modes, usually caused by hidden variables which are not accounted
in the joint distribution.

The minimax entropy learning theory studied in (Zhu, Wu, and Mumford,
1997)[2,3] provide a mathematically rigorious scheme for learning
high dimensional density. The general idea is illustrated in the figure
below. For a density in K-dimensional space (see red clouds and K=3
in the left figure). One can measure as observation the marginal distributions
through various axes, which are projection of the density to those
axes (see the right figure). So one constructs a model that can re-produce
all observed statistics (marginal distributions). Among all densities
satisfying the constraints, we choose the one with maximum entropy[3].
This is posed as a constrained optimization problem, and yields the
Gibbs (Markov random field) model[1,2]. The the axis (observations)
must be chosen so that they are informative, in the sense that the
constructed model approximate the underlying density by minimizing
a Kullback-Leibler divergence (or crosss entropy). This leads to the
general minimax entropy principle expressed below:

1. We choose best features and statistics F to minimize the entropy
of the model.

2. We choose the best parameters beta (model) that has maximum entropy.

Minimax entropy provides a unifying learning scheme for learning homogeneous and inhomogeneous Gibbs models, as well as the verification of such models. The bottom Figure illustrates this scheme.

The minimax entropy
learning scheme [2,3]:
Given
a set of training images as instances of a pattern (texture, shape)
generated by some underlying stochastic processes, we pursue a Gibbs
model by choosing a set of information features(minimizing the entropy),
then a maximum entropy distribution is learned, and verified through
a general MCMC sampling process. The process ends when the MCMC samples
appears non-distinguishable from the observed ones., |