We [4] have attempted three models for textons. The results for a cheetah image are show as below. Click each topic for details.

Textons by Filter TCA

Textons by Patch TCA

Textons by Wavelet bases


Textons refer to fundamental micro-structures in generic natural images and the basic elements in early (pre-attentive) visual perception. In practice, the study of textons has important implications on a series of problems. Firstly, decomposing an image into its constituent components reduces information redundancy and thus leads to better image coding algorithms. Secondly, the decomposed image representation often has much reduced dimensions and less dependence between variables (coefficients), therefore it facilitates image modeling which are necessary for image segmentation and recognition. Thirdly, in biologic vision the micro-structures in natural images provide an ecologic cue for understanding the functions of neurons in the early stage of biologic vision system. However, in the literature of computer vision and visual perception, the word "texton" remains a vague concept and a precise mathematical definition has yet to be found. Here we show some study related to this topic.

  1. Sparse coding with over-complete basis 

As shown in Figure 1, Olshausen and Field (1997)[3] learned a set of bases in a non-parametric form from a large ensemble of image patches. These are over-complete basis learned under the general idea of sparse coding. In contrast to the orthogonal bases or tight frame in the Fourier and wavelet transforms, these bases are highly correlated.

Figure 1. Some image bases learned with sparse coding by (Olshausen and Field 1997)


  1. K-mean clustering in feature space 

Leung and Malik (1999)[2] use a discriminative model to compute image elements by clustering the filter responses. At each pixel, a pyramid of image filters at various scales and orientations are convolved with the image. The the filter responses are clustered by a K-mean clustering method. Because the feature vector over-constrains a local image patch, a pseudo-inverse method can recover an image icon from each cluster center, as shown in Figure 2 below. It is obvious that the potentially same image structure appears multiple times which are shifted, rotated, or scaled versions of each other.

(a).   (b).   (c).

Figure 2. (a) Polka-dot image. (b) Textons found via K-means with K=25. (c) Mapping of pixels to the texton channels. (Leung and Malik 1999)


  1. Transformed components in filter space

To overcome the obvious problem in Leung and Malik's model[2], we[4] adopt a TCA method by introducing a transformation as hidden (latent) variable. The potentially same image structures are transformed and thus combined into one cluster. Figure 3 shows two examples. More details and examples please see here.


Figure 3. The learned textons by the TCA method in filter space. To the right of icons are the label maps. (Zhu, Guo, Wu and Wang 2002)


  1. Transformed components in image space

For the method above, even thought the local image patch could be obtained from the feature vector of filter responses by some methods, such as pseudo-inverse, it is not a convenient way to reconstruct the image.  It is a discriminative model, not a generative model. 

We[1] build a generative model by replacing the filter responses with image patch as the features. The images patches can move within a local area and can be rotated and scaled. Like the TCA in filter space, these local patches are transformed to form tight clusters in the 121-space by an EM-algorithm. Image elements in a cheetah skin pattern are found as shown in Figure 4. More examples please see here.

Figure 4. The learned textons by TCA in image space. (Guo, Zhu and Wu 2001)


  1. Texton learning: from bases to textons 

One of the main problems with previous work is lacking variability in the learned image elements. We[4] propose to define "texton" as a mini-template that consists of a number of bases at some geometric and photometric configurations. Figure 5 shows one example of star pattern. A "star" is represented explicitly by a generative model of several bases. In addition, these bases are not assumed to be independently distributed any more. A sophisticated probabilistic model which accounts for the spatial relation of the bases is built and learned. More examples please see here.

a) Reconstructing a star pattern by two layers of bases. An individual star is decoupled into a LoG base in the upper layer for the body of the star plus a few other bases (mostly Gcos, Gsin) in the lower layer for the angels. 

b) The texton template for the star pattern.


c) How bases compose the image of a star.

Figure 5. The illustration of from bases to textons by an example of a star pattern. (Zhu, Guo, Wu and Wang 2002)



  1. Guo, C., Zhu, S. and Wu, Y. "Visual learning by integrating descriptive and generative methods", Proc. of 8th Int'l Conf. on Computer Vision, Vancouver, Canada, July 2001

  2. Leung, T. and Malik, J. "Recognizing surface using three-dimensional textons", Proc. of 7th Int'l Conf. on Computer Vision, Corfu, Greece, September 1999

  3. Olshausen, B. and Field, D. "Sparse coding with an over-complete basis set: A strategy employed by V1?", Vision Research, 37:3311-3325, 1997

  4. Zhu, S., Guo, C., Wu, Y. and Wang, Y. "What are Textons?", Proc. of 7th European Conf. on Computer Vision, Copenhagen, Denmark, May-June 2002

  5. Cheng-en Guo, Song-chun Zhu and Yingnian Wu,"A Mathematical Theory of Primal Sketch and Sketchability" (.pdf 553K)(results in .ppt 3.5M), Proc. of International Conference on Computer Vision, Nice France, 2003