Images are ambiguous at each of many levels of a contextual hierarchy. Nevertheless, most scenes are highly unambiguous at the level of interpretation, as evidenced by superior performance of humans. These observations argue for global vision models, such as deformable templates. Unfortunately, such models are computationally intractable for unconstrained problems. We will propose a compositional model in which more-or-less local entities are recursively composed, subject to syntactic restrictions, to form objects and object groupings. A gradient favoring composition is imposed via a description-length cost functional, thereby casting the recognition problem in a Bayesian framework. The actual recognition engine generates multiple compositional structures, corresponding to multiple scene interpretations, which are later resolved by appealing to the minimum-description-length principle. Viewed from the Bayesian perspective, this amounts to computing an approximate MAP labeling of the image.
In the companion paper, we argued in favor of a compositional approach
to artificial vision: rather than representing shapes as points in a
feature space in such a way that metric relations between
representations agree with metric relations between shapes (see
e.g. Shimon Edelman), we propose to construe representations
themselves as relationships. The nature of the relationships used is
highly problem-dependent, but, in general, representations are
constructed hierarchically/recursively in terms of parts and their
relations, starting with a small number of low-level constituents, and
eventually resulting, through strictly constrained composition rules,
in unlimited numbers of composite items. Here we argue that
composition is central to human cognition, and that compositional
models provide, in principle, a way to address the following three
major aspects of cognitive functions, in particular vision: (i) human
recognition performances are excellent despite ambiguity at all but
the most global levels of interpretation; (ii) cognitive functions
display a considerable amount of invariance; (iii) computational
requirements in recognition tasks may be daunting. Drawing upon these
remarks we suggest that brains are likely equipped with a mechanism
allowing them to perform dynamical relational binding between neural
representations. We discuss candidate mechanisms, in particular
mechanisms relying on the use of fine-temporal structure of neural
activity.
The human visual system can recognize unprimed views of common objects at
sustained rates in excess of 10 per second. How can a visual system work
so fast? A classical hypothesis entails that the visual system is
organized as a feedforward feature-extracting hierarchy that builds a
progressively more identity-specific but viewpoint-invariant representation
of visual objects [e.g. Rosenblatt, Fukushima]. Recent neurophysiological
results [Tanaka, et al.] extending classical results from other groups are
intriguing, as they demonstrate a substantial population of neurons in the
``object recognition areas'' of the primate visual system that respond best
to specific complex mini-patterns, e.g. localized conjunctions of contour,
texture, and/or color elements. In many cases, these neurons exhibit
considerable insensitivity to changes in viewpoint-related parameters, such
as stimulus position and scale, while remaining selective for their
preferred stimulus pattern. These empirical data seem to support the idea
that the brain uses a set of features that is (1) large in number, (2)
dominated by spatially localized measures, and (3) based on multiple visual
cues. The first aspect may relate to the several advantages of high
dimensional feature space representations, a topic recently discussed at
length elsewhere [Califano & Mohan]. The second aspect may relate to (i)
the need to cope with non-rigid object transformations, which preserve
local but not global structure, (ii) the need to cope with object textures,
defined in large part by local relative-orientation structure, and (iii)
the need to cope with occlusion and clutter, which are least disruptive to
an object's code when derived from features with localized support. The
third aspect may relate to (i) the need to maximize object discrimination
power by utilizing all available visual cues, (ii) the need to richly
represent objects of many different types, and (iii) the need to ``buffer''
the visual representation of objects or scenes against a variety of forms
of image degredation, to which different visual cues are by nature
differentially sensitive.
In this vein, a view-based recognition system called SEEMORE is described,
based on set of 100 feature invariants that emphasize spatially-localized
receptive-field-style computations, and which are collectively sensitive to
a range of visual cues (contour shape, color, and texture). SEEMORE's
architecture is essentially a ``histogramming'' scheme, similar to the
color histogramming approach of [Swain & Ballard], but including shape and
texture-related ``bins'' in addition to color. Experiments reveal good
recognition performance in a 3-D object recognition problem with 100
objects of many types (rigid, non-rigid, ``statistical'', views of complex
scenes, etc.), and entailing image transformations that include rotations
in depth and the image plane, scaling, non-rigid object deformations,
partial occlusion, limited clutter, and other types of image degradation.
An optimization scheme is developed to scale individual feature dimensions
in order to maximize the performance of SEEMORE's high-dimensional
nearest-neighbor classifier. Generalization behavior and classification
errors are illustrated, showing the emergence of several striking natural
object catagories that are not explicitly encoded in the dimensions of the
feature space.
It has been argued that it is possible to account for the principle
properties of cells in primary visual in terms of their ability to produce
sparse responses of natural scenes. In a neural network that optimizes
sparseness and minimizes reconstruction error, units develop that are
localized, bandpass and oriented, similar to cortical simple cells. It is
argued that the responses of these cells may be as independent as possible
given a linear code. To account for higher levels of visual processing, it
is proposed that one must consider the more complex forms of redundancy
found in object relationships and natural scenes. In this talk, we
consider particular forms of such redundancy not captured by linear codes
and consider how particular types of non-linearity can transform this
redundancy.
Objects may appear on the retina in many different positions, sizes,
orientations, or other geometric distortions. These variations
produce a tremendous form of redundancy in images that must be dealt
with efficiently and effectively. Flexible templates provide one
method for doing so, because they represent the variations
independently from the object structure. How would a flexible
template scheme be implemented in the brain? In this talk, I will
present several alternatives that have been proposed for implementing
flexible templates in neural circuitry. I will discuss the advantages
and disadvantages from a computational perpective, and ways of testing
these theories neurophysiologically and psychophysically.
In recent years a number of researchers have reported scaling power
spectra in static natural images. That is, their spectra take the
form of a power-law. This result is surprisingly robust given the
variety in each team's choice of image calibration and subject
matter. We propose that the salient universal structure present in
natural images is their composition of statistically independent
occluding objects.
In such a world the correlation function is generated by two
underlying causes: the distribution of object to object transitions,
and the correlations present within objects. We show that
correlations present within objects have little spatial structure, and
thus the overall correlation function of natural images is dominated
by the probability of object transitions. If the transition
probability distribution is power-law in the separation distance, then
the correlation function (and thus the power spectrum) will also be
power-law. Further, this result is unaffected by image calibration
since object transitions are robust to this transformation.
By creating images of occluding objects using a given object size
distribution we can create any form of power spectrum we wish. We
demonstrate that it is not simply the presence of edges within images
which gives rise to a scaling spectrum of the form 1/f^2. Rather, it
is the probability distribution of object transitions which plays the
important role. Finally, our results also hold for recent results on
scaling in natural spatio-temporal image sequences (Dong & Atick,
Network, 1995).
Because local disparity information is often sparse and noisy, there
are two conflicting demands when estimating disparity in an image
region: the need to spatially average to get an accurate estimate, and
the problem of not averaging over discontinuities. We have developed
a network model of disparity estimation based on disparity-selective
neurons, such as those found in the early stages of processing in
visual cortex. The model can accurately estimate multiple disparities
in a region, which may be caused by transparency or occlusion.
The model consists of several stages and computes its output using
only feedforward processing. One-dimensional binocular retinal input
is preprocessed with disparity energy filters at a range of spatial
frequencies and phases. The output of these disparity energy filters
forms the input to two separate pathways: the local disparity pathway,
and the selection pathway. The local disparity pathway computes an
estimate of the disparity in a local region of the image. Because
these local disparity measurements may be unreliable, a process is
needed to determine which signals to integrate. The selection pathway
fulfills this role by selectively gating those disparity signals that
reliably indicate the true disparity of the object. The output of
this stereo model is a distributed representation of disparity.
This selective integration of reliable local estimates enables the
network to demonstrate stereo hyperacuity (i.e., sub-pixel disparity
estimation from pixel-based inputs) on normal and transparent
random-dot stereograms. Analysis of the model suggests that the
selection units appear to respond to disparity contrast --- that is,
edges in depth. We predict that neurons in visual area MT will
demonstrate a similar selectivity to disparity contrast.
Cortical R.F's are influenced by the nature of the visual
environment. It has been argued that learning is responsible for
creating "optimal" feature detectors in the cortex, and that principal
components are the "optimal" projections. I will discuss the notion
of optimality, and how it applies to the visual cortex; The number of
neurons in visual cortex are an order of magnitude higher than the
number of input lines from LGN and therefore different notions of
optimality should apply here.
The receptive fields extracted by a prncipal component rule, are
sensitive to the second order statistics of the visual environment. I
will show how to extract a representation for the spectrum of natural
images. This spectrum will be decomposed into a radially symmetric
scale invariant component, and a non radially symmetric portion. I
will use this decomposition to derive the receptive fields extracted
by the principal component rule from such and environment.
Comparison with simulation results show that the R.F's are sensitive
both to the radially symmetric and non symmetric portions of the
spectrum. They develop orientation selectivity, but differ from those
found biologically in several respects.
Images formed by a human face change with viewpoint. A new technique
is described for synthesizing images of faces from new viewpoints,
when only a single 2D image is available. The technique draws on a
single generic 3D model of a human head and on prior knowledge of
faces based on example images of other faces seen in different poses.
The example images are used to ``learn'' a pose invariant shape and
texture description of a new face. The representations are based
on the idea of linear object classes. These are 3D objects whose
3D shape can be represented as a linear combination of a sufficiently
small number of prototypical objects. The separation of shape and
texture information in images of human faces was done using point
correspondence between the different facial images, which was
established automatically through optical flow algorithms.
Linear object classes have the properties that new orthographic views
of any object of the class under uniform affine 3D transformations,
and in particular rigid transformations in 3D, can be generated
exactly if the corresponding transformed views are known for the set
of prototypes. Thus if the training set consists of frontal and rotated
views of a set of prototype faces, any rotated view of a new face can
be generated from a single frontal view -- provided that the linear
class assumption holds.
This linear class approach works well for features shared by all
faces (e.g. eyebrows, nose, mouth or the ears). But, it has limited
representational possibilities for features particular to a
individual face (e.g. a mole on the cheek).
To overcome this problem, a single 3D model of a human head is added
to the linear class approach. Face textures mapped onto the 3D model
can be transformed into a new pose. The final ``rotated'' image to a
given face image can be generated by applying to this new image of the
3D model the shape transformation given through the linear object
class approach.
V1 neurons have many layers of responses. Recent
neurophysiological evidence suggests that they are sensitive to the
inside-outside relationship, the boundary, the medial axis and the
spatial extent of the globally defined shapes. There is also evidence
for neural correlate of amodel completion of occluded surfaces. These
evidence, together with Kovacs and Julesz's psychophysical evidence,
suggest that medial axis might play a key role in representing shape
in the visual system and such representation might exist as early as
in V1. Furthermore, this diversity in neural responses, even in the
same cells, seems incompatible with the one neuron-one feature
idea. Computationally, we know "shape" must somehow be computed, but
no successful vision systems have done this purely feedforward. A
radical reinterpretation of the classical paradigm is called for: one
such proposes a) multiplexing data in every spike train and b) a role
for V1 not merely as an initial stage in visual computation, but as a
high-resolution buffer active throughout the whole visual computation.
Barlow's theory regarding suspicious coincidence detectors, calls for
the need to have a neuronal mechanism for prior probability
estimation. Such a mechanism will be described including some
implications on sensory representation.