Object features Workshop Schedule

Workshop Schedule

Morning

7:30 Tai Sing Lee
8:10 Stu Geman
8:40 Elie Bienenstock
9:10 Tommy Poggio
9:50 Dan Ruderman
10:30 Nathan Intrator

Afternoon

4:00 Bartlet Mel
4:15 Thomas Vetter
4:30 David J. Field
5:00 Bruno A. Olshausen
5:30 Harel Shouval
6:00 Michael Gray
6:15 Shimon Edelman
Abstracts
Compositional Vision
Stuart Geman and Elie Bienenstock Division of Applied Mathematics Brown University
Images are ambiguous at each of many levels of a contextual hierarchy. Nevertheless, most scenes are highly unambiguous at the level of interpretation, as evidenced by superior performance of humans. These observations argue for global vision models, such as deformable templates. Unfortunately, such models are computationally intractable for unconstrained problems. We will propose a compositional model in which more-or-less local entities are recursively composed, subject to syntactic restrictions, to form objects and object groupings. A gradient favoring composition is imposed via a description-length cost functional, thereby casting the recognition problem in a Bayesian framework. The actual recognition engine generates multiple compositional structures, corresponding to multiple scene interpretations, which are later resolved by appealing to the minimum-description-length principle. Viewed from the Bayesian perspective, this amounts to computing an approximate MAP labeling of the image.

Compositionality in Neural Activity
Elie Bienenstock and Stuart Geman Division of Applied Mathematics Brown University
In the companion paper, we argued in favor of a compositional approach to artificial vision: rather than representing shapes as points in a feature space in such a way that metric relations between representations agree with metric relations between shapes (see e.g. Shimon Edelman), we propose to construe representations themselves as relationships. The nature of the relationships used is highly problem-dependent, but, in general, representations are constructed hierarchically/recursively in terms of parts and their relations, starting with a small number of low-level constituents, and eventually resulting, through strictly constrained composition rules, in unlimited numbers of composite items. Here we argue that composition is central to human cognition, and that compositional models provide, in principle, a way to address the following three major aspects of cognitive functions, in particular vision: (i) human recognition performances are excellent despite ambiguity at all but the most global levels of interpretation; (ii) cognitive functions display a considerable amount of invariance; (iii) computational requirements in recognition tasks may be daunting. Drawing upon these remarks we suggest that brains are likely equipped with a mechanism allowing them to perform dynamical relational binding between neural representations. We discuss candidate mechanisms, in particular mechanisms relying on the use of fine-temporal structure of neural activity.

SEEMORE: Combining Color, Shape, and Texture Histogramming for Visual Object Recognition

Bartlett W. Mel Department of Biomedical Engineering University of Southern California
The human visual system can recognize unprimed views of common objects at sustained rates in excess of 10 per second. How can a visual system work so fast? A classical hypothesis entails that the visual system is organized as a feedforward feature-extracting hierarchy that builds a progressively more identity-specific but viewpoint-invariant representation of visual objects [e.g. Rosenblatt, Fukushima]. Recent neurophysiological results [Tanaka, et al.] extending classical results from other groups are intriguing, as they demonstrate a substantial population of neurons in the ``object recognition areas'' of the primate visual system that respond best to specific complex mini-patterns, e.g. localized conjunctions of contour, texture, and/or color elements. In many cases, these neurons exhibit considerable insensitivity to changes in viewpoint-related parameters, such as stimulus position and scale, while remaining selective for their preferred stimulus pattern. These empirical data seem to support the idea that the brain uses a set of features that is (1) large in number, (2) dominated by spatially localized measures, and (3) based on multiple visual cues. The first aspect may relate to the several advantages of high dimensional feature space representations, a topic recently discussed at length elsewhere [Califano & Mohan]. The second aspect may relate to (i) the need to cope with non-rigid object transformations, which preserve local but not global structure, (ii) the need to cope with object textures, defined in large part by local relative-orientation structure, and (iii) the need to cope with occlusion and clutter, which are least disruptive to an object's code when derived from features with localized support. The third aspect may relate to (i) the need to maximize object discrimination power by utilizing all available visual cues, (ii) the need to richly represent objects of many different types, and (iii) the need to ``buffer'' the visual representation of objects or scenes against a variety of forms of image degredation, to which different visual cues are by nature differentially sensitive.
In this vein, a view-based recognition system called SEEMORE is described, based on set of 100 feature invariants that emphasize spatially-localized receptive-field-style computations, and which are collectively sensitive to a range of visual cues (contour shape, color, and texture). SEEMORE's architecture is essentially a ``histogramming'' scheme, similar to the color histogramming approach of [Swain & Ballard], but including shape and texture-related ``bins'' in addition to color. Experiments reveal good recognition performance in a 3-D object recognition problem with 100 objects of many types (rigid, non-rigid, ``statistical'', views of complex scenes, etc.), and entailing image transformations that include rotations in depth and the image plane, scaling, non-rigid object deformations, partial occlusion, limited clutter, and other types of image degradation. An optimization scheme is developed to scale individual feature dimensions in order to maximize the performance of SEEMORE's high-dimensional nearest-neighbor classifier. Generalization behavior and classification errors are illustrated, showing the emergence of several striking natural object catagories that are not explicitly encoded in the dimensions of the feature space.

Non-linear feature detectors: Beyond linear sparse codes

David J. Field Cornell University
It has been argued that it is possible to account for the principle properties of cells in primary visual in terms of their ability to produce sparse responses of natural scenes. In a neural network that optimizes sparseness and minimizes reconstruction error, units develop that are localized, bandpass and oriented, similar to cortical simple cells. It is argued that the responses of these cells may be as independent as possible given a linear code. To account for higher levels of visual processing, it is proposed that one must consider the more complex forms of redundancy found in object relationships and natural scenes. In this talk, we consider particular forms of such redundancy not captured by linear codes and consider how particular types of non-linearity can transform this redundancy.

Neural implementations of flexible templates

Bruno A. Olshausen Cornell University
Objects may appear on the retina in many different positions, sizes, orientations, or other geometric distortions. These variations produce a tremendous form of redundancy in images that must be dealt with efficiently and effectively. Flexible templates provide one method for doing so, because they represent the variations independently from the object structure. How would a flexible template scheme be implemented in the brain? In this talk, I will present several alternatives that have been proposed for implementing flexible templates in neural circuitry. I will discuss the advantages and disadvantages from a computational perpective, and ways of testing these theories neurophysiologically and psychophysically.

Objects and scaling in natural images

Dan Ruderman USC
In recent years a number of researchers have reported scaling power spectra in static natural images. That is, their spectra take the form of a power-law. This result is surprisingly robust given the variety in each team's choice of image calibration and subject matter. We propose that the salient universal structure present in natural images is their composition of statistically independent occluding objects.
In such a world the correlation function is generated by two underlying causes: the distribution of object to object transitions, and the correlations present within objects. We show that correlations present within objects have little spatial structure, and thus the overall correlation function of natural images is dominated by the probability of object transitions. If the transition probability distribution is power-law in the separation distance, then the correlation function (and thus the power spectrum) will also be power-law. Further, this result is unaffected by image calibration since object transitions are robust to this transformation.
By creating images of occluding objects using a given object size distribution we can create any form of power spectrum we wish. We demonstrate that it is not simply the presence of edges within images which gives rise to a scaling spectrum of the form 1/f^2. Rather, it is the probability distribution of object transitions which plays the important role. Finally, our results also hold for recent results on scaling in natural spatio-temporal image sequences (Dong & Atick, Network, 1995).

Selective Integration: A Model for Disparity Estimation

M. S. Gray, A. Pouget, R. S. Zemel, S. J. Nowlan, T. J. Sejnowski UC-San Diego
Because local disparity information is often sparse and noisy, there are two conflicting demands when estimating disparity in an image region: the need to spatially average to get an accurate estimate, and the problem of not averaging over discontinuities. We have developed a network model of disparity estimation based on disparity-selective neurons, such as those found in the early stages of processing in visual cortex. The model can accurately estimate multiple disparities in a region, which may be caused by transparency or occlusion. The model consists of several stages and computes its output using only feedforward processing. One-dimensional binocular retinal input is preprocessed with disparity energy filters at a range of spatial frequencies and phases. The output of these disparity energy filters forms the input to two separate pathways: the local disparity pathway, and the selection pathway. The local disparity pathway computes an estimate of the disparity in a local region of the image. Because these local disparity measurements may be unreliable, a process is needed to determine which signals to integrate. The selection pathway fulfills this role by selectively gating those disparity signals that reliably indicate the true disparity of the object. The output of this stereo model is a distributed representation of disparity. This selective integration of reliable local estimates enables the network to demonstrate stereo hyperacuity (i.e., sub-pixel disparity estimation from pixel-based inputs) on normal and transparent random-dot stereograms. Analysis of the model suggests that the selection units appear to respond to disparity contrast --- that is, edges in depth. We predict that neurons in visual area MT will demonstrate a similar selectivity to disparity contrast.

Learning Receptive Fields in a Natural Environment

Harel Shouval Brown University
Cortical R.F's are influenced by the nature of the visual environment. It has been argued that learning is responsible for creating "optimal" feature detectors in the cortex, and that principal components are the "optimal" projections. I will discuss the notion of optimality, and how it applies to the visual cortex; The number of neurons in visual cortex are an order of magnitude higher than the number of input lines from LGN and therefore different notions of optimality should apply here. The receptive fields extracted by a prncipal component rule, are sensitive to the second order statistics of the visual environment. I will show how to extract a representation for the spectrum of natural images. This spectrum will be decomposed into a radially symmetric scale invariant component, and a non radially symmetric portion. I will use this decomposition to derive the receptive fields extracted by the principal component rule from such and environment. Comparison with simulation results show that the R.F's are sensitive both to the radially symmetric and non symmetric portions of the spectrum. They develop orientation selectivity, but differ from those found biologically in several respects.

Synthesis of novel views from a single face image

Thomas Vetter Max-Planck Institute, Tübingen
Images formed by a human face change with viewpoint. A new technique is described for synthesizing images of faces from new viewpoints, when only a single 2D image is available. The technique draws on a single generic 3D model of a human head and on prior knowledge of faces based on example images of other faces seen in different poses. The example images are used to ``learn'' a pose invariant shape and texture description of a new face. The representations are based on the idea of linear object classes. These are 3D objects whose 3D shape can be represented as a linear combination of a sufficiently small number of prototypical objects. The separation of shape and texture information in images of human faces was done using point correspondence between the different facial images, which was established automatically through optical flow algorithms. Linear object classes have the properties that new orthographic views of any object of the class under uniform affine 3D transformations, and in particular rigid transformations in 3D, can be generated exactly if the corresponding transformed views are known for the set of prototypes. Thus if the training set consists of frontal and rotated views of a set of prototype faces, any rotated view of a new face can be generated from a single frontal view -- provided that the linear class assumption holds. This linear class approach works well for features shared by all faces (e.g. eyebrows, nose, mouth or the ears). But, it has limited representational possibilities for features particular to a individual face (e.g. a mole on the cheek). To overcome this problem, a single 3D model of a human head is added to the linear class approach. Face textures mapped onto the 3D model can be transformed into a new pose. The final ``rotated'' image to a given face image can be generated by applying to this new image of the 3D model the shape transformation given through the linear object class approach.

Sensitivity of V1 neurons to Globally-Defined Shapes

Tai Sing Lee Harvard University
V1 neurons have many layers of responses. Recent neurophysiological evidence suggests that they are sensitive to the inside-outside relationship, the boundary, the medial axis and the spatial extent of the globally defined shapes. There is also evidence for neural correlate of amodel completion of occluded surfaces. These evidence, together with Kovacs and Julesz's psychophysical evidence, suggest that medial axis might play a key role in representing shape in the visual system and such representation might exist as early as in V1. Furthermore, this diversity in neural responses, even in the same cells, seems incompatible with the one neuron-one feature idea. Computationally, we know "shape" must somehow be computed, but no successful vision systems have done this purely feedforward. A radical reinterpretation of the classical paradigm is called for: one such proposes a) multiplexing data in every spike train and b) a role for V1 not merely as an initial stage in visual computation, but as a high-resolution buffer active throughout the whole visual computation.

Neuronal mechanisms for feature detection

Nathan Intrator
Barlow's theory regarding suspicious coincidence detectors, calls for the need to have a neuronal mechanism for prior probability estimation. Such a mechanism will be described including some implications on sensory representation.