Functionally, neuromorphic vision chips do what a video camera does when combined with a computer running some dedicated vision program, perhaps an algorithm for detecting edges. Computationally, though, the architectures of the two systems are quite different. Neuromorphic systems, like nervous systems, use massively parallel, analog, nonclocked, collective processing, rather than the numerical and symbolic processing basic to artificial intelligence and conventional machine vision. These desirable neuromorphic properties can implement types of mathematical operations that occur in early vision, as it is called. (Early vision is the set of processes that make use of two-dimensional intensity arrays to recover distance, texture, and other physical properties associated with the surfaces of the three-dimensional objects visible around the viewer.)
The first reflex of today's system engineers, surrounded as they are by digital computers, is to sample and digitize the incoming video signal as soon as possible. Yet since the brightness of an image is continuous in time and amplitude, why import unnecessary artifacts? Why not instead exploit the physics of conductances, capacitances, and nonlinearities inherent in transistors to implement operations that are expensive in the digital domain? When such analog circuits are integrated with 2-D arrays of photoreceptors, the resulting silicon retinas capture the image with a virtuosity no digital computer can match unless capable of hundreds of millions of floating-point operations per second. And the package can be as small as 1 cm2.
Before these devices can be built, several key components must be designed. Adaptive photoreceptors are needed to sense image intensities over eight orders of magnitude--the range of natural lighting from moonlight to high noon. Linear and nonlinear resistive grids must filter the image in order to reduce the ever-present noise and to enhance and detect certain features, such as edges. Smart communication protocols are necessary to send streams of visual information between chips. Velocity sensors have to reliably detect motion in the scene. Finally the chips must be able to adapt their outputs to wide variations in parameters using on-chip learning.
Not every IC dedicated to visual algorithms is a neuromorphic vision chip. The latter processes the image on the same physical plane as it acquires the image (focal plane processing). On the other hand, dedicated signal-processing circuits take the digitized output of a camera and apply a particular visual algorithm to every picture element (pixel) in the image, one after the other.
The dedicated circuits are usually based either on standard digital signal-processing (DSP) chips or on digital systems specially designed for such applications as block matching for video applications or filtering images using convolution. Block matching is popular for estimating motion in images. In convolution, the most common image-processing technique, passing a "filter function" over each point in the original image transforms it into the filtered image. The new value of a pixel is the sum of the products of this filter function with the image intensity at each pixel, suitably normalized.
In these applications, a mathematical operation that needs to be repeated over and over again is cast in special-purpose digital hardware; otherwise, it would limit system performance too much. One example is the correlation chip that Woodward Yang at Harvard University, Cambridge, Mass., developed for recognizing faces. Here, the most demanding operation is to match one face against a large database. So small chunks of the image are fed to Yang's digital chip, which matches them against a template. The chip carries out about 100 000 correlations each second on a 64-by-64-pixel image and outputs the best fit. But although the correlator chip by itself only requires 0.1 W of power, the entire system, including camera and microprocessor, is still large and power hungry.
Today, there are two approaches to image acquisition. The first, sensors based on charge-coupled devices (CCDs), dominates the consumer market. The CCDs sense light intensity by integrating the photocharge in time on a grid of some 800 by 600 pixels. The continuously valued output at each pixel, digitized in time, constitutes the output of the camera. It is typically sent to a "frame-grabber" board, where its amplitude is digitized (usually to 8-bit, or 256-level, resolution) for further analysis.
The amplitude of light in the natural world, however, swings over eight orders of magnitude from moonlight to a sun-filled day, while the dynamic range of CCDs is unfortunately much less. When the dynamic range needed to process the image exceeds the CCD's capability, the image is clipped; and blooming can occur when the charge on a pixel exceeds its holding capacity and the excess spills over into neighboring pixels. A clipped region in the image will be uniformly white, with no details apparent. Blooming manifests itself by a white line in the image, created by the excess charge that flows from the bright pixel onto and along a rail in the imager. The usual remedy for a limited dynamic range is to include automatic gain control. In this case, a mechanical iris will serve, or else the charge integration time of the imager may be adjusted to the brightness of the scene.
CCD cameras do not compute. Indeed, they should not, since their output, a series of bits that can easily be transmitted to a TV monitor, should look as much as possible like the input when displayed on the screen. This also implies that the image requires high resolution all over, since it is not known ahead of time where the viewer will be looking.
Biological creatures view things differently--the photoreceptors in their eyes sense the intensity continuously in time and adapt to the local image intensity in both space and time, thereby maximizing the receptors' dynamic range. Photoreceptors with similar properties can be built using CMOS devices. A simple photodiode can logarithmically compress the photocurrent into a voltage signal, but its response is very slow at lower intensities. Further, device mismatches due to fabrication variables will skew the response of adjacent receptors to identical input. Indeed, variation in voltage due to device mismatch can be as large as the signal itself. All these problems can be solved by adaptive photoreceptors.
Some of the best adaptive photoreceptors have been designed by Tobias Delbruck at Caltech. The response of his five-transistor photoreceptor is logarithmic, so that the differential response to a constant contrast is unaffected by changes in the absolute light intensity. Its output adapts to slow (seconds long) changes in image intensity over more than six orders of magnitude, while preserving a high gain for transient changes in the image. And, in stark contrast to CCDs, no expensive clocks are needed, reducing power consumption and the need for support circuitry.
There is a price, though. In a 1.2-um CMOS process, a single adaptive photoreceptor uses about 52 by 52 um2 of silicon real-estate, compared to 7 by 7 um2 for a state-of-the-art CCD pixel [see "Vision chips compared"].
Image resolution is another important difference between artificial and natural vision systems. While we primates sample the world in daylight using one to two million photoreceptors, other animals need many fewer. Highly evolved insects that use vision to find food and mates and avoid predators and obstacles have in effect 10 000 or fewer pixels with which to sample their environment. Although their visual performance in real time is beyond current machine-vision systems, even the cheapest hand-held video camera has many, many more pixels. The moral here is that while we humans are used to seeing high-resolution images, many visual tasks need far fewer pixels.