We have developed a field-worthy, high-definition, real-time depth-mapping television camera called the HDTV Axi-Vision Camera. The camera can simultaneously capture both an ordinary HDTV color image and a depth image of objects on more than 1280×720 pixels at a frame rate of 29.97 Hz, or on 853×480 pixels at a frame rate of 59.94 Hz. The number of detectable pixels per unit time was increased by about 5 times that of the prototype camera by improving the sensitivity and resolution of the depth-mapping camera. Short video clips demonstrate how depth information from the camera can be used to create a virtual image in actual television program production.
© 2004 Optical Society of America
Three-dimension (3D) camera systems that can detect 3D information about objects are finding their way into a wide range of applications, including television (TV) program production, 3D modeling, and robotic vision, as well as graphics animation. Among the several methods used for 3D imaging, triangulation is the most common method of detecting depth information. The systems based on this approach use either stereoscopic images taken by several cameras  or structured light projections . These triangulation methods, however, require complex signal processing and cannot easily detect pixel-by-pixel depth information at video frame rates with high enough resolution for TV.
Recently, a 3D camera using a CMOS sensor and a sectioned laser beam has been developed . Its operation is based on the principle of triangulation, and it can detect depth images at 640×480 resolution with a maximum frame rate of 65.1 Hz. This type of system, however, generally requires a high-power laser source, introducing a safety concern when the object to be detected is a human being. Another shortcoming of this method is that the positions of the light sources and the sensor are separated, and thus, the distance to objects in the shadows of the sectioned light cannot be detected.
The time-of-flight (TOF) method has the advantage of quick, straightforward information processing . The conventional system using this method, however, requires two-dimensional laser beam scanning to make a depth image of an object, so it is only suitable for stationary objects. A TOF simultaneous 3D imaging system that avoids laser scanning by using CCD and CMOS technologies has been proposed , but in this system, it is difficult to increase the total number of pixels to the level required for TV resolution.
We previously proposed the Axi-Vision Camera [6, 7], which is the original camera that served as the prototype for the HDTV Axi-Vision Camera. The camera achieves simultaneous depth sensing by combining ascending and descending intensity-modulated light with an ultra-fast shutter camera. The depth information can be calculated in real time, because neither scanning illumination light nor complex signal processing is required. The prototype camera based on this method can capture depth images on 768×493 pixels at a frame rate of 15Hz . The performance of the prototype, however, is not sufficient for high-definition television (HDTV) program production.
To use the proposed method in an actual HDTV post-production system, we have developed an HDTV Axi-Vision Camera that can output ordinary HDTV color images and high-definition depth images of objects. To keep a high signal-to-noise ratio (SNR) of the depth detection, the camera was equipped with an image intensifier with increased sensitivity and resolution, optical devices with optimum design, and high-power LED array illuminators.
Here we report the process of how we have further developed the prototype Axi-Vision Camera  into a studio-worthy HDTV Axi-Vision Camera. A TV program was produced with the HDTV Axi-Vision Camera and broadcast live from an NHK broadcasting station in Japan. A short video clip from the TV program (Fig. 1) is attached for viewing. This video clip demonstrates the quality of the camera’s performance.
2.1 Principle of depth detection
To better understand the descriptions in the following sections, it is worthwhile to briefly summarize the principle of operation of the Axi-Vision Camera . The principle of depth mapping is based on the fact that the intensity of an ultra-fast snapshot of an object becomes dependent on the distance from the camera to the object if the intensity of the illuminating light is varied at a speed comparable to the speed of light. Figure 2 illustrates this principle.
The upper section shows the arrangement of the components, with the infrared intensity-modulated LED illuminating sources on the left and a CCD camera with an ultra-fast shutter utilizing an image intensifier on the right. The lower section of the figure shows a time diagram of the illuminating and reflected light during the first three video frame cycles.
The solid red line indicates the triangularly intensity-modulated illuminating LED light at the source. The dotted red line indicates the light intensity at the camera after reflection from one particular point P on the object. The triangle formed by the dotted line is delayed by Δt=2d/v from that formed by the solid line because of the time required for the light to make the round trip, where d is the distance from the camera to point P and v is the velocity of light. The pair of vertical black dotted lines indicates the duration when the ultra-fast shutter in front of the CCD camera is open.
If the shutter opens during the ascending intensity modulation, as shown in first video frame, the input light to the CCD camera during the exposure time (represented by the hatched section in Fig. 2) decreases with the distance d to the object, because the dotted red line shifts to the right. During this cycle of illumination, smaller input light to the CCD camera means longer distance to the object.
On the other hand, if the shutter opens during the descending intensity modulation, as shown in the second video frame, the input light to the CCD camera during the exposure time increases with the distance d, again because the dotted red line shifts to the right. During this cycle of illumination, larger input light to the CCD camera means longer distance.
The third video frame repeats the situation of the first video frame. Even though the intensity of the input light during either the ascending or descending cycle provides distance information, by combining the two, the effect of the object’s reflectivity can be removed, thereby isolating the distance information.
Next, we represent the above description through simple mathematical expressions. Let s(t) represent the triangularly modulated illuminating light power at the source. The intensity I+(ts,d)of the light reflected from the object at distance d during the ascending illumination cycle is
Similarly, the intensity I-(ts,d)during the descending illumination cycle is
where σ is the back-scattering cross section of the object, T is the period of the intensity modulation, and ts is the instant at which the shutter is opened. From Eqs. (1) and (2), the distance d to the object is obtained as
When the shutter opening instant ts is chosen as T/4, Eq. (3) becomes
where λ is the wavelength of the triangular wave.
As shown in the lower portion of Fig. 2, Image A with the ascending reflected light is consecutively captured by about 106 shuttering actions during video frame 1 (lasting 1/30 or 1/60 second) and stored. Then, the ramping direction of the intensity-modulated light is reversed by shifting the phase of the modulation signal. Image B with the descending illumination cycle is then captured consecutively by about 106 shuttering actions and stored during video frame 2. The ratio defined by Eq. (4) is calculated from Images A and B, and then the Depth image 1 is calculated from Eq. (3) by the signal processor. In this way, the camera alternately captures Image A or B during each video frame, and the Depth image n is updated each time.
2.2 Signal-to-noise ratio of the image intensifier of the CCD Camera
With an increase in the number of depth-image pixels, the signal current detected by each pixel decreases. As a result, the SNR of the image decreases, and hence, the depth resolution decreases. In this section, we clarify the relationship between the camera parameters and the SNR of the image intensifier of the CCD camera, which is the main noise source in this system.
Let the intensity of the reflected light incident on the image intensifier be E [W/m2]. The average number of electrons released from the photocathode over one pixel area per frame is given by
where η(0≤η≤1) is the quantum efficiency of the photocathode of the image intensifier, Ap is the area of one pixel of the CCD, τ is the imaging time, ε is the photon energy, and m is the magnification of the relay lens between the phosphor and the CCD camera.
The standard deviation σpe of the total number of electrons released from the photocathode is
. The SNR of the released electrons is given by
The SNR is also reduced by additional noise associated with the amplification process, such as internal electron losses at the microchannel plate and the phosphor plate. These internal aspects of the image intensifier performance can be expressed by the noise factor Nf . The SNR of the image intensifier at the output phosphor plate of the device is expressed as
On the other hand, the intensity of the reflected light from the center of the measurement range is given by
where ρ is the reflectivity of the object, TL is the transmittance of the optics, I is the illumination power, FN is the f-number of the camera lens, and S is the illuminated area . Using Eq. (9), Eq. (8) becomes
The one-pixel area Ap of the HDTV camera is less than one-fifth that of the SDTV camera, and the SNR is reduced accordingly. Moreover, when updating the depth image at 30Hz, the imaging time τ is one-half that of the SDTV camera.
To ensure that the HDTV Axi-Vision Camera maintains the same noise level as the SDTV prototype, we need to compensate for the decreases in the parameters Ap and τ by enhancing other camera parameters. To maintain a good SNR in the HDTV depth image, we made the following improvements:
∗ Selected a photocathode material with high quantum efficiency η.
∗ Optimized the design of the optical devices to enhance the transmittanceTL.
∗ Increased the illuminating power I of the LED array.
3. HDTV Axi-Vision Camera system
3.1 Basic configuration
Figures 3 and 4 show a block diagram and a close-up photograph, respectively, of the HDTV Axi-Vision Camera. The camera specifications are tabulated in Table 1. Near-infrared LED arrays are used for the intensity-modulated light source, because they provide a fast, direct modulation capability. Near-infrared light at a wavelength of 850 nm also lies outside the visible range and thus does not interfere with other visible illumination.
The visible light reflected from the object passes through the camera lens, the dichroic prism, and a relay lens. The resulting color images are captured by an ordinary HDTV camera. The near-infrared light reflected from the object is separated by the dichroic prism and sent incident to the image intensifier system. This near-infrared image of the object is focused onto the photocathode of the image intensifier, which converts the input optical image into an electron image. This electric charge image is then incident to the surface of the phosphor plate, after electron multiplication by a factor of 103 to 104 in the microchannel plate. The multiplied electric charge thus creates an optical image on the phosphor plate.
The shutter opening action of the image intensifier is repeated with bias voltage pulses of the same period as the illuminating light modulation. The shutter is opened about 106 times during one video frame, thus providing a satisfactory SNR. The optical image from the phosphor plate is focused by a relay lens onto a high-resolution progressive CCD camera.
The function generator provides both the ascending and descending signals to modulate the output of the illuminating LED array and the trigger pulses to open the image intensifier shutter. The signal processor sorts the images with ascending and descending illumination into frame memories. The intensities from these two types of images are used in calculating Eq. (3) to determine the depth of the object. The acquisition time for a depth image of 1280×720 pixels is 1/30 second, while that for 853×480 pixels is 1/60 second. The depth image is finally converted to an HDTV signal and output as an HD-SDI (Serial Digital Interface) signal.
3.2 Raising sensitivity and the resolution of the image intensifier
We developed a new, highly sensitive image intensifier specifically for the HDTV Axi-Vision Camera. As shown in Fig. 5, the photocathode of the improved image intensifier is made of a GaAs target with a quantum efficiency of 11.7% at a wavelength of 850 nm, which is eight times higher than that of the multi-alkali target used in the prototype. Thus, the quantum efficiency of the GaAs photocathode could be enhanced to 15%–20% by optimizing the photocathode layer .
The new image intensifier has a double-layered microchannel plate structure to give sufficient amplification for imaging by the high-resolution CCD camera and to reduce the damage to the photocathode by the ion-feedback effect. Furthermore, to enable capturing of high-definition images, the spatial resolution of the image intensifier was increased by introducing a microchannel plate with a small channel diameter (6 µm), as well as by developing a proximity structure.
3.3 Optimum design of the optical devices
In the prototype, a large dichroic mirror (separating visible and near-infrared light) was positioned in front of the camera lens . As a result, the LED arrays had to be installed behind the dichroic mirror, and the camera system was bulky. With this configuration, the LED light power was not used economically. Two separate camera lenses were required, with one for the color camera and one for the depth-mapping camera.
To address these shortcomings, some design changes were made for the HDTV version of the camera. As noted above, a small dichroic prism was placed between the camera lens and the CCD camera. To prevent leakage of visible light to the depth-mapping camera, an optical filter (with an optical density of more than 5 at visible wavelengths) was fitted in front of the image intensifier. To ensure depth mapping with a high SNR, the dichroic prism and the optical filter were designed to maintain about 90% transmittance at a wavelength of 850 nm (Fig. 6). An antireflection coating is, moreover, deposited on the camera lens; the transmittance of the lens is 92% at a wavelength of 850 nm. The total transmittance of the optics is more than twice that of the prototype camera. Furthermore, the camera system is compact and can accommodate a set of zoom lenses (focal length: 7.8 mm to 133 mm, zoom ratio: ×17).
3.4 High-power LED array illuminator
The HDTV camera replaces the bulky, mirror-type dichroic reflector of the prototype with the dichroic prism, thereby creating space for four clusters of LED array units arranged around the camera lens, as shown in Fig. 7(a). The total power of the LED array is 1 W, which is twice that of the prototype. Fig. 7(b) shows the spatial distribution of the optical power of the LED array illuminators. It was measured with an optical power meter in the object plane (2 m wide, 2 m high) at a 3-m distance from the camera. The power at the center was 38.9 µW cm2. The illumination light achieved high power and uniformity, as shown in Fig.7 (b). The angle of the beam irradiation is 40 degrees in order to cover the angle of view of an ordinary camera lens. Because the four LED arrays are arranged close to the camera lens, both the divergence and the shadows of the illumination are reduced.
In summary, the improvements in the quantum efficiency η of the image intensifier, the transmittance TL of the optics, and the power I of the LED illumination resulted in a fivefold improvement in the SNR (Eq. (10)) of the HDTV Axi-Vision Camera over that of the prototype. Even though the area Ap of each CCD pixel is reduced by a factor of five in the high-resolution camera, and the imaging time τ is reduced to one-half that of the prototype camera, the new camera system is sufficiently sensitive to capture depth images without sacrificing SNR.
In testing the range of the system, as an object we used a sheet of white paper with a relative reflectivity of 0.85 normalized by the reflectivity of a BaSO4 standard white plate. The distance from the camera was varied from 4 to 8 m at intervals of 0.2 m, and the output video signal level of the depth image was measured with a video signal analyzer. The frequency of the intensity modulation was 15 MHz, and the midpoint of the measurement range was set at the distance of 6 m from the camera. The amount of detected light decreases with the shutter time, which has to be determined so as to compromise between the amount of incident light to the camera and the resolution of the image. The optimized time was 5 ns, which is about 1/10 of the modulation period of the illumination.
The output image signal changed with the distance from the camera to the object as shown in Fig. 8. Although the measured curve is relatively straight around 6 m, it starts to curve as it approaches the end of the measured range. The ramp of the LED intensity modulation was not perfectly triangular near the edge of the triangle. This is likely the cause of the curvature toward the end of the measured range. This curved portion of the line could be compensated by using a nonlinear amplifier .
4.2 Depth resolution
The depth resolution of the camera was evaluated by measuring the noise level of the output video signal. To obtain the maximum depth resolution, the illumination light was modulated at 45 MHz, and the gate width of the image intensifier shutter was 2 ns, which is about 1/10 of the modulation period of illumination. The root-mean-square value σs of the noise voltage of the output depth video signal was measured, and the distance corresponding to 3σs defined the depth resolution. The distance from the camera to the object used in section 4.1 was varied from 1 to 10 m, and during each measurement, the settings of the camera controls were tweaked for optimization.
Figure 9 shows the depth resolution as a function of the distance to the object. When the distance was 2 m, the depth resolution was 1.7 cm, which is the same depth resolution as that obtained by the prototype SDTV version of the Axi-Vision Camera . At a distance of 5 m, the resolution was 3 cm, and at a distance of 10 m, it was 4.9 cm. The depth resolution was thus degraded in proportion to the object distance. This behavior confirms the prediction from Eq. (10) that the SNR of the image intensifier decreases in inverse proportion to the object distance. In synthesizing an image in a virtual studio, in most cases, the distance from a person to the nearest background is more than the order of tens of centimeters. The depth resolution measured above is thus sufficiently fine for image synthesis.
Next, the relationship between the reflectivity of an object and the depth resolution was measured. Papers of different reflectance were used as objects. The reflectance of the BaSO4 standard white plate was set as unity, and all other values were normalized by this reflectance. The distance from the camera to the object was set at 2 m, the frequency of the intensity modulation of the illuminating light was 45 MHz, and the shutter time was 2 ns.
These results are shown in Fig.10. The depth resolution was degraded with decreasing object reflectance, due to a increase in the shot noise of the image intensifier. With a low-reflectance object, such as human black hair, the reflectance was lower than 0.1, and the resolution suffered and became less than one-half that of objects with a reflectance of unity.
5. Application to image composition
5.1 Depth keying
The most common method used in TV image composition is the chroma keying method, which segments objects from images by using the color information of a scene . This method, however, has more than a few drawbacks. It requires a special background screen, typically blue or green in color. Clothes of the same color as the background screen cannot be used, and reflected light from the screen has undesirable effects on the luminance and color of the segmented image.
As a countermeasure to these problems, depth keying methods that rely on the depth information of objects have been proposed. The usefulness of such methods was experimentally demonstrated by using a stereoscopic camera based on triangulation  or the prototype Axi-Vision Camera .
An example of the image extraction process using HDTV Axi-Vision Camera is shown in Fig. 11. The distance from the camera to the person was about 2.5 m, and that to the wall was about 3.7 m. The frequency of the intensity modulation was 45 MHz, and the shutter time was 2 ns. From the color image (Fig. 11(a)) and the depth image (Fig. 11(b)) obtained by the HDTV Axi-Vision Camera, the image of an object in a particular range could be isolated. Figure 11(c) shows isolation of objects in the furthest range, Fig. 11(d) shows isolation of objects in the middle range, and Fig. 11(e) shows isolation of the image of the hands in the nearest range. The images of objects at an arbitrary depth can be extracted without any undesirable effects of chroma keying method. These results demonstrated that the camera have enough ability to realize the practical depth keying system in HDTV signal format.
5.2 Virtual studio synthesized by combining live images with prerecorded scenes
A virtual studio was synthesized by combining scenes stored in a library of prerecorded scenery . During the process of combining library scenes with a live image, it is necessary to know which scenery is hiding which, and the distance information is essential in this process. Figure 12 shows the results of a virtual image created by combining a live image with scenes prerecorded with the depth information.
The inset on the lower right-hand side of Fig. 12 shows the synthesized image obtained with the HDTV Axi-Vision Camera. The image of the existing back wall of the studio was first removed and then replaced with that of a sliding door, which was inserted between an actress and the back wall. The color and depth images of the actress in front of the set were taken by the HDTV Axi-Vision Camera. Whenever the actress comes close to the camera, her image is properly synthesized with the image components of the sliding door and appears in front of it. Whenever she moves farther away from the camera, her image is ignored and no longer composited with the image of the sliding door, so that her image disappears in the synthesized image. Thus, the camera can make a composite of moving objects in real time. Such results as those shown in Fig. 12 demonstrate that it is possible to use the depth keying method to create a virtual effect without using the blue or green background screens required for the chroma keying method.
5.3 Virtual studio synthesized by combining live images with computer graphic images
Images artificially generated through computer graphics (CG) were combined with a live image taken by the HDTV Axi-Vision Camera. The camera determined the distance relationship between the CG and the live image.
A short video clip of the “50th Anniversary: Today is the birthday of TV. Grand finale.” broadcast live from the Japan Broadcasting Station (NHK) on February 1, 2003 is attached at the beginning of this paper as Fig. 1. The clip shows a composite image in which the CG image moves around the person. These special compositions, which have been conventionally performed by spending a lot of time and cost as post-production work, can now be created in real time with this camera. The depth image was also being interlocked with the zoom function of the camera lens; natural depth composition in the camera work was achieved by using the information about the relative locations of the camera with respect to the linked CG image.
In this live broadcast, the HDTV Axi-Vision Camera played a key role in creating a dynamic yet realistic image by three-dimensionally combining a real image of objects and people with CG images.
We have developed the HDTV Axi-Vision Camera, which can simultaneously capture both high-definition color images and corresponding depth images of objects at a video frame rate. We presented virtual images created by the camera. Future issues for the extension of the application of the camera include further improvements in (1) the SNR of the system, so that it can be used for objects with weak near-infrared reflection, and (2) the sensitivity and dynamic range, and (3) optimization of the image processing technology, in order to reduce the noise in the depth image. This camera has the potential to create more attractive and expressive images than are possible with conventional technology.
The authors would like to thank Yumi Matsutoya, Yasushi Akimoto, and Michiko Shimizu, for their performances in the video clip and the program, and their permission for citation of their images in this paper. We also express our gratitude to Mary Jean Giliberto for her meticulous proofreading and enthusiastic assistance in organizing the manuscript. We extend our sincere thanks to Melles Griot Corporation for their technical support in developing the optics, and to Ikegami Tsushinki Corporation for their technical support in developing the camera. Finally, we thank Seiki Inoue, Norio Akiyama, Manabu Oketani, Shuhei Matsui, Ayumi Iizuka, and all the staff at the NHK Broadcasting Center engaged in the video clip and TV program production with the HDTV Axi-Vision Camera.
1. S. Kimura, H. Kano, T. Kanade, A. Yoshida, E. Kawamura, and K. Oda, “CMU video-rate stereo machine,” in Proceedings of 1995 Mobile Mapping Symposium (American Society for Photogrammetry and Remote Sensing, Columbus, Ohio, 1995), pp. 9–18.
2. K. Sato and S. Inokuchi, “Range-imaging system utilizing nematic liquid crystal mask,” in 1st International Conference on Computer Vision ICCV (Institute of Electrical and Electronics Engineers, London, 1987), pp. 657–661.
3. Y. Oike, M. Ikeda, and K. Asada, “Design and implementation of real-time 3-D image sensor with 640×480 pixel resolution,” IEEE J. Solid-State Circuits , 39, 622–628 (2004). [CrossRef]
4. R. A. Jarvis, “A laser time-of-flight range scanner for robotic vision,” IEEE Trans. on Pattern Analysis and Machine Intelligence , PAMI-5, 505–512 (1983). [CrossRef]
5. R. Lange, P. Seitz, A. Biber, and S. Lauxtermann, “Demodulation pixels in CCD and CMOS technologies for time-of-flight ranging,” in Sensors and Camera Systems for Scientific, Industrial, and Digital Photography Applications, Morley M. Blouke, Nitin Sampat, George M. Williams, and Thomas Yeh, eds. Proc. SPIE3965, 177–188 (2000).
6. M. Kawakita, K. Iizuka, H. Kikuchi, H. Fujikake, J. Yonai, and T. Aida, “A 3D camera system using a high-speed shutter and intensity modulated illuminator,” (in Japanese) Institute of Image Information and Television Engineers ITE Tech. Rep. 22, 19–24 (1998).
7. M. Kawakita, K. Iizuka, T. Aida, H. Kikuchi, H. Fujikake, J. Yonai, and K. Takizawa, “Axi-Vision Camera (real-time depth-mapping camera),” Appl. Opt. 39, 3931–3939 (2000). [CrossRef]
8. M. Kawakita, K. Iizuka, T. Aida, H. Kikuchi, H. Fujikake, J. Yonai, and K. Takizawa, “Axi-Vision Camera: a three-dimension camera,” in Three-Dimensional Image Capture and Applications III, Brian D. Corner and Joseph H. Nurre, eds., Proc. SPIE3958, 61–70 (2000).
9. R. J. Hertel, “Signal and noise properties of proximity focused image tubes,” in Ultrahigh Speed and High Speed Photography, Photonics, and Videography ‘89: Seventh in a Series,Gary L. Stradling, ed., Proc. SPIE1155, 332–343 (1989).
10. M. Kawakita, K. Iizuka, Y. Iino, H. Kikuchi, H. Fujikake, and T. Aida, “Real-time depth-mapping three-dimension TV camera (Axi-Vision Camera),” (in Japanese) IEICE Trans. on Information & SystemsJ87-D-II, No.6, (2004). (to be published).
11. Hamamatsu Photonics K.K., http://www.hpk.co.jp/.
12. S. Shimoda, M. Hayashi, and Y. Kanatsugu, “New chroma-key imaging technique with hi-vision background,” IEEE Trans. on Broadcasting 35, 357–361 (1989). [CrossRef]
13. Y. Yamanouchi, H. Mitsumine, and S. Inoue, “Image-based virtual studio using ultra high-definition omnidirectional images,” (in Japanese) The Journal of the Institute of Image Information and Television Engineers ITE 55, 159–166 (2001). [CrossRef]