Depth image simplification

V. Hernández Arreola; F. J. Renero Carrillo; R. Díaz Hernández

doi:10.1364/OSAC.388580

1. Introduction

Nowadays image processing has gained great importance, mainly for unmanned vehicle applications, where helped to possible collisions with obstacles are avoided. However, there are areas in which it is also necessary to know the object’s proximity, such as medicine, manufacturing and other industries. [1–6]

In those areas it is important to implement an image processing method that allows to acquire the image depth and mapping it to a small matrix, while keeping the most relevant information. In this way the environment information can be acquired, in the shortest possible time, such that collisions or damages are avoided, no compromising the memory storage required by other processes. This is particularly important since some systems need to be portable, such as sensory replacement devices, this is important because increasing storage means increasing the weight of the device and cost.

In this work, based on the stereoscopy knowledge and disparity, depth images were represented in gray levels and were later simplified by using the proposed method, to get the environment information in a 3x3 matrix form. This method simplifies the instructions and decision making for an actuating device, processing only the relevant information without compromising the required data storage, which is sometimes needed for the main process. With this approach, an average response time of 3 seconds and the storage of 20 KB are getting.

2. Stereoscopy and disparity

Stereoscopic vision is the process of perceiving relative distances to objects from lateral displacement (disparity) between two images of a scenario. Stereoscopic vision is based on the principle of human vision, in which the eyes acquire two images, and through a triangulation process, the depth information of the observed scene is known.

Conventional digital stereoscopic vision system consists of a pair of parallel cameras, horizontally separated by a known distance B, called the baseline. [7,8] (Fig. 1).

Fig. 1. Stereoscopic vision system graphical representation, where B is the baseline, f is the focal length, X is a specific position of the object, x and x’ are the object representation on the image plane at the camera’s focal length, Z is the distance between the object and the camera, and O and O’ are the principle points of cameras.

Download Full Size | PDF

Considering Fig. 1, it is possible to solve the distance Z (distance to object X) from the disparity value d, with the Eq. (1):

(1)$$Z = \frac{Bf}{d}$$

with

(2) $$d= x-x'$$

where:

x and x’ are the object representation on the image plane at the camera’s focal length.

B is the distance between both cameras

f is the focal length of cameras

d is the disparity value of the image

3. Methodology

Image processing was performed by using Python’s version 3.7.2, and Opencv’s library version 4.0.1. Stereoscopic images were captured using two Logitech C170 [9] webcams placed at a 7cm horizontal separation from each other; further calibration was performed using a chessboard, in order to get its intrinsic and extrinsic values. From such known values, the images were rectified, and the camera lenses radial and tangential distortion were corrected [10].

From the environment objects points triangulation regarding to the camera’s location, by using Semi-Global Matching (SGM) [11], the 640x480 pixels depth image was obtained in gray levels. The generated depth image has values in gray levels ranging from 0 to 255, and are proportional to the object’s proximity, where 0 (black) are the furthest objects and 255 (white) are the closest objects to the cameras. A segmentation process was performed considering six proximity distances, from the gray levels, to perform a stereoscopic image simplification, as represented by the block diagram in Fig. 2. Since the first two images represent distant objects with a minimum collision risk, both were discarded for the next processing step. [12–14]

Each of the four generated images is binarized to isolate the detected objects from the background as can be seen in Fig. 3. From the depth image in Fig. 3(b) four images are obtained (Fig. 3(c)), on each picture the present objects and their respective distance range (gray levels) are shown in white, while the background, and outside objects from the selected distance range for each image are shown in black.

Fig. 2. Processing flowchart, where ①, ②, ③, ④, ⑤ and ⑥ are the six ranges of gray levels considered.

Download Full Size | PDF

Fig. 3. a) Scene with objects b) depth image, and c) image segmentation and binarization of the four generated images.

Download Full Size | PDF

Thereafter, each image is divided into nine sections, corresponding to the 3x3 output matrix, as can be seen in Fig. 4, in order to individually evaluate the detected objects on each section and within each of the four depth levels.

Fig. 4. Images 3, 4, 5 and 6 division in nine sections.

Download Full Size | PDF

To determine the relevance of the detected objects, the white pixels percentage of each object is evaluated in relation to the background on each generated image, giving higher priority to located objects in brighter gray and white levels, because they are located closer to cameras, as can be seen in diagrams of Fig. 5.

Fig. 5. Sections evaluation in a) image 3, b) image 4, c) image 5 and d) image 6.

Download Full Size | PDF

With the obtained information a 3x3 matrix is generated with values of 0, 0.2, 0.5, 0.8 or 1 (0 refers to an absence of objects while 1 refers to the closest objects), giving greater importance to detected objects at smaller distances, as can be seen in the block diagrams in Fig. 6.

Fig. 6. 3x3 output matrix generation , assigning 0, 0.2, 0.5, 0.8 or 1 values (0 refers to an absence of object and 1 refers to the closest objects).

Download Full Size | PDF

4. Results

The method tests were performed inside controlled environments, where objects were placed at different distances from cameras as shown in Fig. 7, using an Acer laptop with Intel processor (characteristics can be seen in Table 1).

Fig. 7. Experimental setup in a controlled environment. a) side view and b) webcam image. Here Cs are the cameras, Ob1, Ob2 and Ob3 are objects placed at different distances.

Download Full Size | PDF

Table 1. Intel processor characteristics.

View Table | View all tables in this article

For better appreciation of results, the visualization of the 3x3 matrix is shown by an gray levels image (Fig. 8). The depth image of the experimental setup of Fig. 7 is shown in Fig. 8(a). The nearest objects are observed in white color while distant objects in shades of gray (darker when they are farther away).

Fig. 8. a) Depth image, b) 3x3 matrix display in gray levels.

Download Full Size | PDF

Tests were performed with different experimental arrangements in controlled environments and with depth images datasets [15].

The results shown in Figs. 9 and 10 correspond to the processing of two different experimental setups, on which different sizes objects are placed at different distances.

From Fig. 9 and Fig. 10, can be concluded that the object’s size does not represent a problem knowing their proximity to the cameras.

Fig. 9. a) Scene with objects, b) 3x3 matrix display in gray levels.

Download Full Size | PDF

Fig. 10. a) Scene with objects, b) 3x3 matrix display in gray levels.

Download Full Size | PDF

The results shown in Figs. 11, 12, 13 and 14 correspond to the processing with four different datasets depth images. The images were previously resized to 640x480 pixels, to compare the obtained results with the depth images on Figs. 9 and 10.

Fig. 11. a) Depth image, b) 3x3 matrix display in gray levels.

Download Full Size | PDF

Fig. 12. a) Depth image, b) 3x3 matrix display in gray levels.

Download Full Size | PDF

Fig. 13. a) Depth image, b) 3x3 matrix display in gray levels.

Download Full Size | PDF

Fig. 14. a) Depth image, b) 3x3 matrix display in gray levels.

Download Full Size | PDF

From the obtained results it can be seen that the proposed method can be applied to different depth images, allowing its use in many applications, considering a 640x480 pixels depth image. The Table 2 shows the processing time comparison and the output matrix size with other procedures, highlighting the advantages when using the proposed method.

Table 2. Comparison of depth image works.

View Table | View all tables in this article

In the proposed method, unlike that of Garduño [16], all the processing was carried out in a single device, the output data is smaller than a 640x480 image, or a cloud of points as is the case of the output data of Lavin [18], Garduño [16], Hernández [17], Sun [19] or Wei [21].

The proposed method could be used together with Wozniak’s work [20], simplifying the identification of obstacles for virtual reality helmets users.

However, for medical applications such as Hernández’s work [17], the processing time must be improved for less than 1 second such as Wei’s work [21], nonetheless, the proposed method can be considered as real-time processing.

It is currently difficult to find methods to simplify a depth image to get the relevant data for the obstacles location, however, there are many applications for which this is very useful, such as sensory substitution systems for blind people, Table 3 shows the sensory substitution devices comparison and the proposed method application to this area.

Table 3. Sensory substitution devices and the proposed method application to this area.

View Table | View all tables in this article

5. Conclusions

A method was proposed that allows depth image simplification, reducing a 640x480 image to a 3x3 matrix with 0, 0.2, 0.5, 0.8 or 1 values, with an average response time of 3 seconds since the image is transformed to a final 3x3 matrix and only 2 seconds if it begin with a depth image to obtain the final 3x3 matrix.

The output matrix can be easy interpreted by another system, as well as it reduces the number of bits required for the transmission of depth information, with only nine values (between 0, 0.2, 0.5, 0.8 or 1) that indicate the proximity of obstacles and the direction in which they are located.

It is currently difficult to find methods to simplify a depth image to get the relevant data for the location of the obstacles, however, there are many applications for which this is very useful, such as sensory substitution systems for blind people, assistance robots or unmanned vehicles.

The processing time might be improved to be comparable with Hernández’s work [17], since in medical applications the processing time has to be even less than 1 second, the processor RAM can be increased to 8GB, or use a better processor frequency like 4.5GHz.

Using a cameras viewing angle of $58^{\circ }$ [9], the minimum distance at which it is possible to determine the proximity of objects is 8cm.

Funding

Consejo Nacional de Ciencia y Tecnología (449733).

Disclosures

The authors declare no conflicts of interest.

References

1. J. Zabalza, Z. Fei, C. Wong, Y. Yan, C. Mineo, E. Yang, T. Rodden, J. Mehnen, Q.-C. Pham, and J. Ren, “Smart sensing and adaptive reasoning for enabling industrial robots with interactive human-robot capabilities in dynamic environments - a case study,” Sensors 19(6), 1354 (2019). [CrossRef]

2. S. Emani, K. Soman, V. S. Variyar, and S. Adarsh, “Obstacle detection and distance estimation for autonomous electric vehicle using stereo vision and dnn,” in Soft Computing and Signal Processing, (Springer, 2019), pp. 639–648.

3. U. B. Himmelsbach, T. M. Wendt, N. Hangst, and P. Gawron, “Single pixel time-of-flight sensors for object detection and self-detection in three-sectional single-arm robot manipulators,” in 2019 Third IEEE International Conference on Robotic Computing (IRC), (IEEE, 2019), pp. 250–253.

4. M. Sun, P. Ding, J. Song, M. Song, and L. Wang, “Watch your step: Precise obstacle detection and navigation for mobile users through their mobile service,” IEEE Access 7, 66731–66738 (2019). [CrossRef]

5. M. Meenakshi and P. Shubha, “Design and implementation of healthcare assistive robot,” in 2019 5th International Conference on Advanced Computing & Communication Systems (ICACCS), (IEEE, 2019), pp. 61–65.

6. Y. Tange, T. Konishi, and H. Katayama, “Development of vertical obstacle detection system for visually impaired individuals,” in Proceedings of the 7th ACIS International Conference on Applied Computing and Information Technology, (ACM, 2019), p. 17.

7. M. Okutomi and T. Kanade, “A multiple-baseline stereo,” in Proceedings. 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (IEEE, 1991), pp. 63–69.

8. T. Kanade, H. Kano, S. Kimura, A. Yoshida, and K. Oda, “Development of a video-rate stereo machine,” in Proceedings 1995 IEEE/RSJ International Conference on Intelligent Robots and Systems. Human Robot Interaction and Cooperative Robots, vol. 3 (IEEE, 1995) pp. 95–100

9. Logitech, “Especificaciones webcam logitech c170,” https://support.logitech.com/en_us/product/webcam-c170/specs.

10. D. Malacara-Hernández and Z. Malacara-Hernández, Handbook of optical design (CRC Press, 2016).

11. H. Hirschmüller, “Stereo processing by semiglobal matching and mutual information,” IEEE Trans. Pattern Anal. Mach. Intell. 30(2), 328–341 (2008). [CrossRef]

12. B. Jahne, Practical handbook on image processing for scientific and technical applications (CRC Press, 2004).

13. G. Cristóbal, P. Schelkens, and H. Thienpont, Optical and digital image processing: fundamentals and applications (John Wiley & Sons, 2013).

14. W. K. Pratt, “Digital image processing. a wiley-interscience publication,” (1978).

15. D. Scharstein, R. I. Szelisk, and H. Hirschmüller, “http://vision.middlebury.edu/stereo/.”

16. M. A. G. Ramon, “Segmentación de imágenes obtenidas a través de un sensor kinect con criterios morfológicos y atributos visuales de profundidad,” Ph.D. thesis, Universidad Autonoma de Querétaro (2018).

17. C. Castedo Hernández, R. Estop Remacha, and L. Santos de la Fuente, “Sistema de visión estereoscópico para el guiado de un robot quirúrgico en operaciones de cirugía laparosócopica hals,” Actas de las XXXVIII Jornadas de Automática (2017).

18. J. E. L. Delgado, R. A. Cantu, J. E. M. Cruz, and N. I. G. Morales, “Desarrollo de un sistema de reconstrucción 3d estereoscópica basado en la disparidad,” in Determinacion del grado de estres en docentes universitarios con actividad, (2018), p. 6548.

19. Y. Sun, X. Liang, H. Fan, M. Imran, and H. Heidari, “Visual hand tracking on depth image using 2-d matched filter,” in 2019 UK/China Emerging Technologies (UCET), (IEEE, 2019), pp. 1–4.

20. P. Wozniak, A. Capobianco, N. Javahiraly, and D. Curticapean, “Depth sensor based detection of obstacles and notification for virtual reality systems,” in International Conference on Applied Human Factors and Ergonomics, (Springer, 2019), pp. 271–282.

21. Y. Wei, J. Yang, C. Gong, S. Chen, and J. Qian, “Obstacle detection by fusing point clouds and monocular image,” Neural Process. Lett. 49(3), 1007–1019 (2019). [CrossRef]

22. A. Ali and M. A. Ali, “Blind navigation system for visually impaired using windowing-based mean on microsoft kinect camera,” in 2017 Fourth International Conference on Advances in Biomedical Engineering (ICABME), (IEEE, 2017), pp. 1–4.

23. R. Ribani and M. Marengoni, “Vision substitution with object detection and vibrotactile stimulus,” in Proc. 14th Int. Joint Conf. Comput. Vis., Imag. Comput. Graph. Theory Appl., (2019), pp. 584–590.

24. M. P. Cervellini, E. Gonzalez, J. C. Tulli, A. Uriz, P. D. Agüero, and M. G. Kuzman, “Sistema de sustitución sensorial visual - táctil para no videntes empleando sensores infrarrojos,” XVIII Congreso Argentino de Bioingeniería SABI 2011 - VII Jornadas de Ingeniería Clínica (2011).

25. C. Feltner, J. Guilbe, S. Zehtabian, S. Khodadadeh, L. Boloni, and D. Turgut, “Smart walker for the visually impaired,” in ICC 2019-2019 IEEE International Conference on Communications (ICC), (IEEE, 2019), pp. 1–6.

Processor	Intel Core^TM i5-4200U
RAM	6.00 GB
Cores	2
Threads	4
Processor Base Frequency	1.60 GHz
Bus Speed	5 GT/s

Process	Processing time	Output data
M. A. Garduño [16]	Not specified	640x480 image
C. C. Hernández [17]	1 second	640x480 image
J. E. Lavin [18]	Not specified	cloud of points
Y. Sun [19]	Not specified	640x480 image
P. Wozniak [20]	Not specified	VR color indicator
Y. Wei [21]	0.08 seconds	U-disparity map
Proposed method	3 seconds	3x3 matrix

Method	Input data	Advantages	Disadvantages
Proposed method	Any depth image	4 levels of deep, 9 sections independent information	Not yet portable
Ali [22]	Kinect data	————	Limited to left and right information
Ribani [23]	HD camera	Portable	No depth information
Cervellini [24]	IR sensors	$360^{\circ}$ detection	Indoor only
Feltner [25]	Kinect data	3 levels of deep	Limited to left and right information

Processor	Intel Core^TM i5-4200U
RAM	6.00 GB
Cores	2
Threads	4
Processor Base Frequency	1.60 GHz
Bus Speed	5 GT/s

Process	Processing time	Output data
M. A. Garduño [16]	Not specified	640x480 image
C. C. Hernández [17]	1 second	640x480 image
J. E. Lavin [18]	Not specified	cloud of points
Y. Sun [19]	Not specified	640x480 image
P. Wozniak [20]	Not specified	VR color indicator
Y. Wei [21]	0.08 seconds	U-disparity map
Proposed method	3 seconds	3x3 matrix

Depth image simplification

Abstract

1. Introduction

2. Stereoscopy and disparity

3. Methodology

4. Results

5. Conclusions

Funding

Disclosures

References

Cited By

Figures (14)

Tables (3)

Equations (2)

OSA Continuum