## Abstract

Point cloud data offer the potential for viewpoint-independent object recognition based solely on the geometrical information about an object that they contain. We consider two types of one-dimensional data products extracted from point clouds: range histograms and point-separation histograms. We evaluate each histogram in terms of its viewpoint independence. The Jensen-Shannon divergence is used to show that point-separation histograms have the potential for viewpoint independence. We demonstrate viewpoint-independent recognition performance using lidar data sets from two vehicles and a simple algorithm for a two-class recognition problem. We find that point-separation histograms have good potential for viewpoint-independent recognition over a hemisphere.

© 2021 Optical Society of America under the terms of the OSA Open Access Publishing Agreement

## 1. INTRODUCTION

Point clouds have found application in many areas ranging from manufacturing to remote surveillance. Remote acquisition of 3D point cloud by lidar (light detection and ranging) is the sensing mode of interest in this manuscript and has a long history [1]. Examples include mapping [2], tree canopy survey [3], and object identification (see, e.g., papers in [4]). Point cloud imagery is becoming ubiquitous, in no small part because of the potential it offers for autonomous vehicle operation (see, e.g., [5]). A feature of point clouds is that they offer full 3D geometric information of an object. A benefit of geometric information is that it has greater immunity to variations in operating conditions such as illumination, weather, etc. Three-dimensional point cloud data also offer the potential for improved segmentation through ground plane removal [6]. Of primary interest in this paper is the utility of 3D geometric information in performance of viewpoint-independent recognition. Note that we are interested in objects such as vehicles at remote distances—not faces at close range. This places a constraint on the type of information available in the remote-sensing application. The particular imaging scenario will often dictate the resolution that can be achieved.

In simplified terms, one can consider an image as a mapping of parameters associated with a scene on a 2D grid of scene samples. Usually, the parameter of interest is the flux exiting the projected surface of the object or an “intensity” image. A more recent example of a parameter for imaging is the color-dependent flux map or “hyperspectral” image. Other parameters can also be mapped to the sampling grid, including ones not normally considered for images, such as vibration [7]. For geometric information contained in the point cloud, one can consider forming range images [8]. That is, we can represent point clouds as a classical cross-range (pixelated) image, but now the random variable stored in each pixel is not intensity but rather range from the image-gathering sensor to the sample grid point on the target. In this case, the variation between pixels (contrast) is then range—not intensity [9]. An example of a range-contrast image is shown in Fig. 1, along with a more typical intensity-contrast image. From a visual perspective, each type of image offers its own benefits in terms of contrast. The features of the object that are most apparent (have high contrast) are different between the two sensor modes. We point out that each type of image has its own information capacity. Gray-scale images have information capacity related to the number of bits of digitization of the intensity at each pixel and the noise in each pixel [10]. A similar comment can also be made about the range image. Here, the number of bits of digitization is not the number of intensity levels that the pixel can resolve but rather the number of time bins that the pixel can resolve [11]. Note that the electronics associated with extracting information from each pixel are quite different between the two sensing modes.

From an exploitation perspective, range images such as those in Fig. 1 can be processed using standard techniques for pattern recognition. For template-matching approaches to recognition, training of the templates can be problematic because of the large number of variables that must be included in the training set. These variables can include illumination, object pose, and articulation of parts of the object. We want to explore if 3D point clouds offer amelioration with regard to image variability. Point clouds do not necessarily require inclusion of intensity information, only the geometrical information, so part of the image variability can be eliminated. Image articulation is an issue with point clouds and so must be dealt with in selection of training data. Here, we are primarily interested in seeing if point cloud data can be made less sensitive to the pose of an object with respect to the sensor. Reduction of the impact of pose variation on recognition has received considerable attention [12–15].

In [12], a computationally intensive process is used on point cloud data to identify feature points and then develop local surface patches of the objects. These local patches are used to represent targets at various view angles. In [13], an algorithm based on probabilistic multiple interpretations is developed to estimate pose of the object of interest. The object with estimated pose is then compared with a template of that pose. In [14], the object of interest is decomposed into Zernike modes and the even-order modes are retained as having the best rotation invariance. A neural network is then used with the input Zernike modes to perform recognition. In [15], the range images are phase-encoded through the use of a Fourier transform. The phase-encoded images are then converted into spherical coordinates, and an algorithm is developed to operate these images.

To operate on gathered point cloud data in real time sensing, all of these approaches require significant preprocessing of the point cloud data prior to the generation of features that are used for recognition. The features we propose here are simple 1D histograms extracted directly from the point cloud data with only minimal preprocessing of the point cloud. The first starts with range images as discussed above and treats range per pixel as a random variable, from which properly normalized histograms (probability density functions) can be formed. No additional processing on the range data is required. The other random variable that we treat is a histogram of the point-separations between all observable points in the point cloud from a given view of the object. Since each point contains the geometric location of the point with respect to some origin, we can compute the distance (separation) between all pairs of points in the point cloud. Preprocessing here involves the computation of point separations, but after that the histogram is formed in a standard manner. In the following sections, we analyze these two random variables to see if their probability density functions can serve as a feature of the objects that exhibit invariance as a function of view of the object. The remainder of the paper is organized as follows. In Section 2, we briefly discuss the data used in the analyses and the metric used to evaluate viewpoint immunity. In Section 3, we analyze both range histograms and point-separation histograms and comment on their viewpoint invariance. In Section 4, we use point separation histograms as measurements in a two-class recognition problem. We conclude with a summary in Section 5.

## 2. ANALYSIS METHODOLOGY

In this work, we restrict ourselves to images that are segmented from a scene and have the backgrounds removed. We form 1D histograms of random variables of interest that are derived from the 3D data. To simulate the type of data that can be obtained from a remote lidar system, we begin with point cloud data from two vehicles that were obtained by a FARO laser scanner [16]. These data are of a car and forklift. The data from the scanner are of high spatial resolution so we filtered the data to provide point clouds of 1 cm precision. The data are input with a nadir view (top down) and then rotated to represent a particular view of interest that is consistent with what might be gathered remotely. The rotation is performed using standard Cartesian coordinate rotation formulae. The data are then binned in the following manner. First, points with the same cross-range location are filtered to keep only the point with the range closest to the sensor. This filtering process simulates the type of data that would be acquired in a remote-sensing application. These filtered data are then grouped into range/cross-range voxels of varying sizes. The voxel sizes are chosen to represent realistic capabilities of lidar sensors that, again, could be used to gather the data in remote-imaging applications. Hence, the sensor characteristics and the spatial averaging introduced by those characteristics determine to a large degree the properties of the random variables; this work does not use spatial correlations of the high-resolution object to determine the binning. Each voxel within the binned data then represents a potential point in the lower-resolution point cloud, depending on if the voxel contained some points from the high-resolution point cloud. We did not track the number of points within each voxel, as we are interested in only the geometric information of the object—not the intensity. We performed the binning process for 37 views over a hemisphere of observation directions in azimuth (from the rear of the vehicles to the front of the vehicles). Each view is separated by 5° in azimuth using a constant elevation of 17° above the ground plane.

The pixels we used in this work varied between 10 and 50 cm. Such sizes will be considered large by the facial recognition community but are consistent with remote imaging. In typical applications, the pixel size will vary with gross range to the target, assuming a constant aperture size. The resolution in range was set to 10 cm. While ambitious, this resolution is not impossible to obtain [11]. Range resolution is not determined by the same properties of diffraction that have an impact on cross-range resolutions, so keeping the same range resolution for all cases is appropriate.

After the above binning process, the data are then converted to the 1D representations described in Section 1. Typical template matching approaches will group training data into views over some angular width and develop a template for each of the view groups. Since we are interested in comparing representations of the objects that exhibit some degree of viewpoint invariance, we assess how well the 1D representations exhibit this invariance using an information theory metric. We want to compare how the PDFs vary as a function of view angle and also how the PDFs vary between vehicles. A metric to compare two distributions is the Jensen–Shannon divergence (JSD) [17]. Given two probability distributions $P$ and $Q$ and the mixture distribution $M = (P + Q)/{2}$, the JSD is defined as

## 3. ANALYSIS OF ONE-DIMENSIONAL REPRESENTATIONS

#### A. Range Histograms

One feature that can be derived directly from a range image is a histogram of the ranges that make up the image. Point cloud data generation assigns to each measured point an ($x$, $y$, $z$) location, in Cartesian coordinates, with respect to some origin (for example, the sensor aperture). Once a viewpoint is specified, a range image can be created from the point cloud data using the binning process described in Section 2. After a range image is created, such as shown in Fig. 1, we treat range in each of the pixels as a random variable and form a histogram of the ranges. In the process of histogram formulation, we discard the ($x$, $y$) information, which formed the 2D image and are left with a list of 1D ($z$) coordinates. A histogram of this list of coordinates is a feature extracted from the point cloud data for the particular points of view that are desired. By proper normalization, we convert the histograms to probability density functions (PDFs). Using the binned data from the FARO scanner, we developed PDFs of range distributions for the two vehicles. Examples of PDFs of ranges for both vehicles are shown in Fig. 2. We make two notes. First, the PDFs for the same vehicle exhibit a good deal of difference between views. We also note that, visually, the degree of difference between vehicles depends on the view.

For each view, we computed the JSD between two distributions, either between different views of the car compared with itself or between different views of the car compared with the forklift. Figure 3 shows similarity matrices resulting from (car–car) comparisons and (car–forklift) comparisons. In the (car–car) comparisons, the JSD will be zero along the diagonal because the distributions are identical. The histogram distributions for the car are most similar when comparing oblique views and are most dissimilar when looking broadside or from either end. When comparing views of the car with the forklift, we see view aspects in which the histograms are dissimilar and views in which the distributions are nearly identical. As expected, the average JSD between views of the car is smaller than the average JSD between the car and forklift.

#### B. Point-Separation Histograms

One property of an object that is potentially more invariant to sensor perspective is the location of points on the object with respect to each other. The distances between pairs of points are independent of the origin established by sensor location. One can compute a histogram of relative point separations that result from calculating distances between all pairs of points within the point cloud. To compute point separations, one of the first tasks is to determine the norm that will be used to define separation. We chose the classic ${{\rm L}_2}$ norm (Euclidean distance), but other norms such as ${{\rm L}_1}$ (taxicab) or ${{\rm L}_3}$ and higher are also possible. The different norms will produce different values of separation, but, because we are interested in histograms of the separations, it may not make a difference which norm is used. The Euclidean distance is the length of the vector connecting two points forming a point pair. The vector connecting any two points $R$ and $Q$ is given by

We used the following process to investigate the utility of point separations as a viewpoint -independent representation. For a particular view, we define a cross-range resolution and follow the binning process described in Section 2. Once the point cloud data are binned, we compute a distribution of ${{\rm L}_2}$ norm point separation between all pairs of points. The point-separation distribution is created by listing all points $i = 1\; \ldots N$ of the binned data then computing the Euclidean distance between point $i$ and all points $j \gt i$. From this list of Euclidean distances, a histogram is formed of the number of times each point separation occurs. We then normalize the histogram to have unit areas, thus creating a probability density function (PDF) for the random variable “point separation,” which is a 1D representation of the 3D point cloud. We note that the computation of the point-separation histogram feature is a bit more complex than computing the range histogram, but the feature is still derived directly from the point cloud data.

For remote-sensing applications, the number of pixels that span an object of interest can be limited. As a result, the number of point separations can also be limited. We can estimate the number of point-pairs by looking at vehicle sizes compared with the number of samples (voxels) within the vehicle. As an example, we consider the car with the voxel dimensions discussed in Section 2. With 10 cm pixel sizes, the car has ${\sim}{50}$ times 20 times ${15} = {15000}$ voxels; most of these are obscured for any particular view, so only a few thousand voxels are observed. For the case of 50 cm cross-range pixel, there are only a few hundred voxels on the surface; however, the number of point-separation pairs goes as ${\sim}{n^2}$, so there are many point separations from which to form a histogram.

To assess the utility of point-separation PDFs in viewpoint independence, we formed point-separation PDFs for the 37 views discussed in Section 2. The 37 PDFs for the car are shown in Fig. 4 for two different cross-range pixel values. Note that we used a different number of levels in the histogram process for the two pixel sizes. As the pixel size increases, the number of pixels across the target decreases, thus reducing the number of point-separation pairs. The number of histogram levels was chosen to give a relatively smooth PDF. Visually, we see from the figures that the PDFs are similar over the 37 views. To quantitatively compare the PDFs, we again employ the JSD metric. Similarity matrices for the point-separation PDFs of the car and between the car and the forklift are shown in Fig. 5. By comparing these matrices with those shown in Fig. 3, we see that the point-separation PDFs have a mean JSD about two orders of magnitude smaller than the range PDFs. The point-separation PDFs maintain a much greater similarity across the views for a single vehicle but have a significant difference between vehicles. Because of the high degree of similarity within a vehicle, we feel justified to form an average PDF for each vehicle, which now represents the vehicle regardless of the azimuthal viewpoint over the hemisphere. Figure 6 shows the average PDFs for both vehicles. The plots show the mean over 37 views at each histogram bin of the PDFs, with associated standard deviations. We note that the two averaged PDFs overlap significantly at only isolated points. One way to quantify the separation of the PDFs is to use a signal-to-noise ratio (SNR) definition similar to that used in quantifying the closure of an eye diagram in communication theory. We define the SNR as

## 4. PATTERN-RECOGNITION EXPERIMENT USING POINT-SEPARATION HISTOGRAMS

#### A. Averaged Histograms

Because the PDFs of the two vehicles have some separation, we use these PDFs in a simple two-class decision problem. In this problem, we use the average PDFs shown in Fig. 6 as templates. Then, a point-separation PDF of a random view (over a hemisphere) of the car is computed. It is important to note that we do not restrict the random view to be one of the 37 training views. Rather, it can assume any azimuthal angle over the hemisphere. The results are shown in Fig. 8, where point-separation PDFs of 100 random views (uniformly distributed over the azimuthal hemisphere) are overlaid on the templates of the two vehicles. We see the random view cluster in the car template with reduced overlap on the forklift template. We used a simple algorithm to make a decision on which template the random view PDF should be labeled. At each point-separation histogram value, we found which template was closest to the value computed for the random PDF and then summed the number of times each template was chosen as the closest. The template with the largest number was declared the winner. Out of 100 random trials, all 100 were correctly labeled as “car.” The high percentage of correct decisions (100% for this experiment) indicates the stability of the point-separation feature with respect to an object’s viewing angle. Since the random view angles were not restricted to the training views, any view angle over the hemisphere appears to create a point-separation PDF that is highly invariant over viewpoint.

To test the ability of point-separation PDFs to separate vehicles that are more similar, we modified the forklift by removing the forks. We then performed the two class decision using the forklift and the “forkless” lift. The results are shown in Fig. 9. As expected, the templates for the forklift and modified forklift are similar, with most of the difference in PDFs in the higher point separation values. Nonetheless, the modified forklift was correctly classified 98 out of 100 times for pixel size of 10 cm and 100 out of 100 times for a pixel size of 30 cm. The small differences in the PDFs appear to be sufficient to allow classification.

#### B. Point Position Noise

Thus far, we have assumed no noise in the point cloud data. That is, the points used to calculate the averaged templates have the same values as the points used to obtain the histograms/PDFs from random single views. In an actual implementation, the point cloud data used to train the template would be obtained from measurements of a particular vehicle or potentially from computer-aided design (CAD) models. Measurements in the field would most likely come from different vehicles/sources than those used to train the templates. Therefore, it is possible that the points from the field measurement would have slightly different values than those used in template generation. To consider robustness to variation in the points, we want to include fluctuation in the surface position attributed to a particular point in our data. We consider all of space to be filled with voxels with dimension and orientation defined by the optic axis and capabilities of the sensor. The object being investigated will then populate the voxels with either a one (unobscured surface with respect to the sensor) or zero (either no object surface present in that voxel or the surface is obscured from the sensor for that particular view). To induce noise in point location, it is necessary to artificially include some point variability. Because the data are grouped into voxels of finite size, the variability only becomes significant if the variability of the point locations is enough to cause the surface location to move from one voxel location to some neighboring voxel location. For the work here, we assume that the change in location is at most one voxel in each dimension. We assume a uniform likelihood of motion in each dimension. We apply this motion to a percentage of the points, i.e., only a percentage of the points have sufficient error to cause them to move from one voxel to another. Keep in mind that the voxels to some extent already include point location errors in terms of the sensor resolution capabilities.

Results in Fig. 10 are for noise in 88% of voxels with error in at least one dimension, i.e., 88% of the voxels are shifted by one voxel in at least one dimension (some may be shifted in two or three dimensions). Even with this noise in surface location with respect to the sensor, the point separation histogram appears relatively stable, and we obtained 100 correct decisions out of 100 trials, up to pixel sizes of 30 cm. There is a slight reduction in the number of correct decisions at a pixel size of 40 cm. At 50 cm pixel size, the number of correct decisions is 90%.

We repeated the above analysis using noisy forklift data with the “forkless” lift as the template. Results are shown in Fig. 11. As can be seen, the ability to separate more similar objects is degraded with noise and with increasing pixel size. Nonetheless, there is still appreciable capability (75% correct decisions out to pixel sizes of 30 cm).

## 5. DISCUSSION

Point clouds offer different possibilities for data exploitation. The geometric information contained within the point clouds can be used to reduce the dimensionality of the data by looking at histograms of features derived from the geometric information. These features include the spatial averaging one would expect from realistic sensors operating at longer ranges. In this paper, we have looked at two 1D features that can be extracted from point cloud data. The first, range histograms, requires only the formulation of range images at a specified sampling and specified viewpoint. Computation of the range histogram then involves tabulating the values in the pixels prior to the histogram process. For the MATLAB code used in this work, with no attempt to optimize speed, the time to compute the histogram with about 3500 points was about 0.19 s per view. Visually, the histograms exhibit variance with viewpoint. Histograms of point separations remove the impact of sensor origin on the data and offer the potential for viewpoint independence. Development of point separation histograms involves the same formulation of range images but then requires on the order of $n(n - {1}){\rm /2}$ computations of point separations. The time to compute the point separation histogram was about 0.31 s per view, which we do not consider an excessive increase in computation time with respect to the range histograms. We must point out that the rotation of the point cloud data, binning of the data, and development of range images in the above computations are required only during the template training phase. In application to remote sensing, as point clouds are generated, the data would be immediately used to create point separation histograms and compare with the templates. This process can occur in real time as each point of the cloud is gathered. This feature is in contrast with other approaches in the references that require full point cloud generation prior to computation or require the full algorithm to be performed on the cloud.

Of course, the compute time of the sensed data could be reduced with fewer number of gathered points. In a sensing application using only one view, only a fraction of the total potential point cloud would be gathered. The invariance of the feature to viewpoint enables gathering only single views. To address the impact of reducing the number of points in the point cloud, we apply a Poisson random number generator to a point cloud from a single view. We set the rate parameter to 100 mean number of points per point cloud; this reduces the number of points down to about 10%–20% of the original number of points (recall that the binning process described in Section 2 already reduces the number of points derived from the original data). We use a Poisson random generator because this emulates a photon-counting detection process applied to a remote-sensing application. Because we use a random process to decimate the point cloud, we plot in Fig. 12 the mean and standard deviation of 100 realizations of point-separation PDFs resulting from the reduced number of points. We compare these results with the PDF from the higher density point cloud. We can see that, even with a significant reduction in number of points (80–90% reduction), the point-separation PDF is a good approximation to the PDF resulting from the higher-density point cloud. We feel this shows the potential robustness of the point-separation feature.

We have done preliminary work on the impact of reduced numbers of points on recognition performance [18]. A more thorough analysis is beyond the scope of this paper, as it will likely require more complex recognition algorithms and possibly more complex template training. Recognition using sparse point clouds will be the subject of a future publication.

Results presented here validate viewpoint independence, both visually and with the JSD metric, using realistic voxel sizes at the object. For relatively dissimilar objects (car and forklift) we found minimal impact on recognition performance as a function of pixel size. For more similar objects (forklift and “forkless” lift), the impact of pixel size was more prominent but not debilitating. Considerable additional work is required to quantify the robustness of point separations as a representation of point cloud data. In particular, robustness to vehicle articulation (changes in position/orientation of parts of the vehicle) needs to be investigated. Also, the presence of spurious data points from incomplete segmentation and from poor background removal should also be investigated.

A potential approach to investigating these additional considerations would be to apply deep learning to point clouds, such as in [19]. In that reference, the authors consider a point cloud as an unordered ensemble of vectors. Here, we have replaced their vector ensemble with another ensemble of vectors defined by Eq. (2). This new ensemble maintains properties they list as defining: unordered, maintaining interaction between points, and invariant under certain transformations. Beyond the scope of this work, it is possible that the point-separation vectors could be used as input for deep-learning algorithms. Point separation appears to have invariance to point of view, so the training data required for a neural network could be greatly simplified. Deep-learning development could be used to address the issues cited above, particularly the articulation problem.

## Funding

Air Force Research Laboratory (FA865018C1073).

## Acknowledgment

Helpful discussions with Vince Velten are gratefully acknowledged. Support from the Decision Sciences Branch of Sensors Directorate, Air Force Research Laboratory is acknowledged.

## Disclosures

The authors declare no conflicts of interest.

## Data Availability

No data were generated in the presented research.

## REFERENCES AND NOTES

**1. **V. Molebny, P. McManamon, O. Steinvall, T. Kobayashi, and W. Chen, “Laser radar: historical
prospective—from the East to the West,” Opt.
Eng. **56**, 031220
(2016). [CrossRef]

**2. **W. E. Clifton, B. Steele, G. Nelson, A. Truscott, M. Itzler, and M. Entwistle, “Medium altitude airborne
Geiger-mode mapping LIDAR system,” Proc.
SPIE **9465**, 946506
(2015). [CrossRef]

**3. **Y. Wang, H. Weinacker, and B. Koch, “A lidar point cloud based
procedure for vertical canopy structure analysis and 3D single tree
modelling in forest,” Sensors **8**, 3938–3951
(2008). [CrossRef]

**4. **F. A. Sadjadi, ed., “Automatic Target
Recognition XI,” Proc. SPIE **4379** (2001).

**5. **J. Hecht, “Lidar for self-driving
cars,” Opt. Photon. News **29**, 26–33
(2018). [CrossRef]

**6. **Ground and obstacle removal can be accomplished using available
MATLAB routines, e.g.,https://www.mathworks.com/help/driving/ug/ground-plane-and-obstacle-detection-using-lidar.html.

**7. **P. Lutzmann, R. Frank, and R. R. Ebert, “Laser-radar-based vibration
imaging of remote objects,” Proc.
SPIE **4035**,
436–443 (2000). [CrossRef]

**8. **An early reference of ranges images isM. Rioux, J. A. Beraldin, M. O’Sullivan, and L. Cournoyer, “Eye-safe laser scanner for
range imaging,” Appl. Opt. **30**, 2219–2223
(1991). [CrossRef]

**9. **E. A. Watson, “Image quality metrics for
non-traditional imagery,” in *OSA Frontiers in
Optics* (2014), paper FTu5E.4.

**10. **M. A. Neifeld, “Information, resolution, and
space-bandwidth product,” Opt. Lett. **23**, 1477–1479
(1998). [CrossRef]

**11. **P. McManamon, P. Banks, J. Beck, D. Fried, A. Huntington, and E. Watson, “A comparison of flash lidar
detector options,” Opt. Eng. **56**, 031223 (2017). [CrossRef]

**12. **Q. Wang, L. Wang, and J. Sun, “Rotation-invariant target
recognition in LADAR range imagery using model matching
approach,” Opt. Express **18**, 15349–15360
(2010). [CrossRef]

**13. **Z. Lu and S. Lee, “Probabilistic 3D object
recognition and pose estimation using multiple interpretations
generation,” J. Opt. Soc. Am. A **28**, 2607–2618
(2011). [CrossRef]

**14. **Z. Liu, Q. Li, Z. Xia, and Q. Wang, “Target recognition of ladar
range images using even-order Zernike moments,”
Appl. Opt. **51**,
7529–7536 (2012). [CrossRef]

**15. **J. García, J. J. Valles, and C. Ferreira, “Detection of three-dimensional
objects under arbitrary rotations based on range
images,” Opt. Express **11**, 3352–3358
(2003). [CrossRef]

**16. **https://www.faro.com/products/construction-bim/faro-focus/.

**17. **J. Lin, “Divergence measures based on
the Shannon entropy,” IEEE Trans. Info.
Theory **37**,
145–151 (1991). [CrossRef]

**18. **E. A. Watson, “Viewpoint-independent object
recognition using photon-counted point clouds,” in
*Laser Congress*, OSA Technical Digest
(2019), paper LTu5B.3.

**19. **C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: deep learning on
point sets for 3D classification and segmentation,”
arXiv 1612.00593v2 [cs.CV] (2017).