## Abstract

The use of the Mahalanobis distance in a lookup table approach to retrieval of in-water Inherent Optical Properties (IOPs) led to significant improvements in the accuracy of the retrieved IOPs, as high as 50% in some cases, with an average improvement of 20% over a wide range of case II waters. Previous studies have shown that inherent noise in hyperspectral data can cause significant errors in the retrieved IOPs. For LUT-based retrievals that rely on spectrum matching, the particular metric used for spectral comparisons has a significant effect on the accuracy of the results, especially in the presence of noise in the data. In this study, we have compared the Euclidean distance and the Mahalanobis distance as metrics for spectral comparison. In addition to providing justification for the preference of the Mahalanobis Distance over the Euclidean Distance, we have also included a statistical description of noisy hyperspectral data.

© 2013 OSA

## 1. Introduction

Historically, most of ocean color analysis has centered on estimating chl-*a* concentration (as a proxy for algal biomass) from satellite data. While the determination of algal biomass is a key aspect of monitoring the ecological status of a water body, several scientific and military purposes require analysis that goes beyond merely estimating chl-*a* concentration and requires the retrieval of other optically active components in water and the characteristics of the substratum. For example, determining the sediment concentration in the surface layer helps assess the availability and transmission of light into deeper layers, which is a crucial parameter in deriving energy budgets for primary production studies; characterizing the bottom type and bathymetry is important for many purposes, such as maritime navigation, underwater geologic studies, global climate studies, habitat management, seafood safety control, coastal erosion control, etc.

Numerous empirically-driven and analytically-driven algorithms have been developed for retrieving in-water and under-water characteristics from satellite data, with varying degrees of success in the achieved accuracy. Contrary to open-ocean waters, the coastal, estuarine, and inland waters are often optically complex due to the relative abundance of other optically active components besides phytoplankton, and are conventionally categorized as Case II waters [1]. The optical properties of Case II waters are not directly correlated to the concentration of phytoplankton alone. Therefore, spectral algorithms for retrieving a particular biophysical parameter in such waters are prone to effects due to variations in other parameters. For example, blue-green algorithms for estimating chlorophyll-*a* (chl-*a*) concentration do not perform well in Case II waters due to absorption by colored dissolved organic matter (CDOM) in the blue spectral region (e.g., [2, 3]). Algorithms relying on reflectances in the red and near infrared (NIR) regions have been preferred for retrieving chl-*a* concentration in Case II waters because of the diminished effects of absorption by CDOM and scattering by Suspended Particulate Matter (SPM) in the red and NIR wavelengths (e.g., [4, 5]). Nevertheless, algorithms based on band ratios in the red and NIR wavelengths still assume uniform absorption and scattering by non-algal components at the wavelengths of interest. This assumption breaks down in cases of low chl-*a* and high SPM concentrations, causing significant errors in the estimated chl-*a* concentration (e.g., [6]). Modified NIR-red algorithms have been developed to handle the non-uniform scattering by SPM in the red and NIR regions [7, 8]. Even though such algorithms have been shown to yield accurate results for very highly turbid waters from specific geographic regions, their applicability to waters from various geographic regions with varying biophysical characteristics is yet to be proven.

Comprehensive analysis of coastal, estuarine, and inland ecosystems require the retrieval of multiple in-water and under-water characteristics. Algorithms based on a Look-Up-Table (LUT) approach (e.g., [9]) can retrieve multiple parameters simultaneously and, when properly designed, are less prone to adverse effects of the dominance of a few parameters over the rest. The Naval Research Laboratory in Washington, D.C. has developed a LUT-based Coastal Water Spectral Toolkit (CWST) for retrieving Inherent Optical Properties (IOPs) of water, constituent concentrations, sediment type, bottom type, bottom reflectance, and bottom depth from remotely sensed data. The CWST essentially employs a spectrum-matching approach that is based on spectral distances calculated between the input spectrum and modeled spectra in a large database. The accuracy and reliability of this approach depend significantly on the metric used for calculating the spectral distances. In this paper, the use of the Euclidean and the Mahalanobis distances are compared as the metric used in choosing the best-matching spectrum, and demonstrate the benefits of using the Mahalanobis distance instead of the Euclidean distance.

## 2. Experimental description and methodology

#### 2.1 Look-Up-Table (LUT) based IOP retrieval

In order to implement a LUT-based model, one must generate in advance a large number of modeled remote sensing reflectance (${R}_{rs}$) spectra, covering a wide range of bio-optical parameters, to ensure that each image spectrum can be matched to a modeled spectrum within acceptable limits of accuracy. The CWST contains a very large database of modeled spectra, generated using a wide variety of IOP parameters, bottom reflectance, depths, sediment types, and phase functions. The forward model used for generating the spectra is Ecolight, which is a simplified version of the radiative transfer model Hydrolight [10, 11]. The full set of modeled data, currently comprising approximately 20 million unique ${R}_{rs}$ spectra, is stored in a database for warehousing.

To run the IOP retrieval for a given image or set of individual spectra, the database is first reduced by limiting the ranges of the various input parameters to those appropriate for the input spectra (for example, bottom types that are known to not exist in the region from which the input spectra were collected are omitted). This reduction in modeled spectra both increases the speed of the analysis and reduces the chance of incorrect matches that might occur due to unusual combinations of input parameters that aren’t realistic for this region. Each input spectrum is then compared to each of the modeled spectra in the (reduced) database, and the spectral distance (according to some metric) is calculated. The modeled spectrum that is closest to the input spectrum is considered the best match, and the (known) parameters corresponding to this spectrum are then assigned to the input spectrum.

#### 2.2 Sensor noise modeling

In very general terms, the process of measuring a hyperspectral spectrum in a CCD array is a means of converting incoming light (photons) into ‘digital numbers’; sensor calibration is then used to turn the digital numbers into radiance values. The total measured signal is a sum of the incoming light and any sensor-generated noise. The latter terms include dark noise, read noise, and digitization noise, and are independent of the incoming signal strength; accurate characterization of these terms for a given sensor is possible in the laboratory. The incoming light is essentially a count of photons per unit time; this count will vary as a Poisson distribution with a standard deviation equal to the mean number of photons received. This variation is usually referred to as photon (or ‘shot’) noise. When the incoming light is significantly large, the photon noise dominates the sensor noise and the data are said to be ‘shot-noise limited’.

In this paper, we simulate real-world noisy measurements using the sensor model for NRL’s Hyperspectral Imager for the Coastal Ocean (HICO), which has been operating continuously aboard the International Space Station since October 2009. The sensor is a 512 x 512 CCD array with a spectral range of 350-1080 nm and a spectral channel width of 5.73 nm; the ground sampling distance is approximately 90m at nadir. Detailed characterizations of the sensor noise for HICO are available in [12]; a much more detailed description of the noise modeling algorithm can be found in [13].

Only simulated data are used in the paper. Ideally, one would prefer to use actually measured field data. However, there is limited ability to collect a sufficient number of *in situ* measurements over a wide range of biophysical conditions to make statistically meaningful inferences. Through simulations it is possible to generate large data sets over any number of conditions at intervals that are narrow enough to permit investigation of even small biophysical changes in water and still provide a reasonable estimate of how the algorithm will perform with field data.

To generate noisy samples, the following three-step procedure is used: starting with a given ${R}_{rs}$ spectrum$\rho $, an at-sensor radiance spectrum, $L$, is generated by using a forward version of the Tafkaa atmospheric correction algorithm [14, 15]. Next, a ‘noisy’ radiance spectrum ${L}_{n}=L+\eta $ is generated by adding a zero-mean Gaussian noise spectrum $\eta $; the variance of the noise is determined by the noise model described in [13] and is independent of the wavelengths. Finally, Tafkaa is used to atmospherically correct the data and produce a noisy ${R}_{rs}$ spectrum ${\rho}_{n}$.

Note that, by construction, the noisy radiance spectra form a Gaussian distribution with a mean roughly equal to the noise-free spectrum and a covariance matrix ${\text{\Sigma}}_{R}$; since the noise is uncorrelated among wavelengths, the covariance matrix is diagonal. In general,, the atmospheric correction algorithm is, to a very good approximation, an affine mapping (that is, ${\rho}_{n}=A{R}_{n}+b$ for some matrix $A$ and vector $b$), it follows that the distribution of the noisy remote sensing reflectance ${\rho}_{n}$will also be Gaussian, with a mean equal to the noise-free reflectance spectrum $\rho $ and a covariance matrix $\text{\Sigma}=A{\text{\Sigma}}_{R}{A}^{t}$. If we further assume that the matrix $A$ is diagonal – that is, no inelastic terms are included in the atmospheric correction – then it follows that that the reflectance covariance matrix $\text{\Sigma}$ is also diagonal.

#### 2.3 Distance metrics

In order to run the CWST IOP retrieval, some metric must be defined in order to calculate the distance between the input and the database spectra. The most traditional choice is the standard least-squares (or ${L}_{2}$) distance,

An alternative metric is the statistical-based Mahalanobis distance (MD) [16], which measures the distance from a vector $x$ to a given (multivariate) distribution (or set of points) $Y$. Formally, if the mean and covariance of $Y$ are $\mu \text{and\Sigma}$, respectively, then the Mahalanobis distance ${d}_{M}\left(x\right)$ of $x$to $Y$ is given by

For two vectors $x,y$, the MD can be generalized asIf the covariance matrix $\text{\Sigma}$ is diagonal, with diagonal entries ${\sigma}_{1}^{2},\dots ,{\sigma}_{n}^{2}$, then the MD formula simplifies to## 3. Results

#### 3.1 IOP retrievals

The general setup of our experiment consists of first generating noisy spectra from a given ${R}_{rs}$input, as described in Sec. 2.2 and running the LUT-based IOP inversion (Sec. 2.1) using both the ${L}_{2}$ and Mahalanobis distances and comparing the results.

To generate the noisy data, we began with a set of 52 optically-deep ${R}_{rs}$spectra generated by Ecolight over various levels of chl-*a*, CDOM, and SPM. All other parameters (phase functions, sediment type, wind speed, etc.) were set fixed for all generated spectra; each spectrum was generated on a grid of 68 HICO wavelengths within the range 405 – 789 nm. Each of the 52 noise-free spectra was then input into the noise model (Sec. 2.2) and for each input 1000 noisy ${R}_{rs}$ samples generated, an example is shown in Fig. 1. In all cases, the atmospheric conditions were held constant; a complete description of the atmospheric parameters may be found in [13]. We also calculated the associated (sample) covariance matrix for each input for use in the Mahalanobis distance calculations.

Next, we extracted a subset of the existing ${R}_{rs}$ library spectra from the full database. To keep the comparison as simple as possible, we followed the model used to generate the noisy data and only used optically deep spectra with varying amounts of chl-*a*, CDOM, and SPM; the sediment type and scattering phase functions for both SPM and chl-*a* were also varied; all other parameters (wind speed, solar zenith angle, etc.) were fixed to the same values as for the noisy input. An approximate overview of the various ranges and discretization values used in the LUT for CWST is given in Table 1; 261,281 modeled spectra were used in the comparison. The database spectra were originally generated on a wavelength grid of 350 - 790 nm at 5 nm resolution; in order to compare the spectra in the database with the noise-modeled data, the former were resampled onto the HICO grid using cubic-spline interpolation.

The next step was to run the CWST IOP inversion for each of the 52,000 noisy spectra, once under each metric. To run the inversion, we simply ran a brute-force search on each input spectrum, calculating the distance between the input and every database spectra, and finding the minimum. The parameters corresponding to the database spectrum with the minimum spectral distance from the input spectrum were then assigned to the input spectrum, and the retrieved parameters were compared to the original, noise-free parameters. We note that no attempt was made to optimize the search, and all calculations were done in the most naïve way possible. Due to the additional matrix-vector product, the time to run to the Mahalanobis distance search was approximately twice as long; however, on a modern desktop, each version can be done quickly (on average, ${L}_{2}$search ran in approximately 0.15 seconds per input spectra while MD ran in approximately 0.30 seconds).

The main results are summarized in columns 4 and 5 of Table 2, which shows a simple count of how many of each run of 1000 noisy spectra exactly matched the input spectrum – that is, the retrieved values of all parameters were the same. As can be seen from the results, the Mahalanobis distance outperforms the standard ${L}_{2}$ distance for each of the 52 input spectra. The improvement ranged from 2 to 50%, with an average improvement of approximately 20%. It was noticed, somewhat surprisingly, that whenever the retrieved concentrations of chl-*a*, CDOM (expressed in terms of the absorption coefficient at 440 nm), and SPM were correct, the other three free parameters – sediment type and both phase functions – were also correct. It follows that the ‘number correct’ column is the same whether we compare only chl-*a*, CDOM and SPM or all six parameters.

The average relative error for the retrieved chl-*a*, CDOM and SPM values are shown in columns 7-12 of Table 2. The reported values are of the form

In each case, it is important to note that the average value will be biased by the discrete nature of the LUT; when constructing the database, a ‘step size’ must be chosen for each parameter. The correct step size to use is itself a research topic [17]; it is also important to remember that since the retrieval is done for all parameters simultaneously, the discretization grid of any one parameter will impact the retrieval of the others as well. A more complete analysis would include an examination of the distribution of each parameter in each input ‘bin’, as shown in Fig. 2. In general, the overall pattern was similar for each parameter; the majority of the retrieved values exactly matched the true value; the remaining values were within one or two step-units on either side (with a few occasional outliers).

#### 3.2 Theoretical justification

In order to gain an understanding of why the Mahalanobis distance outperforms ${L}_{2}$, an examination of how the noisy data and database spectra are distributed under these two metrics was performed.

From Sec. 2.2, it is reasonable to expect that the generated noisy spectra $X=\left\{{x}_{1},{x}_{2},\dots ,{x}_{1000}\right\}$ follow, to a very good approximation, a multivariate Gaussian distribution with mean $\mu =\rho $ equal to the noise-free input, and diagonal covariance matrix $\text{\Sigma}$.

To test this, the distribution of the distances $d\left({x}_{i},\rho \right)$ between the noise-free and noisy spectra, $\rho $ and ${x}_{i},$ was calculated for both the Mahalanobis and ${L}_{2}$metrics, and the experimental results were compared with the expected theoretical values.

If the data are assumed Gaussian, then it can be shown [18] that the MD reduces to the sum of squared normal variables and therefore has a chi-squared distribution with *n* degrees of freedom, where *n* is the number of wavelengths.

Similarly, under the same assumptions, the ${L}_{2}$ distance Eq. (1) can be rewritten as

An example of the actual and fitted distributions for one of the input spectra is shown in Fig. 3 below. We note that each of the other 51 noise-free spectra lead to fits that are qualitatively similar to the example shown; although not necessarily conclusive, it appears from this that the multivariate Gaussian distribution for the noisy data is reasonable.

An alternative, more geometrical view of the two distances can be obtained by examining how the data are distributed in the spectral space. In general, the data will have a higher ‘spread’ along bands that have more noise, as shown in Fig. 4; as a result, the set of noisy data tends to be more of an ellipsoid than a perfect sphere. From Eq. (4), for a fixed value the Mahalanobis distance defines an ellipsoid with axial lengths given by the band variances; in contrast, for a fixed distance ${L}_{2}$ defines a sphere that is equidistant along each band as shown in Fig. 4. Intuitively, the MD distance does a much better job of modeling the true spread of the noisy data within the spectral space.

With this in mind, it is useful to examine how many of the database spectra lie within the same ‘range’ of the noise-free spectrum as the noisy spectra. From above, we know we can estimate the distribution of the noisy spectra as a chi-squared (MD) or normal distribution (${L}_{2}$); from this, we can estimate the range within which a given percentage of the noisy data must fall. For example, under the normal distribution, we can estimate that approximately 85% of the data are less than one sigma above the mean, and about 99% of the data are less than $\mu +2.33\sigma $. Similar results for the chi-squared distribution with a given degree of freedom can be found via tables or software-based cumulative distribution functions. In Table 3, we show the total number of noisy and database spectra that lie within the 85 / 99% ranges of the noisy data; an example is shown in Fig. 4 (spectral space) and Fig. 5 (distance). In each case, it can be seen that the number of ‘incorrect’ database spectra within the noise range drops significantly under MD; intuitively, this implies that the noisy data are ‘closer’ to the true noise-free spectrum under MD than under ${L}_{2}$ and that the likelihood of the noisy data being closest to the truth is higher.

## Summary and future work

Noise, including sensor and shot noise, is inescapable in real-world data, and has been shown previously [13] to strongly affect the accuracy of retrieved in-water constituents. In this work we have shown that statistical modeling, including the use of the Mahalanobis distance to compare spectra, can partially offset this effect and lead to improved retrievals.

Realistic sensor models were used to simulate noisy data; in particular, extensive simulations were required to derive the (full) covariance matrix needed to use the Mahalanobis distance. In order to reduce the computational complexity, a more efficient method of estimating the (diagonal) covariance is needed; we have recently begun an examination of this problem and expect to present these results in the near future. We also note that an accurate description of the noise levels and how they affect retrievals may be used to help guide future sensor design.

## References and links

**1. **A. Morel and L. Prieur, “Analysis of variations in ocean color,” Limnol. Oceanogr. **22**(4), 709–722 (1977). [CrossRef]

**2. **R. P. Bukata, J. H. Jerome, K. Y. Kondratyev, and D. V. Pozdnyakov, *Optical Properties and Remote Sensing of Inland and Coastal Waters* (CRC Press, 1995).

**3. **K. L. Carder, R. G. Steward, G. R. Harvey, and P. B. Ortner, “Marine humic and fulvic acids: their effects on remote sensing of ocean chlorophyll,” Limnol. Oceanogr. **34**(1), 68–81 (1989). [CrossRef]

**4. **A. Gitelson, “The peak near 700 nm on radiance spectra of algae and water - relationships of its magnitude and position with chlorophyll concentration,” Int. J. Remote Sens. **13**(17), 3367–3373 (1992). [CrossRef]

**5. **G. Dall’Olmo and A. A. Gitelson, “Effect of bio-optical parameter variability on the remote estimation of chlorophyll-a concentration in turbid productive waters: experimental results,” Appl. Opt. **44**(3), 412–422 (2005). [CrossRef] [PubMed]

**6. **Y. Z. Yacobi, W. J. Moses, S. Kaganovsky, B. Sulimani, B. C. Leavitt, and A. A. Gitelson, “NIR-red reflectance-based algorithms for chlorophyll-*a* estimation in mesotrophic inland and coastal waters: Lake Kinneret case study,” Water Res. **45**(7), 2428–2436 (2011). [CrossRef] [PubMed]

**7. **C. Le, Y. Li, Y. Zha, D. Sun, C. Huang, and H. Lu, “A four-band semi-analytical model for estimating chlorophyll a in highly turbid lakes: the case of Taihu Lake, China,” Remote Sens. Environ. **113**(6), 1175–1182 (2009). [CrossRef]

**8. **W. Yang, B. Matsushita, J. Chen, T. Fukushima, and R. Ma, “An enhanced three-band index for estimating chlorophyll-*a* in turbid case-II waters: case studies of Lake Kasumigaura, Japan, and Lake Dianchi, China,” IEEE Geosci. Remote Sens. Lett. **7**(4), 655–659 (2010). [CrossRef]

**9. **C. D. Mobley, L. K. Sundman, C. O. Davis, J. H. Bowles, T. V. Downes, R. A. Leathers, M. J. Montes, W. P. Bissett, D. D. R. Kohler, R. P. Reid, E. M. Louchard, and A. Gleason, “Interpretation of hyperspectral remote-sensing imagery by spectrum matching and look-up tables,” Appl. Opt. **44**(17), 3576–3592 (2005). [CrossRef] [PubMed]

**10. **C. D. Mobley, “A numerical model for the computation of radiance distributions in natural waters with wind roughened surfaces,” Limnol. Oceanogr. **34**(8), 1473–1483 (1989). [CrossRef]

**11. **C. D. Mobley and L. K. Sundman, *Hydrolight 5 Ecolight 5 technical documentation*, 1st Ed., (Sequoia Scientific Inc., 2008).

**12. **R. L. Lucke, M. Corson, N. R. McGlothlin, S. D. Butcher, D. L. Wood, D. R. Korwan, R. R. Li, W. A. Snyder, C. O. Davis, and D. T. Chen, “Hyperspectral Imager for the Coastal Ocean: instrument description and first images,” Appl. Opt. **50**(11), 1501–1516 (2011). [CrossRef] [PubMed]

**13. **W. J. Moses, J. H. Bowles, R. L. Lucke, and M. R. Corson, “Impact of signal-to-noise ratio in a hyperspectral sensor on the accuracy of biophysical parameter estimation in case II waters,” Opt. Express **20**(4), 4309–4330 (2012). [CrossRef] [PubMed]

**14. **B. C. Gao, M. J. Montes, Z. Ahmad, and C. O. Davis, “Atmospheric correction algorithm for hyperspectral remote sensing of ocean color from space,” Appl. Opt. **39**(6), 887–896 (2000). [CrossRef] [PubMed]

**15. **M. J. Montes, B. C. Gao, and C. O. Davis, “A new algorithm for atmospheric correction of hyperspectral remote sensing data,” Proc. SPIE, Geo-Spatial Image and Data Exploitation II, W. E. Roper (ed.), **4383**: 23–30 (2001). [CrossRef]

**16. **P. C. Mahalanobis, “On the generalized distance in statistics,” Proc. Natl. Inst. Sci. India **2**(1), 49–55 (1936).

**17. **J. Hedley, C. Roelfsema, and S. Phinn, “Efficient radiative transfer model inversion for remote sensing applications,” Remote Sens. Environ. **113**(11), 2527–2532 (2009). [CrossRef]

**18. **K. V. Mardia, J. T. Kent, and J. M. Bibby, *Multivariate Analysis* (Academic Press, 2003).

**19. **P. Billingsley, *Probability and Measure, Third Ed.* (John Wiley & Sons, 1995).