This paper proposes new inversion algorithms for the estimation of Chlorophyll-a concentration (Chla) and the ocean’s inherent optical properties (IOPs) from the measurement of remote sensing reflectance (Rrs). With in situ data from the NASA bio-optical marine algorithm data set (NOMAD), inversion algorithms were developed by the novel gene expression programming (GEP) approach, which creates, manipulates and selects the most appropriate tree-structured functions based on evolutionary computing. The limitations and validity of the proposed algorithms are evaluated by simulated Rrs spectra with respect to NOMAD, and a closure test for IOPs obtained at a single reference wavelength. The application of GEP-derived algorithms is validated against in situ, synthetic and satellite match-up data sets compiled by NASA and the International Ocean Color Coordinate Group (IOCCG). The new algorithms are able to provide Chla and IOPs retrievals to those derived by other state-of-the-art regression approaches and obtained with the semi- and quasi-analytical algorithms, respectively. In practice, there are no significant differences between GEP, support vector regression, and multilayer perceptron model in terms of the overall performance. The GEP-derived algorithms are successfully applied in processing the images taken by the Sea Wide Field-of-view Sensor (SeaWiFS), generate Chla and IOPs maps which show better details of developing algal blooms, and give more information on the distribution of water constituents between different water bodies.
© 2015 Optical Society of America
The application of ocean-color remote sensing is promising with regard to its ability to allow for a high spatial and temporal assessment of important physical, chemical, and biological parameters of the global ocean from space . Deriving optical properties associated with these essential parameters from remotely sensed radiance above the sea surface, is known as the “inverse problem” in the field of ocean optics. Retrieval algorithms or inversion algorithms/models are applied to solve the inverse problem, which use inputs representing “ocean-color”  (Rrs, the ratio between the water-leaving radiance and downwelling irradiance, sr−1) measured by a satellite sensor at a limited number of spectral bands in the visible range. Typical outputs are the total phytoplankton content in terms of Chlorophyll-a concentration (Chla); and ocean inherent optical properties (IOPs) , such as the absorption and back-scattering characteristics for phytoplankton, detritus (non-algal particles) and colored dissolved organic matters (CDOM or gelbstoff). Estimation of the bio-optical properties through solving the inverse problem is of importance in understanding the related environmental dynamics at both local and global scales.
The various inversion algorithms can generally be classified into two different approaches based on whether a forward-model that physically describes the bio-optical relationships and radiative-transfer processes is or is not involved in the inversion process . First, model-based approaches solve the inverse problem by deriving algebraic results , or approximating suitable solutions using optimization or spectral matching methods, such as the Levenberg-Marquart approach , and nonlinear least-square schemes [7, 8]. If a representative model is available, the major benefit obtained from such approaches is that few training data are required to develop the algorithm. In order to obtain algebraic solutions or improve the efficiency of optimization, it is commonly necessary to simplify the forward-model, such as by the use of semi-analytical and quasi-analytical approximations [5, 6]. However, simplification may create unrepresentative models and lead to incorrect results, particularly when dealing with optically complex waters . Second, without the use of a predefined model, empirical or regression approaches can directly relate the inputs to the outputs by training with a set of known samples. Although it is indicated that the model-based approach may have a higher level of complexity and perhaps better accuracy for Case-2 waters , current operational algorithms used for mapping global ocean-color products are developed using a data-driven or empirical approach, mainly due to higher model transferability and lower computational load. The IOP products were mostly derived by model-based approaches or empirical algorithms developed by data generated from mechanistic models. The performance of empirical methods is highly dependent on the level of representativeness and complexity of the training data. Because the interactions between different IOPs are commonly not linear , bio-optical models are not globally the same , and bio-optical measurements often return imprecise and uncertain data [10, 11], the concepts of a universally applicable model or representative data are problematic in real oceans. This has led to the use of soft-computing techniques in the ocean-color community, to develop more reliable and robust algorithms to deal with such nonlinear systems. For example, recent work has applied artificial neural networks (ANNs) [12–17], support vector regression (SVR) [18, 19], cross-entropy method , active learning method , and evolutionary algorithms (EAs) [22, 23].
EAs are inspired by natural processes that drive evolution , such as inheritance, variation and the Darwinian struggle for survival. EAs have been widely used in many disciplines to solve complex problems, and are effective, robust and easy-to-implement methods that can search for global solutions to optimization problems . In general, EAs encompass all adaptive and computational models of natural evolutionary systems , such as genetic algorithms (GAs)  and genetic programming (GP) techniques . GAs are genotype-algorithms, in which the individuals are fixed-length linear strings and function as genomes (chromosomes). A GA works as a nonlinear optimization method in the model-based approach for solving the inverse problem [22, 23], in which candidate solutions are encoded as chromosomes and manipulated by natural selection operators, e.g., reproduction, crossover and mutation, and an individual’s fitness for survival is evaluated in decoded forms (using real values) by the predefined forward model and objective function. In contrast, a GP is a phenotype-algorithm, where the phenotypes of individuals are represented and manipulated as nonlinear entities of different sizes and shapes, commonly a tree-structured function . Solving the inverse problem with GP is also known as symbolic regression [27, 28], in which the most appropriate backward model for linking inputs and outputs is developed by directly optimizing a population of tree-structured functions through the related evolutionary processes. GPs have had significant success in creating innovative (i.e., commonly improved) functions and equations for industrial applications. However, in a Koza-style GP , both the searching operators and the fitness functions are based on the phenotype. Therefore, without the use of a genotype, illegal or invalid tree-structures may often be created in the selection process, which limits the optimization efficiency and algorithm transferability .
Several genotype-phenotype GP techniques which can manipulate individuals genetically as GA and can maintain the functional complexity of the phenotype, have recently been proposed, such as the grammatical evolution  and gene expression programming (GEP) techniques . A genotype-phenotype system makes a separation between an organism’s genotype and phenotype, which means that it is more compatible with how genetics actually operates in nature, and is more likely to provide reasonable, efficient and simple algorithms to solve the inverse problem. In GEP, the individuals are encoded as GAs, which are then translated into tree-structured functions, the gene expression trees (ETs), by a functional relationship between chromosomes and phenotype expressions. With the unequivocal translation system , any modification made in the genome can always result in correct ETs. However to date no research has been performed using the GEP to develop ocean-color retrieval algorithms.
With an attempt to gain the computational advantages from both the evolutionary algorithm and empirical approaches, this work develops new inversion algorithms for the retrieval of Chla and IOPs from remote sensing of ocean color using GEP. First, version 2 of the NASA bio-Optical Marine Algorithm Data set (NOMAD)  is examined to summarize all the available data points of Rrs(λ), Chla and IOPs. Second, available data pairs are sampled into two subsets: a training data set for the creation of inversion models, and a validation data set that is used to check the performance of the model. Third, the GEP technique is applied to develop six ocean-color inversion algorithms for the estimation of (1) Chla (mg/m3); (2) absorption coefficients at 443nm (m−1), including phytoplankton [aph(443)], detritus-gelbstoff [adg(443)], particulate matters [ap(443)], and total optically-active-substances [a(443)]; and (3) the particle backscattering coefficients at 555nm (m−1) [bbp(555)]. Finally, the application of these new models is validated against the in situ, synthetic, and Sea Wide Field-of-view Sensor (SeaWiFS) match-up data sets compiled by NASA and International Ocean Color Coordinate Group (IOCCG) . Most of the in situ data used in this study comes from locations relatively close to the coast (some are Case 2 waters), and the IOCCG synthetic data set comprises a wide range of parameters characterizing the global ocean (Case 1 waters). Compared to existing empirical- and model-based approaches, the inversion algorithms proposed in this study are clearly expressed functions which have higher transferability and can be readily applied, and they are capable of: (1) providing better retrievals for in situ data; (2) being used to process the SeaWiFS data at regional and global scales; and (3) generating ocean-color products which give more information on the distribution of water constituents.
2. Data sets
NOMAD is a collection of in situ coincident observation data of radiometric, phytoplankton pigment concentrations and bio-optical properties from NASA’s SeaWiFS Bio-optical Archive and Storage System (SeaBASS). NOMAD is also the most popular and representative publicly available data set used to support algorithm development and satellite product validation, having been cited more than 100 times in the ISI Web of Knowledge since 2005. The latest version of the NOMAD data set (updated on 18 July, 2008), referred as NOMADv2a, was used in this study for the development of inversion models through GEP. The IOP data with coincident ground measured values of Rrs(λi) available at the first five SeaWiFS bands (λi = 411, 443, 489, 510 and 555nm) are used in this work, resulting in the following numbers of corresponding data points: HPLC measured Chla is 1,210, aph(443) is 897, adg(443) is 788, ap(443) is 932, a(443) is 799 and bbp(555) is 332. Although studies have shown that the Rrs measured at 670nm is very important for inversion of coastal waters , it was not used in this study when developing the algorithm, because there is an insufficient number of accurate measurements in NOMADv2a. The frequency distributions of the six selected data pairs are shown in Fig. 1. The NOMADv2a data set contains some near coastal samples, but the majority of the data is distributed in the lower end of each histogram, representing clear waters. The six data pairs were sorted by increasing the levels of Chla and IOPs, and afterwards the odd- and even-numbered data were used separately as training and validation samples for the development of GEP inversion models. A limited number of bbp(555) observations were available, therefore an additional 56 data points were allocated to the training samples from the validation set, and thus two-thirds of the data were used for training. As shown in Table 1, there were 222 to 605 data points available to train the GEP algorithms.
The IOCCG prepared a synthetic and an in situ data set (hereafter referred as IOCCG synthetic and in situ data set) containing IOPs and the corresponding apparent optical properties (AOPs) for its Report Number 5 , in order to evaluate and compare the performance of a variety of retrieval algorithms commonly used in remote sensing practices. Note that the uncertainties in IOP measurements were designed into the IOCCG synthetic data by mapping one Chla to 25 different sets of IOPs based on theory and field data , resulting in 500 IOPs that are generated from 20 levels of Chla in the range of 0.03–30 mg/m3, and the corresponding Rrs(λ) simulated by Hydrolight for the sun at 30° and 60° from zenith, respectively. The IOCCG in situ data set comprises 656 cases including Chla, a(443), adg(443), aph(443) and Rrs(λ), which also originated from the SeaBASS data.
Another subset of NOMAD, released earlier as version 1.3 on 19 Sept 2005 (referred to as NOMADv13), consists of coincident observations of Rrs(λ) from SeaWiFS imagery, and was used to ensure the inversion models are also able to process real satellite data. Due to difficulties in meeting the criteria of satellite “match-up” , the sample size of NOMADv13 is relatively small, where 268, 123, 111, 123,111 and 28 cases are available for on-orbit validation of the derived Chla, aph(443), adg(443), ap(443), a(443) and bbp(555), respectively.
Overall, among the four data sets used, only the training samples (Table 1) in NOMADv2a are used to develop the inversion models, with the remainder used for validation. The IOCCG synthetic and in situ data sets are applied to evaluate and compare the performance of our inversion models and existing algorithms. NOMADv13 was used as satellite-matchup data for the validation of retrievals generated by GEP-derived inversion models, in which the inputs are the satellite observed Rrs(λ), rather than the ground measurements in other data sets.
3. Symbolic regression
The ocean color inverse problem can be regarded as an issue of symbolic regression , which requires finding a function in symbolic form that fits the relationships among given data points that consist of Rrs(λ) observed at a number of wavebands as the predictor variables, and the corresponding measurements of an ocean-color product as the variable to be predicted. Symbolic regression is also known as error-driven-evolution . In an ocean-color inverse problem, where the evolution of solutions identified by symbolic regression is driven by a quantifiable objective, namely fitness. The fitness evaluation function is commonly a measure of the deviation between the simulated and observed values. Compared to other conventional linear or nonlinear regression techniques, the major difference with symbolic regression is that no a priori or pre-defined functions/models are specified. In symbolic regression, the computational constraint is that all of the mathematical operators or logical expressions used in the function should be chosen from a pre-defined pool of mathematical expressions or symbols. In the first part of this section, the general procedures for solving the symbolic regression problem with GEP will be introduced. Particular attention will be given to demonstrate how the unequivocal translation system works. The steps and settings for the development of ocean-color inversion algorithms using the GEP symbolic regression approach will then be described in the second part.
3.1 GEP based symbolic regression
GEP is a genotype-phenotype genetic algorithm, in which the genotype of each individual is represented as a linear symbolic string (chromosome) with one or more fixed-length genes, as shown in Fig. 2(a); and the inferred phenotype (ET) is expressed nonlinearly in diagram form, as shown in Fig. 2(b). The computational strength of GEP is that not only can the ET be read straightforwardly from the “easily manipulated” linear string, but that it can also represent a real mathematic function [Fig. 2(c)] without losing the computational complexity. The process of transferring genetic information from a gene into an ET is carried out by the so-called translation system, in which several rules are applied to prevent the forming of illegal ETs and ensure that all possible messages or changes made in the gene can be directly translated into an ET. The rules that govern this are briefly described, as follows:
- 1. The head region of a gene [Fig. 2(a)] can contain symbols that represent both the functional and terminal nodes in the inferred ET [Fig. 2(b)], whereas the tail region contains only terminals. For instance, the symbols “Q”, “*”, “M” and “+” in the head part [Fig. 2(a)] represent the square root, multiplication, maximum of the two, and addition function, respectively, and these are functional nodes. The symbols “a” and “b” represent the predictor variables or numerical constants, and these are terminal nodes [Fig. 2(a)].
- 2. The length of the head region (h) in a chromosome is freely chosen, and the length of the tail (t) is then calculated as follows:
Where n is the maximum number of arguments that a predefined function or operator can take. In Fig. 2, h is 6 and n equals the number of arguments used in the “maximum of two” function (represented as “M”), and hence the tail length is 7. Consequently, the size and shape of an ET, as well as the created real function, depend on the number/type of functional symbols used in the head region. In this case, when the head region is full of mathematical expressions, each of which has an argument number of 2, then all of the noncoding region will be coded, and the size of ET reaches the maximum available number of 13.
- 3. Multigenic chromosomes are often used in GEP to increase the diversity of individuals, in which each gene of the chromosome represents a particular subunit of the related ET. Note that both the fitness values of sub-ETs and multi-subunit ETs will be evaluated during the evolutionary process. This is useful when both the size (complexity) and performance (fitness) of the identified function are of interest to developers.
- 4. Reproduction of the genotype of an individual with best performing fitness through genetic operators, i.e., replication, mutation, transposition and recombination (crossover), is an essential part of evolution. With the exception of replication, the common genetic operators used in GEP are illustrated in Fig. 3. Because of the linear-string structure, possible modifications made on phenotypes by the genetic operators can be directly translated into the phenotype without using special rules, as shown in Figs. 3(a) and 3(b).
To solve symbolic regression problems, the GEP creates and optimizes regression models to the highest fitness value following the steps below:
- 1. The GEP process begins by randomly generating a population of individuals (genotype).
- 2. Each individual, which is translated to one possible regression model (phenotype) by the translation system, can be any combination of predefined mathematical operators, logical expressions, predictor variables and numerical constants.
- 3. The probability for each individual to survive in the evolutionary process is assessed by a fitness evaluation function.
- 4. The second and subsequent generations are created through a reproduction process. In this process, an individual with a high fitness value has a greater opportunity of being selected and passing its genetic characteristics on to its offspring by duplicating or processing genetic operations.
- 5. In the final generation, the optimum regression model linking the predictor and predictand variables can be obtained.
3.2 Development of ocean-color retrieval algorithms using GEP
The symbolic regression function for solving the ocean-color inverse problem can be defined as the following form:Eq. (2), consists of the following steps:
First, formulate the fitness functions to guide and constrain the direction of evolution. The most widely used evaluation functions in symbolic regression are the coefficient of determination (R2) and root-mean-square-error (RMSE) [29, 36], which tend to match several peak and higher values rather than fitting the overall distribution patterns . In the contrast, the log-transformed RMSE (logRMSE), which places more emphasis on fitting the lower observed values [5, 38], can cover a greater portion of IOP data in NOMADv2a [Fig. 1], which is formulated as follows:Table 1, calculated from both the training and validation samples).
Second, set up the pre-defined pool of mathematical expressions or symbols used to form the regression models. A total of 13 mathematical functions are selected (Table 2) based on the recommendation from Ferreira  for symbolic regression problems, and a review of existing empirical algorithms [39–41]. Moore, et al.  indicated that the use of the Rrs(λ) pre-classification technique can better characterize the uncertainty of satellite data products, and it is also applicable to screening an appropriate algorithm for a specific optical water type. To implement this concept of pre-classification in the proposed inversion model using GEP, another six logical expressions are introduced in this study, as shown in Table 2, including the “maximum of two”, “maximum of three”, and four different “Greater Or Equal To” (GOE2) function sets.
Third, create the genetic diversity that is needed for individuals to evolve, thus making it more likely that the fittest individual will be found. In GEP, the level of genetic diversification is controlled by various parameters describing the probability of specific genetic operations occurring, e.g., the rate of mutation and crossover. The initial settings of the genetic operation parameters are given in Table 2, and were also recommended by Ferreira  for optimum evolution.
Finally, other settings for planning the regression model architecture, e.g., the head size, available number of numerical constants, number of genes (sub-ETs) per chromosome, and function linking sub-ETs, are based on the discussions in the forums hosted by Gepsoft , and determined by trial and error tests. All of the GEP settings are summarized in Table 2, where (i) x, y and z cannot only represent a predictor or a numerical constant, but also a functional node. For example, the addition function set “x+y” can be the addition of two predictors “d0+d1”, one predictor and one function set “d0+d2/d1”, and two function sets, and (ii) the function sets consist of only one variable x indicates that the number of arguments the function can take (arity) is 1. For example, the number of arities for x + y and max3(x,y,z) is 2 and 3, respectively.
4. Performance and model-closure evaluation
To evaluate and compare the performance of the GEP derived inversion models, measurements of (1) model rationality, in terms of the number of valid retrievals (n); (2) coefficient of determination in log phase (R2); (3) logRMSE; (4) mean absolute percentage deviation (MAPD); and (5) Type II regression results (the intercept and slope of major axis regression) are used in this study.
First, the performance of trained GEP models was assessed when applied to the overall NOMADv2a data set by comparing their retrievals to those from (1) NASA’s operational near-surface Chla algorithm for SeaWiFS (OC4v6); (2) a recently published Chla retrieval algorithm developed using GP based symbolic regression (GPSR) ; and (3) the latest version of the Quasi-Analytical Algorithm for IOPs (QAAv5) . Note that the GPSR model was also developed based on NOMADv2a, but with three log-10 band reflectance ratios as inputs. In order to independently evaluate the strength of GEP between other data-driven empirical approaches, three inversion models were developed using the same training data as GEP, and also included in the benchmark comparison: (1) the OC4-type band-ratio polynomial regression model (OC4x); (2) the Multilayer Perceptron model (MLP), which has one hidden layer (6 neurons) and a logistic activation function for both hidden and output layers; and (3) the ε-insensitive SVR model (SVR) equipped with the Radial Basis Function kernels . The MLP and SVR were both developed and optimally tuned using the DTREG commercial software, version 10.7.18.
Secondly, performance indicators for the GEP inversion models are calculated with the same procedures and data sets utilized in the IOCCG Report Number 5 , and are evaluated by comparing them with the indicator values obtained from various algorithms tested in this earlier report. The following inversion algorithms were selected for comparison: (1) the QAAv4, (2) the MERIS neural network algorithm (NN) , (3) the linear matrix inversion algorithm (LMI) , (4) the GSM SA bio-optical model (GSM)  and (5) the SA reflectance model based inversion algorithm (SARA) .
Inversion algorithms for IOPs and diffuse attenuation coefficients are mostly based on physical models or a bio-optical data set generated by “unbiased” and “closure-achieved” forward models. Since the GEP-derived IOP algorithms were fully based on in situ observations to define regression functions, they may be affected by errors within the data set. Additionally, the application of symbolic regression for IOP is limited to only one dependent variable, e.g., aph(443), and each IOP retrieval algorithm is developed independently and might have difficulties in achieving closure when combined with each other. Therefore, the validity of proposed algorithms should be evaluated by confirming that the IOPs obtained by independent algorithms are capable of being linked by bio-optical models forming IOPs spectra, and then achieving optical closure when a radiative-transfer model is applied. As shown on Fig. 4, a model-closure evaluation for the proposed algorithms is conducted by comparing the input Rrs(λ) (IOCCG synthetic data) of the inversion algorithms with the Rrs(λ) simulated by the forward models and corresponding IOP retrievals. The models, parameters, and approaches used to generate a(λ) and bb(λ) with IOPs derived at specific wavelengths, and the settings of HydroLight radiative transfer model used to simulate Rrs(λ) with the generated IOP spectra, are given in Lee (2006) . Slight changes were made by shifting the reference wavelengths of absorption and scattering from 440 and 550nm to 443 and 555nm, respectively, and combining ad(λ) and ag(λ) in Lee’s approach to give adg(λ). The q-index developed by Doerffer and Schiller (2007)  to determine the degree of Rrs(λ) mismatch (mismatch when q4) was used for closure evaluation.
5. Results and discussion
5.1 The GEP-derived ocean-color inversion models
The inversion models developed by GEP were in the form of ETs, which were translated to the following algebraic equations, based on the procedure described in Section 3.1.Table 2); max2 is the maximum of two function; and max3 is the maximum of three function. Interestingly, the GOE2 functions are the most frequently chosen by the above equations. It is evident that logical expressions play an important role in GEP-derived inversion models by varying the functional relationships between variables according to the input value of Rrs(λ). Although similar pre-classification strategies have been widely used in existing algorithms, such as the max3 function in OCx algorithms  and the color index (CI) threshold in the CI-based algorithm , this study is the first to develop an optimum inversion model derived using the GOE2 functions.
5.2 The limitation of GEP-derived algorithms
A synthetic Rrs(λ) data set was developed in this study to evaluate the robustness of the proposed algorithms. The test spectra were generated by taking 100,000 random Latin-hypercube samples of the Rrs(λ) spectrum from a multivariate lognormal distribution with a mean and covariance matrix obtained by all available measurements (n = 3,079) in the NOMADv2a data set. The evaluation aimed to be comprehensive and independent, as a major portion of the measurements was not included in the training data, and the testing spectra were simulated to sufficiently cover possible spectral values and patterns in the NOMADv2a data set. However, the simulated data may contain biases from the measurement errors within the in situ data set. The evaluation results are shown on Table 3. Although for the bbp(555) retrievals are relatively less robust, more than 99% of the retrieved values from other algorithms are within the scope of training/validation data set. The proposed algorithms were robust even when Rrs(411) was at a relatively low level of 0.00014 sr−1. However, an exceptional condition may occur while processing ocean-color imagery over coastal regions, where the Rrs(411) and Rrs(443) are often biased lower than 0.00014 sr−1 or become negative values due to the atmospheric correction errors . Further analysis using different SeaWiFS scenes is required to ensure the validity of GEP-derived algorithms.
5.3 Performance of GEP models for the NOMADv2a in situ bio-optical data set
The overall performance of the GEP-derived algorithms is shown in Table 4 and Fig. 5, by comparing their retrievals to the known Chla and IOPs in both training and validation data sets. The performance indicators for GEP, MLP, and SVR in training and validation stages are both reported in Table 4. The performance for all data is generally the average value of both stages. There are no significant differences among all empirical algorithms with regard to the type II regression. Because of a greater size of training data, the OC4v6 developed using all of the available samples in the NOMADv2a data set (n = 1,243) performs better than the OC4x based on the same training data used in this study (n = 605). Nevertheless, the other four empirical algorithms derived by the state-of-the-art regression approaches with less training data, e.g., GEP, GPSR, MLP and SVR, provide tighter and stronger relationships with the known values in terms of having lower errors (logRMSE and MAPD) and higher R2 values than OC4v6. Based on the same training data, although the SVR performs the best across GEP and MLP for all and the major portion of cases (within the range of 0.25 and 3 mg/m3), the GEP performs the best for oligotrophic waters, where the levels of Chla were commonly lower than 0.25 mg/m3 , and well-balanced its performance across different levels of Chla. This advantage can help to process an ocean-color image with a higher dynamic range of Chla distribution. As shown in Fig. 5(a), because of the relatively few samples taken from eutrophic waters, a similar trend of under-estimation was found in both GEP and OC4v6 algorithms when the observations were higher than 10 mg/m3. The band-ratio algorithm derived by symbolic regression (GPSR) has a simpler architecture than our GEP-derived model [Eq. (5)]. However, the increased complexity of the GEP algorithm with additional GOE2 functions can obtain more accurate results, as seen in the fact that all of the performance indicators are better. In practice, there are no significant differences between GEP, SVR, and MLP in terms of the overall performance. The GEP-derived algorithm has the potential to be improved by using more function sets and adding more sub-ETs. The SVR and MLP can also be improved by tuning the network structures until fully optimized. The key strengths of the GEP-derived algorithm are its greater computational efficiency, which has been shown by GP in Tang et al. (2012) , and its greater algorithm transferability, due to its explicit nature (compared to SVR and MLP, which are implicit, black-box approaches).
Comparisons were also made to the IOP retrievals from the model-based QAAv5 algorithm using the same data set, as shown in Table 4 and Fig. 5. The empirically-based GEP algorithms perform better than the model-based QAAv5 in terms of all performance metrics, for the NOMADv2a in situ data. For the category of a(443) or bbp(555), which combine the contributions from both phytoplankton and detritus-gelbstoff, two algorithms are able to attain good retrievals. However, without the use of the additional measurements of Rrs(667), it is evident that QAAv5 has difficulty in separating the contributions of aph(443) and adg(443) from a(443), particularly when aph(443) is lower than 0.1 m−1. With the advantage of using an independent algorithm for each inversion [Eqs. (6) and (7)], the GEP has better retrievals of both aph(443) and adg(443), which can help in accurately deriving the concentration of the specified water constituent.
The four- and three-component bio-optical model is summarized in the following equation:50]. Based in Eq. (11), the adg(443) can be calculated either by Eq. (7) alone, or by subtracting aph(443) and aw(443) from a(443) by utilizing the Eqs. (6)-(9). Although these equations were developed independently by GEP, and appear to be very different in their formulation, a comparison between the adg(443) retrievals obtained in two different ways [Fig. 6(c)] shows that both approaches are capable of carrying out the inversion, and hence the algorithms are generally interoperable. Therefore, specific information about gelbstoff and detritus in Eq. (11), i.e., ag(443) and ad(443), is obtainable from an “allocation” procedure. The ag(443) can be obtained by subtracting ap(443) and aw(443) from a(443), and then used to derive ad(443) from adg(443). The performances of deriving ag(443) and ad(443) using Eqs. (6)-(9) and following the above procedure are shown in Figs. 6(a) and 6(b), respectively. Although a similar overestimation is found for both adg(443) and ag(443) in the lower values, all of the ag(443) retrievals are valid (n/N = 100% and R2 = 0.65) by comparing them to the available cases in NOVADv2a (with coincident measurements of ag(443) and Rrs(λ) at the first five SeaWiFS visible bands). On the other hand, the ad(443) retrievals from GEP algorithms and the “allocation” procedure appear to be reasonable when higher than 0.01 m−1, but most are underestimated, with invalid estimates (<0) for the lower values. This indicates that even though the proposed GEP algorithms are developed empirically and individually, they are still capable of dividing the contributions between aph(443), ag(443) and ad(443), from a(443) based on the assumptions of the four-component bio-optical model. As shown on Figs. 6(a) and 6(b), such application may be limited when the derived ag(λ) < 0.1 m−1 or ad(λ) < 0.01 m−1, while adg(λ) is too small to be correctly allocated.
5.4 Comparison with model-based IOP algorithms using IOCCG synthetic data set
Table 5 summaries the performance indicators of GEP and selected inversion algorithms by comparing their retrievals to 500 synthetic and 656 in situ cases in the IOCCG data set. Because the performance (Table 5) is compared without using the same data set to tune the algorithms, it might not be an independent and unbiased evaluation of algorithm performance. Results included in this section allow for easy comparison between various existing methods and the new GEP-derived algorithms, and provides general information on the utilization of the algorithms for interested researchers. Note that the bbp(555) is not included in the in situ data, and the comparison made to the bbp(555) retrievals from GEP is limited to values less than 0.01 m−1 in the synthetic data set, as the data set used to train and validate the GEP algorithms (Table 1) does not have any observations higher than 0.01 m−1. Due to wavelength mismatch, the performance of the NN algorithm for in situ data was not applied and reported in the IOCCG Report. Additionally, the proposed GEP algorithm is the only full-empirical approach among all the algorithms compared.
When the synthetic data set is applied, both the GEP and model-based approaches can achieve good retrievals, as shown in Table 5. Notably, an earlier version of QAAv5, the QAAv4, obtains the best retrievals for nearly all categories of the IOCCG synthetic data set, whereas the GEP approach generally ranks second to fourth among the tested algorithms. As shown in Fig. 7, Rrs(λ) simulated at different sun angles may not produce retrievals that are significantly different from one another. The IOCCG synthetic data set was generated using various reasonable bio-optical models, with parameters that were mostly obtained from field measurements . In contrast, the GEP-derived algorithms may include all of the possible relationships that exist between IOPs and Rrs(λ), whether they are reasonable or not. Therefore, some minor systematic deviations were found in the GEP derived retrievals, and these decreased the overall performance of GEP. As shown in Fig. 7 and the results of the type II regression, over-estimation was commonly observed for Chla, aph(443), ap(443) and the lower part of the adg(443) retrievals, while under-estimation occurred in the higher parts of both adg(443) and a(443). Except for the systematic errors originating from the differences between the empirical and model-based approaches, the GEP algorithms generally perform well in dealing with the noisy synthetic data.
As shown in Table 5 (numbers in the parentheses), the performance of all inversion algorithms decreased due to the higher uncertainty in measurement errors in the in situ data set. For example, the logRMSE values of the QAAv4 increased from 2.0 (for aph) to 2.9 (for a) times the results of the synthetic cases. The empirical-based GEP algorithms are relatively robust, with only the a(443) having a 1.3 times higher logRMSE value, and they perform the best on the categories of aph and adg. Additionally, no system deviations were found in retrieving the 656 in situ cases using the GEP-derived algorithms, as seen in the type II regression results (Table 5) and Fig. 8. Overall, compared with algorithms that disregard several cases due some quality checking mechanisms, the performance of the GEP algorithms was evaluated for a combination of 1,156 cases (for bbp(553) is 893). No unreasonable retrievals (i.e., negative IOPs) were found for any of the cases.
The results of the closure evaluation for IOPs obtained by GEP- algorithms are shown in Fig. 9 and Table 6. Again, the closure evaluation was only performed with the 240 available Rrs(λ) cases that had retrieved bbp(555) values of less than 0.01 m−1. The GEP were fully based on in situ data to define empirical functions, and hence the goodness of closure were highly dependent to the level of errors and uncertainties within the data. In the other hand, the model-based algorithms were algebraic or optimization solutions of simplified analytical models, which are commonly regarded as unbiased and closure-achieved. Model-based approaches such as QAA and GSM can achieve nearly 100% closure using synthetic data. As shown on Fig. 9 and Table 6, the agreement between the input reflectance values and those derived by the retrieved IOPs is reasonable for an empirical algorithm developed by in situ data. The q-indexes show that there are only three mismatched cases among the 240 test spectra, with a low average value of 1.48. These results confirm that the IOPs retrieved at a single wavelength are capable of being combined by forward models to achieve closure when bbp(555) retrievals are valid within the region of training data.
5.5 Algorithm verification with level-2 match-up data
The satellite match-up data from NOMADv13 was applied to validate the practical application of GEP-derived algorithms, in which the retrievals were processed through Rrs(λ) from SeaWiFS imagery, rather than the ground measurements in the NOMADv2a data set. Note that a few cases were excluded in the verification when one of the Rrs(λ) values was lower than an estimated detection limit of 0.00005 sr−1. Table 7 and Fig. 10 present the verification results by comparing the retrievals to the SeaWiFS match-ups. The retrievals from GEP-derived algorithms have good agreement with the match-up data, with all R2 values higher than 0.8 and MAPD less than 50% on most categories, except for the HPLC-measured Chla and adg(443). Compared to the 248 Chla data measured on-boat using fluorometers (an average of 1.07 mg/m3), much fewer samples were analyzed by HPLC and these were collected mostly in coastal regions, where the higher concentrations (an average of 1.55 mg/m3) led to larger retrieving errors. Problems with over-estimation were consistently found for lower adg(443) values from all data sets used in this study. This partly indicates that the current selection of logical/mathematical expressions in GEP may be further improved for adg(443), from the view of symbolic regression.
5.6 Application of GEP-derived algorithms
Figure 11 shows the results of applying the developed inversion algorithms [Eqs. (5)-(8)] to the SeaWiFS Global Area Coverage (GAC) data. In the test scene, more than 99% of the retrieved pixel values were within the range of training/validation data and flagged as reasonable retrievals. Note that the bbp retrievals were relatively low, at 97%, because of its narrower parameter range. The majority of not-valid pixels are located in very turbid coastal regions, due to their negative Rrs(411) or Rrs(443) values. Compared to the OC4v6, which uses a maximum Rrs(blue) function to determine the most appropriate blue-to-green ratio, the GEP algorithms developed in their current form rely more heavily on the quality of Rrs(blue). This special SeaWiFS scene taken of the Black Sea and part of the Mediterranean on May 4, 2002, and shows a very wide range of Chla concentrations and optical properties across different water bodies (Sea of Azov, Black Sea and Mediterranean). Figure 11(b) shows that the standard SeaWiFS Chla product can provide differentiation of phytoplankton blooms in the Black Sea at a gross level. However, both the Chla and aph(443) derived by the GEP algorithms [Figs. 11(c) and 11(d)] provide differentiation at a finer scale. For example, using the two algorithms developed in this study we are not only able to depict algal blooms close to the western shorelines of the Black Sea, but also identify the special offshore “bright blooms” . It is notable that the ap(443) algorithm improved contrast in the speckling pattern [Fig. 11(f)], which may indicate the presence of artifacts from the empirical algorithm, or may be a correct depiction of the spatial distribution of oceanic particles. The band-shift corrections  can be applied with respect to the NOMAD training data to allow for the application of GEP algorithms to modern sensors with different central-wavebands, e.g., MODIS and MERIS.
6. Concluding remarks
This work presents a new symbolic regression approach using GEP for the development of ocean-color inversion algorithms. The algorithms are trained by a subset of samples from the NOMADv2a data set, and are validated and evaluated against the remainder of NOMADv2a, a synthetic and in situ data set compiled by IOCCG, and a satellite match-up data set. The GEP-derived algorithms consist of various mathematical and logical expressions that are optimized and selected through evolutionary processes. The results of function finding indicate that a series of special GOE2 functions are the most chosen, and play a significant role in the classification of Rrs(λ) inputs and determining the algorithm formulations.
The results of comparison of GEP algorithms with other algorithms using the NOMADv2a data set demonstrate that the state-of-the-art regression approaches, e.g., GEP, GPSR, MLP and SVR, outperform the OCv6 with less training data. The SVR and GEP perform the best for the majority of cases (3 ≥ Chla > 0.25 mg/m3) and for oligotrophic waters (Chla ≤ 0.25 mg/m3), respectively. However, it is concluded that there are no significant differences between GEP, SVR, and MLP in terms of the overall performance, since these state-of-the-art data interpolation approaches all have the potential to be improved. The GEP algorithms proposed in their current forms are recommended for ocean color data processing because of the following advantages: (1) well-balanced performance across different Chla levels; (2) higher model transferability owing to their explicit nature; (3) appropriate function size compared to other GP derived algorithms, which commonly appear in the form of tree-structure functions or source code for computing; and (4) higher flexibility of model improvement, due to a wide choice of function sets and model architectures.
For IOPs retrievals, the GEP algorithms are robust and their performance is comparable to those of existing model-based approaches regarding the model-synthesized IOCCG data with added noise. Analytical or semi-analytical algorithms enable opportunities to understand the basics of ocean optics. On the other hand, simple empirical relationships extend the potential application of ocean-color products due to the easy and efficient processing ability. It might be the fundamental limitation for algorithms capable to perform complex data interpolations is that they are only valid at the data interval used for the model development. The general “out-of-range” limitation is applicable for these approaches to characterize the quality of each retrieved pixel value in ocean-color products. In this study, the statistical analysis using 100,000 simulated Rrs spectra shows that the GEP-derived algorithms are very reliable and robust at conditions similar to the NOMADv2a data set. The results of image processing show that some exceptional cases may occur in very turbid coastal regions where the Rrs(411) or Rrs(443) are negative. For those within the parameter region, but with questionable retrievals, the results of the closure evaluation provide some support for the GEP-derived IOP algorithms. The success of closure evaluation depends strongly upon the accurate field measurements in the NOMADv2a data set. However, unlike the MLP approach, which has the advantage of building bi-directional networks to derive pixel-by-pixel q-indexes as a threshold to identify not-valid values, the GEP symbolic regression is limited to only one dependent variable. To the best of my knowledge, there are no established procedures for estimating confidence or prediction intervals for symbolic regression models to prevent questionable retrievals. The strategy of assembling a set of diverse models to provide a confidence prediction  is thus suggested in any future GEP scheme.
As an empirically-based approach, the GEP-derived algorithms are easily utilized and fast at processing the satellite imagery. After processing by the GEP-derived algorithms, the SeaWiFS GAC image shows more details of developing algal blooms in the Black Sea, and gives more information on the distribution of particles and CDOM.
The GEP algorithms proposed in this study fully rely on training data to define the symbolic regression model, and thus it is recommended that a further study include more data for algorithm development, which can be a data set simulated by mechanistic models or a collection of accurate in situ data, particularly for higher bbp samples.
The authors thank the Ministry of Science and Technology, Taiwan ROC, for the financial support through grants NSC-101-2221-E-006-153 and MOST-103-2221-E-006-014.
References and links
1. T. Platt, “Primary production of the ocean water column as a function of surface light intensity: algorithms for remote sensing,” Deep Sea Res. Part I Oceanogr. Res. Pap. 33(2), 149–163 (1986). [CrossRef]
2. Z. P. Lee, K. L. Carder, R. G. Steward, T. G. Peacock, C. O. Davis, and J. L. Mueller, “Remote sensing reflectance and inherent optical properties of oceanic waters derived from above-water measurements,” Proc. SPIE 2963, 160–166 (1997). [CrossRef]
3. A. Morel, “Optical modeling of the upper ocean in relation to its biogenous matter content (case I waters).,” J. Geophys. Res.-Oceans 93(C9), 10749–10768 (1988). [CrossRef]
4. S. Sathyendranath, Remote Sensing of Ocean Colour in Coastal, and Other Optically-Complex, Waters (IOCCG, 2000)
5. Z. P. Lee, K. L. Carder, and R. A. Arnone, “Deriving inherent optical properties from water color: a multiband quasi-analytical algorithm for optically deep waters,” Appl. Opt. 41(27), 5755–5772 (2002). [CrossRef] [PubMed]
6. K. L. Carder, F. R. Chen, Z. P. Lee, S. K. Hawes, and D. Kamykowski, “Semianalytic Moderate-Resolution Imaging Spectrometer algorithms for chlorophyll a and absorption with bio-optical domains based on nitrate-depletion temperatures,” J. Geophys. Res.- Oceans 104(C3), 5403–5421 (1999). [CrossRef]
7. S. A. Garver and D. A. Siegel, “Inherent optical property inversion of ocean color spectra and its biogeochemical interpretation. 1. Time series from the Sargasso Sea,” J. Geophys. Res.- Oceans 102(C8), 18607–18625 (1997). [CrossRef]
8. C. C. Liu and R. L. Miller, “A spectrum matching method for estimating the chlorophyll-a concentration, CDOM ratio and backscatter fraction from remote sensing of ocean color,” Can. J. Rem. Sens. 34(4), 343–355 (2008). [CrossRef]
9. F. Melin, “Global distribution of the random uncertainty associated with satellite-derived Chla,” IEEE Geosci. Remote Sens. 7(1), 220–224 (2010). [CrossRef]
10. Z. Lee, R. Arnone, C. Hu, P. J. Werdell, and B. Lubac, “Uncertainties of optical parameters and their propagations in an analytical ocean color inversion algorithm,” Appl. Opt. 49(3), 369–381 (2010). [CrossRef] [PubMed]
11. D. A. Toole, D. A. Siegel, D. W. Menzies, M. J. Neumann, and R. C. Smith, “Remote-sensing reflectance determinations in the coastal ocean environment: impact of instrumental characteristics and environmental variability,” Appl. Opt. 39(3), 456–469 (2000). [CrossRef] [PubMed]
12. J. Chen, W. Quan, T. Cui, Q. Song, and C. Lin, “Remote sensing of absorption and scattering coefficient using neural network model: Development, validation, and application,” Remote Sens. Environ. 149, 213–226 (2014). [CrossRef]
13. P. Cipollini, G. Corsini, M. Diani, and R. Grasso, “Retrieval of sea water optically active parameters from hyperspectral data by means of generalized radial basis function neural networks,” IEEE Trans. Geosci. Remote 39(7), 1508–1524 (2001). [CrossRef]
14. D. D’Alimonte and G. Zibordi, “Phytoplankton determination in an optically complex coastal region using a multilayer perceptron neural network,” IEEE Trans. Geosci. Remote 41(12), 2861–2868 (2003). [CrossRef]
15. I. Ioannou, A. Gilerson, B. Gross, F. Moshary, and S. Ahmed, “Neural network approach to retrieve the inherent optical properties of the ocean from observations of MODIS,” Appl. Opt. 50(19), 3168–3186 (2011). [CrossRef] [PubMed]
16. C. Jamet, H. Loisel, and D. Dessailly, “Retrieval of the spectral diffuse attenuation coefficient K-d(λ) in open and coastal ocean waters using a neural network inversion,” J. Geophys. Res.- Oceans 117(C10), C10023 (2012). [CrossRef]
17. L. E. Keiner and X. H. Yan, “A neural network model for estimating sea surface chlorophyll and sediments from thematic mapper imagery,” Remote Sens. Environ. 66(2), 153–165 (1998). [CrossRef]
18. G. Camps-Valls, L. Gomez-Chova, J. Munoz-Mari, J. Vila-Frances, J. Amoros-Lopez, and J. Calpe-Maravilla, “Retrieval of oceanic chlorophyll concentration with relevance vector machines,” Remote Sens. Environ. 105(1), 23–33 (2006). [CrossRef]
19. H. G. Zhan, P. Shi, and C. Q. Chen, “Retrieval of oceanic chlorophyll concentration using support vector machines,” IEEE Trans. Geosci. Remote 41(12), 2947–2951 (2003). [CrossRef]
20. M. S. Salama and F. Shen, “Stochastic inversion of ocean color data using the cross-entropy method,” Opt. Express 18(2), 479–499 (2010). [PubMed]
21. H. T. Shahraiyni, S. B. Shouraki, F. Fell, M. Schaale, J. Fischer, A. Tavakoli, R. Preusker, M. Tajrishy, M. Vatandoust, and H. Khodaparast, “Application of the active learning method to the retrieval of pigment from spectral remote sensing reflectance data,” Int. J. Remote Sens. 30(4), 1045–1065 (2009). [CrossRef]
22. C. H. Chang, C. C. Liu, and C. G. Wen, “Integrating semianalytical and genetic algorithms to retrieve the constituents of water bodies from remote sensing of ocean color,” Opt. Express 15(2), 252–265 (2007). [CrossRef] [PubMed]
23. H. G. Zhan, Z. P. Lee, P. Shi, C. Q. Chen, and K. L. Carder, “Retrieval of water optical properties for optically deep waters using genetic algorithms,” IEEE Trans. Geosci. Remote 41(5), 1123–1128 (2003). [CrossRef]
24. D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning (Addison-Wesley, 1989).
25. K. Miettinen, P. Neittaanmaki, M. M. Makela, and J. Periaux, Evolutionary Algorithms in Engineering and Computer Sciences (John Wiley & Sons, 1999).
26. J. R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection (MIT Press, 1992).
27. C. Fonlupt, “Solving the ocean color problem using a genetic programming approach,” Appl. Soft Comput. 1(1), 63–72 (2001). [CrossRef]
28. S. Tang, C. Michel, and P. Larouche, “Development of an explicit algorithm for remote sensing estimation of chlorophyll a using symbolic regression,” Opt. Lett. 37(15), 3165–3167 (2012). [CrossRef] [PubMed]
29. C. Ferreira, “Gene expression programming: a new adaptive algorithm for solving problems,” Complex Syst. 13, 87–129 (2001).
30. M. O'Neill and C. Ryan, “Grammatical Evolution: A steady state approach,” in Late Breaking Papers at the Genetic Programming 1998 Conference, J. R. Koza, ed., (Stanford University Bookstore, 1998), pp. 180–185.
31. P. J. Werdell and S. W. Bailey, “An improved in-situ bio-optical data set for ocean color algorithm development and satellite data product validation,” Remote Sens. Environ. 98(1), 122–140 (2005). [CrossRef]
32. Z. P. Lee, Remote Sensing of Inherent Optical Properties: Fundamentals, Tests of Algorithms, and Applications (IOCCG, 2006).
33. D. Odermatt, A. Gitelson, V. E. Brando, and M. Schaepman, “Review of constituent retrieval in optically deep and complex waters from satellite imagery,” Remote Sens. Environ. 118, 116–126 (2012). [CrossRef]
34. Z. P. Lee, “Models, parameters, and approaches that used to generate wide range of absorption and backscattering spectra,” http://www.ioccg.org/groups/lee_data.pdf, accessed 2/16/15.
35. S. W. Bailey and P. J. Werdell, “A multi-sensor approach for the on-orbit validation of ocean color satellite data products,” Remote Sens. Environ. 102(1-2), 12–23 (2006). [CrossRef]
36. M. Z. Hashmi, A. Y. Shamseldin, and B. W. Melville, “Statistical downscaling of watershed precipitation using Gene Expression Programming (GEP),” Environ. Model. Softw. 26(12), 1639–1646 (2011). [CrossRef]
37. A. Efstratiadis and D. Koutsoyiannis, “One decade of multi-objective calibration approaches in hydrological modelling: a review,” Hydrol. Sci. J. 55(1), 58–78 (2010). [CrossRef]
38. E. G. Bekele and J. W. Nicklow, “Multi-objective automatic calibration of SWAT using NSGA-II,” J. Hydrol. (Amst.) 341(3-4), 165–176 (2007). [CrossRef]
39. A. H. Barnard, J. R. V. Zaneveld, and W. S. Pegau, “In situ determination of the remotely sensed reflectance and the absorption coefficient: closure and inversion,” Appl. Opt. 38(24), 5108–5117 (1999). [CrossRef] [PubMed]
40. Z. P. Lee, K. L. Carder, R. G. Steward, T. G. Peacock, C. O. Davis, and J. S. Patch, “An empirical algorithm for light absorption by ocean water based on color,” J. Geophys. Res.- Oceans 103(C12), 27967–27978 (1998). [CrossRef]
41. J. E. O’Reilly, S. Maritorena, B. G. Mitchell, D. A. Siegel, K. L. Carder, S. A. Garver, M. Kahru, and C. McClain, “Ocean color chlorophyll algorithms for SeaWiFS,” J. Geophys. Res.- Oceans 103(C11), 24937–24953 (1998). [CrossRef]
42. T. S. Moore, J. W. Campbell, and M. D. Dowell, “A class-based approach to characterizing and mapping the uncertainty of the MODIS ocean chlorophyll product,” Remote Sens. Environ. 113(11), 2424–2430 (2009). [CrossRef]
43. Gepsoft, “Getting Started with Regression,” http://www.gepsoft.com/tutorials.htm, assessed 2/15/15.
44. R. Doerffer and H. Schiller, “The MERIS case 2 water algorithm,” Int. J. Remote Sens. 28(3-4), 517–535 (2007). [CrossRef]
45. F. E. Hoge and P. E. Lyon, “Satellite retrieval of inherent optical properties by linear matrix inversion of oceanic radiance models: An analysis of model and radiance measurement errors,” J. Geophys. Res.- Oceans 101(C7), 16631–16648 (1996). [CrossRef]
48. C. Hu, Z. Lee, and B. Franz, “Chlorophyll a algorithms for oligotrophic oceans: A novel approach based on three-band reflectance difference,” J. Geophys. Res.- Oceans 117(C1), C01011 (2012). [CrossRef]
49. W. Shi and M. H. Wang, “Detection of turbid waters and absorbing aerosols for the MODIS ocean color data processing,” Remote Sens. Environ. 110(2), 149–161 (2007). [CrossRef]
51. G. C. Feldman, “Feature of the Day (May 4, 2002), Ocean Color Image Gallery,” http://oceancolor.gsfc.nasa.gov/cgi/image_archive.cgi, assessed 2/15/15.
52. G. Zibordi, F. Melin, and J. F. Berthon, “Comparison of SeaWiFS, MODIS and MERIS radiometric products at a coastal site,” Geophys. Res. Lett. 33(6), L06617 (2006). [CrossRef]
53. M. Kotanchek, G. Smits, and E. Vladislavleva, “Trustable symbolic regression models: using ensembles, interval arithmetic and pareto fronts to develop robust and trust-aware models,” in Genetic Programming Theory and Practice V, R. Riolo, T. Soule, and B. Worzel, eds. (Springer, 2008), pp. 201–220.