The deployment of towed depth-profiling paravane systems and autonomous gliders is providing a wealth of high-resolution oceanographic datasets. These datasets are, however, over-sampled in space and time. This paper describes a data-adaptive, user-configurable method that has been used to significantly reduce the time/space density of such data without compromising the inherent scientific information that they provide. The method involves sub-sampling at fixed space and time intervals with additional samples being kept given either a significant change (1) in the depth extent of the alongtrack profiles, or (2) in the values of the profiles themselves. An example is provided showing how well the algorithm works on nearly 5,000 chlorophyll fluorescence profiles collected off the coast of Australia.
© 2008 Optical Society of America
In the field of oceanographic research, one of the most common measurement techniques has been to acquire “depth profiles,” i.e., to measure a parameter, or group of parameters, as a function of depth. One of the earliest, and perhaps most common profile measurement, is the temperature profile. As technology has advanced, scientists have added many more sensors to their profiling equipment, such as sensors that measure conductivity, chlorophyll fluorescence, and optical backscatter. For decades, such sensor packages were lowered on a hydro-wire, but more recently, they have been deployed on paravanes that can be towed behind a ship and “yo-yo’ed” up and down to acquire a dense series of “profiles.” (Although such measurements are actually making a 2-D sawtooth or sine-wave pattern behind the ship, each vertical excursion is often treated as a depth profile. This approximation is justified by the fact that horizontal coherence scales are usually several orders of magnitude greater than vertical coherence scales [1,2]. An example of a commercially available paravane system is the Chelsea Aquashuttle [3,4,5]. Another example is the SeaSoar system built and deployed by the Woods Hole Oceanographic Institution (WHOI) . Such systems can acquire thousands of multi-variate profiles in a period of a few days.
Besides the paravane-based systems, the recent proliferation of Autonomous Underwater Vehicles (AUV) has introduced still another method of gathering densely sampled, multi-variate profile data. For example, Rutgers University has deployed Slocum AUV gliders (built by Webb Research, Inc.) off the New Jersey coast, in the Mediterranean Sea, in the Baltic Sea, and off the coast of Australia . Such glider deployments can last several weeks and obtain >10,000 profiles. The main point is that, unlike the time-consuming deployments of sensors from a ship via a hydro-wire, the AUV vastly reduces the time between individual depth “profiles” from days or hours to minutes. As a result, AUVs collect unprecedented data volumes that are typically many orders of magnitude greater than traditional ship-based water profiling surveys. However, due to the inherent spatial coherence scales in the ocean, each profile is generally not significantly different from the preceding profile.
The influx of this “over-sampled” AUV data provides much finer temporal and spatial sampling, but it poses at least two analysis problems: (1) how to examine/quality check thousands of data profiles per AUV survey, and (2) which of the profiles to archive to capture the inherent ocean variability that was measured by the AUV. With respect to the first problem, even with modern-day high-speed computers, the exorbitant number of profiles gathered from these deployments ensures that the graphing and processing of the data is time consuming. Furthermore, it is not uncommon to use up all of one’s CPU memory in displaying results for a single AUV deployment.
With respect to the second problem—which and how many of the original profiles to archive—this issue directly impacts the World-wide Ocean Optics Database (WOOD). For such historical archives, the question really becomes one of how to sub-sample the original space-time series of profiles so that the resultant dataset accurately represents the original conditions but does not over-sample the environment. For example, in open ocean regions free of oceanic fronts, hundreds of successive profiles may look virtually identical. In contrast, when a front is crossed, or when one approaches a shoreline, conditions are likely to vary quite rapidly. Our solution was to develop data-thinning software that intelligently and automatically extracts only the essential data from the original dataset, saving only those profiles that are necessary to accurately represent the collected data. This software is able to “thin” the data based on several parameters, including the distance between profiles, the time between profiles, and, more importantly, the differences in data structures between profiles. Furthermore, it is able to simultaneously keep track of each data parameter in the AUV deployment, such as temperature, salinity, and beam attenuation and optical backscattering coefficients, and the software ensures that if one profile in one parameter is kept as a unique feature in the set, then the same profiles in the other parameters are kept as well.
This paper documents the capabilities and algorithms associated with this software and describes in detail how the software intelligently “thins” over-sampled datasets. Examples are provided to show the effectiveness of this methodology in processing AUV or SeaSoar data. It also discusses future options for the software’s development and introduces features, such as automatic spike editing, as possibilities for future improvements in software functionality.
In applying our data-thinning algorithm, we make use of the following terms. First, during our initial data processing, we organize the data into single-variable files. These files encompass all data of one data type gathered from one “cruise” or AUV deployment. Each file contains many profiles: each profile is a collection of a single parameter (such as chlorophyll concentration) versus depth obtained at (nominally) one geographical location and one time. Profiles have a metadata header that maintains important features such as location, date, time, a common cruise number, and the identification number of the profile within the larger file. The profile/file structure of these datasets is important to remember, for it determines how the data-thinning algorithm uniquely traverses and analyzes sets of files from many different parameters.
2. Methodology and application
As mentioned above, the World-wide Ocean Optics Database (WOOD) provides an archive of bio-optical data from a wide variety of sources. One of these sources is the AUV, a source that has become increasingly common in the past few years. WOOD became a logical testbed for the development and testing of software that can compare and adaptively reduce (or “thin”) raw AUV data files. (The same methods apply to SeaSoar data.) As described further below, AUV data to be stored in the WOOD are thinned based on four criteria: distance between measurements, elapsed time relative to previous profiles, vertical extent of the data, and changes in the relative structure of successive data profiles.
AUVs collect data profiles while they traverse pre-programmed or user-directed paths through the ocean. Adjacent profiles along such a traverse are in close proximity to one another (usually <1 km), and therefore rarely differ much from one another. In fact, in many open ocean areas, a profile (such as temperature or chlorophyll) may not change significantly for tens or even hundreds of kilometers. As a result, thinning such data via a criterion that is solely based on statistically significant variations could result in huge spatial gaps in the final output. To avoid such problems, one requirement of the data-thinning software is that, regardless of meaningful changes within the data, it will keep a complete set of profiles at some “reasonable” (user-specified) minimum spatial interval.
The second requirement of the data-thinning algorithm involves elapsed time. Even if the AUV were to make continuous circles at one location, and presuming the profile remained constant during that time, one would still want to keep a sufficient number of those profiles to provide a “representative” time series of the original series. The raw dataset must therefore be sub-sampled in relation to both space and time. If the original dataset provides a profile every 5 minutes, then storing a profile, for example, every 4 hours would provide a good representation of that day’s data while saving precious storage space and dramatically reducing loading and retrieval times in a data archive like WOOD.
In addition to the fixed (and somewhat arbitrary) geospatial and chronological thinning criteria described above, additional criteria are imposed to ensure that a sufficient number of profiles are retained to accurately capture any significant physical variability that occurs in adjacent profiles, such as when the depth extent (usually due to bathymetric variability) varies significantly or when an oceanographic front is crossed. To meet these data-sensitive thinning criteria, the algorithm assigns the first data profile in a file to be a “reference” profile. It then iterates through every subsequent profile in the data file, comparing the current profile against the reference profile, which becomes the last saved profile in the thinned file. This comparison involves an examination of the depth extent of the data and a calculation of the change in the structure exhibited by the profile. For the change in the structure, a percent change as well as an absolute mean change is computed. The percent change specification is not given on a parameter-by-parameter basis because it is meant to be used as a single metric for the entire thinning process. In contrast, parameter-specific absolute mean change criteria are used to handle the situation that results when data values change from one minutely small value (e.g., 0.01) to another (e.g., 0.02). While the percent change from 0.01 to 0.02 is 100%, the mean change value is only 0.01; thus, the mean change criterion is useful when deciding whether or not to keep profiles having such small absolute changes in structure.
The respective change equations are given below:
In the equations above, variables yi and yrefi are continually updated as new profiles are tested and saved. yi represents the ith data point in the profile being tested, while yrefi is the ith data point in the profile that was previously saved, or the “reference” profile. After each equation is executed, the results are compared to the threshold change criteria the user has defined as part of a user-input file (see Table 1), and, if the change criteria are exceeded, then the test profile becomes the new “reference” profile. Nominal values for the various change criteria are summarized in Table 1, but users are free to modify any of these settings in a text input file.
Change criteria for data thinning based on the absolute mean value strongly depend on the parameter, the season, and the ambient conditions. For example, when thinning a dataset of absorption and beam attenuation (at multiple wavelengths), temperature, salinity, and uncalibrated chlorophyll fluorometry, the following absolute mean change criteria resulted in about a 65 % reduction in the overall file size:
Absorption (400 to 700 nm): 0.05/m
Beam atten (400 to 700 nm): 0.2/m
Temperature: 0.5 °C
Salinity: 0.2 ppt
Fluorometry: 0.05 V
To reiterate, the relative and absolute mean changes are computed for each of the parameters collected in a given profile. The percentage change criteria result in the saving of a given profile only if the absolute mean change also exceeds the user-provided threshold. The additional constraint for a minimum mean change of Z2 is based on the fact that at some low parameter value, even a large percentage change is unimportant. For example, if chlorophyll falls below 0.2 mg/m3, then even a change of 0.1 mg/m3 is still too small to be significant. Appendix A gives a more complete list of recommended absolute change thresholds to use with Eq. (2) to determine whether to save a given profile.
As previously discussed, AUV deployments collect more than one type of data. In deployments of interest to the WOOD archives, the AUVs are typically equipped with sensors to measure stratification (temperature and conductivity), biological properties (e.g., chlorophyll and bioluminescence), and optical properties (e.g., beam attenuation and scattering coefficients) as a function of depth. Because the data are multi-variate, the data-thinning algorithm must be able to concurrently examine the depth profiles of each variable. For a given profile, if any one of the variables exhibits a sufficient change to justify keeping that parameter’s depth profile, then the profiles from all the other parameters are stored as well. For example, temperature and salinity might change less than the specified criteria, but if the chlorophyll concentration exceeds its threshold, then all three variables are stored in the thinned data files. This approach ensures the maintenance of a synoptic, coherent representation of the multi-variate data: all thinned files contain the same profiles so that data may be compared across various parameters. This method produces a matching set of files that can be easily compared using the “joined query” option in WOOD. (The joined query option searches across multiple parameter tables using the unique profile identifier number that ensures a given profile is from the same original multi-variate profile as that selected from another parameter table.)
To run the data-thinning program—called SUBSMP4.EXE—one first sets up a simple input file, such as the one shown in Table 2. This file contains the filenames of all files (parameters) to be thinned. Next, the distance criteria for thinning are specified, followed by elapsed time and percentage depth change criteria. In this example, if any of the following fixed conditions occur, then the profile becomes the new reference profile and is added to the thinned file:
• The distance from the reference profile exceeds 5 nmi.
• Elapsed time exceeds 1.5 hr.
• The depth extent of the profile changes by more than 30 %.
The input file also has the parameter-specific Z1 (percentage change in structure) and Z2 (absolute mean change) “threshold” values to use to identify “significant” feature changes (i.e., a change that causes that profile to be kept in the thinned file regardless of the changes in the distance, elapsed time, or depth extent).
This input file, called SUMSMP4.INP, is provided to the thinning program from a DOS window by using the input re-direction symbol (SUMSMP4 <SUMSMP4.INP). The inputs are stored in memory and the program begins iterating through profiles in each input file. The first profile from each file is chosen as the initial reference profile, and this profile is then compared to subsequent profiles to determine whether they should be kept. For every parameter being thinned, this reference profile is copied to a new, thinned file. The algorithm then chooses the next profile to compare with the reference, and the absolute mean difference between them, as well as the percentage change of mean difference, is computed and compared to the threshold values provided at the start of the process. If the two values derived from the comparison are greater than their respective thresholds, then the profile is saved to the new thinned file and is chosen as the new reference profile. (This test is done across all the parameters being thinned, so if any one parameter experiences a change that exceeds the threshold criterion, then this latest profile is saved for all the parameters and assigned as the new reference profile.) If the threshold is not met by any of the parameters under consideration, the profile is ignored, and the subsequent profile is tested. The process repeats itself until the entire set of files has been examined. In this way, an accurate representation of each parameter is generated, and each file contains only the profiles that the algorithm deemed noteworthy across all data types.
2.1 Preconditions for effective use of the thinning software
To achieve optimal results, several conditions should be met prior to using the software. First, to avoid falsely triggering the percentage or mean change criteria, the files used as input to this software should be “cleaned.” Cleaning entails the removal of any significant data artifacts, sporadic biases, errors, and noise spikes in the data. Large spikes caused by instrument malfunction or sudden changes in value due to, for example, scattering light off the ocean floor, are likely to be interpreted by this software as real variability in data structure. Note that the software’s inherent tendency to save such bad data is almost impossible to change because the algorithm is designed to preserve variability, and it has no way to discriminate between spurious and real changes. Thus, it is important to remove any false spikes and significant errors in the data prior to running them through this software. In some cases, one has to completely remove a profile that is deemed bad. However, this removal will produce unwanted discrepancies relative to the other parameters if the removal is not done for all the parameter files. As discussed next, software has been written to remove these discrepancies to ensure that the thinning algorithm still works properly when the cleansing/editing process creates differences in the number of profiles across multiple variables.
A second precondition for running the software is to ensure that the files containing the various parameters have identical sets of profiles. The reason is that the software iterates sequentially through the individual data profiles within a given data file, and the software does this process concurrently for multiple files. If any one file has a missing profile, then it needs to be missing in all the parameter files to avoid causing a mismatch in the profile being thinned. To force every data file to have the same number of profiles, a FORTRAN program called MATCH2 was written to sort through two files, find all the common profile “cast numbers,” and then output two new files that match profiles identically. A slower, but more general-purpose, Matlab program was also written called “profile_intersect.” This function allows the user to enter any number of input file names. The output is a set of files (with an extension of INT, which stands for INTERSECTION) having identically matching sets of profiles.
To date, this algorithm has been used on many sets of over-sampled multi-variate data, each time with positive results. Figures 1 and 2 show how this algorithm significantly reduces the size of a dataset without compromising the overall structure of the data. This particular dataset was reduced by 72% but maintained its overall shape. It is important to note that this relatively small dataset (# profiles = 179) was used as a test, and that, generally, this algorithm would be most useful for files containing thousands or even hundreds of thousands of profiles.
While the above figures are certainly evidence of this algorithm’s ability to maintain data structure, they do not demonstrate the capability of thinning based on the combined effects of structural (oceanic) changes plus elapsed time plus geographical location. The following map (Fig. 3) is the result of running the algorithm on a complicated dataset of 4,967 temperature, salinity, beam attenuation coefficient (c660 nm), yellow matter fluorescence, and chlorophyll fluorescence profiles taken off the coast of Australia. (These SeaSoar data were provided by Frank Bahr at WHOI.) When plotted on this scale, the original (“un-thinned”) profile locations merge into what appears to be a single blue line (they are actually discrete asterisks so close together that they become indistinguishable). After being thinned for time, space, depth extent, and data structure, the thinned data occur at only the red asterisks. As expected, the thinned data are sampled less frequently in the deeper waters than in the waters closer to the continental shelf (indicated by the 200-m depth contour).
The corresponding sets of 1,269 thinned profiles have been examined, and they provide a representative subset of the original data. For example, Fig. 4 shows the chlorophyll fluorescence profiles before and after the thinning algorithm has been applied, and it is clear that the algorithm tracks the structural variations that occur along the track of the sensor system.
By examining the diagnostic outputs from the program, one finds that, in this example, most of the spacing in the deeper water is due to the software’s capability to thin based on elapsed time and distance. The sampling gets closer in space and time in the shallower waters where more structural (and depth extent) variations tend to occur.
The examples shown in Figs. 1 through 4 demonstrate how the data-thinning software works qualitatively. To quantitatively ensure that the code was working correctly, several simplistic test cases were run, where a single criterion was configured to cause all of the triggering of saved data. For example, a test file was created that had many copies of a single profile, but successive copies were truncated in discrete percentages of the original depths to test the percentage depth change software. In another test, all the threshold criteria except distance were set to the equivalent of infinite values to ensure they would not cause a profile to be saved. A test dataset with a geographical spacing of exactly 1.0 nmi was created and then run through the code. The output was correctly thinned to the expected 5-nmi interval. For more details about the quantitative testing performed to date, see page 12 of Barrett and Smart.
4. Discussion and recommendations
For certain specialized applications, such as an assessment of internal wave activity, data thinning should not be applied. However, for the purpose of archiving a representative sample in a relational database (i.e., our specific application), some kind of thinning is virtually essential. One might argue that our change thresholds (Table 2 and Appendix A) are too arbitrary and that the thinning criteria should be computed based on an initial analysis of the degree of variability present in the original data. This argument definitely has some merit, and the fact is that we can and do sometimes adjust the thinning criteria to account for the degree of variability in the original data. However, we do not automate that process because of the following danger: if we apply a purely statistics-based approach to a dataset with almost no variability, then we will create very small change thresholds. Similarly, a dataset with a high degree of variability would lead to unusually large change thresholds. We have avoided that problem by using “subject matter expert” thresholds that can be kept uniform across similar datasets and are (if anything) overly conservative (we err on the side of setting the thresholds too low to avoid losing any meaningful variability). A semi-automated approach for setting thresholds is discussed further in Appendix A.
Regardless of how change thresholds are obtained and applied, one must first ensure that the data are free of noise spikes or other artifacts. Although data artifacts are common in raw bio-optical data, there is still a significant lack of robust tools for cleaning files and especially for removing noise spikes. Although we have developed a powerful Matlab on-screen editor that allows the user to manually identify and remove spikes and bad data using a mouse, we still lack a reliable automated spike editor, i.e., a tool capable of scanning a file and intelligently removing spikes due to bad data while ignoring genuine changes. We have asked numerous oceanographers and several commercial companies if they have such a tool, and we have done web searches to discover what other scientists are using for this purpose, but to date, we have not found a general-purpose solution for this need. As the data-thinning software works best when noise spikes are absent, a reliable automated spike editor would be an extremely beneficial complement.
Finally, the software is currently implemented in FORTRAN. While this is a very fast solution, it is also very limiting, as FORTRAN does not provide easy editing or extension of the code. Ideally, in the future, this algorithm could be implemented in Java, and could then be more easily expanded and shared. A reimplementation in Java would not be as fast as the current FORTRAN version; however, the benefit of easy extension probably outweighs this disadvantage.
The advent of autonomous data-gathering techniques, such as AUVs, has yielded a massive influx of oceanographic data. This newfound wealth of data, while helpful for analysis and research, creates a dilemma when sharing these data through internet databases. As a result, a solution was needed to take a large amount of data and extract an accurate but much smaller representation. The algorithm discussed above does this job quite well, but there is still room for improvement. For example, adding a pattern recognition capability and an associated change threshold look-up table would make the software easier to use, and updating the software’s architecture/language (e.g., to Java) would make the code easier to extend to other applications and also more portable. As large datasets become increasingly prevalent, solutions like this one will become increasingly important to the scientific community and especially to those charged with archiving their results. Finally, the task of thinning over-sampled data will be significantly aided by the development of a capable suite of data-cleansing/editing tools.
Equation (2) defined the method used to test for a significant absolute change in a given parameter. The table below provides the recommended values for “yrefi” used in that equation.
The current software requires the user to manually provide such threshold values for every file processed. We are considering the following improvement: give the algorithm a memory that spans multiple uses; this memory would record threshold values versus results and would rank these pairs by accuracy and user preference. This table of ranked threshold-result pairs would then be available to the software as a reference for “recommended” or default threshold values. Instead of the user manually trying several threshold values before settling on an optimal set, the software itself would be able to look up “good” threshold values determined from previous usage. In this way, extended use of the software would result in faster, better results; the software would “learn” the unique patterns and properties of each parameter and be able to thin them more efficiently. However, such learning should probably be associated with regions and seasons that are known to have similar properties. Research needs to be done to determine how to specify a set of properties defining similarity in such a supervised learning system.
The absolute change criteria provided in this Appendix are based on years of experience working with this particular kind of data, and the values also reflect the known accuracy of measurement systems. Nevertheless, a more objective approach would be to first screen each variable for its inherent variability within the unthinned data, and to provide statistics (such as the mean and standard deviation, the minimum and maximum values, the 5 and 95 percentile values, etc.) to the user. This information could then be combined with the expert’s knowledge about such data to make a better-informed decision as to the change criteria for each parameter.
Suggested minimal change required to be “significant” for saving to World-wide Ocean Optics Database (WOOD)
This work was supported by the Office of Naval Research (ONR) Grant # N000149810773, “World-Wide Ocean Optics Database (WOOD),” funded by Dr. Steve Ackleson (Code 322OP). Special thanks are also due to Frank Bahr (WHOI) and Oscar Schofield (Rutgers), who provided numerous high-quality datasets used in this work.
The Office of Naval Research (ONR) has funded The Johns Hopkins University Applied Physics Laboratory (JHU/APL) to build and maintain the WOOD (see http://wood.jhuapl.edu).
References and links
1. D. Olbers, Ocean Waves Volume 3: Oceanography, J. Siindermann, ed., (Springer-Verlag, 1986). Vol. 3, Chap. 6.
2. R. E. Thomson, S. E. Roth, and J. Dymond, “Near-inertial motions over a mid-ocean ridge; effects of topography and hydrothermal plumes,” J. Geophys. Res. 95, 7261–7278 (1990). [CrossRef]
3. J. Dunning and L. Huttchings, “TUORs for marine monitoring,” International Ocean Systems, November/December 2005. http://www.chelsea.co.uk/Technical%20Papers/SA%20Nushuttle%20article%20TP11%2012.pdf.
4. E. Keegan, “Aquashuttle monitors Exxon Valdez oil spill,” Spill Science & Technology Bulletin 2, 87–88 (1995). [CrossRef]
5. R. Burt, “The growth in towed undulating vehicles for oceanographic data gathering,” in Oceans 2000 MTS/IEEE 0-7803-6551-8 and 0-7803-6552, Volume 1, Providence, RI, Sept 11, 2000 (IEEE, New York, 2000) 641–645.
6. K. H. Brink, F. Bahr, and R. K. Shearman, “Alongshore currents and mesoscale variability near the shelf edge off northwestern Australia,” J. Geophys. Res.112, C05013, doi:10.1029/2006JC003725 (2007). [CrossRef]
7. O. Schofield, J. Kohut, D. Aragon, L. Creed, J. Graver, C. Haldeman, J. Kerfoot, H. Roarty, C. Jones, D. Webb, and S. Glenn “Slocum gliders: robust and ready,” J. Field Robotics 24, 1–13 (2007). [CrossRef]
8. K. Barrett and J. Smart, “Sub-sampling software for environmental profile data,” The Johns Hopkins University APL Internal Memorandum STF-06-095, 30 June 2006.