The issue of distribution water quality security ensuring is recently attracting global attention due to the potential threat from harmful contaminants. The real-time monitoring based on ultraviolet optical sensors is a promising technique. This method is of reagent-free, low maintenance cost, rapid analysis and wide cover range. However, the ultraviolet absorption spectra are of large size and easily interfered. While within the on-site application, there is almost no prior knowledge like spectral characteristics of potential contaminants before determined. Meanwhile, the concept of normal water quality is also varying due to the operating condition. In this paper, a procedure based on multivariate statistical analysis is proposed to detect distribution water quality anomaly based on ultraviolet optical sensors. Firstly, the principal component analysis is employed to capture the main variety features from the spectral matrix and reduce the dimensionality. A new statistical variable is then constructed and used for evaluating the local outlying degree according to the chi-square distribution in the principal component subspace. The possibility of anomaly of the latest observation is calculated by the accumulation of the outlying degrees from the adjacent previous observations. To develop a more reliable anomaly detection procedure, several key parameters are discussed. By utilizing the proposed methods, the distribution water quality anomalies and the optical abnormal changes can be detected. The contaminants intrusion experiment is conducted in a pilot-scale distribution system by injecting phenol solution. The effectiveness of the proposed procedure is finally testified using the experimental spectral data.
© 2015 Optical Society of America
To ensure drinking water safety in water distribution systems is of great significance for public health and social security. It has been attracting considerable attention of water utilities and researchers around the world [1–3 ]. Detecting anomaly in water quality is one of many effective methods to indicate potential harmful compounds resulting from accidental spills or malicious poison . Especially, early detecting of anomaly is critical to assist utility personnel to distinguish water quality events from natural variation and to initiate effective response that can reduce adverse impacts [4, 5 ].
The notion of anomaly detection here refers to the issue of finding observations which do not conform to normal behavior from the data . Anomaly detection is extensively used in variety of areas such as traffic accidents detection, financial frauds detection, network intrusions detection et al. . In recent decades, especially after the terrorist attack of September 11th, anomaly detection for water quality has prompted a lot of interest . Methods for unusual changes detection in water distribution system can be found in literatures. The basic strategy of the existing anomaly detection procedures may involve standard distribution water quality parameters evaluation. The value sequences of standard parameters including pH, free chlorine, oxidation reduction potential (ORP), conductivity, turbidity and dissolved oxygen (DO) are evaluated using anomaly detecting algorithms such as classification, clustering, statistical model or sequence analysis [3, 8 ]. Yang, for example, detected and classified reactive contaminant in a drinking water pipe based on chlorine residual loss and its variations [3, 9 ]. Hall investigated the response to secondary effluent challenge on standard parameters and found sensors for free chlorine, total organic carbon (TOC), ORP, conductivity, and chloride respond to most contaminants . These previous works are effective and meaningful, and refine the role of standard parameters in early warning systems. Some of these standard parameters like pH and conductivity are measured using membranes or electrodes. The measurement of these parameters is almost real-time. While for others, including free chlorine and TOC, the measurement is mainly based on wet chemistry and sequential injection analysis, which are time-consuming and reagent-consuming. It’s also essential for waste collection and further treatment as the use of toxic reagents. Moreover, the instruments are expensive and frequent maintenance is required . And these instruments are generally applied to only one parameter or one contaminant. All instruments are necessary to form a wide-cover anomaly detection system. These disadvantages make a high frequency of analyses (every two minutes or even higher) impossible. The low time-resolution of value sequences will lead to a higher risk of miss detection and delay.
One general solution for detection would be based on UV spectral data which has been used in water quality monitoring as an alternative TOC inspection measure. The UV spectrum is the absorption in the ultraviolet region (namely 200~400nm) which is relevant with benzene ring or conjugated double bonds. These functional groups exist widely in various contaminants especially in artificial industrial compounds, wherefore the UV spectrum can be employed to determinate the existence and even the rough concentration of these organic contaminants . A number of studies have shown the applications of UV spectral analysis on wet chemical process monitoring , and drinks classification [13–15 ]. For online water quality monitoring, UV spectral analysis method is mainly used to estimate TOC, biological oxygen demand (BOD) or chemical oxygen demand (COD) level, as an rapid alternative measurement [16, 17 ]. UV spectral technique is less time consuming, free-reagent and there is almost no need for sample preparation, waste treatment or regular maintenance. Several frameworks have been proposed to detect specific contaminants in water based on these estimated parameters [18, 19 ]. However, the information obtained within the UV spectra is much richer than conventional surrogate parameters . Hence, there are also anomaly detecting procures evaluating abnormal deviation of spectra directly. Despite the slight inferiority in lower detection limit comparing to the standard analytical methods such as liquid chromatography, the UV spectra has the advantages of cheap expense, easy application, rapid and continual monitoring. These advantages make the UV spectra an alternate monitoring and a trigger for more accurate analysis. Simple alarm parameters have be generated from time-resolved delta spectrometry to identify low probability/high impact contamination events [4, 20 ]. Similar studies can be seen in wastewater quality monitor. Durrenmatt, for example, identifies unusual change in wastewater caused by the discharge of industrial wastewater using clustering influent UV-Vis spectra .
Despite the remarkable works of previous literatures, there are still several problems remained. Firstly, there are a variety of anomaly types corresponding to different contaminations and concentrations, while the potential contaminations are numerous and cannot be all listed. Hence, the distribution of anomaly cannot be determined in advance, and trained fixed models may be unreliable. Secondly, the anomaly detection can also be implemented by converting the spectrum to standard surrogate parameters such as TOC, or COD, and then using conventional anomaly detection algorithms with the value sequence. However, the strategy is of low efficiency as compressing the rich information into limited value sequences, while the regression models for converting also requires regular correction. Finally, the UV spectrum is easily affected by instrument noises and operation conditions like temperature, dispersive light, turbulence, etc. Meanwhile, the concept of normal water quality is slightly changeable according to different production batch and different water supply plant.
To address these topics, in this study, a novel adaptive approach combining principal component analysis (PCA) and statistical analysis model is proposed for UV spectral data evaluation and anomaly detection in distribution water quality monitoring. This approach is based on the wide-cover and rapid-response characteristics of UV spectral analysis for potential organic contaminants. PCA is employed to project UV absorption spectra to a smaller but representative data space as its sensitivity to the major change directions. A new statistical variable is then constructed to evaluate the outlying degree of the latest observation in principal component subspace. The probability of anomaly is calculated in consideration of the adjacent previous observations. The proposed approach is finally performed during an organic contaminant intrusion which is conducted in a pilot-scale distribution system.
In this paper, the anomaly detection in UV spectral data for distribution water quality monitoring is studied. The evaluation of UV spectra is a very difficult task. The UV spectrum is a wide-cover indicator as ultraviolet function groups widely exists in potential organic contaminants. These contaminants have different absorption characteristic and different key wavelengths in UV range, therefore, the optimal monitoring wavelength can be scarcely possible to be specified before the intruded contaminant is determined. Screening for the high risk contaminants database would be helpful. Though, the UV spectrum is low compound-specific ambiguous due to the low concentration of the intruded contaminants. Meanwhile, the high risk contaminants database is restricted to the regional features, and large statistical work will be needed when applied to new region. Hence, the whole spectrum containing hundreds of wavelengths should be monitored and evaluated despite its large size. UV spectrum is also easily interfered by operation conditions like temperature, dispersive light, turbulence. The interferences and the large size increase the difficulty of signal extraction and make direct analysis almost impossible. Meanwhile, there are natural changes in water quality due to the operation batch and long time scale change on source water. A drift on instruments also reduces the quality of a maladaptive model. To address these topics, a novel algorithm combining effective information extraction and adaptive statistical modeling is investigated, with main stages depicted in Fig. 1 and described in subsequent sections.
2.1 Principal components analysis
Principal components analysis (PCA) is one of many efficient mathematical tools for data compression and information extraction . Numerous applications can be found in traffic monitor , financial fraud detection , fault diagnosis , image recognition , and recently spectra analysis . The general objective of the method is to transform the dimensionality of original matrix into a new set of orthogonal variables which are linear combinations of the original variables, as shown in23]. The covariance matrix is diagonal if variables are independent and orthogonal. So the procedure is to make the correlation matrix diagonal using its eigen roots and eigen vectors. The first p (p 1) is associated with the largest eigen root and represents the new coordinate direction accounting for the maximum variability in the spectral data. The second p (p 2) represents the maximum variability except p 1 and other likewise . In this new coordinate system, observations can be expressed by the projections on corresponding new variables, where the projections are linear combinations of original absorptions on each wavelength.
As an effective dimensionality reducing method, principal components are arranged according to decreasing order of importance in principal components analysis. The first limited number of principal components can be extracted to express characteristics of the original matrix while others with less importance are abandoned, as shown in28], as shown in Eq. (3).
2.2 Calculation of outlying degrees
Here, in this study, the latest observation is evaluated using an adaptive model rather than a fixed trained model. The adaptive model is trained based on the moving window which is a subset consisting of the latest hundreds of previous observations. The subset would update or move forward by replacing the earliest observation with the latest one when a new observation is measured. This adaptive model has flexibility to track the normal water quality change accurately. As one of the basic assumptions in anomaly detection techniques, the normal data is similar with the majority in the moving window. Normal observations occur in higher frequency while anomalies in lower probability in a stochastic model. The spectra are transformed into a subspace formed by the first several principal components where the normal observations and anomalies are more easily distinguished. Probability of outlier is calculated based on principal components analysis and statistical model, with main stages and inference process depicted in Fig. 2 .
In this case, due to interference from noises of instruments and normal changes in water flow, the spectrum will deviate in small amplitude. The contaminant intrusion is assumed rare in this study. So the spectrum signal is mainly affected by the repeatable and non-directional, interference and obeys a Gaussian distribution . While for anomalies, the deviations are much larger, lasts for several adjacent measurements with the same deviating direction which is determined by the spectral absorption characteristics of the intrusive contaminants. The principal components analysis captures the main deviating directions of spectral matrix in the moving window and records scores on these directions of each spectrum. If there are anomalies in the moving window, the main deviating direction will be one of the first principal components according to the nature of PCA. While in this direction, the score of abnormal spectrum is much larger than normal ones. Anomaly can be detected by evaluating the score sequences on each direction, namely ts in this study. However, as there are first f ts retained in PCA which need to be evaluated simultaneously, a data fusion work is necessary before or after the evaluation. In this study, the sum of the squares of standardized first f scores is employed as an integrated indicator for anomaly detecting, which is also equivalent to the Mahalanobis distance of the observation from the mean center of the moving window, as showed in6]. A significance level α can be found to satisfy Eq. (5). based on the Chi-square cumulative distribution function.
2.3 Detection of anomaly
An outlier in the spectral data can be recognized as a point water quality event which lies outside of the majority. However, a transient point anomaly which last for only one or two measurements is more likely to be resulted from environmental interference or instrumental noise rather than a contaminant intrusion which we actually concern about. It has been considered that a longer duration of an anomaly, a larger probability for a water quality event. To increase the detection accuracy of the proposed approach, a trust mechanism developed from cumulative sum is introduced to evaluate the probability for water quality anomaly according to the previous adjacent observations, shown inEq. (8).Eq. (9).
2.4 Receiver operating characteristic curve
Receiver operating characteristic (ROC) curve is employed for evaluating the ability of the proposed algorithm to discriminate between anomaly and normal. A ROC curve plots PD against FAR over a range of thresholds as shown in Table 1 . PD is short for probability of detection, which is equal to the anomaly detected correctly against whole potential anomalies which are known in a simulating experiment. FAR is short for false alarm rate which is the anomaly incorrectly against whole true normal. The threshold in this study means the probability for anomalies over which a water sample can be confirmed as an anomaly. It’s obvious to see that FP and FAR will ascend when the threshold descend, both from 0 to 1. However, if the difference between anomaly and normal is obvious, FP will ascend faster than FAR. Thus, the ROC will approach the upper left corner with a good performed algorithm. The ability of event detecting algorithm can be measured by the area under the ROC curve which range from 0.5 to 1.
3. Applications and results
A simple simulation is conducted in this section to verify the effectiveness of the proposed procedure before the experimental testing. A total of 500 spectra of normal distribution water are observed in on-line monitoring using the spectrometer probe (S:CAN analysis::lyser). Two Gaussian curves defined in Eqs. (10) and (11) overlap to the spectra from 245th to 254th and 445th to 454th steps, respectively. Each step means an individual spectral measurement with 30 s interval. The intensity of the Gaussian curves is set as 5σ, considering the widely used 3σ threshold method. A larger intensity is used as the special shape of Gaussian curve.
Figure 3 demonstrated the absorbance at 270nm, 370nm varied against the change of concentration. Almost no method based on single wavelength threshold can be used to achieve the anomaly detection in this case, due to the uncertainty of the optimal wavelength. The simulating spectral data is then analyzed applying the proposed procedure. The parameters of the procedure are set as Table 2 .
Figure 4 demonstrated the analytical results of the proposed procedure. The simulating contaminants intrusion can be clearly identified regardless of the key wavelengths. There are 2-3 steps delayed due to the process of probability accumulation, which is a necessary trade-off to the procedure stability. The effectiveness of the proposed procedure is sufficiently verified according the simulation.
3.2 Experimental setup and water quality events
The experimental testing of the real contaminant is conducted in a 6m long and 1.50cm diameter pilot PVC-U water distribution pipe as shown in Fig. 5 . The material is hard to corrode so the interactions with the pipe wall can be assumed negligible. The distribution water used in the following experiments is from Hangzhou Water Affair Holding Croup Co. LTD. which is the water supply utility for the city. The experimental testing setup is connected to the Hangzhou distribution system and collects water directly from the practical water distribution system. The flow rate and linear velocity are around 5L/min and 0.472m/s, respectively. The contaminant solution is injected into the pipe flow using a peristaltic pump as the simulating contaminant intrusion. The injecting flow rate is controlled by a feedback system accord to the pipe rate at a diluting ratio of one over fifty. Distribution water and injected solution are then mixed by a 20cm long and 2cm diameter static mixer downstream. The static mixer here is a simulation of the diffusion process alongside the radial direction, mixing the distribution water and injected solution, and ensuring the concentration approximately equally alongside the radial direction for each slice in the pipe. The UV spectra are observed by the spectrometer probe (S:CAN analysis::lyser) with the light path submerged in a 5cm high overflow pipe. The overflow pipe is 5.5m downstream from the injection port, so the hydraulic residence time for this experimental setup is 11.67s. The spectrophotometer measures with a 1cm light path every 30 seconds. The absorption spectra from 200 to 750nm in a 2.5nm wavelength increment are acquired and recorded.
Phenol, a frequently detected organic contaminant, listed as one of the most conventional indicators in the Chinese national standard for drinking water quality (GB5749-2006) , is employed as the testing contaminant. As a typical aromatic compound, phenol has a significant absorption in the ultraviolet band. A test is carried out before the experiment to verify whether the spectral changes resulted from by the intruded phenol can be detected by the spectrometer probe. Phenol-distribution water solution is prepared at different concentrations varying from 0ug/L to 1000ug/L and scanned by the same spectrometer probe used in the experiments. The spectral data is mean centered and plotted in Fig. 6 with the background removed. As Fig. 6 illustrates, the phenol peak located at 270nm, consistent with the experimental results published in previous literatures . The amplitudes of the peak are plotted against the phenol concentration. The spectral absorbance is approximately linear with concentration which is consistent with the Lambert-Beer law. The dashed line in Fig. 6 is the standard peak absorbance at 270nm according to the literature . A small decrease can be observed for the two larger concentrations. This may result from the aqueous background and the decrease of phenol due to inadequately mixed or extra reactions. Spectra for concentrations below 100ug can almost not be distinguished from the background in the Fig. 6. The signal to noise ratios are plotted against concentrations for both the whole spectrum and the absorbance at 270nm. As illustrated in the Fig. 7 , the signal to noise ratios for the three lower concentrations are blow 3 even at the wavelengths where the strongest signal located. It can be affirmed that a phenol intrusion whose concentration below 100ug was hardly detectable by the experimental setup according to the test. A longer light path, higher resolution spectrophotometer, and some physical or chemical contaminants extracting methods such as chromatography or surface enhanced technology are needed for micro pollution events recognition. The lower detection limit of the UV spectrum is inferior to 2ug/L as listed in the Chinese national standard. However, the standard analytical method is not compulsive in China, so the frequency is rather low because of the noticeable adverse factors. While in the intrusion cases already happened in Chinese distributing system, the concentrations is much higher than 100ug/L. Hence, the UV spectrum could be regarded as an effective trigger for the standard analytical method.
The experiment lasted for 21 hours and a total of 2500 spectra are acquired. The phenol solution for injection is prepared by the same fresh tap water flowed in the experiment setup and a 1g/L phenol standard solution (AR, provided by Aladdin Reagent Co. Ltd.). The injecting solution is prepared right before the injection, so the photolysis, volatilization, oxidation of the phenol could be ignored. The injecting concentration is prepared at different level from 150ug/L to 5mg/L. The flow of injection is controlled by a feedback loop which is set as fiftieth of the main pipe flow. The diffusion along the flow direction is also ignored as the small hydraulic residence time and the use of static mixer which mixed the injected solvent and stream water immediately after injecting. The expected phenol concentration after mixing could be noted as fiftieth of the injecting concentration, namely 1000mg/L, 500mg/L, 200mg/L, 100mg/L, 50mg/L, 30mg/L, respectively, as showed in Fig. 6. Each scan of the spectrometer took about 20s, and there was a 10s interval between two scans. Considering the 11.67s hydraulic residence time, the injecting began at the 15th second of the previous scan to assure the next step is completely intruded. The same operation is employed when the injecting stop. The durations of intruding events are fixed around 20min with an interval of around 90min.
All spectral data was derived from the instrument and analyzed in Matlab R2011b Fig. 8 demonstrated the absorbance at 260nm, 270nm, 280nm varied against the change of concentration. While Fig. 9 demonstrated the 3-D spectral picture of the distribution water during the experimental period. As can be seen from the results, absorbance at 270nm had the most significant signal and the change resulting from the concentration 100ug/L can be clearly identified. While for other wavelengths, the signal is much weaker and even the change resulting from the concentration 200ug/L can hardly be identified. The experimental results are consistent with the test conducted previously.
3.3 Anomaly detection using proposed procedure
All spectral data is analyzed using the Principal Components Analysis to observe the internal data structure. Figure 10 shows the eigen values and the cumulative percent variance. Figure 10 illustrates the variance distributes relatively uniformly. There are almost no 1- or 2- principal components dominant the signal. Figure 10 demonstrates that the spectral signal constitutes by the random noises. And the signal of intruded contaminant is too weak and transience to hold the dominance in the Principal Components Analysis for whole spectral matrix.
All spectral data is then analyzed using the procedure for potential anomaly detection. As general configurations, the spectral range from 230nm to 400nm is used in anomaly detection, as a typical mid- and near-ultraviolet band where key wavelengths of aromaticity, COD, proteins and other contaminants locate . The size of moving window is set to 200, where totally 199min or 100min previous observations were used in principal component analysis and new observation assessment during each loop. The cut-off threshold for PCA is set to 0.90 to ensure the majority of the information retained as referred in most previous literatures. The Gaussian cumulating scale is set to 6, where 6 outlier probabilities of previous observations are used for anomaly evaluation and the standard deviation of weighting coefficient distribution function is one third of the Gaussian cumulating scale.
Figures 11 and 12 shows the anomaly detection effectiveness of the proposed procedure. AUC here short for area under the curve. In Fig. 11(a), the intruded concentration was plotted to label the steps where the contaminant intruded, also referred as true positive here in this study. The probability for events is plotted in Fig. 11(b) as the proposed procedure. A threshold is specified at 0.90 to decide whether an alarm should be triggered or not. For a more reliable statistical model, anomalies are moved out from principal component analysis and outlier assessment when later observations are evaluated. The alarms are recorded in Fig. 11(c), while the correctness of these alarms was evaluated according to the true positive in Fig. 11(a) and shown in Fig. 11(d).
As can be seen in Fig. 11, water quality events at concentration 50ug/L and 30ug/L can hardly be identified from the output of the proposed procedure. The average probability for concentration 50ug/L and 30ug/L is a little higher than the average probability for normal background, namely 0.76, 0.65 to 0.51, respectively. These lower concentration intrusions can be detected if a lower trigger threshold is employed, 0.65 for example. However, much more false alarms would be generated along with the threshold descending.
While for concentration intrusions larger than 100ug/L, anomalies can be correctly detected except for a 1- or 2-step delay on the rising edge and falling edge. These delays should not result from the hydraulic residence time between the injection port and the spectrometer probe, as absorbance changes at wavelengths can be observed on the right steps where the edge located according to Fig. 8 without any delay or advance. The delay mainly resulted from the Gaussian cumulating, where the observations were evaluated according to the accumulation of several adjacent previous outlier probabilities. A 1- or 2-step delay was acceptable as the contaminant intrusion generally lasting for several steps.
Results shown in this section demonstrate the effectiveness of the proposed procedure to detect water quality events caused by phenol intrusion with a concentration larger than 100ug/L. It can be believed that other colored contaminants intrusions can be recognized if there was spectral change between 240nm-400nm which can be detected by the spectrometer probe.
4.1 Setting of approach parameters
Table 3 summarize 4 parameters that need to be specified in the proposed approach, without proper theoretical guidance. A simple discussion is made in this section to develop a more reliable anomaly detection procedure.
4.1.1 Selecting of wavelength band
Figures 13 and 14 depict the analysis result based on new wavelength ranges expanding to far- ultraviolet and visible wavelength, respectively. False alarm can be observed around step 418 at both figures. While in Fig. 13, alarm signal for concentration 100ug/L is missing in some steps. In Fig. 14, intrusion at concentration 50ug/L around 1950th step is totally undetected. Comparing the probability curve in Figs. 12-14 , a decline in the effectiveness of the proposed approach can be found as the wavelength range varies. The same trend can be observed in the areas under the ROC curve listed in Table 4 .
This may be because of the new selected wavelength contained much more interference than spectral absorbance resulting from contamination intrusions. Hence, the selection of wavelength range should take signal to noise ratio into consideration. The wavelength band containing most key wavelength and has a higher spectral signal should be an optimized selection. As key wavelength where the absorbance peak of the potential contaminants could not be determined, the spectral range 230nm-400nm should be general decision. Wavelengths blow 230nm can be interfered by the solvent itself, while wavelengths above 400nm suffered from interference of color and turbidity. And this wavelengths band contains numerous key wavelengths of the potential pollutants.
4.1.2 Setting of moving window size
In Figs. 15 and 16 , different moving window sizes are tested and the results are compared to find the appropriate size. And Table 5 listed the AUCs using different moving window size. The size of the moving window affects the results in many aspects as it is utilized to normalize the data, analyze the principal component and assess the outlier. The results show that the detection rate and false alarm rate are only slightly affected by the moving window size. This may results from the steady of the process during the simulating intrusion. The background spectral data makes almost no difference on various time scales. A subset of 100 previous observations can characterize the background data as exactly as a subset of 500 observations. Representative historical data can be selected as the training set to build a fixed model rather than the moving window, for the sake of reducing the computational complexity. However, the concept of normal is certain here in this model, while actually the normal should be definite as the majority according the literatures . So the concept of normal may change as the operation condition changes. A fixed model can hardly track this change. Similarly, the size of the moving window should also be smaller than the time scale of operation condition changes.
4.1.3 Setting of cut-off threshold
As a dimensionality reduction method, principal component analysis cuts off unimportant components also referred as noises here after coordinate conversion. The criterion to distinguish principal components and unimportant dimensionalities is the cut-off threshold. As the components are ordered descendingly according to the importance or data variance which is embodied in the corresponding eigen roots, the cut-off threshold can be set on the number of components, eigen roots or the cumulative percent of the total data variance. The rear components can be cut off as unimportant dimensionalities with scarce information once the threshold is satisfied. In this study, the cut-off threshold is determined by an adaptive function based on cumulative percent variance theory as generally employed in literatures. Obviously, a higher threshold means more components retained, as well as signal and noise. In this section, cut-off thresholds at four levels are tested to explore the affection of different thresholds for the effectiveness of the proposed procedure. The analytical results are depicted as following.
As shown in Fig. 17 , Fig. 18 , Fig. 19 , and Fig. 20 and Table 6 , the effectiveness of the proposed procedure decreases lightly as the cut-off threshold decreases. About 28 components remain in the moving windows as principal components when the cut-off threshold is 0.90. Comparing to the original spectral matrix, the dimensionality reduction is not significant. This may result from the data structure of the original matrix, as the absorptions on 65 wavelengths are almost independent and low related due to the 2.5nm spectral resolution. While the signal from contaminants is weak and short, it can almost not hold the dominant position. A relatively higher threshold is necessary to retaining vital weak signals consisted in posterior components. 0.90 is reasonable and a widely used cut-off threshold to tradeoff between retaining vital components and eliminating interferences.
4.1.4 Setting of cumulating scale
Cumulating scale refers the number of previous observations need to be considered when calculating event probability according to previous and latest outlier probabilities. A single observation can be easily interfered and has a high probability in the outlier evaluation. If the high outlier probability is occasional and transient, lasting for only one step, the observation can be removed from the alarm triggering list as a mistake. In previous literatures, the alarm will be triggered if the number of continuous outlier exceeded a specified threshold. However, in this study, quantitative analysis is employed in outlier evaluation instead of qualitative analysis. The outlier probability is cumulated to compare with the specified threshold, 0.9 here in this paper. Meanwhile, a small improvement is made during the accumulation. Different weighting coefficients are distributed to the previous outlier probabilities according the unilateral Gaussian curve, as shown in Eq. (10). Obviously, later observations are distributed with higher weighting coefficients, while earlier observations beyond the 3 times of the unilateral Gaussian curve’s standard deviation are not included in cumulating as the weighting coefficients are negligible. Hence, the cumulating scale is equal to 3 times of the unilateral Gaussian curve’s standard deviation. In this section, cumulating scales at three levels are tested to determine a proper value. The analytical results are depicted as following.
As shown in Fig. 21 , the event probabilities at steps when contaminant intrusions located are almost 1 with shape rising edge which lead to low false negative rate. However, there are also considerable false positives lasting for only one or two steps. The occasional mistake resulting from the noises cannot be removed correctly on this scale. While for event probabilities in Fig. 24, the rising edge is fuzzy due to the smoothing effect from the cumulating. There are larger delays lasting for three steps. Figure 21, Fig. 22 , and Fig. 23 and Table 7 show a slight increase of AUC with increasing cumulating scale. As the increasing of the cumulating scale, the event probability curve would be more smoothing, so the AUC would improve correspondingly. However, the delay on rising edge will increase, while the probabilities at steps when true event located will decrease and tend to the average. The setting of cumulating scale should be consistent with the time scale of the background noise. A cumulating scales lager than 6 steps cannot help to reduce the false positive rate and improve effectiveness of the proposed procedure according to the discussion in this section.
4.1.5 Summary of approach parameters setting
In this section, the interrelation of the approach results and parameter values is under discussion. A general guidance to approach parameters setting can be summarized according to the demonstrations above. The conclusion of this section is listed as follows.
Wavelength range chosen in the analysis should include key wavelengths of potential contaminants. While wavelengths suffered from more interference such as instrumental noise, absorbance of water, turbidity and turbulence should be removed. 230nm-400nm is a good choice if there is no further detail of potential contaminants. The wavelength range can be narrowed according to the spectral characteristics of intruded contaminants once the intrusion is determined.
b) Moving window size
A certain number of previous observations are need for principal component analysis. These observations should be representative and can be used to describe the latest background spectral data structure characteristics. Obviously, the moving window size is related to the cycle of the background fluctuations and the frequency of contaminants intrusion. According to the experimental results in this study, the water quality is relatively stable against to the high sampling frequency. Hence, 100 previous observations are enough for the model. While taking the redundant into account, 200 previous observations are an appropriate setup.
c) Cut-off threshold
The setting of cut-off threshold should ensure that residual principal components can depict the characteristics of the original data while eliminate the interferences and noises. In this study, the signal from contaminants is weak and short duration. A larger cut-off threshold such as 0.90 can be appropriate to retain more components and signals from contaminants as well.
d) Cumulating scale
The setting of cumulating scale should take the timescale of the background noise into consideration. Cumulating scales lager than 6 steps cannot help to reduce the false positive rate and improve effectiveness of the proposed procedure. 3 steps would be a general choice.
4.2 Comparison with 3σ threshold method
The conventional threshold method is employed to instructing the effectiveness of the proposed procedure in this section. The 3σ threshold method is developed following two different strategies. In the first strategy, the absorption at 270nm is taken into consideration and the alarm is triggered once the absorption deviates over 3 times of the standard deviation from the average absorption. While for the second strategy, all the wavelengths of the latest observation are evaluated and trigger an alarm when any absorption exceeding the 3σ threshold at corresponding wavelength. All standard deviations and average absorptions are obtained from the first 500 observations during the experiment. Both analytic results are showed as following.
As depicted in Fig. 24 , for events with intruded concentration larger than 200mg/L, the conventional threshold method has almost the same false alarm rate and detection rate comparing with the proposed procedure. While for event with intruded concentration 100mg/L, the detection rate in Fig. 24 is inferior to the proposed procedure. Meanwhile, as shown in Fig. 25 , the correct alarms are almost unable to distinguish from the false positive. AUC listed in Table 8 also show the significant difference in the performance of the proposed procedure and the conventional threshold method. However, 270nm is the key wavelength of phenol we used in the simulating intrusion, while no key wavelength could be specified as no contaminants can be confirmed before the qualitative analysis is triggered. Hence, the second strategy should be used when there is no other prior knowledge. The experimental results show that the proposed approach works much better in reducing interference and extracting valid signal than the common threshold method. The threshold method can hardly recognize the water quality event.
The real-time water quality monitoring based on UV spectra has been proved as a promising alternate monitoring and an effective trigger on account of advantages such as reagent-free, low maintenance cost, rapid analysis and wide cover range. A few reports of using spectral data to detect water quality anomalies have been published. Though, the large spectral dimension, the uncertainty of the intruded contaminants and the noises from water and instrument can produce a large number of missed detection or false alarms, which compromises the reliability of the detecting system. In this paper, a novel procedure, which based on statistical analysis method, is proposed to detect anomaly from ultraviolet spectral data. Firstly, the principal component analysis is employed to capture the main variety feature from the spectral matrix. The possibility of outliers is then assumed according to the chi-square distribution. Next, the possibility of anomaly is established by cumulative sum. A simulating intrusion is conducted in the pilot-scale distributing system by injecting phenol. The spectral data is collected and analyzed using the proposed procedure. The experimental results also confirm the proposed procedure a promising method to implement anomaly detection. A general guidance to parameters setting is summarized and a comparison with the 3σ threshold method is conducted. Comparing to other methods, the proposed procedure can be easily implemented and a generic detecting procedure applying to potential water quality anomalies. Meanwhile, the variable concept of normal water quality can also be successfully addressed using the moving window and the statistical model. There are also several other advantages of the proposed procedure such as dimension reducing, data tracking feature of the moving window, and the cumulating trust mechanism. Due to these advantages, the novel procedure can be widespread in distributing water quality anomalies detection from ultraviolet spectral data, and will contribute to ensuring the safety of water quality.
This work was funded by the National Natural Science Foundation of China (No. 41101508) “Research on Water Quality Event Detection Methods based on Time-Frequency Analysis and Multi-sensor Data Fusion”, the Public Welfare Program of Zhejiang Provincial Science Technology Department (No. 2014C33025) “Research and Development on Small Mobile Water Quality Emergency and Monitoring Platform for Urban Water Supply Security”, the Program for Zhejiang Provincial Innovative Research Team (No. 2012R10037-08) “Research on Water Quality Early-warning Methods and Technologies”, and the Fundamental Research Funds for the Central Universities (No. 2014FZA5008) “Research on Mobile monitoring and Source Tracking Technology for Urban Water Supply Quality Pollution”.
References and links
1. L. Perelman, J. Arad, M. Housh, and A. Ostfeld, “Event detection in water distribution systems from multivariate water quality time series,” Environ. Sci. Technol. 46(15), 8212–8219 (2012). [CrossRef] [PubMed]
2. K. A. Klise and S. A. McKenna, “Water quality change detection: multivariate algorithms,” in Conference on Optics and Photonics in Global Homeland Security II, T. T. Saito, and D. Lehrfeld, ed. (SPIE, 2006), J2030. [CrossRef]
3. Y. Jeffrey Yang, R. C. Haught, and J. A. Goodrich, “Real-time contaminant detection and classification in a drinking water pipe using conventional water quality sensors: Techniques and experimental results,” J. Environ. Manage. 90(8), 2494–2506 (2009). [CrossRef] [PubMed]
4. G. Langergraber, A. Weingartner, and N. Fleischmann, “Time-resolved delta spectrometry: a method to define alarm parameters from spectral data,” in International Conference on Automation in Water Quality Monitoring, (IWA, 2006), pp. 13–20.
5. J. Broeke, G. Langergraber, and A. Weingartner, “On-line and in-situ UV/vis spectroscopy for multi-parameter measurements: a brief review,” Spectrosc. Eur. 18(4), 15–18 (2006).
6. V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM Comput. Surv. 41(3), 1 (2009). [CrossRef]
7. J. Hall, A. D. Zaffiro, R. B. Marx, P. C. Kefauver, E. R. Krishnan, and J. G. Herrmann, “On-line water quality parameters as indicators of distribution system contamination,” J. Amer. Water Work, Assn. 99(1), 66–77 (2007).
9. Y. Jeffrey Yang, J. A. Goodrich, R. M. Clark, and S. Y. Li, “Modeling and testing of reactive contaminant transport in drinking water pipes: Chlorine response and implications for online contaminant detection,” Water Res. 42(6-7), 1397–1412 (2008). [CrossRef] [PubMed]
10. N. D. Lourenço, J. C. Menezes, H. M. Pinheiro, and D. Diniz, “Development of PLS calibration models from UV-Vis spectra for TOC estimation at the outlet of a fuel park wastewater treatment plant,” Environ. Technol. 29(8), 891–898 (2008). [CrossRef] [PubMed]
11. R. Aryal, S. Vigneswaran, and J. Kandasamy, “Application of ultraviolet (UV) spectrophotometry in the assessment of membrane bioreactor performance for monitoring water and wastewater treatment,” Appl. Spectrosc. 65(2), 227–232 (2011). [CrossRef]
12. L. G. Dias, A. C. Veloso, D. M. Correia, O. Rocha, D. Torres, I. Rocha, L. R. Rodrigues, and A. M. Peres, “UV spectrophotometry method for the monitoring of galacto-oligosaccharides production,” Food Chem. 113(1), 246–252 (2009). [CrossRef]
13. F. J. Acevedo, J. Jiménez, S. Maldonado, E. Domínguez, and A. Narváez, “Classification of wines produced in specific regions by UV-visible spectroscopy combined with support vector machines,” J. Agric. Food Chem. 55(17), 6842–6849 (2007). [CrossRef] [PubMed]
14. O. Barbosa-García, G. Ramos-Ortiz, J. L. Maldonado-Molina, M. A. Meneses-Nava, and J. E. Landgrave, “UV–vis absorption spectroscopy and multivariate analysis as a method to discriminate tequila,” Spectrochim. Acta PT. A- Mol, Bio. 66(1), 129–134 (2007).
15. U. T. C. P. Souto, M. J. C. Pontes, E. C. Silva, R. K. H. Galvão, M. C. U. Araujo, F. A. C. Sanches, F. A. S. Cunha, and M. S. R. Oliveira, “UV–Vis spectrometric classification of coffees by SPA–LDA,” Food Chem. 119(1), 368–371 (2010). [CrossRef]
16. S. Fogelman, M. Blumenstein, and H. Zhao, “Estimation of chemical oxygen demand by ultraviolet spectroscopic profiling and artificial neural networks,” Neural Comput. Appl. 15(3-4), 197–203 (2006). [CrossRef]
18. R. Guercio and E. Ruzza, “An early warning monitoring system for quality control in a water distribution network,” in International Conference on Sustainable Water Resources Management, C. A. Guercio, E. Di Ruzza, ed. (Wessex Inst Technol, 2007), pp. 143–152. [CrossRef]
19. J. Broeke, A. Brandt, A. Weingartner, and F. Hofstadter, “Monitoring of organic micro contaminants in drinking water using a submersible UV/vis spectrophotometer” in NATO Advanced Research Workshop on Security of Water Supply Systems, J. Pollert, B. Dedus, ed. (NATO, 2005), pp. 27–31.
20. G. Langergraber, J. Broeke, W. Lettl, and A. Weingartner, “Real-time detection of possible harmful events using UV/vis spectrometry,” Spectrosc. Eur. 18(4), 19–22 (2006).
21. D. J. Dürrenmatt and W. Gujer, “Identification of industrial wastewater by clustering wastewater treatment plant influent ultraviolet visible spectra,” Water Sci. Technol. 63(6), 1153–1159 (2011). [CrossRef] [PubMed]
22. N. D. Lourenço, F. Paixão, H. M. Pinheiro, and A. Sousa, “Use of spectra in the visible and near-mid-ultraviolet range with Principal Component Analysis and Partial Least Squares Processing for monitoring of suspended solids in municipal wastewater treatment plants,” Appl. Spectrosc. 64(9), 1061–1067 (2010). [CrossRef] [PubMed]
23. S. M. S. Nagendra and M. Khare, “Principal component analysis of urban traffic characteristics and meteorological data,” Transp. Res. Pt D-Transp, Enviro. 8(4), 285–297 (2003).
24. P. L. Brockett, R. A. Derrig, L. L. Golden, A. Levine, and M. Alpert, “Fraud classification using principal component analysis of RIDITs,” J. Risk Insur. 69(3), 341–371 (2002). [CrossRef]
25. W. Sun, J. Chen, and J. Li, “Decision tree and PCA-based fault diagnosis of rotating machinery,” Mech. Syst. Signal Process. 21(3), 1300–1317 (2007). [CrossRef]
26. W. Zhao, R. Chellappa, and A. Krishnaswamy, “Discriminant analysis of principal components for face recognition,” in IEEE International Conference on Automatic Face and Gesture Recognition, (FG, 1998), pp. 336–341. [CrossRef]
27. A. M. A. Dias, I. Moita, M. M. Alves, E. C. Ferreira, R. Páscoa, and J. A. Lopes, “Activated sludge process monitoring through in situ near-infrared spectral analysis,” Water Sci. Technol. 57(10), 1643–1650 (2008). [CrossRef] [PubMed]
28. W. Li, H. Yue, S. Valle-Cervantes, and S. Qin, “Recursive PCA for adaptive process monitoring,” J. Process Contr. 10(5), 471–486 (2000). [CrossRef]
29. Standard for drinking water, GB 5749–2006 of China, 2006.
30. H. Kenzo, Handbook of Ultraviolet and Visible Absorption Spectra of Organic Compounds (Plenum Press Data Division, New York 1967)