Method for developing and using high quality reference images to evaluate tone mapping operators

Method for developing and using high quality reference images to evaluate tone mapping operators

Journal of the Optical Society of America A

1. INTRODUCTION

2. EXPERIMENT 1: DEVELOPMENT OF REFERENCE IMAGES

3. EXPERIMENT 1, RESULTS AND DISCUSSION

4. EXPERIMENT 2: EVALUATION OF TMOs

5. EXPERIMENT 2, RESULTS AND DISCUSSION

6. CONCLUSION

Acknowledgment

Disclosures

Data availability

REFERENCES

Abstract

1. INTRODUCTION

2. EXPERIMENT 1: DEVELOPMENT OF REFERENCE IMAGES

3. EXPERIMENT 1, RESULTS AND DISCUSSION

4. EXPERIMENT 2: EVALUATION OF TMOs

5. EXPERIMENT 2, RESULTS AND DISCUSSION

6. CONCLUSION

Acknowledgment

Disclosures

Data availability

REFERENCES

Data availability

Cited By

Figures (8)

Tables (1)

Equations (3)

A. Experiment Design

B. Interface

C. Procedure

A. Observer Variation

B. Reference Images

A. Experiment Design

B. Procedure

A. Observer Variation

B. Evaluation of TMOs

C. Ranking of TMOs

D. Hierarchical Relationship among TMOs

E. Comparison with Other Studies

A. Experiment Design

B. Interface

C. Procedure

A. Observer Variation

B. Reference Images

A. Experiment Design

B. Procedure

A. Observer Variation

B. Evaluation of TMOs

C. Ranking of TMOs

D. Hierarchical Relationship among TMOs

E. Comparison with Other Studies

Imran Mehmood; Xiaoxuan Liu; Muhammad Usman Khan; Ming Ronnier Luo; Ming Ronnier Luo

doi:10.1364/JOSAA.450581

High dynamic range (HDR) imaging enables one to capture a wide range of luminance, i.e., from extremely dark shadows to bright sunshine, and display in a limited range [1]. A good tone mapping operator (TMO) will map the real world dynamic range of a scene to the displayable range while preserving contrast and details [2–4]. Various TMOs have been developed based on the currently available knowledge of the physiology of the human visual system and image processing techniques [5–9]. TMOs have been applied to map HDR images to the dynamic range of mobile phone and computer displays [10]. The process of tone mapping can cause problems including loss of detail and degradation of colors [11]. To improve the performance of TMOs, it is essential to assess the quality of the corresponding tone-mapped images. However, there are differences in means of assessment. There are two types of image assessment methods: (i) subjective evaluation methods based on visual assessment such as preference, naturalness, or overall quality, and (ii) objective assessment methods based on various metrics of an image such as color difference formula and structural similarity. Though the former method is expensive due to design and time consumption for experiments, human observers are the ultimate end users of multimedia applications. Therefore, the former method is frequently used. On the other hand, an objective method is automatic and faster, and tries to mimic the quality of images as predicted by visual results. However, there is yet no agreement on standard procedures of objective metrics.

The image assessment methods are further divided into comparative (reference images based) and non-comparative (no reference images are used) methods. The former methods compare the test image with a reference image, whereas for the latter method, the test image is compared with the memory image of the observer. Subjective evaluation of various TMOs has been typically studied based on the visual assessment results of preference and similarity [10]. Ledda et al. [12] compared six TMOs including histogram adjustment [13], bilateral filter [14], Reinhard’s photographic reproduction [8], iCAM06 model [15], Drago’s logarithmic model [5], and local eye adaptation [16] with the HDR image on an HDR display. In their study, iCAM06 performed the best followed by Reinhard’s photographic reproduction, local eye adaptation, histogram adjustment, Drago’s logarithmic model, and bilateral filter. Some other studies also investigated different TMOs in which observers judged image quality, mainly in terms of brightness, colorfulness, contrast, and overall quality [17,18]. Akuz et al. [18] compared seven TMOs using a stimulus derived from Cornsweet–Craik–O’Brien illusion [19] and grouped similar TMOs in various categories. Čadík et al. [20] conducted an experiment to compare 14 tone mapped images with the original scene and found Reinhard’s photographic reproduction as the best TMO. Almost all studies mentioned above ranked Reinhard’s photographic reproduction in the highest category, whereas there were contradictions in the rankings of the other TMOs. This could be because of using different scenarios and images in the evaluation methods. Also, the operators perform differently when evaluated using various image attributes [17].

Image quality metrics (IQMs) are classified into three types: full reference, reduced reference, and no reference [21]. In full reference IQMs, the reference image is available to compare with the tone-mapped image. In reduced reference IQMs, a specific set of features of the reference image is available for quality evaluations. In no reference IQMs, no information regarding the reference image is used. There is a lack of reliable methods for reference image based evaluation of TMOs. This study focuses on evaluating TMOs using full reference IQMs. Many image databases [22,23] are publicly available, but HDR reference images have not been developed so far. Therefore, in this study, we first produced high quality reference images and afterwards evaluated different TMOs.

Previous studies have been conducted to access image quality attributes such as naturalness and detailed reproduction of scenes. TMOs use various parameters to render image attributes. For instance, Reinhard-D [6] TMO adjusts the colorfulness of images. Fairchild’s iCAM06 [15] applies CIECAM02 attributes to perform image rendering. Many mobile phone applications including gaming boost the various attributes of images for more appealing effects. Therefore, it is important to evaluate MOs in terms of contrast, colorfulness, and sharpness, as these are human observer perceptions.

The primary aim of the study was to scale the observer’s preference for each image and to compare the images between different TMOs and reference images. The framework of the paper is designed as follows. Two psychophysical experiments were conducted to assess the quality of images produced by popular TMOs. Experiment 1 focused on producing a set of high quality reference images, i.e., images with the right levels of various image attributes. Experiment 2 studied the performance of 14 TMOs using the reference images obtained in experiment 1. The reference images were high quality in terms of contrast, sharpness, and colorfulness, and therefore the TMOs were evaluated in terms of these scales and overall preferences compared to the reference images. The ranking of TMOs and the hierarchical relationship among them was established using these four scales.

To assess TMOs, it is important to produce the maximum level of details in images, especially in shadows and highlights for reference images. Ten HDR images covering different kinds of natural scenes were selected from the RIT Fairchild database [24]. Each HDR image was transformed to XYZ tristimulus space using the camera characterization model available with the database. Tone mapping algorithms were applied only to the Y-channel. Afterwards, the XYZ images were transformed to the CIELAB color space to render in terms of contrast, sharpness, and colorfulness. The rendered images were then transformed to RGB space for the calibrated display.

Prior to the experiment, various TMOs were implemented [5–9,13,25]. With the authors’ experimentation, it was found that the local version of the Reinhard’s photographic reproduction operator [6] outperformed the other TMOs with the altered user parameters. It also gave the desired level of details in shadows and highlights for the experiment. Each of the selected HDR images was tone mapped by manually varying the parameters of the Reinhard local TMO to produce the right level of details in highlights and shadows, and preserving maximum contrast. Contrast and colorfulness are important image characteristics for image quality [26]. The images were also enhanced by sharpness as it defines contrast at the edges.

Various schemes for image contrast enhancement have been proposed [27–30]. One of the most effective and commonly used methods for contrast enhancement is adaptive histogram equalization (AHE) with uniform distribution. However, AHE often results in over-enhancement of image contrast [31]. Therefore, the modified AHE proposed by Reza [29], called AHE with clip limit (CLAHE), was used to control the over-enhancement of image contrast.

To render images in terms of sharpness, the unsharp masking technique was implemented [32]. In unsharp masking, a blurred image $(\bar{f}({x,y}))$ is subtracted from the original image ($f({x,y})$) to create a mask of the image ${g_{\rm{mask}}}({x,y})$ as given in Eq. (1):

(1)$${g_{{\rm{mask}}}}(x,y) = f(x,y) - \bar f (x,y).$$

The mask [${g_{\rm{mask}}}({x,y})$] is then added to the original image [$g({x,y})$] to give sharpening effects, i.e.,

(2)$$g(x,y) = f(x,y) + {g_{{\rm{mask}}}}(x,y).$$

The blurring effect can be created using traditional Gaussian filtering and edging preserving filters such as bilateral filtering [33], ranked-order filters [34], and guided image filtering [35]. In this study, traditional Gaussian filtering was used. The other methods could perform better and will be evaluated in the future.

Image colorfulness enhancement and boosting were done to control CIELAB Chroma $({\rm{C}}_{\textit{ab}}^*)$ to make images more colorful.

Each image was rendered for the three scales, i.e., contrast, sharpness, and colorfulness. For each of the three scales, five rendered images were produced including one original image, one most pleasing, one with the maximum admissible effect, and two images neighboring the most pleasing image. The maximum admissible effect in the image was chosen visually such that after this rendering, the images would appear unnatural or had defects. For example, in the case of contrast rendering, one original version, one with the best contrast, one with the highest possible contrast (i.e., a level up to which no oversaturation or defects were caused), and two more images with contrast close to the best version were generated. To render the five levels of sharpness, the standard deviation (SD) of the Gaussian distribution function was varied to get blurring and sharpening effects. To render image colorfulness, five levels of colorfulness were applied to each of the previously generated images.

In total, ${{1250}}\;[{ = }\;{{5}}\;({\rm{contrast}}\;{\rm{levels}}) \times {{5}}\;({\rm{sharpness}}\;{\rm{levels}})\; \times{{5}}\;({\rm{colorfulness}}\;{\rm{levels}}) \times {{10}}\;({\rm{original}}\;{\rm{images}})]$ rendered images were obtained. Twenty percent of the images were repeated for analysis of the observer’s variability. In total, 1500 images were processed.

Figure 1 shows the experimental interface designed to perform the experiment. The images were presented one by one at a calibrated display to be judged as high quality or low quality. The observers could select either “low” or “high” quality of the images by clicking the right button. Then the observer could click “forward” for a new image or the “backward” button to redo the previous judgment in case of a change of decision.

Fig. 1. Experiment 1 setup.

Download Full Size | PDF

The calibrated NEC PA302W AH-IPS LCD was located in a dark room. The wall reflectance of the darkroom was approximately 4%–5% of reflection. The gain-offset-gamma (GOG) model was used to characterize the display [36]. Twenty-four colors of the X-Rite ColorChecker chart [37] were used to check the performance of the model. A Konica Minolta CS-2000 spectroradiometer was used for measuring colors on the display. The colors on the X-rite ColorChecker chart were employed to test the model’s performance in terms of CIELAB color difference [38,39]. The mean and maximum color differences were 0.64 and 1.66 CIELAB units, respectively. This performance is considered to be satisfactory.

The processed images were transformed to RGB using the display model. For evaluation of spatial uniformity, the display was divided into ${{3}} \times {{3}}$ segments, and the mean color difference between the center and each segment was noted as 1.21 in CIELAB units. The peak luminance of the peak white of the display was set to ${{287}}\;{\rm{cd}}/{{\rm{m}}^2}$ and chromaticity of the CIE standard illuminant D65. Overall, the display performed well and was suitable for the experiment. The observers were seated in the darkroom. The height of the chair was adjusted to maintain the viewing/illumination geometry of $0^\circ {:}0^\circ$. The distance between the observer and the display was kept at approximately 60 cm.

Before the experiment, each observer passed the Ishihara test [40] to ensure they had normal color vision. The observers were trained on how to rate the three images rendered using different scales, to understand the experiment. In addition, a neutral gray image (L* of 50) was presented for 30 s to adapt to the environment before displaying the actual images used in the experiment. The testing images were displayed in a random sequence for each observer. Each observer conducted the experiment in two sessions, each lasting 50 min. The experiment was either conducted in two non-consecutive sessions or a five min break was given between consecutive sessions.

A panel of 20 observers from two ethnic groups (Chinese and Pakistani origin) participated in the experiment. The mean age and SD of the observers were noted as 26.1 and 4.1, respectively. All observers were students from various disciplines. In total, 30,000 (= 1250 (rendered images) $+$250 (repeated images) ×20 (observers) judgments were made. Each observer on average took about 120 min.

To test the experiment 1 accuracy, wrong decision (WD) was calculated to analyze both inter- and intra-observer variability [41]. For inter-observer variability, the WD was defined as disagreeing assessments for the same image when it was repeated. For example, an image was displayed the first time, and the observer ranked it “high quality,” but when the same image was repeated for the same observer, it was ranked “low quality.” The number of WDs was divided by the total number of decisions to calculate WD percentage. This method was applied to each image for the whole number of observers to interpret intra-observer variability. In the case of inter-observer variability, the lower value of WD indicated better observer performance, whereas a lower value of WD in the case of intra-observer variability inferred the experiment more precisely. The worst observer had 0.30, and the best observer had 0.14 variability with a mean of 0.19 and SD of 0.028. The worst and best values of inter-observer variability were 0.21 and 0.30, respectively, with a mean of 0.26. The mean and SD are not large, which shows that the observers were reliable and the collected data can be used with confidence.

Thurstone’s law of comparative judgment [41] was used to analyze the raw data of experiment 1. The preference score for all rendered images is reported in Fig. 2 using a box and whisker plot. The $x$ axis is labeled with the image number, and the $y$ axis displays the preference score for all rendered images. As mentioned previously, each image was rendered with three scales (i.e., sharpness, colorfulness, and contrast), and each scale was rendered with five levels; therefore, preference scores of 1250 images were calculated. If more than 50% of observers rate an image as high quality, its preference score will be positive, otherwise, negative. Each box and whisker plot displays five numbers (i.e., minimum, maximum, median, and first and third quartiles of the preference scores), a summary of the preference scores corresponding to rendered images. The median is the middle value, the first quartile is the median of the lower half of the preference score, and the second quartile is the median of the upper half of the preference scores.

Fig. 2. Comparison of preference scores for original (red circle), reference images (black stars), and spread of the other rendered images. The error bars correspond to each of the boxplots.

Download Full Size | PDF

Fig. 3. High quality reference images.

Download Full Size | PDF

The error bars are represented along with each boxplot. It can be seen that the error bars, represented by one standard error, are very small. This implies that the “reference” image is in general quite robust.

It can be seen in Fig. 2 that for each of boxplot, both the median and first three quartiles lie below zero except for image 10, for which the third quartile lies above zero. This indicates that most of the preference scores are accumulated on the negative side, and fewer images have positive scores, as expected. Images with negative scores were discarded, considering them low quality images. The rendered images with the highest preference score corresponding to the original images were selected as reference images, which include maximum values of all boxplots except for the eighth image. The outlier value for the eighth image was considered for the selection of reference images. These scores are labeled with black stars. The preference scores for the original images are labeled as red circles in Fig. 2. None of the original images had a positive preference score to be considered as high quality images using the defined criteria. It reveals that at least more than 50% of observers did not, straightaway, prefer the tone mapped images. Moreover, each reference image had higher contrast, colorfulness, and sharpness compared to the original images. Figure 3 presents the 10 reference images developed using the psychophysical experiment.

Experiment 2 was designed to assess the performance of TMOs psychophysically. The reference images developed using experiment 1 were used to evaluate the performance of TMOs in terms of differences from the reference images. As discussed previously, the reference images were higher in contrast, sharpness, and colorfulness as compared to the original images. Images closer to the reference images would be higher quality images. The experiment was conducted to evaluate the performance using four scales, i.e., contrast, sharpness, colorfulness, and overall preference. Fourteen TMOs were selected including recent and other common TMOs such as Schlick’s rational quantization [9], Durand’s bilateral filtering based TMO [33], Reinhard’s photographic reproduction based local and global versions (Reinhard-L and Reinhard-G) [8,42], Fattal’s gradient domain based operator [14], Drago’s logarithmic TMO, Reinhard and Devlin’s dynamic range reduction operator inspired by photoreceptor physiology (Reinhard-D) [6], Meylan’s retinex based adaptive filter [43], Lichinski’s interactive local adjustments of tone values [44], Fairchild’s refined image appearance model, i.e., iCAM06 [15], KimKautz consistent tone reproduction [45], Shan’s globally optimized linear windowed tone mapping [46], and Khan’s tone mapping technique based on a sensitivity model of the human visual system [47]. In each TMO, default parameters were used to eliminate author biasness. Original MATLAB codes provided by the authors at their websites were used for Khan, Banterle, Shan iCAM06, and Meylan’s TMOs; however, others were implemented by the authors. Each TMO was applied on 10 HDR images; therefore, 140 tone mapped images were rendered including. To cross-check the repeatability of the experiment, 30% of images were repeated. Therefore, the total number of images in the experiment was 182 for each scale.

Fig. 4. Setup for experiment 2.

Download Full Size | PDF

To assess the performance of the TMOs, a six-point categorical scale was defined to rank the closeness of the test images with reference images. The six categories included: (1) no difference, (2) just noticeable difference, (3) small difference, (4) acceptable difference, (5) large difference, and (6) extremely large difference. Observers were asked to judge the test images, i.e., tone mapped images with respect to reference images. The experiment setup presented to the observers is given in Fig. 4.

Experiment 2 was conducted under the same experimental conditions as experiment 1, and a similar procedure was adopted to assess the color difference of the observers. Before the actual experiment, a training session was conducted to present a set of three images displaying various levels of scales, and the observers were explained the terms. During the experiment, the pairs of reference and test images were displayed in a random sequence. The positions of test and reference images were also randomized. Observers were asked to rate the difference between test and reference images on a categorical scale of 1 to 6.

Twenty-one observers took part in the experiment whose mean age and SD were 23 and 2.4, respectively. On average, each observer took 15 min to complete each session and 60 min to complete the experiment. In total, the observers made 15, 288 judgments, i.e., ${{21}}\;{\rm{observers}} \times {{4}}\;{\rm{sessions}} \times ({{14}}\;{\rm{TMOs}} \times {{10}}\;{\rm{images}}\;+ 42\;{\rm{repeated}}\;{\rm{images}})$.

To test the observer’s accuracy in terms of inter- and intra-observer variability, the standardized residual sum of squares (STRESS) metric was used [48]. The percent STRESS values ranged between zero and 100. For perfect agreement of two observations, the STRESS value is zero. A higher STRESS represents a poorer agreement between two observations. STRESS is defined as

(3)$${\rm{STRESS = 100}}\sqrt {\frac{{{{\sum \!{(F\Delta {E_i} - \Delta {V_i})}}^2}}}{{\sum {\Delta {V_i}^2}}}} \; {\rm{and}} \; F = \frac{{\sum {\Delta {E_i}\Delta {V_i}}}}{{\sum {\Delta {V_i}^2}}},$$

where ${{\Delta}}E$ and ${{\Delta}}V$ are the first and second observations, respectively, and $F$ is the scaling factor to adjust ${{\Delta}}E$ and ${{\Delta}}V$ to the same scale. The STRESS was calculated for each of the four sessions of the experiment and is reported in Table 1. The mean values of intra-observer variability in each session are lower than the inter-observer variability values. It indicates that, in each session, observers were more consistent among themselves compared to the other observers. The inter-observer variability indicates that the observers were more consistent in the overall preference session and were less consistent in the case of the colorfulness session with mean values of 18 and 21, respectively. The inter-observer variabilities for contrast and sharpness were equal. The mean and SD of inter-observer variability of all four sessions were 20 with three (STRESS units), respectively. The mean and SD of intra-observer variability of all four sessions were 12 with four, respectively. The variance of observers was equal in contrast, colorfulness, and sharpness and overall sessions. The values of SD in each session and the mean SD indicate that the observations were in small variations.

Table 1. Inter- and Intra-observer Variability of the Four Sessions

View Table

The raw data from experiment 2 were used to calculate the difference of tone-mapped images from reference images, and the ranking of TMOs was established. For better visualization of the results for each TMO and image, a compact box and whisker plot is drawn in Fig. 5. The box and whisker plots present the spread, minimum values, and maximum values of the differences for each TMO, for each scale.

Fig. 5. Performance comparison of TMOs using experiment 2 data.

Download Full Size | PDF

In the case of contrast reproduction, the scores are delineated in the blue boxplots. It is obvious from Fig. 5 that the images with the lowest difference belong to the Schlick operator. The image with the greatest difference of the Schlick operator from reference images is about 4.4, and the image with the minimum difference is close to one. It implies that the image with the maximum difference from the reference image is at a large difference in terms of judgment scales, and the image with minimum difference has no difference. As the distribution of values lies in the lower part of the boxplot, it can be interpreted that the Schlick operator generates high contrast images compared to the other operators, as the difference from high contrast reference images is smaller. It is also important to mention that one of the parameters of the Schlick operator is the minimum luminance level of the display. As the reference images are high contrast images, it can be stated that the Schlick operator produces high contrast images for this particular display.

Similarly, the difference scores of Khan’s TMO are accumulated in the upper section of the graphs with a minimum value around 4.4 and a maximum value of 5.8. It infers that the best image of Khan’s operator is at a large difference, and the worst images were at an extremely large difference. The scores of the other images lie between them. As most of the images have differences greater than five, the first and second quartiles are on the upper side of the graph. It conveys that most of the images tone mapped with Khan’s operator were at a large difference from the reference images. Therefore, it can be stated that Khan’s TMO produces either very low contrast or very high contrast images compared to the reference images. However, the authors note that the images were low contrast compared to the reference images. Shan’s TMO also produced images with more differences from the reference images with a minimum value greater than four and a maximum value greater than 5.5, and its behavior is also consistent with Khan’s TMO. A similar interpretation can be drawn for other TMOs.

The red boxplots in Fig. 5 represent the distribution of the difference scores for the sharpness scale. The boxplot of each TMO represents a wide spread of scores except for Khan’s. The TMO that shows outliers with most of the images had higher differences from the reference images, i.e., the minimum score is greater than four. All scores are distributed in the range of four to 5.5. It means that Khan’s TMO had at a large difference in sharpness for most of the images when compared with the reference images. Schlick outperformed again. The image with minimum difference is at small difference with an average score of 3.6. The image with a maximum score is at a large difference with a score of 4.8. The median indicates that most of the images had fewer differences compared with the other TMOs. The graph represents that the Meylan TMO has minimum and maximum scores of 4.3 and six, respectively, which indicates that the images tone mapped using Meylan’s lie between acceptable difference and extremely large difference. Furthermore, the length of the first quartile and the median values indicate that most of the images tone mapped by the Meylan TMO are at a large difference from the reference images. The scores of Fattal’s TMO are scattered around a single point with outliers, which means that most of the images have scores around five, and images are at a large difference from the reference images, while the minimum difference is 4.1, and the maximum difference is 5.3.

The green boxplots in Fig. 5 reveal the difference scores of the colorfulness scale. The Schlick operator performed similarly to the results of contrast scores. However, the minimum difference for colorfulness of the Schlick TMO is nearly three, and the maximum difference is 4.6. It implies that the images were between acceptable difference and large difference. The graph represents that the Meylan TMO has minimum and maximum scores of 4.3 and six, respectively. Furthermore, the length of the first quartile and the median values indicate that most of the images tone mapped by the Meylan TMO are at a large difference from the reference images. The scores of Fattal’s TMO are scattered around a single point with outliers indicating that most of the images have scores around 4.8 with a minimum difference of 3.7 and a maximum difference of five. The spread of the iCAM06 is smaller; the scores lie around 4.8, indicating that the images are at a large difference from the reference images.

Fig. 6. Ranking of each TMO corresponding to the four scales. Lower scores represent better performance.

Download Full Size | PDF

The black boxplots depicted in Fig. 5 convey the performance of the TMOs in the overall case. The observers were asked to judge the images in terms of overall preference compared with the reference images. It is obvious that observers preferred more the images of Schlick’s TMO. The scores of the images lie in the range of 3.5 to 4.5, implying that the images lie between a small difference and large difference. The tone mapped images of the Lischinski TMO had an acceptable difference to large difference. Most images from Khan’s TMO and Shan’s TMO are at the large difference to the extremely large difference.

The ranking of the TMO was established using the mean values of the difference score against the reference images. Figure 6 shows the ranking of each TMO evaluated psychophysically. The $y$ axis represents the mean difference scores for each TMO. It was a six-point categorical scale, and its values were assigned 1 to 6. Score 1 indicates no difference, while score 6 indicates an extremely large difference from the reference image. Overall Schlick’s TMO gave the best performance in all four scales. Lischinski was the second followed by Reinhard-L and Reinhard-G. It is to be noted that the minimum mean score corresponding to Schlick’s TMO is at an acceptable difference with score 4.0 as well as the other three scales, i.e., contrast, colorfulness, and sharpness with scores 3.5, 3.8, and 4.2, respectively. The mean scores of Khan’s and Shan’s TMOs reveal that the images are at the highest difference from the reference images, ranking them as the worst performers in our studies. The other TMOs lie in between.

Mahalanobis distance is a statistical measure of the correlation of similarity of data points relative to the probability distribution of data, and has different variances along different dimensions. A dendrogram based on Mahalanobis distance for each scale is given in Figs. 7(a)–7(d). It shows the hierarchical relationship among TMOs in the form of a hierarchical binary tree.

Fig. 7. Tree-structured Mahalanobis distances to determine similarity among TMOs.

Download Full Size | PDF

For contrast judgment scale, Fig. 7(a) shows that Durand B and Shan; Lischinski and Reinhard-L; iCAM06 and Meylan; Reinhard-G, KimKautz, and Drago; and Banterle and Reinhard-D TMOs lie in the same clusters, which means their contrast reproductions are similar. Schlick has the least difference from the Lischinski and Reinhard-L TMOs. Khan’s TMO has the highest difference from the Lischinski and Reinhard-L TMOs.

Figure 7(b) represents the similarity of the color reproduction of the TMOs. Here, Schlick’s TMO is at the highest difference from the other TMOs. iCAM06 and Khan; Durand-B and Fattal; Lischinski, Reinhard-L, KimKautz, and Reinhard-G; Banterle and Reinhard-D; and Shan and Meylan TMOs lie in the same clusters, which means these TMOs produce similar colors of images when tone mapped using these TMOs.

Figure 7(c) presents the clustering of the sharpness of the tone mapped images by various TMOs. It can be noted that Reinhard-D, Banterle, and Drago are close to each other. Their sharpening effects in TMOs would be similar. Fattal has the least difference from Shan and Durand-B TMOs. Khan’s TMO has the highest difference from the other TMOs, which lie exactly on the horizontal axis.

Figure 7(d) depicts the clustering and similarity of TMOs for the overall preference scale. In this case, iCAM06 and Durand-B; Meylan, Shan and Khan; Schlick and Lischinski; Reinhard-L and Reinhard-G; Drago, Banterle and Reinhard-D gave similar performances.

The results of the experiment were compared to the previous studies. Cerda-Company et al. evaluated 15 TMOs in a pair comparison experiment with original physical scenes [49]. They compared their results with many other studies [12,20,50–53] and concluded that most of the ranking of TMOs agree with other studies; therefore, the results of this experiment are compared with the ranking of Cerda-Company’s data. The eight TMOs—Drago, Reinhard-L, iCAM06, Durand-B, Meylan, and KimKautz [5,8,15,33,43,45]—are the same used in Cerda-Company’s experiment. We calculated Spearman’s correlation coefficient ($\rho$) with the scores obtained in Cerda-Company’s second experiment (scene reproduction) and the scores obtained in the current experiment. The scores from experiment 2 were normalized to one. The scatter plot of current experiment scores versus Cerda-Company’s scores along with the least square line is depicted in Fig. 8. Six out of eight TMOs correlate well with the ranking obtained in Cerda-Company’s experiment with $\rho = {0.92}$. The Reinhard-D [6] and Fattal [14] TMOs do not correlate well; however, the ranking of these two TMOs is also not good in our experiment, which is similar to Cerda-Company’s results. Reinhard-L is better than iCAM06. Drago is better than Durand-B, iCAM06, and Fattal. Meylan is worse than iCAM06, Durand-B, and the other TMOs. It shows that the results obtained from our experiment are consistent with other studies.

Fig. 8. Correlation between experiment 2 and Cerda-Company’s [49] scores.

Download Full Size | PDF

In this study, a method is proposed to generate and use high quality reference images to test the performance of TMOs using full reference IQMs. Two psychophysical experiments were carried out. In the first experiment, 10 high quality reference images were obtained having the right features of contrast, sharpness, and colorfulness rated by human observers. The reference images were used in another psychophysical experiment to evaluate the performance of 14 TMOs using four scales of contrast, colorfulness, sharpness, and overall preference. To evaluate the performance of TMOs, tone mapped images were compared with reference images. The Schlick operator ranked highest, giving contrast, colorfulness, and sharpness reproduction close to the reference images. Lischinski, Reinhard-L, and Reinhard-G followed the Schlick operator, while Khan and Shan operators were ranked at the highest difference from reference images in contrast and sharpness reproduction. Meylan and Shan were at the highest difference in color reproduction. When judged as overall preference, Khan and Shan operators were ranked at the highest difference. The minimum difference for the best operator, i.e., Schlick’s TMO, had a small difference, and Khan and Shan TMOs were at an extremely large difference. Based on Mahalanobis distance, similarity among TMOs was established. It is noted that these operators may give different contrasts, colorfulness, and sharpness reproductions, e.g., higher contrast, and may perform differently among each other. However, it should be noted that reference images are judged by average observers, and there may be different preferences among individuals. Moreover, the results were compared with previous studies, and it was found that the comparison with Cerda-Company’s [49] studies showed that six TMOs have good correlation for ranking TMOs in the current study.

The authors thank Dr. Mark Fairchild for providing the HDR image database, i.e., the HDR photographic survey.

The authors declare no conflicts of no interest.

Data underlying the results presented in this paper are available in Ref. [24].

1. P. Sen and C. Aguerrebere, “Practical high dynamic range imaging of everyday scenes: photographing the world as we see it with our own eyes,” IEEE Signal Process Mag. 33(5), 36–44 (2016). [CrossRef]

2. M. Narwaria, M. P. Da Silva, P. Le Callet, and R. Pepion, “Tone mapping based HDR compression: does it affect visual experience?” Signal Process. Image Commun. 29, 257–273 (2014). [CrossRef]

3. Y. Hel-Or, H. Hel-Or, and E. David, “Matching by tone mapping: photometric invariant template matching,” IEEE Trans. Pattern Anal. Mach. Intell. 36, 317–330 (2013). [CrossRef]

4. K. Ma, H. Yeganeh, K. Zeng, and Z. Wang, “High dynamic range image compression by optimizing tone mapped image quality index,” IEEE Trans. Image Process. 24, 3086–3097 (2015). [CrossRef]

5. F. Drago, K. Myszkowski, T. Annen, and N. Chiba, “Adaptive logarithmic mapping for displaying high contrast scenes,” in Computer Graphics Forum (Wiley, 2003), pp. 419–426.

6. E. Reinhard and K. Devlin, “Dynamic range reduction inspired by photoreceptor physiology,” IEEE Trans. Vis. Comput. Graph. 11, 13–24 (2005). [CrossRef]

7. J. Duan and G. Qiu, “Fast tone mapping for high dynamic range images,” in 17th International Conference on Pattern Recognition (ICPR) (IEEE, 2004), pp. 847–850.

8. E. Reinhard, M. Stark, P. Shirley, and J. Ferwerda, “Photographic tone reproduction for digital images,” in 29th Annual Conference on Computer Graphics and Interactive Techniques (2002), pp. 267–276.

9. C. Schlick, “Quantization techniques for visualization of high dynamic range pictures,” in Photorealistic Rendering Techniques (Springer, 1995), pp. 7–20.

10. E. Staff, “The display industry awards: honoring creativity at its best,” Inf. Disp. 36, 10–15 (2020). [CrossRef]

11. X. Liu, Y. Fang, R. Du, Y. Zuo, and W. Wen, “Blind quality assessment for tone-mapped images based on local and global features,” Inf. Sci. 528, 46–57 (2020). [CrossRef]

12. P. Ledda, A. Chalmers, T. Troscianko, and H. Seetzen, “Evaluation of tone mapping operators using a high dynamic range display,” ACM Trans. Graph. 24, 640–648 (2005). [CrossRef]

13. G. W. Larson, H. Rushmeier, and C. Piatko, “A visibility matching tone reproduction operator for high dynamic range scenes,” IEEE Trans. Vis. Comput. Graph. 3, 291–306 (1997). [CrossRef]

14. R. Fattal, D. Lischinski, and M. Werman, “Gradient domain high dynamic range compression,” in 29th Annual Conference on Computer Graphics and Interactive Techniques (2002), pp. 249–256.

15. J. Kuang, G. M. Johnson, and M. D. Fairchild, “iCAM06: a refined image appearance model for HDR image rendering,” J. Vis. Commun. Image Represent. 18, 406–414 (2007). [CrossRef]

16. P. Ledda, L. P. Santos, and A. Chalmers, “A local model of eye adaptation for high dynamic range images,” in 3rd International Conference on Computer Graphics, Virtual Reality, Visualisation and Interaction in Africa (2004), pp. 151–160.

17. M. Barkowsky and P. Le Callet, “On the perceptual similarity of realistic looking tone mapped high dynamic range images,” in IEEE International Conference on Image Processing (IEEE, 2010), pp. 3245–3248.

18. A. O. Akyüz and E. Reinhard, “Perceptual evaluation of tone-reproduction operators using the Cornsweet-Craik-O’Brien illusion,” ACM Trans. Appl. Percept. 4, 1–29 (2008). [CrossRef]

19. T. N. Cornsweet, Visual Perception (Academic, 1970).

20. M. Čadík, M. Wimmer, L. Neumann, and A. Artusi, “Image attributes and quality for evaluation of tone mapping operators,” in Proceedings of Pacific Graphics 2006 (National Taiwan University Press, 2006), pp. 35–44.

21. S. Bosse, D. Maniry, K.-R. Müller, T. Wiegand, and W. Samek, “Deep neural networks for no-reference and full-reference image quality assessment,” IEEE Trans. Image Process. 27, 206–219 (2017). [CrossRef]

22. R. Franzen, “True color Kodak images,” http://r0k.us/graphics/kodak (2013).

23. A. G. Weber, “The USC-SIPI Image Database: Version 5, Original release: October 1997, Signal and Image Processing Institute, University of Southern California, Department of Electrical Engineering,” (2014).

24. M. D. Fairchild, “The HDR photographic survey,” in Color and Imaging Conference (Society for Imaging Science and Technology, 2007), pp. 233–238.

25. R. Mantiuk, S. Daly, and L. Kerofsky, “Display adaptive tone mapping,” in ACM SIGGRAPH (2008), pp. 1–10.

26. J. A. Olson and N. I. Krinsky, “Introduction: the colorful, fascinating world of the carotenoids: important physiologic modulators,” FASEB J. 9, 1547–1550 (1995). [CrossRef]

27. H. Ibrahim and N. S. P. Kong, “Brightness preserving dynamic histogram equalization for image contrast enhancement,” IEEE Trans. Consum. Electron. 53, 1752–1758 (2007). [CrossRef]

28. G.-H. Park, H.-H. Cho, and M.-R. Choi, “A contrast enhancement method using dynamic range separate histogram equalization,” IEEE Trans. Consum. Electron. 54, 1981–1987 (2008). [CrossRef]

29. A. M. Reza, “Realization of the contrast limited adaptive histogram equalization (CLAHE) for real-time image enhancement,” J. VLSI Signal Process. Syst. Signal Image Video Technol. 38, 35–44 (2004). [CrossRef]

30. N. S. P. Kong and H. Ibrahim, “Color image enhancement using brightness preserving dynamic histogram equalization,” IEEE Trans. Consum. Electron. 54, 1962–1968 (2008). [CrossRef]

31. M. Kaur, J. Kaur, and J. Kaur, “Survey of contrast enhancement techniques based on histogram equalization,” Int. J. Adv Comput. Sci. Appl. 2, 137–141 (2011). [CrossRef]

32. D. Sundararajan, Digital Image Processing: A Signal Processing and Algorithmic Approach (Springer, 2017).

33. F. Durand and J. Dorsey, “Fast bilateral filtering for the display of high-dynamic-range images,” in 29th Annual Conference on Computer Graphics and Interactive Techniques (2002), pp. 257–266.

34. G. R. Arce and R. E. Foster, “Detail-preserving ranked-order based filters for image processing,” IEEE Trans. Acoust. Speech Signal Process. 37, 83–98 (1989). [CrossRef]

35. K. He, J. Sun, and X. Tang, “Guided image filtering,” IEEE Trans. Pattern Anal. Mach. Intell. 35, 1397–1409 (2012). [CrossRef]

36. R. S. Berns, “Methods for characterizing CRT displays,” Displays 16, 173–182 (1996). [CrossRef]

37. D. Varghese, R. Wanat, and R. Mantiuk, “Colorimetric calibration of high dynamic range images with a ColorChecker chart,” in Proceedings of the HDRi (2014).

38. M. R. Luo, “CIE division 8: a servant for the imaging industry,” Proc. SPIE 4922, 51–55 (2002). [CrossRef]

39. C. Colorimetry-Part, “4: 1976 L* a* b* colour space,” Joint ISO/CIE Standard, ISO, 11664-4:2008 (2008).

40. J. Clark, “The Ishihara test for color blindness,” Am J. Physiol. Opt. 5, 269–276 (1924).

41. L. L. Thurstone, “A law of comparative judgment,” Psychol. Rev. 34, 273–286 (1927). [CrossRef]

42. E. Reinhard, “Parameter estimation for photographic tone reproduction,” J. Graph. Tools 7, 45–51 (2002). [CrossRef]

43. L. Meylan and S. Susstrunk, “High dynamic range image rendering with a retinex-based adaptive filter,” IEEE Trans. Image Process. 15, 2820–2830 (2006). [CrossRef]

44. D. Lischinski, Z. Farbman, M. Uyttendaele, and R. Szeliski, “Interactive local adjustment of tonal values,” ACM Trans. Graph. 25, 646–653 (2006). [CrossRef]

45. M. H. Kim and J. Kautz, “Consistent tone reproduction,” in Computer Graphics and Imaging (2008), pp. 152–159.

46. Q. Shan, J. Jia, and M. S. Brown, “Globally optimized linear windowed tone mapping,” IEEE Trans. Vis. Comput. Graph. 16, 663–675 (2009). [CrossRef]

47. I. R. Khan, S. Rahardja, M. M. Khan, M. M. Movania, and F. Abed, “A tone-mapping technique based on histogram using a sensitivity model of the human visual system,” IEEE Trans. Ind. Electron. 65, 3469–3479 (2017). [CrossRef]

48. S. Ma, M. Wei, J. Liang, B. Wang, Y. Chen, M. Pointer, and M. Luo, “Evaluation of whiteness metrics,” Lighting Res Technol. 50, 429–445 (2018). [CrossRef]

49. X. Cerda-Company, C. A. Parraga, and X. Otazu, “Which tone-mapping operator is the best? A comparative study of perceptual quality,” J. Opt. Soc. Am. A 35, 626–638 (2018). [CrossRef]

50. J. Kuang, H. Yamaguchi, G. M. Johnson, and M. D. Fairchild, “Testing HDR image rendering algorithms,” in Color and Imaging Conference (Society for Imaging Science and Technology, 2004), pp. 315–320.

51. A. Yoshida, V. Blanz, K. Myszkowski, and H.-P. Seidel, “Perceptual evaluation of tone mapping operators with real-world scenes,” Proc. SPIE 5666, 192–203 (2005). [CrossRef]

52. J. Kuang, H. Yamaguchi, C. Liu, G. M. Johnson, and M. D. Fairchild, “Evaluating HDR rendering algorithms,” ACM Trans. Appl. Percept. 4, 9-es (2007). [CrossRef]

53. M. Čadík, M. Wimmer, L. Neumann, and A. Artusi, “Evaluation of HDR tone mapping methods using essential perceptual attributes,” Comput. Graph. 32, 330–349 (2008). [CrossRef]

Colorfulness