Which tone-mapping operator is the best? A comparative study of perceptual quality

Xim Cerda-Company; C. Alejandro Parraga; Xavier Otazu

doi:10.1364/JOSAA.35.000626

1. INTRODUCTION

In almost all naturalistic viewing situations, we are immersed in scenes that could be described as high dynamic range (HDR); in other words, the intensity difference between the brightest and the darkest patch is much higher than the difference both imaging and capturing devices can faithfully capture. For instance, the energy ratio between sunlight and starlight is approximately about 100,000,000:1 [1]. If the human visual system (HVS) was to linearly represent these extreme differences in its normal daylight operation, it would require a much larger sensitivity range for its retinal sensors (cones) and neural pathways than is achievable within biological limitations. Instead, millions of years of evolution have solved this problem by adapting the sensorial and neural machinery, allowing it to nonlinearly convert the large natural intensity range into a much smaller range of about 10,000:1 [2,3].

A. Historical Context

The problem of translating the HDR world into low-dynamic-range (LDR) depictions is very old. Renaissance painters such as da Vinci and Caravaggio tried to solve it by creating an artistic technique called chiaroscuro, which pays attention to strong contrasts in different painted areas, creating very strong effects. This and the need to overcome the limitations of physical materials (oil paints and substrates) inspired later artists to produce remarkable paintings. Perhaps the most dramatic were created by depicting a single artificial source of light (such as a candle), making the details of the central subject very bright, while other subjects are slightly darker. It can be argued that some of the works by Rembrandt and Constable are no different from today’s HDR photography [4–6].

The arrival of photography implied a new set of challenges given the strong limitations of early light-sensitive materials [7]. Examples of the first characterization of silver halide films as plots of density versus exposure were made by Hurter and Driffield in 1890, as discussed in [6]. In particular, outdoor scenes were very difficult to capture, and early photographers experimented with multiple exposures to overcome dynamic range problems. When photographs involved human subjects, these had to remain still during the whole process so that several exposures could be combined into a single image.

B. Electronic HDR Imaging

Analog HDR imaging allowed only limited manipulations (via exposure time, chemical reactions or the combination of several exposures, etc.), but the arrival of electronic digital imaging made possible long-range interactions of pixels located in different parts of the image. This opened the field to multiple possibilities including mimicking the operation of the HVS and the work of chiaroscuro artists.

Physiological and psychophysical research has shown that photopic human vision is the result of highly nonlinear processing of the information captured by retinal cones. This processing includes the inhibition of the output of a neuron by the output of surrounding neurons in its field of view [8], which results in higher sensitivity for edges and spots than for uniform light. Other processing includes the combination of visual information in the retina into a series of postreceptoral chromatically opponent channels to transmit it to the visual cortex via the optic nerve [9]. In the cortex, visual information is mostly processed in terms of its spatial frequency and visual orientation [10]. In the 1960s, a series of psychophysical experiments with achromatic Mondrians run by Edwin Land demonstrated that patches reflecting light with exactly the same physical properties appear completely different to observers [11]. This implies that a digital image (where these patches produce exactly the same pixel values) cannot be modified using a pixel-wise transformation to simulate the appearance reported by observers. In other words, the information contained in individual pixels is not enough to mimic human vision. A comprehensive review of these experiments can be found in [12–15]. Other effects to consider are related to how the visual cortex processes local brightness interactions [16,17]. More detailed experiments have shown the effects of edges in illumination perception by matching the appearance of painted wooden facets to that of a painted test target (“ground truth”) [18], a paradigm very similar to ours (see below).

In order to mimic the response of the HVS, electronic imaging systems set out to use information not only from single pixels but from the entire scene. This allowed them a much larger flexibility to calculate appearances and to apply them to electronic displays or prints. Ironically, later HDR algorithms reverted to the old “multiple exposures” and “pixel-wise” processing techniques of analog photography for the same task (see below).

C. Tone-Mapping Operator (TMO)

Mapping the HDR dynamic range of the world into LDR media presents an important challenge for visual representation technologies mainly because most imaging devices (cameras and monitors) are only able to obtain or display images within a small range of about 100:1 [2], which can be increased up to 1000:1 for specialized HDR LED-based displays [19]. To solve this problem, an assortment of nonlinear image processing techniques were defined to display HDR scenes in LDR devices. To construct the HDR image, many LDR images of the same scene are usually taken at different exposure values, capturing a much larger dynamic range. This HDR image is generated by extracting from each LDR image the information corresponding to its region of interest (where it is neither over- nor underexposed) and combining them. Since this new HDR image cannot be displayed on a standard LDR monitor, an algorithm is needed to reduce its dynamic range to match that of the monitor. A common solution is to use a TMO to reduce the dynamic range while keeping the perceptual characteristics of the original HDR image approximately constant. The performance of these TMOs depends of several factors including lighting and viewing conditions, aesthetic/realistic preferences, local/global assumptions, and so forth, and are usually evaluated using computational [20,21] and psychophysical [22–32] methods.

Although HDR images are able to reproduce a wider range of luminance highlights and shadows than LDR ones, the presence of veiling glare both in the camera and the eye limits the possible range of accurate luminance measurements [6]. Since HDR is perceptually closer to the original scene, there must be other reasons than simply obtaining a larger range of luminances for this perceived improvement. It has been hypothesized [6] that the improvement comes from a better preservation of relative spatial information that comes from digital quantization (spatial differences in highlights and shadows are preserved), and TMOs use this to replicate the HVS processing.

In this work we present a new set of experiments and analyses to psychophysically evaluate the performance of 15 state-of-the-art TMOs. This allowed us to rank the TMOs according to how well they represent the original scene as human observers perceive it. Unlike previous studies, all the experiments were performed in a controlled environment, and tone-mapped images were presented side by side with the physical scene.

D. “Global” Versus “Local” Analysis

At this point we believe it is important to clarify the terminology used throughout this work. The term tone is traditionally used to describe pixel data (as in “tone mapping”) and was introduced by Mees [7] in 1920 to explain how exposure was related to photographic print density (silver halide response). Indeed, tone scale is the name given to a look-up table that transforms data in an input space to a desired output space.

The term global TMO, which is also used by several authors [33–35], generally refers to an algorithm that applies the same pixel-wise adjustment to all pixels in the image (although, in fact, it uses the most local information: a single pixel). In contrast, the term local TMO generally defines an algorithm that applies a combination of pixel-wise processing and spatial transformations to improve the image. Although confusing, we will follow the traditional terminology here, using global TMO for algorithms that apply pixel-wise processing and local TMO for algorithms that apply a combination of pixel-wise and spatial image processing.

We will refer to our psychophysical experiments (see below) as scene reproduction when observers judge images by freely comparing them, and segment matching when they match the luminances of specific points in the scene to those of a reference table in the same scene.

2. STATE OF THE ART

A. Previous TMO Psychophysics

Although the idea of using algorithms to match the brightness of real scenes to that of imaging devices is not new [33,36], TMOs did not become popular until the turn of the century, when affordable digital cameras became available [30,35,37–49]. To date, many different psychophysical experiments have been performed, and they can be classified as described in the following sections.

1. Experiments without a Reference HDR Scene

One of the first psychophysical experiments to evaluate TMOs compared the performance of six TMOs on four different (synthetic and photographic) scenes by asking subjects to make pairwise perceptual evaluations and by rating stimuli with respect to three attributes: apparent image contrast, apparent level of detail, and apparent naturalness [22]. The results showed that preferred operators produced detailed images with moderate contrast.

Kuang et al. [23] performed pairwise comparisons on eight different TMOs using 10 different scenes and two conditions (color and gray-level) where subjects had to choose the preferred image considering general rendering performance (including tone compression performance, color saturation, natural appearance, image contrast, and image sharpness). Their results showed that the gray-scale tone-mapping performances are consistent with those in the overall rendering results, if not the same.

2. Experiments with a Reference HDR Scene

Yoshida et al. [24] conducted a psychophysical experiment based on a direct comparison between the appearance of real-world scenes and TMO images of these scenes displayed on a LDR monitor. In their experiment, they differentiated between global and local operators, and introduced, for the first time, the comparison between a tone-mapped image and the real scene, selecting two different indoor architectural scenes. Fourteen subjects were asked to give ratings according to several criteria such as realism (image naturalness in terms of reproducing the overall appearance of the real-world views) and image appearance (brightness, contrast, and detail reproduction in dark regions and in bright ones). They found that none of these image appearance attributes had a strong influence on the perception of naturalness by itself. This work was extended to find out which attributes of image appearance accounted for the differences between tone-mapped images and the real scene [28]. They observed a clear distinction between global and local operators. However, they concluded again that none of the evaluated image attributes had a strong influence on the perception of naturalness by itself, which suggested that naturalness depends on a combination of the other attributes with different weights.

In another work, Ashikhmin and Goyal [26] performed three different experiments. Subjects ranked different tone-mapped images depending on the task. In the first experiment, the authors asked which image they liked more without having the reference scene. In the second one, the authors asked which image seemed more real without viewing the reference scene, and in the third one, they asked which image was the closest to the real scene viewing the reference scene. They observed that rankings were totally different when subjects could compare the tone-mapped image to the reference scene.

In a subsequent study, Kuang et al. [29] performed three different experiments they named preference evaluation, image-preference modelling, and accuracy evaluation. In the preference evaluation experiment, pairwise comparisons between tone-mapped images were performed. Here they used only color images, and the aim was to evaluate the general rendering performance by instructing observers to consider perceptual attributes such as overall impression on image contrast, colorfulness, image sharpness, and natural appearance. In contrast, in the image-preference modelling experiment, they rated gray-scale images (which were gray-scale versions of the first experiment color images). Here, observers considered perceptual attributes such as highlight details, shadow details, overall contrast, sharpness, colorfulness, and appearance of artifacts, comparing the TMOs’ visual rendering “to their internal representation of a ‘perfect’ image in their minds” [29]. In the accuracy evaluation, both pairwise comparison and rating techniques were used in order to evaluate the perceptual accuracy of the rendering algorithms. The pairwise comparison of TMOs was performed without viewing the real scene, and subjects were asked to compare the overall impression on image contrast, colorfulness, image sharpness, and overall natural appearance. An additional rating evaluation was performed using the real scenes set up in the adjoining room as references. Here, subjects had to rate image attributes such as highlight contrast, shadow contrast, highlight colorfulness, shadow colorfulness, overall contrast, and overall rendering accuracy compared to the overall appearance of the real-world scenes. In both experiments, observers did not have immediate access to the real scene and had to rely on their memories (either short- or long-term) to perform the tasks.

To validate the iCAM06 operator [30], its authors performed two psychophysical experiments similar to the previous ones [29]. The first experiment was a pairwise comparison without viewing the reference scene. Observers had to choose the tone-mapped image that they preferred based on overall impression on image quality (considering contrast, colorfulness, image sharpness, and overall natural appearance). In the second experiment, observers were also asked to evaluate overall rendering accuracy by comparing the overall appearance of the rendered images to their corresponding real-world scenes, which were set up in an adjoining room.

While looking for a definition of an overall image quality measure, Cadík et al. [27] studied the relationships between image attributes such as brightness, contrast, reproduction of colors, and reproduction of details. They performed two psychophysical experiments, using 14 TMOs, in order to propose a scheme of relationships between these attributes, being aware that some special attributes, which were not evaluated (e.g., glare stimulation, visual acuity, and artifacts), can influence their relationships. In the first one, 10 subjects were asked to perform ratings using five criteria: overall image quality and the four basic attributes (brightness, contrast, and reproduction of detail and of colors). These evaluations were performed using a real scene as a reference (a typical real indoor HDR scene). In the second experiment, subjects did not have access to the real scene and had to rank image printouts according to the overall image quality and the four basic attributes.

In a new study, Cadík et al. [32] performed exactly the same type of experiments, adding two new scenes, that is, they had a total of three scenes: a real indoor HDR scene, a HDR outdoor scene, and a night urban HDR scene. In the first experiment, subjects were asked to rate overall image quality and the quality of reproduction of five attributes by comparing samples to the real scene. These attributes were the same four basic ones of their previous work and the lack of disturbing image artifacts (which was one of the non-evaluated special attributes in [27]). These experiments were set up in an uncontrolled natural environment, so subjects had to perform the experiments at the same time of the day as the HDR image was acquired. In the second experiment, subjects had no possibility of directly comparing to the real scene and had to rank the image printouts according to the overall image quality and the quality of basic attributes.

3. Experiments Using an HDR Monitor

In 2005, Ledda et al. [25] performed two different psychophysical experiments comparing six different TMOs to linearly mapped HDR scenes displayed on a HDR device. They used 23 different color and gray-scale HDR scenes showing three different images per comparison: the HDR and two tone-mapped images. In the first experiment, subjects were asked to select the TMO image more similar to the HDR reference by judging its global appearance. In the second one, they were asked to make their judgment based on reproduction detail.

In a later work, Akyüz et al. [31] asked subjects to rank six images (one HDR image, three tone-mapped images, one objectively good LDR exposure value, and one subjectively good LDR exposure value) according to their subjective preferences. They found that participants did not systematically prefer tone-mapped HDR images over the best single LDR exposures.

All the previous studies have been focused on subjective comparisons of global and local image appearance attributes such as contrast, colorfulness, sharpness, and reproduction artifacts, either within TMOs or against the real scene. While this is no doubt extremely important, we believe a good TMO should output a scene that produces the same visual sensation as the physical scene, in particular the interrelations between objects and their perceived attributes. For instance, no study has been conducted (as far as we know) to evaluate whether objects represented within a TMO image maintain the same perceived visual differences as the real scene. This is the main objective of our work.

B. Tone-Mapping Operators

As mentioned before, TMOs can be classified according to their processing as global or local. Global operators perform the same computation in all pixels, regardless of spatial position, which make them more computationally efficient at the cost of losing contrast and image detail. Some examples of global TMO are [37,39,43,45]. On the other hand, local operators, which take into account surrounding pixels, produce images with more contrast and higher detail level, but they may show problems with halos around high-contrast edges. Local operators are inspired by the local adaptation process present at the early processing stages of the HVS. Some examples of local operators are [30,38,40–42,44,47,49]. There are some TMOs which could be global or local depending on their setup configuration parameters. One example is [35], and another one is [48], which is developed in two stages, the first global and the second local. A brief summary of the properties of each TMO used in our experiments is given in Table 1. The first column shows the names that we will use to refer to each operator throughout this work. The characteristics of each TMO are as follows:

Table 1. Summary of Used TMO Characteristics^a

View Table | View all tables in this article

– Ashikhmin [40]. This local TMO is inspired by the processing mechanisms present at the first stages of the HVS. The intensity range is compressed by a local luminance adaptation function, and, in a last step, detail information is added.
– Drago [43]. This global TMO is based on luminance logarithmic compression that, depending on scene content, uses a predetermined logarithmic basis to preserve contrast and details.
– Durand [41]. This local TMO decomposes the image in two layers: the base and the detail. Large-scale variations of the base layer are encoded, while the magnitudes of the detail layer are preserved.
– Fattal [42]. This local TMO manipulates the gradient fields of the luminance image. Its idea is to identify high gradients in different scales and attenuate their magnitudes while maintaining their directions.
– Ferradans [48]. This TMO can be executed as global or local because it is divided in two stages. In the first stage, it applies a global method that implements the visual adaptation, trying to mimic human cones’ saturation. In the second stage, it enhances local contrast using a variational model inspired by color vision phenomenology. In our work, this operator was run as local.
– Ferwerda [39]. This global TMO is based on a computational model of visual adaptation that was adjusted to fit psychophysical results on threshold visibility, color appearance, visual acuity, and sensitivity over the time.
– iCAM06 [30]. This local TMO is based on the iCAM06 color appearance model, which gives the perceptual attributes of each pixel, such as lightness, chromaticity, hue, contrast, and sharpness. It includes an inverse model which considers viewing conditions to generate the result.
– KimKautz [37]. This global TMO is based on the assumption that human visual sensitivity is adapted to the average log luminance of the scene and that it follows a Gaussian distribution.
– Krawczyk [38]. This local TMO is inspired by the anchoring theory [50]. It decomposes the image into patches of consistent luminance (frameworks) and calculates, locally, the lightness values.
– Li [44]. This local TMO is based on multiscale image decomposition that uses a symmetrical analysis–synthesis filter bank to reconstruct the signal and applies local gain control to the subbands to reduce the dynamic range.
– Mertens [46]. This technique fuses original LDR images of different exposure values (exposure fusion) to obtain the final “tone-mapped” image, which avoids the generation of an HDR image. Guided by simple quality measures like saturation and contrast, it selects “good” pixels of the sequence and combines them to create the resulting image. Thus, for this method instead of an HDR image we used a stack of LDR images.
– Meylan [47]. This local TMO is derived from a model of retinal processing. In a first step, a basic tone-mapping algorithm is applied on the mosaic image captured by the sensors. In a second step, it introduces a variation of the center/surround spatial opponency.
– Otazu [49]. This local TMO is based on a multipurpose human color perception algorithm. It decomposes the intensity channel in a multiresolution contrast decomposition and applies a nonlinear saturation model of visual cortex neurons.
– Reinhard [35]. This TMO can be executed as global or local. It performs a global scaling of the dynamic range followed by dodging and burning (local) processes. In our work, this operator was run as global, which is its default value in the toolbox.
– Reinhard–Devlin [45]. This global TMO uses a model of photoreceptor adaptation which can be automatically adjusted to the general light level.

3. METHODS

In order to compare TMOs, we performed two different experiments called segment matching and scene reproduction experiments. The aim of the first experiment was to study the internal relationships among gray levels in the tone-mapping image and in the real scene (i.e., a segment matching experiment similar to [18]). The aim of the second experiment was to evaluate TMOs according to how similar their results were perceived to be with respect to the real scene. In both cases, we obtained a ranking of the different TMOs. Behind these experiments is the idea that a good TMO is one whose output is perceptually similar to the real scene, and, to do that, a good reproduction of the objects’ relationships is needed.

A. Materials

Our experiments were performed in a controlled environment where the only sources of light were a lamp, which illuminated the real scene, and a CRT screen. We used a ViSaGe MKII Stimulus Generator and a Mitsubishi Diamond-Pro 2045u CRT monitor side-by-side with a handmade real HDR scene. The monitor was calibrated via a customary Cambridge Research Systems Ltd. software for ViSaGe MKII Stimulus Generator (Rochester, England) and a ColorCal (Minolta sensor) suction-cup colorimeter. Both the monitor and the real scene were set up so that the objects in both scenes subtended approximately the same angle ( $18.13 ° \times 13.81 °$ ) and looked similarly positioned to the observer.

We built three different HDR scenes, each including a gray-level reference table and two solid parallelepipeds (cuboids). The reference table was built by printing a series of 65 gray squares ( $2.8 cm \times 2.2 cm$ ) arranged in a flat $11 \times 6$ distribution. The arrangement of rows and columns was labelled $A, B, C, \dots, K$ for the rows and 1, 2, 3 …, 6 for the columns. The lightness of these patches decreased monotonically from the top (patch A1–#1) to the bottom (patch K5–#65), as measured by our PR-655 SpectraScan spectroradiometer. The printed values were selected so that their CIE L* (lightness) value was equally spaced, meaning that their distribution was approximately uniform in terms of perceived lightness (see Table 2). The cuboids consisted of pieces of wood ( $3.6 cm \times 3.6 cm \times variable$ length between 9.4 and 10 cm), whose sides (facets) were covered with arbitrary samples of the same printed paper as the reference table. There were two cuboids in each scene (one under direct illumination and the other in the shade). The third column of Table 3 shows the patch of the reference table that the cuboid’s facet corresponded to, the fourth column indicates its position with respect to the illumination, and the last column indicates its luminance (when placed within its scene). Table 2 also shows the luminance values for these patches once lit by our light source. The chromaticity of all printed material was $CIE x y = (0.3652, 0.3817)$ . The rest of the scenes consisted of many plastic and wooden objects of different colors and shapes (see Fig. 1).

Table 2. L* CIELab Color Space Values and Luminance Values ( $cd / m^{2}$ ) of Each Patch in the Reference Table^a

View Table | View all tables in this article

Table 3. Photometric Assessment of the Scene Facets Used in Our Matching Experiments^a

View Table | View all tables in this article

Fig. 1. To show the general appearance of the physical scenes, here we show a single LDR exposure (chosen by simple visual inspection by the authors) from the set of LDR exposures used to create the HDR images. Since they are a single LDR exposure, the cuboids in the dark regions are not completely visible in these pictures.

Download Full Size | PDF

Two facets of one cuboid and three of the other were always visible from the subjects’ location, resulting in 15 different gray facets in total (see Table 3). The incandescent lamp (100 W) had its bulb painted blue to simulate D65 illumination and was set up so that the luminance of the brightest object was about the same as the maximum luminance the monitor was capable of producing (about $100 cd / m^{2}$ ).

We photographed the real scene using a Sigma Foveon SD10 camera placed in the exact same position as the subjects’ heads during the psychophysical experiments. The same camera was calibrated for use in other measurements [51], and because of this, we have a fairly good idea of the linearity and spectral sensitivity of its sensors. The setup was arranged so that the images presented on the monitor looked geometrically the same as the real ones shown beside it. Since the walls were covered in black felt, reflections from all other objects were minimized. The dynamic range of the scenes as measured by multiple exposures using the camera was approximately $10^{5}$ for scene 1 and $10^{6}$ for scenes 2 and 3. The dynamic range of the reference table as measured by the PR-655 was $104.0 - 0.559 cd / m^{2}$ .

Although it has been shown that, because of glare, it is not possible to achieve an accurate representation of the scene luminance distribution from a combination of many LDR images, this technique can still provide a good enough approximation [6]. In consequence, a set of 25 photographs were taken at different exposure values (from 15 to 1/6000 s) using the same aperture, focal distance, zoom settings, and visual field. Individual images were stored in RAW format and transformed into 16 bits sRGB (using the camera manufacturer’s software).

To avoid any bias regarding the operators, all experiments started with a 1-min subject adaptation to the ambient light. Most TMO implementations were obtained from the popular HDR Toolbox for MATLAB [52], while others (Ferradans, iCAM06, Li, Meylan, and Otazu), were obtained from their corresponding authors’ web pages. In order to avoid benefiting any of the TMOs, we ran all of them with their default settings. In Ferradans’ case, we had to chose between two different parameters, and we selected the default values specified in their paper ( $ρ = 0$ and $α^{- 1} = 3$ ). Other cases required that the TMO’s author was asked to perform the best tone-mapping, but we discarded this option because of its impracticality (we could not ask all authors the same) and besides, this practice impairs the reproducibility of the results.

4. EXPERIMENTS

A. Experiment 1: Segment Matching

1. Procedure

The segment matching experiment consisted of two different tasks:

Task 1. After adaptation, subjects were asked to match, in the real scenes (i.e., with monitor turned off), the brightness of the five cuboids’ facets to the brightness of the patches in the reference table in each scene [see Fig. 2(a)]. Although there were no time constraints to perform the tasks, subjects were advised to take no more than 30 s per match.

Fig. 2. In Experiment 1, observers performed two tasks. In Task 1 [Fig. 2(a)], observers had to match the brightnesses of the five cuboids’ facets to the brightnesses of five patches in the reference table. In Task 2 [Fig. 2(b)], observers had to perform the same task on the TMO image displayed on the calibrated monitor. (Red arrows are randomly drawn for illustrative purposes).

Download Full Size | PDF

Task 2. Here the real scene was not visible, and the observers only saw digital (tone-mapped) versions presented on the monitor. Their task was similar to Task 1, except that all matchings were conducted entirely between the facets and patches shown on the screen [see Fig. 2(b)].

There were three conditions for Experiment 1, corresponding to the three different scenes created (see Fig. 1). Observers performed 240 matchings in total (5 facets × 15 different tone-mapped images × 3 scenes plus 15 matchings in the real scenes). In practice, all matchings were conducted by writing, for each facet, the coordinates of the matching reference table patch on a piece of paper. The presentation order of the tone-mapped images was randomized.

2. Experimental Design

In Experiment 1, the independent variables (IVs) were the cuboids’ facets and the reference table patches. The dependent variables (DVs) were the subjects’ segment matches in the tone-mapped images (Task 2), and the control variables (CVs) were the subjects’ segment matches in the real scene (Task 1). Our null hypothesis was that there was no significant difference between the segments matched in the real scene (CV— Task 1) and the matches in the tone-mapped images (DVs—Task2) because the TMOs perfectly reproduce the perceptual relationships among the objects present in the real scene.

3. Participants

Task 1 was completed by a group of 12 observers with normal or corrected-to-normal vision, recruited from our lab academic/research community. This group (eight male and four female) was comprised of people aged between 17 and 54. Nine of them were completely naïve to the aims of the experiment. Task 2 was completed by 10 of the previous observers (eight male and two female).

4. Results

Figure 3 shows a plot of the segments matches obtained in Task 2 against the segments matches in Task 1. We fitted a linear model to the results obtained by each TMO. If a TMO reproduced well the interrelations among the gray facets, the fitting should be very similar to the fitting for the real scene (i.e., points should lay about the diagonal).

Fig. 3. Results of Experiment 1. Segment matches in the tone-mapped images are plotted against segment matches in the real scene. Markers and lines identify each TMO. Since not all the data had a normal distribution, the markers show the median of the subjects’ observations. Horizontal lines indicate the first and the third quartiles of Task 1, and vertical lines indicate the first and the third quartiles of Task 2. For each operator, we fitted a linear model using the median of the subjects’ observations. The figure is divided in four panels for clarity. The real scene is plotted against itself in all panels to provide a fixed reference ( $y = x$ ). In summary, the better the TMO, the closer its fit to the solid black line.

Download Full Size | PDF

We performed two different analyses to evaluate to what extent the local interrelations perceived by the observers in the tone-mapped versions corresponded to those perceived in the real scene. In the first analysis, we studied the slopes of the different fitted linear models with respect to the slope obtained in the real scene. The smaller the difference, the better the reproduction of the interrelations (it means that the TMO maintained the relationships among the facets and patches). Figure 3 shows the offset between the lines fitted to the TMOs and the line fitted to the real scene. In the second analysis, we studied this displacement by computing the root mean square error (RMSE) between them.

All results are shown in Table 4, where iCAM06 has the smallest distance to the real scene in both analyses. Since its slope difference and RMSE are very small, we can assume that the pixel interrelations in its tone-mapped image perceptually mimic the real scene. Given that iCAM06 is based on a color appearance model that considers perceptual attributes such as lightness, chromaticity, hue, contrast, and sharpness, its results are expected to be in line with observers’ perception.

Table 4. Performance of All TMOs in the Segment Matching Experiment^a

View Table | View all tables in this article

We calculated the Spearman’s rank correlation coefficient between the rankings obtained from both segment matching analyses (see Table 4) and obtained a value of 0.59 ( $p < 0.05$ ). Since both rankings are quite similar, it is worth paying attention to some interesting cases such as Ferradans, whose slope is very close to that of the real scene, but the fitted model lays systematically under the real scene’s line (i.e., its RMSE is very big). An opposite example is Mertens, which has a different slope, but its RMSE is the second smallest.

Another interesting observation from Fig. 3 is that, at the lowest and highest brightness values, the agreement between subjects is higher than at middle values (both horizontal and vertical dispersion lines are smaller). This suggests that the TMOs are more accurate at reproducing both the brightest and the darkest parts of the image. To analyze this effect in more detail, we studied the subjects’ results for each facet. In Fig. 4, the abscissa shows the segments matched in the real scene ordered from darkest to brightest and the ordinate represents the RMSE in the tone-mapped images with respect to the real scene. We defined RMSE as ${RMSE}_{scene} = \sqrt{\frac{1}{n} \sum_{\forall i} {(x_{i} - y_{i})}^{2}}$ , where $x_{i}$ is the $i$ th subject segment matched in the real scene, $y_{i}$ is the $i$ th subject segment matched in the tone-mapped image, and $n$ is the number of subjects. Again, in almost all TMOs, the RMSE value is smaller for darkest and brightest facets than for mid-gray facets. Thus, not only the agreement between subjects but also the error ( ${RMSE}_{scene}$ ) is lower for both brightest and darkest values.

Fig. 4. RMSE with respect to the real scene ( ${RMSE}_{scene}$ ) is the difference between segments matched in the tone-mapped image and in the real scene. Different types of lines and markers represent different TMOs. The abscissa represents the segments matched in the real scene ordered from darkest to brightest. According to this metric, the smaller the value, the better the TMO.

Download Full Size | PDF

B. Experiment 2: Scene Reproduction