Tone-mapping operators (TMOs) are designed to generate perceptually similar low-dynamic-range images from high-dynamic-range ones. We studied the performance of 15 TMOs in two psychophysical experiments where observers compared the digitally generated tone-mapped images to their corresponding physical scenes. All experiments were performed in a controlled environment, and the setups were designed to emphasize different image properties: in the first experiment we evaluated the local relationships among intensity levels, and in the second one we evaluated global visual appearance among physical scenes and tone-mapped images, which were presented side by side. We ranked the TMOs according to how well they reproduced the results obtained in the physical scene. Our results show that ranking position clearly depends on the adopted evaluation criteria, which implies that, in general, these tone-mapping algorithms consider either local or global image attributes but rarely both. Regarding the question of which TMO is the best, KimKautz [“Consistent tone reproduction,” in Proceedings of Computer Graphics and Imaging (2008)] and Krawczyk [“Lightness perception in tone reproduction for high dynamic range images,” in Proceedings of Eurographics (2005), p. 3] obtained the better results across the different experiments. We conclude that more thorough and standardized evaluation criteria are needed to study all the characteristics of TMOs, as there is ample room for improvement in future developments.
© 2018 Optical Society of America under the terms of the OSA Open Access Publishing Agreement
In almost all naturalistic viewing situations, we are immersed in scenes that could be described as high dynamic range (HDR); in other words, the intensity difference between the brightest and the darkest patch is much higher than the difference both imaging and capturing devices can faithfully capture. For instance, the energy ratio between sunlight and starlight is approximately about 100,000,000:1 . If the human visual system (HVS) was to linearly represent these extreme differences in its normal daylight operation, it would require a much larger sensitivity range for its retinal sensors (cones) and neural pathways than is achievable within biological limitations. Instead, millions of years of evolution have solved this problem by adapting the sensorial and neural machinery, allowing it to nonlinearly convert the large natural intensity range into a much smaller range of about 10,000:1 [2,3].
A. Historical Context
The problem of translating the HDR world into low-dynamic-range (LDR) depictions is very old. Renaissance painters such as da Vinci and Caravaggio tried to solve it by creating an artistic technique called chiaroscuro, which pays attention to strong contrasts in different painted areas, creating very strong effects. This and the need to overcome the limitations of physical materials (oil paints and substrates) inspired later artists to produce remarkable paintings. Perhaps the most dramatic were created by depicting a single artificial source of light (such as a candle), making the details of the central subject very bright, while other subjects are slightly darker. It can be argued that some of the works by Rembrandt and Constable are no different from today’s HDR photography [4–6].
The arrival of photography implied a new set of challenges given the strong limitations of early light-sensitive materials . Examples of the first characterization of silver halide films as plots of density versus exposure were made by Hurter and Driffield in 1890, as discussed in . In particular, outdoor scenes were very difficult to capture, and early photographers experimented with multiple exposures to overcome dynamic range problems. When photographs involved human subjects, these had to remain still during the whole process so that several exposures could be combined into a single image.
B. Electronic HDR Imaging
Analog HDR imaging allowed only limited manipulations (via exposure time, chemical reactions or the combination of several exposures, etc.), but the arrival of electronic digital imaging made possible long-range interactions of pixels located in different parts of the image. This opened the field to multiple possibilities including mimicking the operation of the HVS and the work of chiaroscuro artists.
Physiological and psychophysical research has shown that photopic human vision is the result of highly nonlinear processing of the information captured by retinal cones. This processing includes the inhibition of the output of a neuron by the output of surrounding neurons in its field of view , which results in higher sensitivity for edges and spots than for uniform light. Other processing includes the combination of visual information in the retina into a series of postreceptoral chromatically opponent channels to transmit it to the visual cortex via the optic nerve . In the cortex, visual information is mostly processed in terms of its spatial frequency and visual orientation . In the 1960s, a series of psychophysical experiments with achromatic Mondrians run by Edwin Land demonstrated that patches reflecting light with exactly the same physical properties appear completely different to observers . This implies that a digital image (where these patches produce exactly the same pixel values) cannot be modified using a pixel-wise transformation to simulate the appearance reported by observers. In other words, the information contained in individual pixels is not enough to mimic human vision. A comprehensive review of these experiments can be found in [12–15]. Other effects to consider are related to how the visual cortex processes local brightness interactions [16,17]. More detailed experiments have shown the effects of edges in illumination perception by matching the appearance of painted wooden facets to that of a painted test target (“ground truth”) , a paradigm very similar to ours (see below).
In order to mimic the response of the HVS, electronic imaging systems set out to use information not only from single pixels but from the entire scene. This allowed them a much larger flexibility to calculate appearances and to apply them to electronic displays or prints. Ironically, later HDR algorithms reverted to the old “multiple exposures” and “pixel-wise” processing techniques of analog photography for the same task (see below).
C. Tone-Mapping Operator (TMO)
Mapping the HDR dynamic range of the world into LDR media presents an important challenge for visual representation technologies mainly because most imaging devices (cameras and monitors) are only able to obtain or display images within a small range of about 100:1 , which can be increased up to 1000:1 for specialized HDR LED-based displays . To solve this problem, an assortment of nonlinear image processing techniques were defined to display HDR scenes in LDR devices. To construct the HDR image, many LDR images of the same scene are usually taken at different exposure values, capturing a much larger dynamic range. This HDR image is generated by extracting from each LDR image the information corresponding to its region of interest (where it is neither over- nor underexposed) and combining them. Since this new HDR image cannot be displayed on a standard LDR monitor, an algorithm is needed to reduce its dynamic range to match that of the monitor. A common solution is to use a TMO to reduce the dynamic range while keeping the perceptual characteristics of the original HDR image approximately constant. The performance of these TMOs depends of several factors including lighting and viewing conditions, aesthetic/realistic preferences, local/global assumptions, and so forth, and are usually evaluated using computational [20,21] and psychophysical [22–32] methods.
Although HDR images are able to reproduce a wider range of luminance highlights and shadows than LDR ones, the presence of veiling glare both in the camera and the eye limits the possible range of accurate luminance measurements . Since HDR is perceptually closer to the original scene, there must be other reasons than simply obtaining a larger range of luminances for this perceived improvement. It has been hypothesized  that the improvement comes from a better preservation of relative spatial information that comes from digital quantization (spatial differences in highlights and shadows are preserved), and TMOs use this to replicate the HVS processing.
In this work we present a new set of experiments and analyses to psychophysically evaluate the performance of 15 state-of-the-art TMOs. This allowed us to rank the TMOs according to how well they represent the original scene as human observers perceive it. Unlike previous studies, all the experiments were performed in a controlled environment, and tone-mapped images were presented side by side with the physical scene.
D. “Global” Versus “Local” Analysis
At this point we believe it is important to clarify the terminology used throughout this work. The term tone is traditionally used to describe pixel data (as in “tone mapping”) and was introduced by Mees  in 1920 to explain how exposure was related to photographic print density (silver halide response). Indeed, tone scale is the name given to a look-up table that transforms data in an input space to a desired output space.
The term global TMO, which is also used by several authors [33–35], generally refers to an algorithm that applies the same pixel-wise adjustment to all pixels in the image (although, in fact, it uses the most local information: a single pixel). In contrast, the term local TMO generally defines an algorithm that applies a combination of pixel-wise processing and spatial transformations to improve the image. Although confusing, we will follow the traditional terminology here, using global TMO for algorithms that apply pixel-wise processing and local TMO for algorithms that apply a combination of pixel-wise and spatial image processing.
We will refer to our psychophysical experiments (see below) as scene reproduction when observers judge images by freely comparing them, and segment matching when they match the luminances of specific points in the scene to those of a reference table in the same scene.
2. STATE OF THE ART
A. Previous TMO Psychophysics
Although the idea of using algorithms to match the brightness of real scenes to that of imaging devices is not new [33,36], TMOs did not become popular until the turn of the century, when affordable digital cameras became available [30,35,37–49]. To date, many different psychophysical experiments have been performed, and they can be classified as described in the following sections.
1. Experiments without a Reference HDR Scene
One of the first psychophysical experiments to evaluate TMOs compared the performance of six TMOs on four different (synthetic and photographic) scenes by asking subjects to make pairwise perceptual evaluations and by rating stimuli with respect to three attributes: apparent image contrast, apparent level of detail, and apparent naturalness . The results showed that preferred operators produced detailed images with moderate contrast.
Kuang et al.  performed pairwise comparisons on eight different TMOs using 10 different scenes and two conditions (color and gray-level) where subjects had to choose the preferred image considering general rendering performance (including tone compression performance, color saturation, natural appearance, image contrast, and image sharpness). Their results showed that the gray-scale tone-mapping performances are consistent with those in the overall rendering results, if not the same.
2. Experiments with a Reference HDR Scene
Yoshida et al.  conducted a psychophysical experiment based on a direct comparison between the appearance of real-world scenes and TMO images of these scenes displayed on a LDR monitor. In their experiment, they differentiated between global and local operators, and introduced, for the first time, the comparison between a tone-mapped image and the real scene, selecting two different indoor architectural scenes. Fourteen subjects were asked to give ratings according to several criteria such as realism (image naturalness in terms of reproducing the overall appearance of the real-world views) and image appearance (brightness, contrast, and detail reproduction in dark regions and in bright ones). They found that none of these image appearance attributes had a strong influence on the perception of naturalness by itself. This work was extended to find out which attributes of image appearance accounted for the differences between tone-mapped images and the real scene . They observed a clear distinction between global and local operators. However, they concluded again that none of the evaluated image attributes had a strong influence on the perception of naturalness by itself, which suggested that naturalness depends on a combination of the other attributes with different weights.
In another work, Ashikhmin and Goyal  performed three different experiments. Subjects ranked different tone-mapped images depending on the task. In the first experiment, the authors asked which image they liked more without having the reference scene. In the second one, the authors asked which image seemed more real without viewing the reference scene, and in the third one, they asked which image was the closest to the real scene viewing the reference scene. They observed that rankings were totally different when subjects could compare the tone-mapped image to the reference scene.
In a subsequent study, Kuang et al.  performed three different experiments they named preference evaluation, image-preference modelling, and accuracy evaluation. In the preference evaluation experiment, pairwise comparisons between tone-mapped images were performed. Here they used only color images, and the aim was to evaluate the general rendering performance by instructing observers to consider perceptual attributes such as overall impression on image contrast, colorfulness, image sharpness, and natural appearance. In contrast, in the image-preference modelling experiment, they rated gray-scale images (which were gray-scale versions of the first experiment color images). Here, observers considered perceptual attributes such as highlight details, shadow details, overall contrast, sharpness, colorfulness, and appearance of artifacts, comparing the TMOs’ visual rendering “to their internal representation of a ‘perfect’ image in their minds” . In the accuracy evaluation, both pairwise comparison and rating techniques were used in order to evaluate the perceptual accuracy of the rendering algorithms. The pairwise comparison of TMOs was performed without viewing the real scene, and subjects were asked to compare the overall impression on image contrast, colorfulness, image sharpness, and overall natural appearance. An additional rating evaluation was performed using the real scenes set up in the adjoining room as references. Here, subjects had to rate image attributes such as highlight contrast, shadow contrast, highlight colorfulness, shadow colorfulness, overall contrast, and overall rendering accuracy compared to the overall appearance of the real-world scenes. In both experiments, observers did not have immediate access to the real scene and had to rely on their memories (either short- or long-term) to perform the tasks.
To validate the iCAM06 operator , its authors performed two psychophysical experiments similar to the previous ones . The first experiment was a pairwise comparison without viewing the reference scene. Observers had to choose the tone-mapped image that they preferred based on overall impression on image quality (considering contrast, colorfulness, image sharpness, and overall natural appearance). In the second experiment, observers were also asked to evaluate overall rendering accuracy by comparing the overall appearance of the rendered images to their corresponding real-world scenes, which were set up in an adjoining room.
While looking for a definition of an overall image quality measure, Cadík et al.  studied the relationships between image attributes such as brightness, contrast, reproduction of colors, and reproduction of details. They performed two psychophysical experiments, using 14 TMOs, in order to propose a scheme of relationships between these attributes, being aware that some special attributes, which were not evaluated (e.g., glare stimulation, visual acuity, and artifacts), can influence their relationships. In the first one, 10 subjects were asked to perform ratings using five criteria: overall image quality and the four basic attributes (brightness, contrast, and reproduction of detail and of colors). These evaluations were performed using a real scene as a reference (a typical real indoor HDR scene). In the second experiment, subjects did not have access to the real scene and had to rank image printouts according to the overall image quality and the four basic attributes.
In a new study, Cadík et al.  performed exactly the same type of experiments, adding two new scenes, that is, they had a total of three scenes: a real indoor HDR scene, a HDR outdoor scene, and a night urban HDR scene. In the first experiment, subjects were asked to rate overall image quality and the quality of reproduction of five attributes by comparing samples to the real scene. These attributes were the same four basic ones of their previous work and the lack of disturbing image artifacts (which was one of the non-evaluated special attributes in ). These experiments were set up in an uncontrolled natural environment, so subjects had to perform the experiments at the same time of the day as the HDR image was acquired. In the second experiment, subjects had no possibility of directly comparing to the real scene and had to rank the image printouts according to the overall image quality and the quality of basic attributes.
3. Experiments Using an HDR Monitor
In 2005, Ledda et al.  performed two different psychophysical experiments comparing six different TMOs to linearly mapped HDR scenes displayed on a HDR device. They used 23 different color and gray-scale HDR scenes showing three different images per comparison: the HDR and two tone-mapped images. In the first experiment, subjects were asked to select the TMO image more similar to the HDR reference by judging its global appearance. In the second one, they were asked to make their judgment based on reproduction detail.
In a later work, Akyüz et al.  asked subjects to rank six images (one HDR image, three tone-mapped images, one objectively good LDR exposure value, and one subjectively good LDR exposure value) according to their subjective preferences. They found that participants did not systematically prefer tone-mapped HDR images over the best single LDR exposures.
All the previous studies have been focused on subjective comparisons of global and local image appearance attributes such as contrast, colorfulness, sharpness, and reproduction artifacts, either within TMOs or against the real scene. While this is no doubt extremely important, we believe a good TMO should output a scene that produces the same visual sensation as the physical scene, in particular the interrelations between objects and their perceived attributes. For instance, no study has been conducted (as far as we know) to evaluate whether objects represented within a TMO image maintain the same perceived visual differences as the real scene. This is the main objective of our work.
B. Tone-Mapping Operators
As mentioned before, TMOs can be classified according to their processing as global or local. Global operators perform the same computation in all pixels, regardless of spatial position, which make them more computationally efficient at the cost of losing contrast and image detail. Some examples of global TMO are [37,39,43,45]. On the other hand, local operators, which take into account surrounding pixels, produce images with more contrast and higher detail level, but they may show problems with halos around high-contrast edges. Local operators are inspired by the local adaptation process present at the early processing stages of the HVS. Some examples of local operators are [30,38,40–42,44,47,49]. There are some TMOs which could be global or local depending on their setup configuration parameters. One example is , and another one is , which is developed in two stages, the first global and the second local. A brief summary of the properties of each TMO used in our experiments is given in Table 1. The first column shows the names that we will use to refer to each operator throughout this work. The characteristics of each TMO are as follows:
- – Ashikhmin . This local TMO is inspired by the processing mechanisms present at the first stages of the HVS. The intensity range is compressed by a local luminance adaptation function, and, in a last step, detail information is added.
- – Drago . This global TMO is based on luminance logarithmic compression that, depending on scene content, uses a predetermined logarithmic basis to preserve contrast and details.
- – Durand . This local TMO decomposes the image in two layers: the base and the detail. Large-scale variations of the base layer are encoded, while the magnitudes of the detail layer are preserved.
- – Fattal . This local TMO manipulates the gradient fields of the luminance image. Its idea is to identify high gradients in different scales and attenuate their magnitudes while maintaining their directions.
- – Ferradans . This TMO can be executed as global or local because it is divided in two stages. In the first stage, it applies a global method that implements the visual adaptation, trying to mimic human cones’ saturation. In the second stage, it enhances local contrast using a variational model inspired by color vision phenomenology. In our work, this operator was run as local.
- – Ferwerda . This global TMO is based on a computational model of visual adaptation that was adjusted to fit psychophysical results on threshold visibility, color appearance, visual acuity, and sensitivity over the time.
- – iCAM06 . This local TMO is based on the iCAM06 color appearance model, which gives the perceptual attributes of each pixel, such as lightness, chromaticity, hue, contrast, and sharpness. It includes an inverse model which considers viewing conditions to generate the result.
- – KimKautz . This global TMO is based on the assumption that human visual sensitivity is adapted to the average log luminance of the scene and that it follows a Gaussian distribution.
- – Li . This local TMO is based on multiscale image decomposition that uses a symmetrical analysis–synthesis filter bank to reconstruct the signal and applies local gain control to the subbands to reduce the dynamic range.
- – Mertens . This technique fuses original LDR images of different exposure values (exposure fusion) to obtain the final “tone-mapped” image, which avoids the generation of an HDR image. Guided by simple quality measures like saturation and contrast, it selects “good” pixels of the sequence and combines them to create the resulting image. Thus, for this method instead of an HDR image we used a stack of LDR images.
- – Meylan . This local TMO is derived from a model of retinal processing. In a first step, a basic tone-mapping algorithm is applied on the mosaic image captured by the sensors. In a second step, it introduces a variation of the center/surround spatial opponency.
- – Otazu . This local TMO is based on a multipurpose human color perception algorithm. It decomposes the intensity channel in a multiresolution contrast decomposition and applies a nonlinear saturation model of visual cortex neurons.
- – Reinhard . This TMO can be executed as global or local. It performs a global scaling of the dynamic range followed by dodging and burning (local) processes. In our work, this operator was run as global, which is its default value in the toolbox.
- – Reinhard–Devlin . This global TMO uses a model of photoreceptor adaptation which can be automatically adjusted to the general light level.
In order to compare TMOs, we performed two different experiments called segment matching and scene reproduction experiments. The aim of the first experiment was to study the internal relationships among gray levels in the tone-mapping image and in the real scene (i.e., a segment matching experiment similar to ). The aim of the second experiment was to evaluate TMOs according to how similar their results were perceived to be with respect to the real scene. In both cases, we obtained a ranking of the different TMOs. Behind these experiments is the idea that a good TMO is one whose output is perceptually similar to the real scene, and, to do that, a good reproduction of the objects’ relationships is needed.
Our experiments were performed in a controlled environment where the only sources of light were a lamp, which illuminated the real scene, and a CRT screen. We used a ViSaGe MKII Stimulus Generator and a Mitsubishi Diamond-Pro 2045u CRT monitor side-by-side with a handmade real HDR scene. The monitor was calibrated via a customary Cambridge Research Systems Ltd. software for ViSaGe MKII Stimulus Generator (Rochester, England) and a ColorCal (Minolta sensor) suction-cup colorimeter. Both the monitor and the real scene were set up so that the objects in both scenes subtended approximately the same angle () and looked similarly positioned to the observer.
We built three different HDR scenes, each including a gray-level reference table and two solid parallelepipeds (cuboids). The reference table was built by printing a series of 65 gray squares () arranged in a flat distribution. The arrangement of rows and columns was labelled for the rows and 1, 2, 3 …, 6 for the columns. The lightness of these patches decreased monotonically from the top (patch A1–#1) to the bottom (patch K5–#65), as measured by our PR-655 SpectraScan spectroradiometer. The printed values were selected so that their CIE L* (lightness) value was equally spaced, meaning that their distribution was approximately uniform in terms of perceived lightness (see Table 2). The cuboids consisted of pieces of wood ( length between 9.4 and 10 cm), whose sides (facets) were covered with arbitrary samples of the same printed paper as the reference table. There were two cuboids in each scene (one under direct illumination and the other in the shade). The third column of Table 3 shows the patch of the reference table that the cuboid’s facet corresponded to, the fourth column indicates its position with respect to the illumination, and the last column indicates its luminance (when placed within its scene). Table 2 also shows the luminance values for these patches once lit by our light source. The chromaticity of all printed material was . The rest of the scenes consisted of many plastic and wooden objects of different colors and shapes (see Fig. 1).
Two facets of one cuboid and three of the other were always visible from the subjects’ location, resulting in 15 different gray facets in total (see Table 3). The incandescent lamp (100 W) had its bulb painted blue to simulate D65 illumination and was set up so that the luminance of the brightest object was about the same as the maximum luminance the monitor was capable of producing (about ).
We photographed the real scene using a Sigma Foveon SD10 camera placed in the exact same position as the subjects’ heads during the psychophysical experiments. The same camera was calibrated for use in other measurements , and because of this, we have a fairly good idea of the linearity and spectral sensitivity of its sensors. The setup was arranged so that the images presented on the monitor looked geometrically the same as the real ones shown beside it. Since the walls were covered in black felt, reflections from all other objects were minimized. The dynamic range of the scenes as measured by multiple exposures using the camera was approximately for scene 1 and for scenes 2 and 3. The dynamic range of the reference table as measured by the PR-655 was .
Although it has been shown that, because of glare, it is not possible to achieve an accurate representation of the scene luminance distribution from a combination of many LDR images, this technique can still provide a good enough approximation . In consequence, a set of 25 photographs were taken at different exposure values (from 15 to 1/6000 s) using the same aperture, focal distance, zoom settings, and visual field. Individual images were stored in RAW format and transformed into 16 bits sRGB (using the camera manufacturer’s software).
To avoid any bias regarding the operators, all experiments started with a 1-min subject adaptation to the ambient light. Most TMO implementations were obtained from the popular HDR Toolbox for MATLAB , while others (Ferradans, iCAM06, Li, Meylan, and Otazu), were obtained from their corresponding authors’ web pages. In order to avoid benefiting any of the TMOs, we ran all of them with their default settings. In Ferradans’ case, we had to chose between two different parameters, and we selected the default values specified in their paper ( and ). Other cases required that the TMO’s author was asked to perform the best tone-mapping, but we discarded this option because of its impracticality (we could not ask all authors the same) and besides, this practice impairs the reproducibility of the results.
A. Experiment 1: Segment Matching
The segment matching experiment consisted of two different tasks:
Task 1. After adaptation, subjects were asked to match, in the real scenes (i.e., with monitor turned off), the brightness of the five cuboids’ facets to the brightness of the patches in the reference table in each scene [see Fig. 2(a)]. Although there were no time constraints to perform the tasks, subjects were advised to take no more than 30 s per match.
Task 2. Here the real scene was not visible, and the observers only saw digital (tone-mapped) versions presented on the monitor. Their task was similar to Task 1, except that all matchings were conducted entirely between the facets and patches shown on the screen [see Fig. 2(b)].
There were three conditions for Experiment 1, corresponding to the three different scenes created (see Fig. 1). Observers performed 240 matchings in total (5 facets × 15 different tone-mapped images × 3 scenes plus 15 matchings in the real scenes). In practice, all matchings were conducted by writing, for each facet, the coordinates of the matching reference table patch on a piece of paper. The presentation order of the tone-mapped images was randomized.
2. Experimental Design
In Experiment 1, the independent variables (IVs) were the cuboids’ facets and the reference table patches. The dependent variables (DVs) were the subjects’ segment matches in the tone-mapped images (Task 2), and the control variables (CVs) were the subjects’ segment matches in the real scene (Task 1). Our null hypothesis was that there was no significant difference between the segments matched in the real scene (CV— Task 1) and the matches in the tone-mapped images (DVs—Task2) because the TMOs perfectly reproduce the perceptual relationships among the objects present in the real scene.
Task 1 was completed by a group of 12 observers with normal or corrected-to-normal vision, recruited from our lab academic/research community. This group (eight male and four female) was comprised of people aged between 17 and 54. Nine of them were completely naïve to the aims of the experiment. Task 2 was completed by 10 of the previous observers (eight male and two female).
Figure 3 shows a plot of the segments matches obtained in Task 2 against the segments matches in Task 1. We fitted a linear model to the results obtained by each TMO. If a TMO reproduced well the interrelations among the gray facets, the fitting should be very similar to the fitting for the real scene (i.e., points should lay about the diagonal).
We performed two different analyses to evaluate to what extent the local interrelations perceived by the observers in the tone-mapped versions corresponded to those perceived in the real scene. In the first analysis, we studied the slopes of the different fitted linear models with respect to the slope obtained in the real scene. The smaller the difference, the better the reproduction of the interrelations (it means that the TMO maintained the relationships among the facets and patches). Figure 3 shows the offset between the lines fitted to the TMOs and the line fitted to the real scene. In the second analysis, we studied this displacement by computing the root mean square error (RMSE) between them.
All results are shown in Table 4, where iCAM06 has the smallest distance to the real scene in both analyses. Since its slope difference and RMSE are very small, we can assume that the pixel interrelations in its tone-mapped image perceptually mimic the real scene. Given that iCAM06 is based on a color appearance model that considers perceptual attributes such as lightness, chromaticity, hue, contrast, and sharpness, its results are expected to be in line with observers’ perception.
We calculated the Spearman’s rank correlation coefficient between the rankings obtained from both segment matching analyses (see Table 4) and obtained a value of 0.59 (). Since both rankings are quite similar, it is worth paying attention to some interesting cases such as Ferradans, whose slope is very close to that of the real scene, but the fitted model lays systematically under the real scene’s line (i.e., its RMSE is very big). An opposite example is Mertens, which has a different slope, but its RMSE is the second smallest.
Another interesting observation from Fig. 3 is that, at the lowest and highest brightness values, the agreement between subjects is higher than at middle values (both horizontal and vertical dispersion lines are smaller). This suggests that the TMOs are more accurate at reproducing both the brightest and the darkest parts of the image. To analyze this effect in more detail, we studied the subjects’ results for each facet. In Fig. 4, the abscissa shows the segments matched in the real scene ordered from darkest to brightest and the ordinate represents the RMSE in the tone-mapped images with respect to the real scene. We defined RMSE as , where is the th subject segment matched in the real scene, is the th subject segment matched in the tone-mapped image, and is the number of subjects. Again, in almost all TMOs, the RMSE value is smaller for darkest and brightest facets than for mid-gray facets. Thus, not only the agreement between subjects but also the error () is lower for both brightest and darkest values.
B. Experiment 2: Scene Reproduction
Experiment 2 consisted of a pairwise comparison of tone-mapped images obtained using different TMOs in the presence of the original scene (side by side). After 1-min adaptation in front of the physical scene, a pair of tone-mapped images of the same physical scene was randomly selected and presented sequentially to the observer on the CRT screen beside the real scene. Subjects could press a gamepad button to toggle which image of the tone-mapped pair was presented on the monitor (only one image was displayed at a time). For this task, they were asked to “select the image that was more similar to the real scene.” As before, there was no time limit, but subjects were advised to complete a trial in less than 30 s. After an image was chosen, a gray background was shown for 2 s, and a different random pair was selected for the next trial. Every subject performed 105 comparisons per scene, taking around 25 min in total. There were three experimental conditions, corresponding to the three different physical scenes created (see Fig. 1). Between conditions, subjects were forced to take a 5–10 min break outside while the physical scene was replaced.
2. Experimental Design
In this experiment the IVs were the different TMOs, the DVs were the subjects’ evaluations (i.e., the preference matrix), and the CV was the real scene. Our null hypothesis was that there were no differences in the TMOs performances since all of them perceptually reproduce the real scene.
A group of 10 people with normal or corrected-to-normal vision, seven male and three female, recruited from our lab academic and research community, completed this experiment. This group was comprised of people aged between 17 and 54 years old. Seven of them were naïve to the aims of the experiment.
From the pairwise comparison results, we defined a preference matrix for each subject and each scene. We constructed a directed graph where the nodes were the evaluated TMOs and the arrows pointed from a preferred TMO to a non-preferred TMO, for example, if the is preferred over the (tone-mapped image from is more similar to the real scene than the one from ), we drew an arrow from to , for .
From this graph, we were able to analyze the intra-subject consistency coefficient for each scene. The consistency coefficient for each subject and scene is defined by2), is , and its possible maximum value is .
In order to study whether values are significant, we used the chi-squared test (). The values are defined by
The number of degrees of freedom of the chi-squared test is given by .
In Table 5, we show all statistical measures for each scene, where we can see that intra- and inter-subject consistency values are very high and statistically significant. Then, in Fig. 5, we show the results of the overall paired comparison evaluations for every scene (obtained from Thurstone’s law of comparative judgment, Case V ) with 95% confidence limits. Spearman’s correlation between these rankings shows that TMOs have similar behavior across different scenes (their coefficients are equal to or higher than 0.90, with ). We computed the mean value along all the scenes (Table 6) and observed that the best ranked TMOs were KimKautz, Krawczyk, and Reinhard, which are completely different from the rankings obtained in the previous experiment.
Comparing the results of our two experiments, we observe that in Section 4.A (segment matching experiment—see Table 4), local TMOs are significantly better than global ones. On the contrary, in Section 4.B (scene reproduction experiment—see Table 6), global TMOs are significantly better than local ones. We computed Spearman’s correlation coefficient between both experiments rankings and verified that there is no correlation.
An interesting example of this lack of correlation is iCAM06. It is clearly at the top of the rankings in the segment matching experiment, but it is in the middle position in the scene reproduction experiment. This means that it correctly reproduces relationships among gray levels, but overall features are not maintained. An extreme example is Fattal, which is in the fourth position in the segment matching rankings but is the last in the scene reproduction ranking. This can be explained because Fattal is based on local (or spatial) features, for instance, luminance gradients, but it does not enforce global features (such as global brightness and contrast). In fact, from Table 4 (RMSE results) we can conclude that Fattal produces a tone-mapped image which is systematically brighter than the real scene. Since Fattal’s fitted line has almost the same slope as the real scene (see Fig. 3), removing this offset could improve its performance in the scene reproduction experiment.
From the previous results, we infer that overall appearance not only depends on the correct reproduction of intensity relationships, but might also depend on many other weighted local attributes, such as the reproduction of gray-level and color relationships, contrast, brightness, artifacts, level of detail, and so on. This is in agreement with other authors [24,27,28,32]. Furthermore, our results show that overall attributes should also be considered to correctly reproduce the appearance.
Regarding the question of which is the best TMO, KimKautz and Krawczyk are very close in all rankings; hence both can be considered equally good.
A. Comparison to Other Studies
In Section 4.A (segment matching) we took into account a particular criterion which, to our knowledge, has never been studied in this kind of TMO ranking experiments. Moreover, we compare our segment matching results to the results obtained by other works that study TMOs applied to gray-level images (given that our analysis has been performed on gray-level facets).
Many works perform overall appearance comparisons, either with (as in our work) or without the real scene. Although Kuang et al.  performed an experiment without a real scene reference, our scene reproduction results agree with theirs in that Fattal is the worst ranked operator and Reinhard is one of the best ranked. Contrary to our results, Kuang et al.  conclude that Durand is better than Reinhard. The reason could be that they might have run Reinhard in local operator mode, which we did not. Furthermore, they performed a study with gray-scales images and their results showed that Durand was better than Reinhard, but iCAM was worse than Reinhard, which is approximately similar to our segment matching experiment’s results. They differ in iCAM’s result, but they used iCAM  instead of iCAM06, as in our case.
Yoshida et al. [24,28] performed experiments with architectural indoor HDR scenes and concluded that Reinhard and Drago were good in terms of naturalness and Durand was not ranked as highly as in  (in an experiment without the reference scene). Our results agree with Yoshida et al. [24,28]. Moreover, Yoshida et al.  showed that global and local operators obtain different results, but global TMO results are more similar among themselves than local TMO results. As pointed out in the previous section, this relationship is also present in our study (Tables 4 and 6).
Ledda et al.  used a HDR display and obtained a ranking according to the overall similarity of TMO images. In this ranking, iCAM was the first one, which does not agree with our results. In addition, their ranking shows the following TMO order: Reinhard, Drago, and Durand, which matches our results. These authors also performed experiments in gray scales, obtaining Reinhard as the best ranked, which does not agree with our results.
Cadík et al. [27,32] performed a very exhaustive study of perceptual attributes. We agree with some of their results like the good ranking of Reinhard (close to the best) and the unnaturalness of Fattal. Moreover, we strongly agree with them in that the best overall quality is generally observed in images produced by global TMOs. Nevertheless, we want to point out that there was some conflict between these two studies. In the first one , Durand was the worst ranked operator, ranked even lower than Fattal, but in the second one , Fattal was the worst ranked and Durand was in a middle position. Our results are in line with Cadík et al. .
We do not agree with Kuang et al.  in that Durand is always the best ranked operator (with and without a reference scene). Furthermore, in contrast with our results, Reinhard is in a middle position of their ranking.
Kuang et al.  suggested, again, that Durand was better than Reinhard and iCAM06 was even better than Durand. In our results, Durand and iCAM06 are quite close, but Reinhard is much better than them. Again, Reinhard could have been run in local TMO mode.
In a similar study as Kuang et al. , Ashikhmin and Goyal  concluded that, compared to the real scene, Fattal and Drago were two of their overall best performers. We do not agree that Fattal is one of the best performers, but we have to point out that, in their work, they tuned the TMO’s parameters, which implies that Fattal could be a good TMO when a fine tuning of the parameters is performed. Furthermore, in their work, Drago obtained more or less the same results as Fattal, but Reinhard obtained worse results than them. They do not specify how they run Reinhard, but it is possible that they run it in the local mode. They obtained that the trilateral filtering , which is an improvement of Durand, was the worst ranked TMO, so it makes sense that, in our work, Durand has obtained worse results than Drago and Reinhard.
In , the outputs of the most internally sophisticated TMO are statistically worse than the best single LDR exposure. Since a global operator is generally less sophisticated than a local one, we could expect that global TMO results are better than local TMO results. Contrary to this theory, Mertens (which cannot be considered a sophisticated TMO because it uses single exposure values) is on middle positions in the segment matching experiments, but it is one of the worst ranked in the scene reproduction experiments.
Some authors emphasize the creation and use of particular metrics to compare tone-mapped images. For example, Ferradans et al.  performed an evaluation of several TMOs using the metric of Aydin et al. . Although it is not the purpose of our work, we performed a very preliminary analysis comparing our results to those of Aydin et al.  as shown in . We agree that Fattal was the operator with highest total error percentages, but disagree with the general overall TMO ranking. A detailed analysis comparing numerical metrics and psychophysical results is scheduled for future work.
It is possible to identify several shortcomings in our study that need to be addressed before a more definitive conclusion is achieved. First, we have assumed that the software provided by the Sigma camera manufacturers is accurate enough to convert the scene luminance array to the sRGB digital file used as input to all TMO algorithms. This assumption hides possible inaccuracies because of glare effects, lens aberrations, and possible tone/chroma enhancements. In the past, we calibrated this camera and measured the linearity and spectral sensitivity of its sensors for use in daylight settings  and verified that tone/chroma enhancements are kept to a minimum at least for its raw image settings. For this work we did not employ our own calibration (which is valid within a fairly limited dynamic range) but decided to rely on the manufacturer’s algorithm instead. All these limit the reproducibility of our experiments (unless of course the same camera is used). We are also aware that the absence of an accurate radiometric description of our scenes also limits the reproducibility of our experiments. To this end we provide photometric information at least of the patches and facets used in the matching comparisons (see Tables 2 and 3) and the dynamic range of both the monitor and the scenes (see Section 3.A).
Our results show that TMO quality rankings strongly depend on the criteria used for the psychophysical evaluation. Not surprisingly, on one hand, local TMOs are better than global TMOs on our segment matching experiment because these operators do not consider just a pixel but also a region of pixels (i.e., spatial information). On the other hand, global TMOs are better than local ones in our scene reproduction experiment. We have found no significant correlation between segment matching and scene reproduction rankings, showing that observers are using several visual attributes to perform their tasks and some of these attributes are not considered by TMOs. We conclude that TMOs should take into account both local and global characteristics of the image, which implies that there is ample room for improvement in the future development of TMO algorithms. Furthermore, we suggest that an agreed standard criteria should be defined for a proper and fair comparison among them.
Our rankings also show there is no TMO that is clearly better than all the others across our experiments, but KimKautz and Krawczyk are perhaps the best ranked since they do not underperform in any of the metrics.
As a general conclusion, since none of the tested TMOs satisfies all the testing criteria (segment matching, scene reproduction, and their respective analyses), operators have to be selected depending on each particular task. This is a consequence of the lack of coherent understanding of the goals of a TMO, which is reflected in the wide variety of evaluation methods and results present in the literature. From a scientific point of view, a TMO should aim to perceptually reproduce the real scene instead of modifying image appearance according to aesthetics (for which we already have a wide selection of image tools). Having said so, it is also important to consider that these operators are widely used in digital cameras and mobile phone cameras, and TMO users often prefer aesthetic improvements over accurate scene reproduction.
Agència de Gestió d’Ajuts Universitaris i de Recerca (AGAUR) (2017-SGR-649); Ministerio de Economía y Competitividad (MINECO) (DPI2017-89867-C2-1-R); CERCA Programme/Generalitat de Catalunya.
We would like to thank Carlo Gatta for his useful comments on the psychophysical experiment design, Javier Retana for his useful comments on the statistical analysis procedures, and the reviewers, who provided very interesting and useful comments on the paper and the work in general. Thanks to all subjects who have participated in the psychophysical experiments and all TMO authors who publicly share their code.
1. J. Ferwerda and S. Luka, “A high resolution, high dynamic range display system for vision research,” J. Vis. 9(8), 346 (2009). [CrossRef]
2. E. Reinhard, G. Ward, S. Pattanaik, and P. Debevec, High Dynamic Range Imaging: Acquisition, Display and Image-Based Lighting, 1st ed. (Morgan Kaufmann, 2005), Chap. 6, pp. 187–221.
3. R. Snowden, P. Thompson, and T. Troscianko, Basic Vision an Introduction to Visual Perception (Oxford University, 2006).
4. J. McCann, “Art, science, and appearance in HDR,” J. Soc. Inf. Disp. 15, 709–719 (2007). [CrossRef]
5. C. Parraman, “The drama of illumination: artist’s approaches to the creation of hdr in paintings and prints,” Proc. SPIE 7527, 75270U (2010). [CrossRef]
6. J. McCann and A. Rizzi, The Art and Science of HDR Imaging, 1st ed. (Wiley, 2012), Chap. 13, pp. 119–121.
7. C. Mees, The Fundamentals of Photography, 2nd ed. (Eastman Kodak, 1921).
8. H. Barlow, “Summation and inhibition in the frogs retina,” J. Physiol. 119, 69–88 (1953). [CrossRef]
9. A. Derrington, J. Krauskopf, and P. Lennie, “Chromatic mechanisms in lateral geniculate-nucleus of macaque,” J. Physiol. 357, 241–265 (1984). [CrossRef]
10. C. Blakemore and F. Campbell, “On the existence of neurons in the human visual system selectively sensitive to the orientation and size of retinal images,” J. Physiol. 203, 237–260 (1969). [CrossRef]
11. E. Land, “The retinex,” Am. Sci. 52, 247–253, 255–264 (1964).
12. E. Land and J. McCann, “Lightness and retinex theory,” J. Opt. Soc. Am. 61, 1–11 (1971). [CrossRef]
13. J. McCann, “Capturing a black cat in shade: past and present of retinex color appearance models,” J. Electron. Imaging 13, 36–47 (2004). [CrossRef]
14. J. McCann, “Retinex at 50: color theory and spatial algorithms, a review,” J. Electron. Imaging 26, 031204 (2017). [CrossRef]
15. J. McCann, “Lessons learned from mondrians applied to real images and color gamuts,” in 7th Color Imaging Conference: Color Science, Systems and Applications (1999), pp. 1–8.
16. X. Otazu, M. Vanrell, and C. Parraga, “Multiresolution wavelet framework models brightness induction effects,” Vis. Res. 48, 733–751 (2008). [CrossRef]
17. X. Otazu, C. A. Parraga, and M. Vanrell, “Toward a unified chromatic induction model,” J. Vis. 10(12), 5 (2010). [CrossRef]
18. J. McCann, C. Parraman, and A. Rizzi, “Reflectance, illumination, and appearance in color constancy,” Front. Psychol. 5, 5 (2014). [CrossRef]
19. A. Ruppertsberg, M. Bloj, F. Banterle, and A. Chalmers, “Displaying colourimetrically calibrated images on a high dynamic range display,” J. Visual Commun. Image Represent. 18, 429–438 (2007). [CrossRef]
20. T. Aydin, R. Mantiuk, K. Myszkowski, and H. Seidel, “Dynamic range independent image quality assessment,” ACM Trans. Graph. 27, 69 (2008). [CrossRef]
21. H. Yeganeh and Z. Wang, “Objective quality assessment of tone-mapped images,” IEEE Trans. Image Process. 22, 657–667 (2013). [CrossRef]
22. F. Drago, W. Martens, K. Myszkowski, and H. Seidel, “Perceptual evaluation of tone mapping operators,” in ACM SIGGRAPH Conference Abstracts and Applications (2003).
23. J. Kuang, H. Yamaguchi, G. Johnson, and M. Fairchild, “Testing hdr image rendering algortihms,” in IS&T/SID 12th Color Imaging Conference (2004).
24. A. Yoshida, V. Blanz, K. Myszkowski, and H. Seidel, “Perceptual evaluation of tone mapping operators with real-world scenes,” in Human Vision & Electronic Imaging X (SPIE, 2005).
25. P. Ledda, A. Chalmers, T. Troscianko, and H. Seetzen, “Evaluation of tone mapping operators using a high dynamic range display,” ACM Trans. Graph. 24, 640–648 (2005). [CrossRef]
26. M. Ashikhmin and J. Goyal, “A reality check for tone-mapping operators,” ACM Trans. Appl. Percept. 3, 399–411 (2006). [CrossRef]
27. M. Cadík, M. Wimmer, L. Neumann, and A. Artusi, “Image attributes and quality for evaluation of tone mapping operators,” in 14th Pacific Conference on Computer Graphics and Applications (2006), pp. 35–44.
28. A. Yoshida, V. Blanz, K. Myszkowski, and H. Seidel, “Testing tone mapping operators with human-perceived reality,” J. Electron. Imaging 16, 013004 (2007). [CrossRef]
29. J. Kuang, H. Yamaguchi, C. Liu, G. Johnson, and M. Fairchild, “Evaluating hdr rendering algorithms,” ACM Trans. Appl. Percept. 4, 1–27 (2007). [CrossRef]
30. J. Kuang, G. Johnson, and M. Fairchild, “icam06: a refined image appearance model for hdr image rendering,” J. Vis. Commun. Image Represent. 18, 406–414 (2007). [CrossRef]
31. A. Akyüz, R. Fleming, B. Riecke, E. Reinhard, and H. Bülthoff, “Do hdr displays support ldr content? A psychophysical evaluation,” ACM Trans. Graph. 26, 38 (2007). [CrossRef]
32. M. Cadík, M. Wimmer, L. Neumann, and A. Artusi, “Evaluation of hdr tone mapping methods using essential perceptual attributes,” Comput. Graph. 32, 330–349 (2008). [CrossRef]
33. J. Tumblin and H. Rushmeier, “Tone reproduction for realistic images,” IEEE Comput. Graph. Appl. 13, 42–48 (1993). [CrossRef]
34. G. Ward, A Contrast-Based Scalefactor for Luminance Display (Academic, 1994), pp. 415–421.
35. E. Reinhard, M. Stark, P. Shirley, and J. Ferwerda, “Photographic tone reproduction for digital images,” ACM Trans. Graph. 21, 267–276 (2002). [CrossRef]
36. N. Miller, P. Y. Ngai, and D. D. Miller, “The application of computer graphics in lighting design,” J. Illum. Eng. Soc. 14, 6–26 (1984). [CrossRef]
37. M. Kim and J. Kautz, “Consistent tone reproduction,” in Proceedings of Computer Graphics and Imaging (2008).
38. G. Krawczyk, K. Myszkowski, and H. Seidel, “Lightness perception in tone reproduction for high dynamic range images,” in Proceedings of Eurographics (2005), p. 3.
39. J. Ferwerda, S. Pattanaik, P. Shirley, and D. Greenberg, “A model of visual adaptation for realistic image synthesis,” in Proceedings of ACM SIGGRAPH (ACM, 1996), pp. 249–258.
40. M. Ashikhmin, “A tone mapping algorithm for high contrast images,” in 13th Eurographics Workshop on Rendering (2002).
41. F. Durand and J. Dorsey, “Fast bilateral filtering for the display of high dynamic-range images,” in Proceedings of ACM SIGGRAPH (ACM, 2002), pp. 257–266.
42. R. Fattal, D. Lischinski, and M. Werman, “Gradient domain high dynamic range compression,” in Proceedings of ACM SIGGRAPH (ACM, 2002), pp. 249–256.
43. F. Drago, K. Myszkowski, T. Annen, and N. Chiba, “Adaptive logarithmic mapping for displaying high contrast scenes,” in Proceedings of Eurographics (2003), Vol. 22.
44. Y. Li, L. Sharan, and E. Adelson, “Compressing and companding high dynamic range images with subband architectures,” ACM Trans. Graph. 24, 836–844 (2005). [CrossRef]
45. E. Reinhard and K. Devlin, “Dynamic range reduction inspired by photoreceptor physiology,” IEEE Trans. Vis. Comput. Graph. 11, 13–24 (2005). [CrossRef]
46. T. Mertens, J. Kautz, and F. Van Reeth, “Exposure fusion,” in 15th Pacific Conference on Computer Graphics and Applications (2007), pp. 382–390.
47. L. Meylan, D. Alleysson, and S. Süsstrunk, “Model of retinal local adaptation for the tone mapping color filter array images,” J. Opt. Soc. Am. A 24, 2807–2816 (2007). [CrossRef]
48. S. Ferradans, M. Bertalmío, E. Provenzi, and V. Caselles, “An analysis of visual adaptation and contrast perception for tone mapping,” IEEE Trans. Pattern Anal. Mach. Intell. 33, 2002–2012 (2011). [CrossRef]
49. X. Otazu, “Perceptual tone-mapping operator based on multiresolution contrast decomposition,” Perception 41 ECVP Abstract Supplement (2012), p. 86.
50. A. Gilchrist, C. Kossyfidis, F. Bonato, T. Agostini, J. Cataliotti, X. Li, B. Spehar, V. Annan, and E. Economou, “An anchoring theory of lightness perception,” Psychol. Rev. 106, 795–834 (1999). [CrossRef]
51. “Camera calibration methods,” 2018, http://www.cvc.uab.es/color_calibration/CameraCal2.htm.
52. F. Banterle, A. Artusi, K. Debattista, and A. Chalmers, Advanced High Dynamic Range Imaging: Theory and Practice (AK Peters, CRC Press, 2011).
53. M. Kendall and B. Babington-Smith, “On the method of paired comparisons,” Biometrika 31, 324–345 (1940). [CrossRef]
54. E. Montage, “Louis leon thurstone in monte carlo: creating error bars for the method of paired comparison,” Proc. SPIE 5294, 222–230 (2004). [CrossRef]
55. M. Fairchild and G. Johnson, “Rendering hdr images,” in 11th Color Imaging Conference (IS&T/SID, 2000), pp. 108–111.
56. P. Choudhury and J. Tumblin, “The trilateral filter for high contrast images and meshes,” in Proceedings of the Eurographics Symposium on Rendering (2003), pp. 186–196.