Abstract
In computation of full-parallax computer-generated hologram (CGH), balance between speed and memory usage is always the core of algorithm development. To solve the speed problem of coherent ray trace (CRT) algorithm and memory problem of look-up table (LUT) algorithm without sacrificing reconstructed object quality, we develop a novel algorithm with split look-up tables (S-LUT) and implement it on graphics processing unit (GPU). Our results show that S-LUT on GPU has the fastest speed among all the algorithms investigated in this paper, while it still maintaining low memory usage. We also demonstrate high quality objects reconstructed from CGHs computed with S-LUT on GPU. The GPU implementation of our new algorithm may enable real-time and interactive holographic 3D display in the future.
©2009 Optical Society of America
1. Introduction
Holographic 3D display uses diffraction effect from holograms to reconstruct 3D objects. It is a very promising true 3D display technology, which has capability to provide wide color gamut and full depth cues, without using any glasses [1]. However, generation of holograms has heavy computational load [2]. To solve this problem needs fast algorithm and powerful computer hardware.
There are different kinds of algorithms for computer generated hologram (CGH) [2,3]. Fourier Ping-Pong algorithm was used to create Fourier holograms for holographic 3D display [4,5]. In this algorithm, one hologram-size fast Fourier transform operation is required for each slice of object, thus it is slow for objects with many depth layers. Lucente [6] developed diffraction specific algorithm and significantly reduced CGH computational load. But it is more suitable for horizontal parallax only (HPO) holograms and display system with high horizontal resolution. Coherent ray trace (CRT) algorithm simulates light propagation from all sampling object points to all hologram pixels [7]. It can achieve very high quality image reconstruction with full parallax, while keeping low memory consumption. However, because of direct many-to-many convolution, CRT algorithm is very time consuming. Look-up table (LUT) was used to reduce the computational load of CRT algorithm by pre-calculating possible results and storing them in a table for in-line look-up [8]. As the amount of possible results is huge, the size of look-up table is big. Kim et al. [9] implemented a novel look-up table (N-LUT) method to reduce memory usage but it is still in the order of gigabyte (GB).
All CGH algorithms need to trade off among reconstruction quality, computational speed and memory usage. For the holographic 3D display system developed in Data Storage Institute (DSI) [10], we need to compute full parallax holograms with fast speed on PC, even for big holograms containing complicated 3D object specifications. Thus, we have developed a new algorithm using split look-up tables (S-LUT), which can meet the requirements of high speed, low memory usage and full parallax reconstruction.
Using powerful computer hardware is another way to increase CGH computational speed. In the past, one used super computers / clusters for this task due to its heavy computational load [2,3]. Special-purpose computational chip was also designed to increase the speed [11]. Recently there is a trend to use GPU for CGH computation [7,12,13]. In this paper, we implement CGH algorithms on GPU in order to further increase the computational speed. But instead of using GPU’s graphic pipeline / OpenGL [7,12,13], we use C + + -like parallel GPU application programming interface (API), Compute Unified Device Architecture (CUDA) [14], as CUDA is designed especially for direct GPU computing without limitations by graphic pipeline / shader.
2. Conventional CRT and LUT algorithms
Conventional CRT algorithm directly simulates physical light propagation process from each object point in 3D space to each hologram pixel. Its mathematical formula can be written as in Eq. (1) [7,13]:
where is the intensity in the hologram plane at , Nthe number of object points and λ the wavelength. is the object point coordinates and the intensities.Algorithm steps of CRT follow Eq. (1), and can be listed as:
where X and Y are horizontal and vertical resolutions of spatial light modulator (SLM) used in display system, respectively.From the algorithm described above, the most inner loop ① runs times. The total computational complexity is in , where is the big O notation. For each run of loop ①, operations include one cosine, one square root, five additions and five multiplications. These operations introduce heavy computational load, especially cosine and square root operations, and make the algorithm run very slow. On the other hand, CRT algorithm has some advantages such as no additional memory requirement other than input 3D object specification and output hologram, as well as high reconstruction quality.
In order to solve slow speed problem of CRT, LUT algorithm [8,9] had been developed. In LUT, all possible results of the cosine function in Eq. (1) are off-line pre-computed and stored in a table; later whenever a result is needed for in-line computation, it will be directly read out from the table. Original LUT uses HPO, and reduces CGH computational load by one order [8] while keeping affordable memory requirement. But using HPO is considered as downgrading of reconstruction quality, as HPO lacks vertical parallax. If using LUT for full parallax hologram computation, memory requirement will dramatically increase.
The mathematical detail of LUT can be described in Eq. (2):
where , and .Converted into algorithm, it becomes:
- //Off-line pre-computation:
- For each possible : ;
- //In-line computation:
- For each hologram pixel :
- ;
- For each object point :
- ; ②
- End;
- End;
Although the most inner loop ② of in-line computation part also runs times. The total computational complexity is also in . Each time only one multiplication and one addition operations are performed. The rest of computational operations are done by the off-line pre-computation in LUT. Therefore, the computational speed of LUT algorithm is much faster than CRT. Considering the memory requirement of LUT, values are from 0 to with stepping , and values are from 0 to with stepping , where is the pixel size of SLM, the sampling interval of object space in horizontal direction and W the number of object space sampling points along horizontal direction. For typical holographic display system, , and in our display system W and X are in the same order of magnitude. Thus, the minimum of is and the maximum is , and the stepping of is . As a result, the table size in dimension is . Similarly, the table size in dimension is , where is the sampling interval of object space, H the number of object space sampling points, the pixel size of SLM, all in vertical direction. In total, the table size is , where D is the number of object space sampling points along depth direction (depth layers). Normally and are in the order of micrometer while and are in the order of sub-millimeter, so the table size can easily boost and uses up all available memory space of a PC. This limits the application of LUT algorithm although it is faster than CRT. However the reconstruction quality of LUT is as good as that of CRT because their computational results are mathematically identical.
3. A novel S-LUT algorithm
In order to solve the speed problem of CRT and the memory problem of LUT without sacrificing reconstruction quality, we propose a new algorithm using look-up tables for horizontal and vertical directions, separately. We name it split look-up table (S-LUT) algorithm.
Recalled from Eq. (1), the resulted hologram can be described as:
where and .As in our display system, reconstruction size is small as compared to object-hologram distance, i.e. and , using Fresnel approximation, Eq. (3) can be approximated as Eq. (4):
Because wavelength is much smaller than the object-hologram distance, i.e. , small change of in the order of λ will not affect the reconstruction quality. Therefore can be adjusted to the nearest multiple of λ. Under this condition, is multiple of , and one more can be added into Eq. (4) without changing the value of . As a result, Eq. (4) becomes Eq. (5):
Using inverse Fresnel approximation on Eq. (5) results in Eq. (6):
Rewrite Eq. (6) in complex form as Eq. (7):
Let us define horizontal light modulation factor as and vertical light modulation factor as , then Eq. (7) can be rewritten as Eq. (8):
For n object points falling on the same vertical line , is all the same, thus the hologram resulted from these n object points can be calculated by Eq. (9):
As is independent of , Eq. (9) can be broken down into two steps as Eq. (10) and Eq. (11):
where is the sum of contribution in vertical direction from those n object points to the hologram .The final hologram is the sum of all contributions from object points on different vertical lines and can be computed by Eq. (12):
Therefore our new algorithm S-LUT can be listed as:
- //Off-line pre-computation:
- For each possible : ;
- For each possible : ;
- //In-line computation:
- For each hologram pixel :
- For each that exists :
- For each : ;
- For each with and each :
- ; ③
- End;
- For each :
- ; ④
- End;
- End;
The list of object points falling on the same vertical line is generated by the program in input processing during in-line computation. The n object points with non-zero intensity are marked down and stored in an array for each vertical line together with respective y coordinates.
In in-line computation of S-LUT algorithm, for a vertical line containing n object points with non-zero intensity, loop ③ runs times and loop ④ runs times. Therefore, total in-line computational complexity is in , and during each step of both loop ③ and loop ④, only one multiplication and one addition operations need to be performed. Similar to memory analysis of LUT, the table size of in dimension is , and that of in dimension is , while the size in dimension is D. As a result, the total size of two tables is . Reconstruction quality of S-LUT is as good as CRT, if all conditions, , , and being multiple of λ, are met.
The comparison in computational complexity, operation and memory usage for CRT, LUT and S-LUT algorithms is listed in Table 1 , for n object points falling on the same vertical line.

Table 1. Complexity, operation and memory usage comparison among algorithms
From Table 1, it can be seen that both computational complexity and memory usage of S-LUT are one order less than that of LUT, i.e. S-LUT runs much faster than LUT, while using much less memory.
4. Implementation of algorithms on GPU
By using CUDA, mathematical operations can be performed on GPU cores in parallel, without mapping them to graphic pipeline. As generally GPU has more cores than CPU (30 computational cores for nVidia GTX285 GPU [15] and 4 cores for Intel core i7 965 CPU [16]), parallel performance of GPU is better than that of CPU, although individual core speed of GPU is slower than that of CPU (680 MHz for MSI GTX 285 OC [17], 3.2 GHz for Intel core i7 965 [16]). Multiple GPUs can be used in a PC, which provides more GPU cores to speed up CGH computation. Another advantage of GPU over CPU is that each GPU has its own memory interface to its dedicated memory chips. As a result, memory for different GPUs can read/write simultaneously. Currently only global memory on GPU is used to store the pre-computed data of and . In the future, shared memory on GPU may be used for further computational speed optimization. However CPU only has one interface to system main memory and data for CPU cores need to be queued up, which may create computational bottleneck.
GPU also has some disadvantages. High speed computation on GPU is mainly for single precision calculation. Error rate of GPU and its memory is higher than that of CPU and main memory. Therefore, direct computation on GPU without error correction may not be suitable for high precision scientific computation. As hologram has high error tolerance, GPU still can meet the requirement of CGH computation, especially when resulted holograms need to be binarised before they are lunched to SLM for reconstruction.
Using a multi-GPU PC for CGH computation needs to consider how to distribute and balance workload among GPUs. In our implementation, the program will distribute loads to each CUDA-enabled GPU in unit of vertical line in object space, in such a way that computation time on each GPU is as close to each other as possible.
Detailed implementation steps of S-LUT algorithm on GPU are listed below, where m is the number of CUDA-enabled GPUs in the PC:
- //Off-line pre-computation:
- For each possible : ;
- For each possible : ;
- //In-line computation:
- Distribute load into sets , elements in are vertical lines in object space;
- For each , create a thread for GPU k and run in parallel:
- Allocate GPU memory on GPU k;
- Transfer tables , and to GPU global memory;
- For each hologram pixel :
- For each , call CUDA kernel to do the following three loops:
- For each : ;
- For each with and each :
- ;
- End;
- For each :
- ;
- End;
- End;
- Transfer to main memory;
- Deallocate GPU memory on GPU k;
- End;
- Combine all to get ;
From steps listed above, it is noticed that additional memory needs to be allocated for workload and holograms of individual GPUs in main memory. This increases memory requirement by as compared to that running on CPU.
5. Experimental results
Our computational experiments are done on a PC with specifications listed in Table 2 . Our program uses only single CPU core and total 3 × 30 GPU cores. Some parameters used in CGH computation are listed in Table 3 .

Table 2. Specifications of PC

Table 3. CGH computation parameters
The time for calculation of and is shown in Fig. 1 . It can be seen that the time needed for this off-line calculation is proportional to the number of depth layers.
The comparison of computational speed among different algorithms implemented on both CPU and GPU is shown in Fig. 2 . The detail of computation time of S-LUT on GPU is shown in Fig. 3 . The object points are randomly distributed in the object space (Table 3) with 21 depth layers.
From CPU computational results shown in Fig. 2, it is clear that S-LUT is much faster than CRT and LUT, especially when the number of object points increases. LUT also shows faster computation as compared with CRT. Running time of S-LUT depends mainly on the number of vertical lines rather than the number of object points in object space as described in Table 1. As a result, running time for S-LUT keeps almost constant when the number of object points goes beyond a certain value, which in our test cases is 10k. Beyond this value, almost all vertical lines in object space have at least one object point and need to be computed for different number of object points above 10k.
From GPU computational results shown in Fig. 2, the speed of CRT on GPU increases drastically as compared with that on CPU. CUDA automatically optimizes the execution of kernel code running on GPUs to cover read / write time of GPU memory. This optimization effect is more obvious for larger number of object points for the cases of CRT and LUT on GPU. Therefore, their computation time does not have linear relationship with the number of object points. From Fig. 2 and Fig. 3, the speed of S-LUT on GPU is about 100 times faster than S-LUT on CPU and CRT on GPU. It is shown that S-LUT is the fastest among all algorithms. However, LUT is even slower than CRT on GPU. This is because LUT reads huge amount of data from large area in memory on GPU, and different GPU cores read data simultaneously, which causes a bottleneck in memory bandwidth. As for S-LUT, there is no such a bottleneck because it only reads small amount of data from small area in memory, as shown in Table 1.
As shown in Fig. 3, S-LUT on GPU achieves 0.3 second running time for less than 1k object points and 0.5 second for more than 10k object points. For the number of object points larger than 40k, S-LUT on GPU is faster than LUT on CPU by at least 700 times.
The comparison of memory usage is shown in Fig. 4 , in which base memory refers to the memory used to store input 3D object specification and final output hologram.
LUT on both CPU and GPU uses much more memory space than CRT and S-LUT. It reaches 12GB physical memory limitation of our PC at 90 depth layers. With virtual memory support, LUT can handle up to 200 layers by using 27GB memory, but the computation as well as the whole PC slows down, and further increase in the number of depth layers stops the OS from responding. S-LUT on CPU uses very small amount of memory and only needs 1.2GB for 20k depth layers. S-LUT on GPU needs roughly twice of base memory because tables of object points are created to optimize overall performance of algorithm. However it still keeps memory usage at 7.5GB for 10k depth layers, which are far more than enough for normal CGH computation.
In order to test memory limitations, we purposely increase resolutions of SLM and object space to four times in both horizontal and vertical directions, i.e. 4096 × 3072 for hologram and 1200 × 1200 for object space, and the comparison is shown in Fig. 5 . Under this condition, LUT on both CPU and GPU stops at 8 depth layers with 17.5GB memory usage; S-LUT on CPU stops at 1490 depth layers with 474MB memory for tables and because at 1490 depth layers base memory occupies available memory space of 8GB. S-LUT on GPU uses 12GB at 1000 depth layers, and stops at 1300 depth layers with 15GB memory usages. This shows that S-LUT on GPU can meet most of application requirements within main memory limitation.
Based on the results presented above, S-LUT on GPU has shown the fastest speed while maintaining low memory usage. It is most efficient in addressing the balance between computational speed and memory usage among all the algorithms investigated in this paper.
Figure 6 shows objects reconstructed from CGHs computed by CRT on CPU and S-LUT on GPU with our holographic 3D display system, in which the SLM device has 1920 × 1080 pixels, and its pixel size is 10.8 µm in both horizontal and vertical directions. The pictures were captured by a 2D SLR camera. Random-phase noise reduction technique [18] is adopted to improve reconstruction quality for all the objects. By naked eyes or even camera, no difference in reconstruction is observed between above two algorithms for both 2D (resolution chart, 372k object points) and 3D (teapot, 100 depth layers, 37k object points) objects. It is shown that S-LUT algorithm implemented on GPU can preserve full parallax and high reconstruction quality of original CRT algorithm.

Fig. 6 Reconstructed objects: (a) 2D resolution chart by CRT on CPU, (b) 2D resolution chart by S-LUT on GPU, (c) 3D teapot by CRT on CPU, and (d) 3D teapot by S-LUT on GPU.
Figure 7 shows a snapshot from a short video clip of a rotating 3D globe (150 depth layers, 62k object points). It is displayed at video rate by our holographic 3D display system [10]. The holograms used to reconstruct this dynamic 3D globe are computed by S-LUT on GPU.

Fig. 7 (Media 1) Video clip of a rotating 3D globe reconstructed from holograms computed with S-LUT on GPU.
6. Conclusion
We have developed a new algorithm by using split look-up tables (S-LUT) and successfully implemented it on GPU using CUDA. Comparisons among different algorithms implemented on both CPU and GPU have also been done in terms of computational speed, memory usage as well as reconstruction quality. S-LUT on GPU can achieve a speed faster than LUT on CPU by at least 700 times for the number of object points larger than 40k under the specified system specifications. It is also about 100 times faster than CRT on GPU. The memory usage of S-LUT on GPU is much less than that of LUT, which enables fast CGH computation for complicated 3D objects. The reconstruction of objects from holograms computed with S-LUT on GPU has shown full parallax and the same image quality as that with CRT on CPU. S-LUT algorithm implemented on GPU is very promising to meet the requirements of high speed, low memory usage and high quality full-parallax reconstruction for real-time and interactive holographic 3D display in the future.
Acknowledgement
This project is a joint project between DSI and I2R (Institute for Infocomm Research) and is funded by HOME2015 Programme of A*STAR, Singapore. We would like to thank Dr. Xu Baoxi from DSI, Dr. Xu Shuhong and Dr. Farzam Farbiz from I2R for helpful discussions.
References and links
1. C. Slinger, C. Cameron, and M. Stanley, “Computer-generated holography as a generic display technology,” Computer 38(8), 46–53 (2005). [CrossRef]
2. C. Slinger, C. Cameron, S. Coomber, R. Miller, D. Payne, A. Smith, M. Smith, M. Stanley, and P. Watson, “Recent developments in computer-generated holography: toward a practical electroholography system for interactive 3D visualization,” Proc. SPIE 5290, 27–41 (2004). [CrossRef]
3. C. D. Cameron, D. A. Pain, M. Stanley, and C. W. Slinger, “Computational challengers of emerging novel true 3D holography displays,” Proc. SPIE 4109, 129–140 (2000). [CrossRef]
4. J. W. Goodman, “Introduction to Fourier Optics 3rd Edition,” McGraw-Hill College, (Roberts & Co. Publishers, 2005).
5. Y. Ichioka, M. Izumi, and Y. Suzuki, “Scanning halftone plotter and computer-generated continuous tone hologram,” Appl. Opt. 10(2), 403–411 (1971). [PubMed] [CrossRef] [PubMed]
6. M. Lucente, “Diffraction-specific fringe computation for electro-holography,” Ph. D.Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology (1994).
7. N. Masuda, T. Ito, T. Tanaka, A. Shiraki, and T. Sugie, “Computer generated holography using a graphics processing unit,” Opt. Express 14(2), 603–608 (2006), http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-14-2-603. [PubMed] [CrossRef] [PubMed]
8. M. Lucente, “Interactive computation of holograms using a look-up table,” J. Electron. Imaging 2(1), 28–34 (1993). [CrossRef]
9. S. C. Kim and E. S. Kim, “Fast computation of hologram patterns of a 3D object using run-length encoding and novel look-up table methods,” Appl. Opt. 48(6), 1030–1041 (2009). [CrossRef]
10. X. W. Xu, S. Solanki, X. A. Liang, S. H. Xu, A. T. Ridwan, Y. C. Pan, F. Farbiz, B. X. Xu, and T. C. Chong, “Computer-generated holography for dynamic display of 3D objects with full parallax,” The International Journal of Virtual Reality 8(2), 33–38 (2009), http://www.ijvr.org/issues/issue2-2009/6.pdf.
11. T. Ito and T. Shimobaba, “One-unit system for electroholography by use of a special-purpose computational chip with a high-resolution liquid-crystal display toward a three-dimensional television,” Opt. Express 12(9), 1788–1793 (2004), http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-12-9-1788. [PubMed] [CrossRef] [PubMed]
12. C. Petz and M. Magnor, “Fast hologram synthesis for 3D geometry models using graphics hardware,” Proc. SPIE 5005, 266–275 (2003). [CrossRef]
13. L. Ahrenberg, P. Benzie, M. Magnor, and J. Watson, “Computer generated holography using parallel commodity graphics hardware,” Opt. Express 14(17), 7636–7641 (2006), http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-14-17-7636. [PubMed] [CrossRef] [PubMed]
14. nVidia, “Compute Unified Device Architecture Programming Guide ver. 2.2”, (nVidia, 2009). http://developer.download.nvidia.com/compute/cuda/2_2/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.2.pdf
15. nVidia, “Specification of GeForce GTX 285”, (nVidia, 2008). http://www.nvidia.com/object/product_geforce_gtx_285_us.html [PubMed]
16. Intel, “Specification of Intel Core i7processor Extreme Edition”, (Intel, 2009). http://www.intel.com/products/processor/corei7ee/specifications.htm [PubMed]
17. MSI, “Specification of MSI N285GTX-T2D1G-OC”, (MSI, 2008). http://www.msi.com/index.php?func=prodvgaspec&maincat_no=130&cat2_no=136&cat3_no=&prod_no=1726#menu [PubMed]
18. L. Golan and S. Shoham, “Speckle elimination using shift-averaging in high-rate holographic projection,” Opt. Express 17(3), 1330–1339 (2009), http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-17-3-1330. [PubMed] [CrossRef] [PubMed]