Electro-holography is a promising display technology that can reconstruct a photorealistic three-dimensional (3D) movie; however, it is yet to be realized practically owing to the need for enormous calculation power. A special-purpose computer for electro-holography, namely HORN, has been studied for over 20 years as a means to solve this problem. The latest version of HORN, HORN-8, was developed using field programmable gate array (FPGA) technology. Initially, a circuit for amplitude-type electro-holography was implemented in HORN-8; however, implementation of phase-type electro-holography has remained an issue. In this paper, the development of new version of HORN-8 and its cluster system, which achieved a real-time reconstruction of a 3D movie with point clouds comprised of 32,000 points for phase-type electro-holography, was reported.
© 2018 Optical Society of America under the terms of the OSA Open Access Publishing Agreement
Since D. Gabor invented holography in 1947 , electro-holography has shown great promise as a display technology to realize photorealistic three-dimensional (3D) images. However, it has not been put into practical use due to two reasons: (1) The enormous calculation burden of computer-generated holograms (CGHs) and (2) the pixel pitch of the spatial light modulator (SLM), which determines the frame rate and viewing angle of the 3D images. This study focuses on solving first of these two problems.
Studies into the calculation processing of electro-holography are roughly divided into two approaches and can be further classified according to the description of the 3D information that is to be displayed; such as point-cloud [2–11], polygon , and multi-view [13, 14]. HORN-8 is designed for CGHs of the point-cloud based 3D model because of the simplicity of its CGH calculation; thus, the following explanations are based on a point-cloud CGH. The first approach is to reduce the computational complexity by devising algorithms [2–11]. For example, Look-up Table (LUT) method [2–5], wavefront recording plane method [6, 7], and the inter frame differences [8,9] method. The second approach is to develop the hardware system, such as the graphics processing units (GPUs) [15–17], many-core processors,  and field programmable gate arrays (FPGAs) [19–27].
In 1993, the authors developed a high-performance computation system for electro-holography, namely HOlographic ReconstructioN (HORN) [20–27]. In comparison with the general purpose computers such as GPUs, the computational efficiency of a special-purpose computer is higher because the calculation target has been restricted. Recently, we released the latest model of the HORN series called HORN-8. It is a peripheral board type dedicated computer with eight FPGAs (Fig. 1). We first implemented a circuit for an amplitude-type CGH and successfully synthesized an amplitude-type CGH of 100 million pixels at video rate . However, it was not the final stage of HORN-8. Since the amplitude-type CGH has poor light utilization efficiency in comparison with a phase-type CGH, the sharpness of the 3D image was lower. Therefore, to realize a photorealistic 3D image with the HORN system, it is indispensable to implement a phase-type CGH in HORN-8. In this paper, we report the phase-type HORN-8, which is based on our previous amplitude-type HORN-8. To distinguish between the two types of HORN-8, we call the phase-type HORN-8, simply HORN-8, and the amplitude-type HORN-8 as it is.
The remainder of this paper is organized as follows. In Section 2, we introduce the brief history of the HORN-8 project. In Section 3, we describe the algorithms and architectures implemented in HORN-8. In Section 4, we describe the specification of the HORN-8 board and its performance. In Section 5, we discuss the experimental results and in Section 6 we conclude this work.
2. HORN-8 project
The HORN-8 project began in October 2012. We started with the detailed design of HORN-8, and then carried out board production and component mounting. The initial cost of the board production was 300,000 yen. We made the first board in 2013 and in total produced ten boards up until 2014. The primary part cost was 120,000 yen for the calculation FPGA (Xilinx Virtex-5 XC5VLX110-2FF676C), 30,000 yen for the communication FPGA (Xilinx Virtex-5 XC5VLX30T-2FF665C), and 50,000 yen for the main circuit board including implementation fee; thus, the total cost of one HORN-8 board was one million yen (approx. $10,000 (U.S. Dollars)).
We then completed the development of the related software (e.g., a device driver) in 2015, and succeeded in operating a single HORN-8 board with the amplitude-type CGH calculation circuits. Then, we moved to the development phase to increase the scale of the system to a cluster. We succeeded in constructing an amplitude-type HORN-8 cluster system with two boards in 2016, and eight boards in 2017, which can produce 100 million pixels of amplitude-type CGH with 10 million point-light sources (PLSs) . Finally, in 2018, we succeeded in producing a phase-type HORN-8 cluster system that we report in this paper.
3. Hardware design
CGH is obtained by simulating a spherical wave from each PLS which constitute the 3D images, and the complex amplitude distribution produced by those spherical waves on a CGH plane U(xα, yα) on the computer, where (xα, yα) are the coordinates of the CGH.
Defining (xj, yj, zj) as the coordinates of the position of the j-th PLS of a total of Nobj points, the complex amplitude distribution on plane U(xα, yα) under the condition of zj ≫ xj, yj becomes;
The phase-type CGH is obtained by quantizing the argument of Eq. (1) as
In the HORN-8 system, and similar to the conventional HORN systems, we pipelined the phase calculation in Eq. (2) with the recurrence relation algorithm . The recurrence relation algorithm first defines an initial phase at a position, which is separated by n pixels from the reference point (Xα,Yα) as,
Besides, we developed an approximation method of the cosine and sine function based on Nishitsuji’s approximation method  to improve the parallel degree of the calculation circuit by eliminating the use of read-only memory and other related resources (e.g., memory channel), which are used to look-up tables of cosine and sine functions. According to , CGHs created with Nishitsuji’s approximation method can reconstruct 3D images with sufficient image quality; thus, the approximation method can replace the conventional LUT for the cos/sin functions.
In this approximation method, we first extract 6 bits from the beginning of the decimal part of Θ and reinterpret it as a two’s complement value with only the decimal bit. Here we define this value as Θs whose range is −0.5 ≤ Θs < 0.5. Considering Θs as a fixed-point number with no integer bits or two’s complements, the approximated cosine function, c(Θs), and sine function, s(Θs), can be written as:Eq. (3).
Figure 2 shows the comparison between the output of the cosine and sine functions by the conventional LUT as a ground truth and the output of this approximation method. Note that the value range of the ground truth is adjusted to the approximation method. According to Fig. 2, the approximate shape of c(Θs) and s(Θs) follows the ground truths, the positions of each peak completely match, and the phase value and the output values have a one-to-one correspondence within the range of Θs; therefore, the approximation method can be an alternative to the LUT.
Figure 3 shows the block diagram of the HORN-8 board. The host PC connected to the HORN-8 board controls the entire calculation process. Input and output data (e.g., command, PLS data, CGH) are transferred via PCI-Express. The HORN-8 board has seven FPGAs for calculation, called calculation FPGAs and one FPGA for the control, called the communication FPGA, each of which is connected via a ring-bus. The calculation FPGAs share the CGH calculation among the other calculation FPGAs on the board and the communication FPGA controls the ring-bus and the PCI-Express. The calculation FPGAs are allocated a CGH calculation divided by the unit of the row, and process them in parallel; thus, every calculation FPGA stores the whole PLS data. The remaining of this section describes the details of the calculation FPGAs.
Figure 4 shows the signal flow diagram of the calculation FPGAs. Each calculation FPGA acquires the PLS data and the allocated coordinates of the CGH, and outputs the calculated CGH via the ring-bus. Rx and Tx in Fig. 4 are the modules for reading and writing data from and to the ring-bus, respectively. Each calculation FPGA has a hierarchical structure, which is composed of HORN-CONTROL, a basic phase unit (BPU), an additional phase unit (APU), a complex amplitude unit (CAU), and other accompanying function blocks. In HORN-8, one calculation FPGA has one BPU and 319 APUs, i.e., one calculation FPGA calculates 320 pixels of CGH.
HORN-CONTROL controls the overall CGH calculation which consists of the HORN-CORE as a CGH calculation circuit and a communication module that sends and receives PLSs and the calculated CGH (data receive control, data send control), and blocks random access memory (BRAM). In HORN-8, the maximum number of PLSs that can be calculated at one time is 32,768 points due to the restriction of the BRAM capacity. Therefore, in the case of a 3-dimensional (3D) model exceeding 32,768 points, the HORN-8 divides the PLSs and calculates the CGH for a plurality of times and outputs sequentially to reconstruct the visually integrated 3D image using the afterimage effect.
HORN-CORE is a calculation circuit block to operate the CGH calculation with the recurrence relation, which consists of BPU, APU, CAU, selector, and arctangent table. The arctangent table is a two-input LUT of tan−1[I/R] where R and I are the real-part and imaginary-part, respectively. The HORN-CORE calculates successive 320 pixels of CGH in a pipeline with a start coordinate of the recurrence relation (Xα,Yα) and serially inputs the coordinates of PLSs (xj, yj, ρj), where , which is pre-calculated to eliminate a division.
Figures 5–7 show the signal flow diagram of the BPU, APU, and CAU. The numbers depicted in those figures express the bit-length of each signal. BPU calculates the initial term Θ0, which can be defined from Eq. (4). The APU calculates Eq. (6), that is, each of the BPU and APUs are responsible for calculating one pixel out of 320 pixels allocated to each calculation FPGA. The CAU calculates the cosine and sine functions and accumulates the results as a complex amplitude distribution. After processing all of the coordinates of the PLSs, the CAU normalizes the accumulated values and outputs as U, which becomes an index value of the arctangent table. In HORN-8, the first half of U is a real-part and the latter is an imaginary-part.
The role of the normalization circuit is to reduce the scale of the arctangent table by shortening the bit-width of the index. The normalization circuit operates according to the following process: (a) Finding a first position where a bit is different from the sign bit in the top bit of the real and imaginary-part, respectively. (b) Extracting 5 bits from the position found in (a). For example, when the bit-strings of a real and imaginary-part are “000 0000 1010 0100 0000” and “111 1111 1101 0011 0000,” the normalization circuit extracts 5 bits from a 7 bit in the real-part and a 9 bit in the imaginary-part, i.e., the normalized bit-strings of the real-part become “01010” and the imaginary-part becomes“10100.”
4. Packaging and performance
4.1. Hardware specification
The HORN-8 board adopted a PCI-Express Gen. 1 as a communication interface to the host PC and mounted seven FPGAs for calculation FPGA (Xilinx Vertex 5 XC5VLX110-2-FF676) and one FPGA for communication FPGA (Xilinx Vertex 5 XC5VLX30T-2-FF665). The number of pixels that can be processed in parallel on a single HORN-8 board is 2,240 pixels because each calculation FPGA can calculate 320 pixels, which is a half of the amplitude-type HORN-8 , as the phase-type HORN-8 calculates both the real and imaginary-parts simultaneously. The usage rate of the slice (logic cell) and Block RAM are 98 % of each other, and the operating frequency is 0.25 GHz.
Moreover, we constructed a cluster system with a maximum of eight HORN-8 boards. The cluster system consists of one master node PC, which does not mount the HORN-8 board, and four slave node PCs which mount two HORN-8 boards. We connected the PCs by a Gigabit Ethernet cable and used MPICH3 for message passing.
4.2.1. Evaluation setup
For the performance evaluation, we implemented a CGH calculation in a central processing unit (CPU) and GPU for comparing calculation time and image quality to HORN-8. The computer environment used for comparison is as follows: OS: Windows 10 Enterprise 64bit, CPU: Intel Core i7-6700K 4.00 GHz, GPU: NVIDIA Geforce GTX 1080Ti, Memory: 16GB, Compiler (CPU): Intel C++ Compiler 17.0, Compiler (GPU): NVIDIA CUDA Compiler driver 8.0. The resolution of the CGH is 1, 920 × 1, 080 pixels, and the number of PLSs in the 3D image is Nobj = 8, 000.
As for the CPU, we implemented CGH calculation of Eq. (1) with recurrence relation algorithm, look-up tables for cos and sin functions, built-in arctangent function of C++, OpenMP for parallelization, and a fixed-point program for a comparison of computational speed. Additionally, we also implemented CGH calculation with no approximation, double precision floating point program, built-in trigonometric functions of C++ for a comparison of image quality of reconstructed images.
As for the GPU, we implemented CGH calculation of Eq. (1) itself with single precision floating point program, i.e., we didn’t apply recurrence relation and pipelined structure as HORN-8 to GPU because GPU is suitable for the simultaneous execution of independent operations, which is generally called as single instruction multiple data. We adopted the following techniques for performance tuning: (a) using shared memory, (b) using built-in fast trigonometric functions. Although GPU can execute the parallel calculation in faster speed, memory access often becomes a bottleneck for increasing computational efficiency. To avoid it, we used a cash memory, namely shared memory, which is set between the main memory on a GPU and processor’s registers to store PLS data. Also, we didn’t apply Nishitsuji’s approximation algorithm to cos and sin functions because the GPU has special functional circuits for the trigonometric functions, which are more optimized for a GPU program than Nishitsuji’s approximation method. Additionally, we didn’t apply look-up table to arctangent function but used built-in arctangent function because the memory amount of a GPU is limited and the access speed of memory often become a bottleneck of the performance.
Table 1 shows the comparison of the calculation times and frame rates of the CGHs, which were created by a CPU, GPU, and single HORN-8 board. As shown in Table 1, the HORN-8 board succeeded in speeding up the CPU by approximately 100 times, and about 1.03 times for the GPU. Also, we realized a frame rate above the video rate (29.97 fps: a standard frame rate for NTSC broadcasting system). Note that the execution efficiency of the GPU program is sufficient enough to compare the performance of the GPU with HORN-8 due to some profiling indicators of NVIDIA’s profiling tool, e.g., achieved occupancy is 97% which is an actual operation rate of GPU’s processors. The high value of achieved occupancy insists that the threads of the GPU program are well distributed among the processors in the GPU, and the GPU program avoids ineffective phenomenon such as a stall of the memory access.
Figure 8 shows the performance of the HORN-8 cluster system. As shown in the figure, we succeeded in calculating the CGH with 32,000 points PLSs at video rate. On the other hand, as seen from the result of the cluster system with 6 and 8 HORN-8 boards, there is a section where the change in calculation time is not smooth. This is because the communication time increased because the computation result is divided and transmitted when Nobj exceeds 32,768 due to the limitation of the memory capacity of the calculation FPGA.
Figure 9 shows the comparison of the reconstructed images of the CGHs with Nobj = 8, 000 obtained by the optical reconstruction and numerical simulation of the CGHs created by HORN-8 and the CPU with double precision floating point program as ground truth. We used a phase-modulated liquid crystal on silicon display (Holoeye PLUTO-2 Spatial Light Modulator) as the SLM, and a high intensity green light emitting diode as an optical source with wavelength λ = 523 nm for optical reconstruction. Also, we used the angular spectrum method with a CWO++ library  for numerical simulation. As shown in Fig. 9, the reconstructed optical image of the CGH outputted by HORN-8 matches both the numerical simulation result and the original 3D model well. The peak signal to noise ratio (PSNR) between the numerically simulated results, which are shown in Figs. 9(c) and 9(d) is 29.88 dB. Since the PSNR of the image quality criterion of a two-dimensional image is 30 dB or more , the reconstructed image with the HORN-8 system is considered to be of favorably good quality.
Finally, we describe the computational efficiency of the HORN-8 system. The theoretical value of the CGH calculation with the HORN-8 system Thorn is defined as follows :
Figure 10 shows the comparison of the computational efficiency of the HORN-8 system (including both single board and cluster operations). As shown in Fig. 10, the computational efficiency of the HORN-8 system reaches over 90 % in single, two, four, and six board cluster systems, and reaches over 80 % in the eight board cluster system. For example, in the single board operation, Thorn = 29.62 ms, and Ttotal = 30.01 ms, i.e., E = 98.7 % when Nobj = 8, 000 and Nhol = 1, 920 × 1, 080 pixels. In the cluster operation, Thorn = 29.62 ms and Ttotal = 31.51 ms, i.e., E = 94.0 % in the four board cluster system when Nobj = 32, 000 and Nhol = 1, 920 × 1, 080 pixels.
According to Fig. 10, the computational efficiency increases according to the number of calculation points and converges at a constant value. Besides, as the number of clusters increases, the improvement trend in the computational efficiency is gentle. The HORN-8 system divides the CGH calculation into each HORN-8 board, so Tout is not concealed when the number of PLS is small, i.e., the relation between calculation times becomes Tout > Thorn + Tin. On the other hand, as seen in the result of the cluster system with six and eight HORN-8 boards, the computational efficiency drops when the number of PLS is around 40,000 and 60,000. Since the CGH is created from above Nobj = 32, 768, the PLSs should transmit separately due to the limitation of the memory capacity of each calculation FPGA, and Tout increases by the number of divisions; thus, Tout is not completely concealed in this situation. For example, in the case of Nobj = 40, 000, the HORN-8 divides the PLSs into Nobj = 32, 768 and Nobj = 7, 232. According to Fig. 10, the computational efficiency of Nobj = 32, 768 and 7, 232 are approximately 80% and 20%, respectively; thus the overall efficiency is expected to be 50%, which matches the result shown in Fig. 10. As for the case of Nobj = 70, 000, the HORN-8 divides the calculation into two times Nobj = 32, 768 and the residuals, so that the computational efficiency becomes higher than in the case of Nobj = 32, 768.
5. Conclusion and future work
In this paper, we proposed a HORN-8 system, which is the latest model of the special-purpose computer for phase-type electro-holography, HORN. The HORN-8 can reconstruct 3D videos composed of tens of thousands of PLSs in video rate, which could be realized in interactive systems, such as televisions and telephones.
The performance of HORN-8 is approximately the same as the latest GPU; however, because FPGAs mounted on the HORN-8 are not the newest type, its performance will be significantly improved when the latest model FPGAs are used. For example, if we replace the FPGAs with a Xilinx Vertex Ultrascale+ VU037P of the same structure, we estimate that the performance of HORN-8 will increase 18 times according to a comparison with the number from the configurable logic block look-up table (CLB LUT). In other words, the single HORN-8 system can calculate the CGH created from Nobj = 140,000 PLSs above the video rate and the cluster system with eight HORN-8 boards can calculate the CGH created from over Nobj = 570,000 PLSs. Since the performance of current HORN-8 is almost the same as the latest GPU, this estimated performance is hard to reach in current GPUs. Therefore, it suggests the superiority of HORN-8 architecture.
Japan Society for the Promotion of Science (Grant-in-Aid No. 25240015)
2. M. E. Lucente, “Interactive computation of holograms using a look-up table,” J. Electron. Imaging 2, 28–34 (1993). [CrossRef]
4. T. Nishitsuji, T. Shimobaba, T. Kakue, N. Masuda, and T. Ito, “Fast calculation of computer-generated hologram using the circular symmetry of zone plates,” Opt. Express 18, 19504–19509 (2010).
5. Y. Pan, X. Xu, S. Solanki, X. Liang, R. B. A. Tanjung, C. Tan, and T.-C. Chong, “Fast CGH computation using S-LUT on GPU,” Opt. Express 17, 18543–18555 (2009). [CrossRef]
8. S.-C. Kim, J.-H. Yoon, and E.-S. Kim, “Fast generation of three-dimensional video holograms by combined use of data compression and lookup table techniques,” Appl. Opt. 47, 5986–5995 (2008). [CrossRef] [PubMed]
9. X.-B. Dong, S.-C. Kim, and E.-S. Kim, “MPEG-based novel look-up table for rapid generation of video holograms of fast-moving three-dimensional objects,” Opt. Express 22, 8047–8067 (2014). [CrossRef] [PubMed]
10. T. Nishitsuji, T. Shimobaba, T. Kakue, and T. Ito, “Review of Fast Calculation Techniques for Computer-Generated Holograms With the Point-Light-Source-Based Model,” IEEE Trans. Ind. Inf. 13, 2447–2454 (2017). [CrossRef]
11. T. Shimobaba, T. Kakue, and T. Ito, “Review of Fast Algorithms and Hardware Implementations on Computer Holography,” IEEE Trans. Ind. Inf. 12, 1611–1622 (2016). [CrossRef]
16. Y. Ichihashi, R. Oi, T. Senoh, K. Yamamoto, and T. Kurita, “Real-time capture and reconstruction system with multiple GPUs for a 3D live scene by a generation from 4K IP images to 8K holograms,” Opt. Express 20, 21645–21655 (2012). [CrossRef] [PubMed]
17. H. Niwase, N. Takada, H. Araki, H. Nakayama, A. Sugiyama, T. Kakue, T. Shimobaba, and T. Ito, “Real-time spatiotemporal division multiplexing electroholography with a single graphics processing unit utilizing movie features,” Opt. Express 22, 28052–28057 (2014). [CrossRef] [PubMed]
18. K. Murano, T. Shimobaba, A. Sugiyama, N. Takada, T. Kakue, M. Oikawa, and T. Ito, “Fast computation of computer-generated hologram using Xeon Phi coprocessor,” Comput. Phys. Commun. 185, 2742–2757 (2014). [CrossRef]
19. Z.-Y. Pang, Z.-X. Xu, Y. Xiong, B. Chen, H.-M. Dai, S.-J. Jiang, and J.-W. Dong, “Hardware architecture for full analytical Fraunhofer computer-generated holograms,” Opt. Eng. 54, 095101 (2015). [CrossRef]
20. T. Ito, T. Yabe, M. Okazaki, and M. Yanagi, “Special-purpose computer HORN-1 for reconstruction of virtual image in three dimensions,” Comput. Phys. Commun. 82(2), 104–110 (1994). [CrossRef]
21. T. Ito, H. Eldeib, K. Yoshida, S. Takahashi, T. Yabe, and T. Kunugi, “Special-purpose computer for holography HORN-2,” Comput. Phys. Commun. 93(1), 13–20 (1996). [CrossRef]
22. T. Shimobaba, N. Masuda, T. Sugie, S. Hosono, S. Tsukui, and T. Ito, “Special-purpose computer for holography HORN-3 with PLD technology,” Comput. Phys. Commun. 130(1), 75–82 (2000). [CrossRef]
23. T. Shimobaba, S. Hishinuma, and T. Ito, “Special-purpose computer for holography HORN-4 with recurrence algorithm,” Comput. Phys. Commun. 148(2), 160–170 (2002). [CrossRef]
24. T. Ito, N. Masuda, K. Yoshimura, A. Shiraki, T. Shimobaba, and T. Sugie, “Special-purpose computer HORN-5 for a real-time electroholography,” Opt. Express 13(6), 1923–1932 (2005). [CrossRef] [PubMed]
25. Y. Ichihashi, H. Nakayama, T. Ito, N. Masuda, T. Shimobaba, A. Shiraki, and T. Sugie, “HORN-6 special-purpose clustered computing system for electroholography,” Comput. Phys. Commun. 93, 13–20 (2009).
26. N. Okada, D. Hirai, Y. Ichihashi, A. Shiraki, T. Kakue, T. Shimobaba, N. Masuda, and T. Ito, “Special-purpose computer HORN-7 with FPGA technology for phase modulation type electro-holography,” IDW/AD’12 Proc. Int. Display Workshops 3Dp–26 (2012).
27. T. Sugie, T. Akamatsu, T. Nishitsuji, R. Hirayama, N. Masuda, H. Nakayama, Y. Ichihashi, A. Shiraki, M. Oikawa, N. Takada, Y. Endo, T. Kakue, T. Shimobaba, and T. Ito, “High-performance parallel computing for next-generation holographic imaging,” Nature Electronics 1(4), 254–259 (2018). [CrossRef]
28. T. Shimobaba and T. Ito, “An efficient computational method suitable for hardware of computer-generated hologram with phase computation by addition,” Comput. Phys. Commun. 138, 44–52 (2001). [CrossRef]
29. T. Nishitsuji, T. Shimobaba, T. Kakue, D. Arai, and T. Ito, “Simple and fast cosine approximation method for computer-generated hologram calculation,” Opt. Express 23, 32465–32470 (2015). [CrossRef] [PubMed]
30. T. Shimobaba, J. Weng, T. Sakurai, N. Okada, T. Nishitsuji, N. Takada, A. Shiraki, N. Masuda, and T. Ito, “Computational wave optics library for C++: CWO++ library,” Comput. Phys. Commun. 183, 1124–1138 (2012). [CrossRef]
31. R. Gomes, W. Junior, E. Cerqueira, and A. Abelem, “A QoE Fuzzy Routing Protocol for Wireless Mesh Networks,” in Future Multimedia Networking (SpringerBerlin Heidelberg, 2010), pp. 1–12.