PCIe-based FPGA-GPU heterogeneous computation for real-time multi-emitter fitting in super-resolution localization microscopy

Dan Gui; Dan Gui; Dan Gui; Yunjiu Chen; Weibing Kuang; Mingtao Shang; Yingjun Zhang; Zhen-Li Huang; Zhen-Li Huang

doi:10.1364/BOE.459198

1. Introduction

Super-resolution localization microscopy (SRLM) refers to a series of super-resolution fluorescence microscopy techniques, including photoactivated localization microscopy (PALM), stochastic optical reconstruction microscopy (STORM), direct stochastic optical reconstruction microscopy (dSTORM), point accumulation for imaging in nanoscale topography (PAINT), and many other names [1,2]. As a typical computational microscopy approach, SRLM requires a suitable image processing algorithm to extract the precise locations of millions of emitters from thousands or even tens of thousands of raw fluorescence images, and these locations are further used to reconstruct a final super-resolution (SR) image [3]. Therefore, the performance of the image processing algorithms, quantified mainly by detection rate, localization precision, and execution speed [4], is critical for making SRLM successful in practical biomedical applications [5].

The reported image processing algorithms in SRLM include non-iterative algorithms, single-emitter fitting algorithms, multi-emitter fitting algorithms, compressed sensing algorithms, and several other approaches [6]. Among them, multi-emitter fitting algorithms have been receiving intensive attention in the past decade, because they can be used to increase the image speed of SRLM by allowing more emitters to be localized within a fluorescent spots (or called regions of interest, ROI) [7]. And, when multi-emitter fitting algorithms are combined with maximum-likelihood estimation (MLE), it is possible to approach theoretical maximum localization precision [8], while the detection rate is also satisfactory. However, because a large amount of time is required to perform the iterative optimization process, MLE-based multi-emitter fitting algorithms usually run at a slow execution speed [9], which limits the practical use of these algorithms in routine biomedical experiments.

It is well-known that Graphics Processing Unit (GPU) computation is able to increase significantly the execution speed of MLE-based multi-emitter fitting algorithms [10]. For a rough comparison, the analytical speed of a recent MLE-based multi-emitter fitting algorithm called UNLOC [11], which was implemented with multi-core CPU computation, was about 20 times of a popular localization algorithm called ThunderSTORM [12]. However, for another multi-emitter fitting method called QC-STORM [13], where GPU computation was used to accelerate a series of MLE-based fitting algorithms (which can process up to three molecules per ROI), the analytical speed (full routine) was about 3-4 orders of magnitude faster than ThunderSTORM, and is even slightly faster than the up-to-date non-iterative multi-emitter localization algorithm called WindSTORM [14]. Unfortunately, QC-STORM is not able to enable real-time multi-emitter fitting for raw images from popular sCMOS cameras working at full frame rates (for example, Hamamatsu Flash 4.0 V3, 2048 × 2048 pixels, 100 frame per second (fps)), even at a moderated emitter densities of 2 - 3 µm⁻². Note that, for real-time image processing, the image processing speed should be faster than the acquisition speed of a single frame, and that further improvement on the execution speed of QC-STORM will require expensive upgrades on the GPU hardware, and the slow pre-processing could not be avoided due to the von Neumann computation architecture in GPU. We note that, for GPU computation, data processing is driven by soft instructions, and thus the data has to be transferred between computing and storage units. Each processing step actually involves execution of multiple instructions in centralized computing units, and repetitive transfer of intermediate data among computing and storage units is required [15]. On the contrary, for Field Programmable Gate Array (FPGA) implementation, data processing happens at multiple hardware stages. At each stage, computing is accomplished with specifically programmed hardware units. The data flows through these units, and no buffering and repetitive transferring of intermediate data is involved [16].

Recently, we implemented a series of MLE-based fitting algorithms into a cooperative computation platform consisted of FPGA computation and GPU computation, and proposed an upgraded multi-emitter localization method called HCP-STORM [17]. We found that HCP-STORM could enable real-time image processing for raw images with 2048 × 2048 pixels @ 100 fps. However, the performance of HCP-STORM is limited not only by the FPGA resources, but also by the bandwidth of USB 3.0, which is used to transmit ROIs from FPGA to GPU. Actually, a rough estimation pointed out that, when the emitter density increased to 0.6 µm⁻², USB 3.0 will be insufficient to transmit all extracted ROIs in real-time. Furthermore, with the rapid development of scientific complementary metal-oxide semiconductor (sCMOS) cameras, researchers can now use a back-illuminated sCMOS cameras with larger sensor array and faster frame rates (for example, Photometrics Kinetix, 3200 × 3200 pixels @ 498 fps) as the detector in SRLM. This kind of sCMOS camera provides a tenfold data throughput than current popular sCMOS cameras. That is to say, in order to deal with higher emitter density (> 0.6 mol/µm²) and/or detectors with higher data throughput, the image processing method in the current solutions should be significantly upgraded [18], especially for some biomedical applications that require multi-FOV stitching of super-resolution images and multi-scale visualization of cell functions [19,20].

Here we use PCIe (Peripheral Component Interconnect Express) interface to build a heterogeneous FPGA-GPU computation platform called AIO-HCP (all-in-one heterogeneous computing platform) [21], and present a pilot evaluation on the power of this platform by implementing the same series of MLE-based fitting algorithms in HCP-STORM. To simplify the performance evaluation hereafter, this MLE-based fitting method under this new FPGA-GPU platform is called AIO-STORM, where we adopt the task parallelism strategy in computer science to improve the overall computation efficiency through big data stream parallelism scheduling. Instead of processing images frame by frame through GPU computation, FPGA computation caches and processes images row by row through machine states, which are synchronized with the rolling shutter mode of the sCMOS camera used to capture raw images. The FPGA outputs ROIs in batches to GPU for further emitter localization, under the scheduling of the PCIe-based host-computer and windows driver. We apply several tricks to accelerate image processing speed (see Section 2.1 for details). We verify that AIO-HCP could guarantee a data interaction capacity of higher than the data transmission bandwidth of the sCMOS camera used in this study, and AIO-STORM is able to enable real-time image processing for raw images with 1024×1024 pixels @ 100 fps and structure density up to 5.5 mol/µm², without sacrificing image quality. This study demonstrates the data processing power of heterogeneous parallelism computing for computational microscopy. We believe this study paves the way for enhancing the potential applications of SRLM in cases that require both high imaging speed and high localization accuracy.

2. Methods

2.1 PCIe-based heterogeneous FGPA-GPU computation platform

A typical SRLM system is commonly consist of a camera and a computer, with a GPU card in the computer to accelerate emitter localization. The data processing steps in emitter localization mainly includes image pre-processing (denoising, background removal, fluorescent spots identification, ROIs extraction), localization, and rendering [22]. In this paper, we built an PCIe-based heterogeneous computing platform (called AIO-HCP), where a FPGA card (Xilinx KC705 FPGA with PCIe interface), a GPU card (Nvidia TITAN Xp with 3840 CUDA cores and 11 GB memory), a CPU (Intel Core i7-8700, 32 GB memory), and a hard-disk are integrated inside a computer chassis. These devices communicate with each other through PCIe slots, as shown in Fig. 1(a).

Fig. 1. PCIe-based FPGA-GPU heterogeneous computing platform for image processing in SRLM. (a) System configuration. (b) The data flow for execution procedure in AIO-HCP.

Download Full Size | PDF

We adopt the task parallelism strategy in computer science, implement the MLE-based fitting algorithms in QC-STORM into the AIO-HCP, and present a big data stream parallelism scheduling method, called AIO-STORM, for accelerating the multi-emitter fitting localization in SRLM. The image pre-processing is performed by FPGA, the localization and rendering is performed by GPU, and the final image display is under the control of CPU. We apply several tricks in AIO-STORM to accelerate image processing. Trick 1: The extracted ROIs in the FGPA are transmitted to the GPU card via a Direct Memory Access (DMA) engine, which is implemented in the PCIe-core of FPGA. In this case, the time consumption in CPU (central processing unit) and memory visits can be minimized. This treatment greatly reduces the software consumption. Trick 2: Task parallelism scheduling is designed through host-computer and Windows driver to enable simultaneous working between FPGA and GPU. This treatment could significantly reduce the overall time consumption. Trick 3: Maximizing the batch loading volume in FPGA, so that a large number of ROIs can be sent from FPGA to GPU. Note that GPU executes parallel computation by single instruction multiple data (SIMD), and that a larger batch loading volume will produce less data interactions and higher utilization of GPU cores.

2.2 Data interaction via direct memory access

In traditional heterogeneous computation, the data interaction between FPGA and GPU actually means transferring the data in FPGA to the memory of a computer (under the control of CPU), and then GPU reads the data from the memory for further computation. These steps require CPU instructions and memory read/write, and thus are not efficient for parallel computation due to the frequent data access operations. In our AIO-HCP, the data interaction is directly performed between FPGA and GPU. Specifically, when a PCIe-based FPGA card is connected to a computer, the computer allocates resources to the FPGA card automatically through a host-computer configuration, including base address, interrupt number, etc. When the DMA in the FPGA card requests to use the PCIe Bus, CPU releases the Bus control, the FPGA takes the master control of the Bus, and the data is interacted between FPGA and GPU directly through the DMA engine [23]. In this way, the number of memory data access is reduced, and the time consumption of data interaction is also effectively reduced, as shown in Fig. 1(b). Notably, in order to improve the generality of AIO-HCP, we take the PCIe-core as a data interaction model to connect the user logic in FPGA and the algorithms in GPU. Therefore, for implementing other algorithms into AIO-HCP, users do not need to develop the high-speed interface for data interaction again, and they only need to load their own application logic and algorithms into the corresponding processor.

2.3 Implementing MLE-based fitting algorithms into the AIO-HCP

In AIO-HCP, the PCIe-core in the FPGA card implements the handshake sequence among different heterogenous devices using DMA engine, Windows driver and User App. This treatment ensures the data transfer accuracy, even for a throughput at several Gigabytes level. In addition, to maximize the PCIe throughput, the descriptor in the DMA engine is set inside the FPGA rather than the central system (memory). For implementing the multi-emitter fitting algorithms (QC-STORM) into the AIO-HCP, the complete workflow of the handshake sequence between DMA engine and Window driver is described in the following steps:

Step 1: Data transfer from camera to FPGA. The FPGA acquires raw images from the image acquisition card in the camera using PCIe interface. Notably, a state machine is adopted to keep consistent with the rolling shutter mode of the sCMOS camera.

Step 2: ROI extraction in FPGA. The pre-processing steps in QC-STORM are performed in FPGA, including denoising, background removal, and sub-region extraction with three filters (a 5×5 gaussian low-pass filter to reduce noise, a 7×7 annular filter to remove background, and a standard deviation filter to obtain background fluctuation intensity). Then, from a denoised image, FPGA identifies fluorescent spots when the maximum pixel value in these spots is larger than a threshold called local maxima, which is calculated by 8.5 times the standard deviation of the adjacent background pixels [13]. ROIs are extracted from a raw image, and the size of the ROIs is either 7×7 or 11×11 pixels. When the distance between a local maximum and its nearest local maximum is larger than 7 pixels, the ROI centered at the former local maximum is considered as single-molecule ROI. Otherwise, the ROI is considered as multi-molecule ROI, and the extracted size of the ROI is expanded to 11×11 pixels. More details can be seen in the Pre-processing logic in Fig. 2(a).

Fig. 2. The implementation of a series of MLE-based fitting algorithms (QC-STORM) under AIO-HCP. (a) The algorithms implemented among FPGA, GPU and CPU. (b) The handshake sequence via DMA engine, Window driver and the User APP. The brown lines indicate data transfer direction, and the blue lines indicate instruction direction.

Download Full Size | PDF

Step 3: Transfer ROIs from FPGA to GPU. The DMA engine loads a descriptor (see Fig. 2(b), instruction 1) and moves data (ROIs) from the FPGA to the ring buffer (Fig. 2(b), instruction 2). After filling a programmable number of 8kB pages, the DMA updates the write (WR) pointer with the last filled descriptor (Fig. 2(b), instruction 3). The driver reads the updated WR pointer (Fig. 2(b), instruction 4), moves the data to the GPU (Fig. 2(b), instruction 5), and updates the read (RD) pointer in the FPGA with the last address read (Fig. 2(b), instruction 6).

Step 4: Localization and rendering in GPU. AIO-HCP allocates the localization and rendering steps of QC-STORM in GPU through data handling function (see the localization algorithm part in Fig. 2(a)). The localization algorithms include: a maximum likelihood estimator based on Broyden-Fletcher-Goldfarb-Shanno optimization (MLE_bfgs) for single molecule localization, a weighted MLE (MLE_wt) for localization of single molecule signal with contamination, and a two-emitter MLE (MLE_2e) and a three-emitter MLE (MLE_3e) for multi-emitter localization. The ROIs with a size of 7×7 pixels are processed with MLE_bfgs and MLE_wt, while the ROIs with a size of 11×11 pixels are processed with MLE_2e and MLE_3e. More details can be found in the QC-STORM paper [13].

Step 5: Transfer super-resolution image from GPU to CPU. The DMA engine loads a new descriptor from its internal list, and compares this address with the RD pointer to avoid overwriting data in the ring buffer (Fig. 2(b), instruction 7). If the address is the same (that is, the driver is slower than the DMA), the writing process is paused until RD is updated. Finally, the GPU sends the localization results to the User App for further super-resolution image visualization (Fig. 2(b), instruction 8). The visualization of localizations is processed in batches, depending on the loading volume from the User logic each time. A larger loading volume means fewer interation number and lower software overhead, and thus results in a shorter time consumption.

It is worthy to note that, to avoid the boundary effect during the pre-processing in FPGA, we firstly accumulate a buffer of 7 rows and 1024 columns, slide a window of 7×7 pixels across this area pixel by pixel, to determine local maxima in this area. Note that local maxima are identified only if they locate in the middle of the sliding window, and the locations of the local maxima are marked in the current buffer and stored in the second buffer. The next area for window sliding includes data from a new row and six adjacent rows from the buffer. After we identify all local maxima in a buffer area of 14 rows and 1024 columns, we calculate the distances between two local maxima in the current area. If the distance with two local maxima is larger than 7 pixels, we extract single-molecule ROI with 7×7 pixels using the location of this local maximum in the second buffer. Otherwise, we extract multi-molecule ROIs with 11×11 pixels in the current buffer. We repeat the window sliding and ROI extraction processes until we reach the end of the raw image. On the other hand, we found no boundary effect across different ROIs, since the localization algorithms were the same as our previous paper [13], and the performance of these algorithms have been evaluated systematically using both simulated and experimental datasets.

When all the data are moved from the FPGA to the computer, the DMA updates the WR pointer with the last descriptor, and informs the driver about the exact amount of data written in the last 8 kB page. An interrupt is also generated to inform the CPU about the end of transmission. It is worthy to note that a timely intervention on the image processing can be completed by configurating the registers in the host-computer to modify the parameters in the FPGA logic. This is an unique feature of our PCIe-based architecture for heterogeneous computation (see the brown lines in Fig. 2(b)).

2.4 Stream parallel scheduling of big data between FPGA and GPU

The handshake sequence in the previous section describes a single process in DMA-based data interaction, which is the basis for stream parallel scheduling of the big data accurately between the heterogeneous devices. Based on the fixed nature of the processing speed of FPGA [17] (which is related to the working frequency and the number of pixels processed simultaneously at the same clock, but is independent on the emitter density). During a batch data load, more ROIs output from FPGA means fewer data interactions between FPGA and GPU, leading to a minimized waiting time for GPU localization and an improved overall computation efficiency. Limited by the resources of the FPGA and GPU cards used in this study, the single data load to FPGA was set to be 1024 × 1024 pixels × 2000 frames. Notably, the single data load on QC-STORM is 1024 × 1024 pixels × 50 frames due to the limited memory size of 4 GB in the GPU card [24].

We compared the computation architectures (Fig. 3(a-c)) and overall execution times (Fig. 3(d-f)) of different computation platforms. Note that the GPU-based computation platform is used in QC-STORM, the FPGA-GPU cooperative computation platform is used in HCP-STORM, and the PCIe-based FPGA-GPU heterogeneous computation platform is used in AIO-STORM, respectively. The stream parallel scheduling in AIO-STORM is shown in Fig. 3(g). When FPGA extracts ROIs from the second data load (stream 1), GPU localized fluorescence spots from the previous batch simultaneously (stream 0). In this way, AIO-STORM optimizes the stream parallel scheduling of big data between FPGA and GPU, and thus minimizes the overall time consumption for multi-emitter fitting.

Fig. 3. Data stream execution in different computation platforms. (a) The computation architecture in GPU-based computation platform. (b) The computation architecture in FPGA-GPU cooperative computation platform. (c) The computation architecture in PCIe-based FPGA-GPU heterogeneous computation platform. (d) The execution time for (a). (e) The execution time for (b). (f) The execution time for (c). (g) The stream parallel scheduling in AIO-STORM.

Download Full Size | PDF

2.5 Image simulation and algorithm evaluation

We evaluated the performance of AIO-STORM on simulated datasets with different activation density (0.1 ∼ 4 µm^-2, according to typical experimental density ranges). The performance of the current fastest multi-emitter fitting method called QC-STORM and the popular multi-emitter fitting method called ThunderSTORM were also evaluated and compared. In each density level, we simulated 200 frames of raw images with randomly distributed emitters, with 1024×1024 pixels in each frame. The photon number of emitters on each frame was set to be 5000 photons, and the background was set to be 200 photon/pixel. The emission from emitters was captured with 0.77 quantum efficiency, and deteriorated with shot noise and readout noise (1.6 e-). The full width at half maximum (FWHM) of the Gaussian point spread function (PSF) was 200 nm.

We used data throughput to estimate the bandwidth requirements at different activation densities. We also used data throughput to evaluate the data interaction volume of heterogeneous devices with different DMA sizes. Note that, to ensure real-time data processing, the data interaction volume should be larger than the data bandwidth requirements. We used simulated datasets to evaluate the localization performance of the three algorithms using Jaccard index, recall, root mean square error (RMSE) and run time. Notably, the localization accuracy is characterized by RMSE, the detection rate is evaluated by Jaccard index, and the ratio of correctly identified ROIs is checked by Recall.

2.6 Super-resolution fluorescence imaging experiments

The optical microscopy system and the biological sample preparation are basically the same as those reported in HCP-STORM [17], but the computation architecture and data processing method are different. In AIO-STORM, raw fluorescence images were pre-processed in FPGA to obtain ROIs, which were passed through PCIe to GPU for further localization and rendering. The pre-processing and localization were executed simultaneously under the control of a host-computer and Windows driver. Super-resolution images were displayed in batches after rendering. Limited by the resources in FPGA and the typical imaging FOV, the imaging was taken with an FOV of 1024 × 1024 pixels and an exposure time of 10 ms. To evaluate the performance of AIO-STORM in practical super-resolution imaging, we specifically selected an image with high emitter density. Furthermore, we selected an FOV with relatively high density to compare the number of localizations and Fourier ring correlation (FRC) resolution.

3. Results and discussion

3.1 Estimating the data throughput in SRLM

We estimated the relationship between data throughput and activation density, using simulated described in Section 2.5. The results can be seen in Fig. 4. ROIs with 7×7 pixels and 11×11 pixels were extracted for further single emitter localization and multi-emitter localization, respectively. This treatment is helpful for reducing the execution time and guaranteeing localization accuracy, especially for the high-density cases. With the increase of activation density, the data throughput of ROIs with 7×7 pixels show a peak at ∼ 1.2 µm⁻², followed by a slow decrease at higher densities. However, the data throughput of ROIs with 11×11 pixels and the total ROIs (7×7 & 11×11 pixels) keep increasing with increased densities. Furthermore, when the activation density is up to 0.6 µm⁻², the data throughput of the total ROIs reaches about 300 MB/s, which is nearly the capacity limit of USB 3.0 (used in HCP-STORM). When the activation density is 4 µm⁻², the total data throughput (the brown line in Fig. 4) almost reaches the maximum data throughput of the current sCMOS camera (∼ 800 MB/s). In this case, the pre-processing in FPGA leads to almost no data reduction. To ensuring accurate data transmission under this kind of high data throughput, we will need to use additional storage, which would seriously affect the overall speed due to the extra read/write in external memories. That is to say, when the activation density is higher than 0.6 µm⁻², a new data interaction scheme with a bandwidth of larger than 800 MB/s should be provided.

Fig. 4. The dependence of data throughput on different activation densities.

Download Full Size | PDF

3.2 Data interaction capacity of AIO-STORM

It is well-known that the data transmission from device A to device B are extended at different layers (transaction layer, data link layer, and physical layer) in PCIe, which are encapsulated within FPGA as an IP-core. A standard IP-core can be called to control the timing and the data package size for data transmission and receipt. The complete communication mode in PCIe is shown in Fig. 5(a). According to the protocol, the data throughput in PCIe can be set using the location of each instruction and the number of occupied bytes (Fig. 5(b)).

Fig. 5. The data transmission links in PCIe. (a) The data transmission layer in PCIe. (b) The data packet structure of PCIe protocol.

Download Full Size | PDF

We evaluated the data interaction capacity of AIO-HCP and AIO-STORM using data throughput T:

(1)$$T = \frac{{{P_L}}}{{{P_L} + {O_V}}} \cdot N \cdot Gen \cdot (250MBytes/s).$$

Here P_L is the maximum payload size, O_V is the protocol overhead (24 bytes for 64-bit memory addressing), N is the number of lanes and $Gen\; $ is the generation of the PCIe endpoint [25]. In this study, we used an FPGA Kintex-7 accelerating card with PCIe Gen 2 × 4 lanes endpoint.

To improve T, we need to increase P_L and decrease O_V. Thus, based on the PCIe protocol (Fig. 5(b)), we tested the data throughput with a fixed O_V and different P_L (8∼128 Bytes). Note that the theoretical maximum throughput for the FPGA used in this study is 1684 MB/s (P_L= 128 Bytes). We found that the data interaction capacity of AIO-HCP is 1651 MB/s, which is approximately 98% of the theoretical maximum (1684 MB/s, see the pink dashed line in Fig. 6). Note that the data interaction capacity of AIO-HCP is 4% higher than the reported value on a similar FPGA board, Virtex-6 Gen 2 × 4 (1583 MB/s) [25]. The data interaction capacity of AIO-STORM was found to be 580 MB/s. The significantly lower capacity of AIO-STORM vs AIO-HCP is arisen from the reduced FPGA processing frequency to ensure accurate timing. Note that the processing frequency is 200 MHz in AIO-HCP, and 75 MHz in AIO-STORM. In the near future, we could upgrade FPGA chips with more resources, so that the FPGA processing frequency does not need to be reduced. In this case, the data interaction capacity of AIO-STORM could be the same as that of AIO-HCP. Notably the data throughput of AIO-HCP (Fig. 6) exceeds the bandwidth requirements for ROIs transmission at different activation densities (Fig. 4), which is a pre-requisite for enabling real-time image processing in SRLM.

Fig. 6. The data interaction capacity with AIO-HCP (no load) and AIO-STORM (loaded with localization algorithms) at different payload. Note that the PCIe-based FPGA-GPU heterogeneous computation involves bidirectional data interaction. Here FPGA-GPU means FPGA is used as the master device, and GPU-FPGA means GPU is used as the master device.

Download Full Size | PDF

3.3 Evaluating the performance of AIO-STORM using simulated images

We compared the localization performance of AIO-STORM with QC-STORM and ThunderSTORM using simulated image datasets with different activation densities (from 0.1 µm⁻² to 4 µm⁻², according to typical experimental density ranges). We quantified the localization performance of these algorithms with Jaccard index, Recall rate, root-mean-square error (RMSE) and Runtime, as shown in Fig. 5. We found that AIO-STORM exhibits a close localization performance to ThunderSTORM under different activation densities, and both of them are slightly worse than QC-STORM (Fig. 7(a-c)). The decreased localization performance of AIO-STORM over QC-STORM is probably due to the simplified threshold judgement in ROI extraction, which was used to minimize resource use in FPGA. Furthermore, we noticed that AIO-STORM presented 4∼5 orders of magnitude faster speed than ThunderSTORM, and about four times faster speed than QC-STORM (Fig. 7(d)). And, the run time of AIO-STORM slightly increases with higher activation density.

Fig. 7. Comparison on the overall performance of three localization algorithms in processing simulated images at different activation densities. (a) The dependence of Jaccard on activation density. (b) The dependence of Recall on activation density. (c) The dependence of RMSE on activation density. (d) The Run time at different activation densities.

Download Full Size | PDF

We note that the computing performance of FPGA is determined by the working frequency and the number of pixels processed per clock. Therefore, under the same hardware conditions, AIO-STORM would provide a better localization performance in scenarios with a larger amount of preprocessed data. And, increasing the resources and frequency of an FPGA chip can improve the processing speed directly.

3.4 Evaluating the performance of AIO-STORM using experimental images

We quantified the localization performance of AIO-STORM on experimental dataset of fixed COS-7 cells. We labelled the tubulins structure with Alexa Fluor 647, and imaged the sample on a home-built SRLM setup. As seen in Fig. 8(a), AIO-STROM is capable of processing 10,000 experimental raw images (1024×1024 pixels, 5.5 µm⁻² in structure density) within 81 s, corresponding to 8.1 ms for each image. This data processing time is faster than the exposure time (10 ms). Therefore, AIO-STROM is fast enough for enabling real-time data processing for the representative Hamamatsu Flash 4.0 V3 sCMOS cameras working at typical data rate (1024 × 1024 pixels @ 100 fps). The overall data processing speed of AIO-STROM is nearly 4 orders of magnitude faster than ThunderSTORM, and is about 4 times faster than QC-STORM, without sacrificing super-resolution image quality (as indicated by the FRC resolution in Fig. 8(b) and Fig. 8(c), and the line profiles in Fig. 8(d)). Notably, in AIO-STORM, the working frequency of the FPGA chip was reduced from 200 MHz to 75 MHz, resulting in a reduction of nearly three times in the overall image processing speed. If we use an FPGA chip with more resources and higher working frequency, the overall image processing speed may be sufficient to support real-time image processing for SRLM with a typical sCMOS camera (2048 × 2048 @ 100 fps) and high-density emitters.

Fig. 8. Evaluation of the overall performance of the three localization algorithms using experimental images. (a) A super-resolution image reconstructed from a total of 10000 raw images with 1024 × 1024 pixels per frame. The presented image was processed by AIO-STORM. The overall execution times for three localization algorithms are shown in the upper-left corner. (b) Enlarged super-resolution images from the boxed region in (a) and a frame of raw image. These super-resolution images were processed using different localization algorithms shown in the upper-left corner. The FRC resolution and the total number of localization points are shown in the lower-right corners. (c) Enlarged super-resolution images of the boxed region in (b). The FRC resolution and the total number of localization points are shown in the lower-right corners. (d) Projected line profiles of the marked areas in (c).

Download Full Size | PDF

Furthermore, we evaluated the overall processing speed of AIO-STORM under different batch sizes (50 ∼ 2000 frames). Experiment images were used in the evaluation. As shown in Fig. 9, the overall data processing speed becomes faster when the batch size expands. And, when the batch size reaches 2000 frames, the image processing speed is even faster than the image acquisition speed of the sCMOS camera, meaning that real-time data processing is realized. These results also confirm the effectiveness of the task parallel scheduling described in Section 2.4.

Fig. 9. The processing speed of FPGA under different batch size.

Download Full Size | PDF

3.5 Pilot study on real-time multi-emitter fitting with high emitter densities

Using simulated datasets with a large density range (0.1–10 µm⁻²), we compared the localization performance of our AIO-STORM with QC-STORM, ThunderSTORM and DAOSTORM (a well-known localization algorithm that can process ROIs with very high activation density) [26]. As shown in Fig. 10, ThunderSTORM exhibits the best performance in both Recall and Localization error, followed by our AIO-STORM. Note that AIO-STORM executes 4∼5 orders of magnitude faster than ThunderSTORM (Fig. 7(d)), and that the performance of DAOSTORM depends on the threshold setting and could be potentially improve in some cases. Moreover, DAOSTORM is based on multi-core CPU, and thus runs much slower than QC-STORM (based on GPU computation) and AIO-STORM (based on FPGA-GPU heterogeneous computation). In addition, the AIO-HCP platform reported in this study is a general heterogeneous platform for image processing acceleration, and we can implement any localization algorithms, including but not limited ThunderSTORM and DAOSTORM, into this platform.

Fig. 10. Evaluation of the localization performance of the four algorithms using simulation datasets. (a) Recall at different activation densities. (b) Localization error at different activation densities.

Download Full Size | PDF

3.6 Discussions on the heterogeneous computing platform and method

In HCP-STORM, because GPU localization and rendering are performed after the FPGA-based image pre-processing (see Fig. 3(a) and Fig. 3(d)), the overall execution time is the sum of pre-processing, localization and rendering. However, in AIO-STORM, GPU localization and FPGA pre-processing are executed in parallel using data batches (see Fig. 3(c) and Fig. 3(f)). Therefore, the overall execution time of AIO-STORM is the sum of the pre-processing time in FPGA in the first batch, the synchronous execution time in FPGA and GPU, and the localization and rendering time in GPU in the last batch (see Fig. 3(g). Unsurprisingly, the overall execution time of AIO-STORM is much faster than that of HCP-STORM.

Theoretically, for the GPU-accelerated localization algorithms such as QC-STORM, the overall excitation speed can be improved using a better GPU card and/or adopting multiple GPUs. However, limited by the caches in GPUs, a huge time and resource consumption would be wasted in the repetitive operations of data computing and storage, which are controlled under software instructions. FPGA computation does not need caches, and data stream are driven in parallel pipeline to complete the computing using hardware-based logic circuits, which greatly improves the computation speed.

Furthermore, in this study, we demonstrated that the overall execution speed of AIO-STORM is substantially improved owing to the heterogeneous computing platform and scheduling method. However, the maximum performance of AIO-STORM has not yet been achieved, because the working frequency of the FPGA card is reduced from 200 MHz to 75 MHz to ensure correct data exchange. This operation reduced the overall image processing speed of AIO-STORM by 3 times. In the near future, this limitation can be overcome using an FPGA with more resources (4 times resource would be sufficient). Moreover, because the data interaction in high-speed heterogeneous computing requires high timing accuracy, it is difficult to realize a joint control of software and hardware for heterogeneous computing. A further improvement would be the development of a unified and simplified tool at the software level for heterogeneous computing. Notably, the PCIe-based heterogeneous computation platform (AIO-HCP) can be used to accelerate other algorithms, including but not limited localization algorithms.

4. Conclusion

In summary, a PCIe-based FPGA-GPU heterogeneous computing platform (AIO-HCP) and a multi-emitter localization algorithm (AIO-STORM) were proposed and evaluated in this study. The AIO-HCP could guarantee a data throughput of up to ∼ 1.561 GB/s between heterogeneous devices, and provides a powerful computing platform for computational microscopy. The AIO-STORM could achieve real-time multi-emitter fitting for raw images of 1024 × 1024 pixels @ 100 fps, under a structure density of up to 5.5 µm⁻² without sacrificing the imaging quality. The image processing speed of AIO-STORM is four times faster than QC-STORM and four orders of magnitude faster than ThunderSTORM. A high-efficiency parallel scheduling method was proposed to improve the execution speed of AIO-STORM.

Finally, we point out that the localization performance of the AIO-STORM algorithm is currently limited by the FPGA resources in this study. Through convenient hardware migration and software upgrade, the AIO-STORM algorithm could be used to process raw images from cameras with a higher data throughput (for example, Photometrics Kinetix which works at ∼ 3200 × 3200 pixels and 400 fps). This study offers opportunities for enabling super-resolution mosaic imaging and multi-scale visualization of cellular functions.

Funding

National Natural Science Foundation of China (81827901); Start-up Fund from Hainan University (KYQD(ZR)-20077); Scientific Research Program Project of Hubei Province Education Department (B2020405).

Acknowledgments

We thank the Optical Bioimaging Core Facility of WNLO-HUST for technical support, and appreciate Prof. Xiaohu Lv from Huazhong University of Science and Technology for the help in preparing and editing the manuscript.

Disclosures

Chinese patent (No. 201810410930.2).

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. L. Schermelleh, A. Ferrand, T. Huser, C. Eggeling, M. Sauer, O. Biehlmaier, and G. P. C. Drummen, “Super-resolution microscopy demystified,” Nat. Cell Biol. 21(1), 72–84 (2019). [CrossRef]

2. S. J. Sahl, S. W. Hell, and S. Jakobs, “Fluorescence nanoscopy in cell biology,” Nat. Rev. Mol. Cell Biol. 18(11), 685–701 (2017). [CrossRef]

3. H. Deschout, F. C. Zanacchi, M. Mlodzianoski, A. Diaspro, J. Bewersdorf, S. T. Hess, and K. Braeckmans, “Precisely and accurately localizing single emitters in fluorescence microscopy,” Nat. Methods 11(3), 253–266 (2014). [CrossRef]

4. I. M. Khater, I. R. Nabi, and G. Hamarneh, “A review of super-resolution single-molecule localization microscopy cluster analysis and quantification methods,” Patterns 1(3), 100038 (2020). [CrossRef]

5. A. Mau, K. Friedl, C. Leterrier, N. Bourg, and S. Leveque-Fort, “Fast widefield scan provides tunable and uniform illumination optimizing super-resolution microscopy on large fields,” Nat. Commun. 12(1), 3077 (2021). [CrossRef]

6. D. Sage, H. Kirshner, T. Pengo, N. Stuurman, J. Min, S. Manley, and M. Unser, “Quantitative evaluation of software packages for single-molecule localization microscopy,” Nat. Methods 12(8), 717–724 (2015). [CrossRef]

7. Y. Wang, T. Quan, S. Zeng, and Z. L. Huang, “PALMER: a method capable of parallel localization of multiple emitters for high-density localization microscopy,” Opt. Express 20(14), 16039–16049 (2012). [CrossRef]

8. A. Small and S. Stahlheber, “Fluorophore localization algorithms for super-resolution microscopy,” Nat. Methods 11(3), 267–279 (2014). [CrossRef]

9. F. Huang, S. L. Schwartz, J. M. Byars, and K. A. Lidke, “Simultaneous multiple-emitter fitting for single molecule super-resolution imaging,” Biomed. Opt. Express 2(5), 1377–1393 (2011). [CrossRef]

10. D. Sage, T.-A. Pham, H. Babcock, T. Lukes, T. Pengo, J. Chao, R. Velmurugan, A. Herbert, A. Agrawal, S. Colabrese, A. Wheeler, A. Archetti, B. Rieger, R. Ober, G. M. Hagen, J.-B. Sibarita, J. Ries, R. Henriques, M. Unser, and S. Holden, “Super-resolution fight club: assessment of 2D and 3D single-molecule localization microscopy software,” Nat. Methods 16(5), 387–395 (2019). [CrossRef]

11. S. Mailfert, J. Touvier, L. Benyoussef, R. Fabre, A. Rabaoui, M. C. Blache, Y. Hamon, S. Brustlein, S. Monneret, D. Marguet, and N. Bertaux, “A theoretical high-density nanoscopy study leads to the design of UNLOC, a parameter-free algorithm,” Biophys. J. 115(3), 565–576 (2018). [CrossRef]

12. M. Ovesny, P. Krizek, J. Borkovec, Z. K. Svindrych, and G. M. Hagen, “ThunderSTORM: a comprehensive ImageJ plug-in for PALM and STORM data analysis and super-resolution imaging,” Bioinformatics 30(16), 2389–2390 (2014). [CrossRef]

13. L. Li, B. Xin, W. Kuang, Z. Zhou, and Z. L. Huang, “Divide and conquer: real-time maximum likelihood fitting of multiple emitters for super-resolution localization microscopy,” Opt. Express 27(15), 21029–21049 (2019). [CrossRef]

14. H. Ma, J. Xu, and Y. Liu, “WindSTORM: Robust online image processing for high-throughput nanoscopy,” Sci. Adv. 5(4), eaaw0683 (2019). [CrossRef]

15. C. A. Navarro, N. Hitschfeld-Kahler, and L. Mateu, “A survey on parallel computing and its applications in data-parallel problems using GPU Architectures,” Commun. comput. phys. 15(2), 285–329 (2014). [CrossRef]

16. L. Bustio-Martínez, R. Cumplido, M. Letras, R. Hernández-León, C. Feregrino-Uribe, and J. Hernández-Palancar, “FPGA/GPU-based acceleration for frequent itemsets mining: a comprehensive review,” 54, Article 179 (2021).

17. D. Gui, Y. Chen, W. Kuang, M. Shang, Z. Wang, and Z. L. Huang, “Accelerating multi-emitter localization in super-resolution localization microscopy with FPGA-GPU cooperative computation,” Opt. Express 29(22), 35247–35260 (2021). [CrossRef]

18. D. Mahecic, I. Testa, J. Griffie, and S. Manley, “Strategies for increasing the throughput of super-resolution microscopies,” Current Opinion Chemical Biology 51, 84–91 (2019). [CrossRef]

19. Y. Du, C. Wang, C. Zhang, L. Guo, Y. Chen, M. Yan, Q. Feng, M. Shang, W. Kuang, Z. Wang, and Z.-L. Huang, “Computational framework for generating large panoramic super-resolution images from localization microscopy,” Biomed. Opt. Express 12(8), 4759–4778 (2021). [CrossRef]

20. J. Fan, J. Suo, J. Wu, H. Xie, Y. Shen, F. Chen, G. Wang, L. Cao, G. Jin, Q. He, T. Li, G. Luan, L. Kong, Z. Zheng, and Q. Dai, “Video-rate imaging of biological dynamics at centimetre scale and micrometre resolution,” Nat. Photonics 13(11), 809–816 (2019). [CrossRef]

21. K. Chen, H. Chen, J. Huang, F. Lanni, S. Tang, and W. Wu, “A generic high bandwidth data acquisition card for physics experiments,” IEEE Trans. Instrum. Meas. 69(7), 4569–4577 (2020). [CrossRef]

22. A. Beghin, A. Kechkar, C. Butler, F. Levet, M. Cabillic, O. Rossier, G. Giannone, R. Galland, D. Choquet, and J.-B. Sibarita, “Localization-based super-resolution imaging meets high-content screening,” Nat. Methods 14(12), 1184–1190 (2017). [CrossRef]

23. C. Shim, R. Shinde, and M. Choi, “Compatibility enhancement and performance measurement for socket interface with PCIe interconnections,” Hum. Cent. Comput. Inf. Sci. 9(1), 10 (2019). [CrossRef]

24. L. Li, B. Xin, W. Kuang, Z. Zhou, and Z. L. Huang, “Divide and conquer: Real-time maximum likelihood fitting of multiple emitters for super-resolution localization microscopy,” (2019).

25. L. Rota, M. Caselle, S. Chilingaryan, A. Kopmann, and M. Weber, “A new DMA PCIe architecture for gigabyte data transmission,” in 2014 19th IEEE-NPSS Real Time Conference (Nara, Japan, 26-30 May 2014, Processings of the IEEE, 2014), pp. 1–2.

26. S. J. Holden, S. Uphoff, and A. N. Kapanidis, “DAOSTORM: an algorithm for high- density super-resolution microscopy,” Nat Methods 8(4), 279–280 (2011). [CrossRef]

PCIe-based FPGA-GPU heterogeneous computation for real-time multi-emitter fitting in super-resolution localization microscopy

Abstract

1. Introduction

2. Methods

2.1 PCIe-based heterogeneous FGPA-GPU computation platform

2.2 Data interaction via direct memory access

2.3 Implementing MLE-based fitting algorithms into the AIO-HCP

2.4 Stream parallel scheduling of big data between FPGA and GPU

2.5 Image simulation and algorithm evaluation

2.6 Super-resolution fluorescence imaging experiments

3. Results and discussion

3.1 Estimating the data throughput in SRLM

3.2 Data interaction capacity of AIO-STORM

3.3 Evaluating the performance of AIO-STORM using simulated images

3.4 Evaluating the performance of AIO-STORM using experimental images

3.5 Pilot study on real-time multi-emitter fitting with high emitter densities

3.6 Discussions on the heterogeneous computing platform and method

4. Conclusion

Funding

Acknowledgments

Disclosures

Data availability

References

Data availability

Cited By

Figures (10)

Equations (1)

Biomedical Optics Express