Accelerating multi-emitter localization in super-resolution localization microscopy with FPGA-GPU cooperative computation

Dan Gui; Dan Gui; Yunjiu Chen; Weibing Kuang; Mingtao Shang; Zhengxia Wang; Zhengxia Wang; Zhen-Li Huang; Zhen-Li Huang

doi:10.1364/OE.439976

1. Introduction

Super-resolution localization microscopy (SRLM) is beneficial for studying important biomedical questions with nanoscale resolution, for example, unraveling complex biological processes in a large heterogeneous cell population [1,2]. Combining SRLM with scientific complementary metal-oxide-semiconductor (sCMOS) cameras [3], especially back-illuminated sCMOS cameras [4], strengthens the power of SRLM by providing high-throughput imaging capacities. However, this combination brings challenges in real-time massive data processing, especially when it is necessary to use multi-emitter localization to further improve the imaging throughput of SRLM [5,6].

It is well-known that multi-emitter localization allows a much higher emitter density (or activation density) in a raw image frame than traditional sparse localization method, so that a final super-resolution image could be reconstructed with a reduced number of raw image frames [7,8]. In this way, the imaging throughput of SRLM could be improved by at least several times. Undoubtedly, the development of multi-emitter localization methods is desirable in HT-SRLM. However, due to the mathematical complexity, most multi-emitter localization methods run usually far behind the frame rates of popular sCMOS cameras [7], resulting in a huge pressure in storing enormous raw images, especially when HT-SRLM is required for a long-term use [9].

Currently, the execution speed of multi-emitter localization methods could be significantly improved by graphics processing unit (GPU) acceleration [10,11]. In 2019, we proposed a maximum likelihood estimation (MLE) based algorithm called QC-STORM [12] for multi-emitter localization. QC-STORM exhibits a speed gain of 2∼3 orders of magnitude over the popular fitting-based ThunderSTORM [13], and is able to provide real-time image processing on raw images with 1024 × 1024 pixels and 10 ms exposure time, with comparable spatial resolution as ThunderSTORM. However, the achieved execution speed of QC-STORM is still not fast enough to enable real-time multi-emitter localization for raw images from popular sCMOS cameras working at full frame rate, and a further increase on the execution speed will require expensive upgrades on workstation and graphic card.

Also in 2019, Munro and co-workers tried to use a high-performance computing cluster to accelerate SRLM data processing [14]. However, due to the high cost and the requirement of experienced users, this approach would be difficult to have a wide-spread use. Recently, Gaire and co-workers proposed to use blind sparse inpainting to reduce the number of raw images for reconstructing a 3D super-resolution image [20]. It would be interesting to see the benefits of combining this sparse inpainting technology with multi-emitter localization methods for high throughput SRLM.

Another way for improving the imaging throughput of SRLM would be deep learning, where a super-resolution image is directly obtained from a much smaller amount of raw images than that used by the multi-emitter localization approach [15–17]. For example, a deep learning method, called Deep-STORM [16], uses deep convolution neural network to achieve three orders of magnitude faster speed than traditional localization-based methods. The deep learning approach also makes good progresses in multi-color SRLM [18] and 3D SRLM [19]. However, the deep learning methods are usually limited by the size of the neural network and the available memory of commercial GPU, and had not been reported to achieve real-time image processing for sCMOS-based SRLM, especially when sCMOS camera is working at full field-of-view and frame rate. Therefore, we believe that exploring a new way for accelerating multi-emitter localization with large field-of-view and frame rate is still highly desirable.

We checked into the time cost of every image processing step in QC-STORM, and found out that the GPU acceleration in QC-STORM is effective in localization and rendering, but not in the pre-processing steps (which include Denoising and background removal, Fluorescent spot identification, and ROI extraction). Therefore, to further accelerate the execution speed of QC-STORM, a typical MLE-based multi-emitter localization method with excellent localization precision, we would need a better way to speed up these image pre-processing steps. Previously, we used the field-programmable gate array (FPGA) chip inside a Hamamatsu sCMOS camera to identify and export the regions-of-interest (ROIs) containing emitter signal, so that the data flow can be significantly reduced with negligible information loss [21]. This example clearly shows the power of FPGA programming in sCMOS-based SRLM, at least in reducing the pressure of heavy data flow from sCMOS cameras. However, it is not easy to access the internal FPGA chip of a commercial sCMOS camera due to the limitation from the camera manufacturer, and the computation resource in the internal FPGA chip may not be sufficient to enable the pre-processing steps in QC-STORM. Furthermore, we will need external memory to coordinate the internal FPGA chip with the localization algorithm. Nevertheless, we conclude that FPGA-GPU cooperative computation would be promising in accelerating multi-emitter localization.

In this paper, we propose a heterogeneous computation platform (HCP), which includes a multicore CPU, a general purpose GPU card, and a customized external FPGA board, to accelerate multi-emitter localization with minimal reduction on the localization precision. We implemented QC-STORM [12] into this heterogeneous computation platform, and thus called this modified method HCP-STORM. In this method, we aimed to accelerate the localization speed with FPGA-GPU cooperative computation, and ensured the localization precision with MLE-based fitting. The image processing steps in QC-STORM were decomposed into the pre-processing steps mentioned above and several post-processing steps (including Emitter localization, Statistics, and Rendering). The pre-processing steps were implemented on the external FPGA, and the post-processing steps were implemented on the GPU. Using simulated and experimental image datasets, we verified that implementing the pre-processing steps on FPGA could efficiently improve the data throughput, and that the HCP-STORM method can achieve comparable localization precision and recall rates as ThunderSTORM and QC-STORM, while the overall execution time of HCP-STORM is more than an order of magnitude faster than that of QC-STORM. Furthermore, we found that the HCP-STORM method could process a raw image of 256 × 256 pixels within 0.6 ∼ 0.7 ms, which is sufficient to enable real-time processing of an FOV of 2048 × 2048 pixels within 10 ms exposure time.

2. Methods

2.1 Heterogeneous computing platform for FPGA-GPU cooperative computation

The sCMOS camera in existing SRLM systems is typically connected to a data acquisition card (or called frame grabber) that is usually installed in a PCIe slot of a personal computer (PC). In this paper, we designed and inserted a repeater between the sCMOS camera in our system and the corresponding data acquisition card (Fig. 1(a)), so that a copy of raw images can be obtained and sent to an external FPGA board for pre-processing, without affecting the original data flow control. In this way, we built a heterogeneous computing platform (HCP) with three different computation devices (FPGA, GPU, and CPU). Then, we optimally configured a multi-emitter localization method (QC-STORM) into the platform (Fig. 1(b)), with the goal of benefiting from FPGA-GPU cooperative computation. Specifically, in the external FPGA board, we executed three pre-processing steps: Denoising & background removal, Fluorescent spot identification, and ROIs extraction & classification. Note that in a traditional multi-emitter localization algorithm with GPU acceleration, these pre-processing steps are usually executed in GPU (Fig. 1(c)), and ROI classification is rarely used in the pre-processing step. Next, the classified ROIs were transmitted to the memory of the PC through different FIFOs (First-In First-Out) and a USB 3.0 hardware interface. Later, the ROIs in the memory were copied to GPU, and processed using single-emitter or multi-emitter localization to obtain localization coordinates. Finally, a super-resolution image was rendered in GPU. Note that the steps implemented in GPU were called post-processing steps in this paper, and that the time-consuming pre-processing steps were all performed in FPGA, which improves the overall execution speed.

Fig. 1. Heterogeneous computing platform for processing raw images in SRLM. (a) The system configuration. The decode and encode were designed for enabling Full Camera Link interface. (b) The data processing steps in HCP-STORM. (c) The data processing steps in QC-STORM.

Download Full Size | PDF

2.2 Pre-processing steps for extracting ROIs from raw images

We checked into the ROI extraction methods used in two MLE-based localization algorithms, MaLiang [10] and QC-STORM [12], and realized that, after ROI extraction, we may include a further step called ROI classification to increase the execution speed in multi-emitter localization. Therefore, after considering the capabilities of FPGA-based pipeline processing, we proposed an FPGA-based ROI extraction method. The characteristics of this new method mainly include: 1) using an external FPGA chip to reduce the transmission and storage volume of raw images; 2) using an annular filter to remove non-uniform background; 3) classifying the extracted ROIs for single-emitter and multi-emitter localization; 4) eliminating the influence of overlapping emitters on background threshold determination.

The workflow of the FPGA-based ROI extraction method is shown in Fig. 2, which includes all of the pre-processing steps implemented in FPGA. First, we used a 3×3 Gaussian low-pass filter to denoise a raw image and a 5×5 annular filter to remove the background in the image. Next, we used a standard deviation filter (see Section 2.3 for details) to calculate background fluctuation intensity, which would be used as the threshold for local maxima identification. Then, we identified fluorescent spots with values larger than the threshold (called local maxima) from denoised images, and extracted ROIs with a size of 7×7 or 11×11 from the raw image. When the distance between two local maxima is larger than 7×7 pixels, we concluded that the current ROI contains only one emitter and a region of 7×7 pixels would be sufficient to extract the ROI effectively. Otherwise, we used a region of 11×11 pixels to extract the ROI and marked it as a multi-emitter ROI. This treatment is helpful for reducing the execution time in further MLE-based emitter localization.

Fig. 2. FPGA implementation of image filtering and ROI extraction.

Download Full Size | PDF

2.3 FPGA implementation of the pre-processing steps

The FPGA hardware for implementing the pre-processing steps was based on a customized FPGA board, which mainly composes of a Xilinx Kintex-7 FPGA chip (Model: xc7k325t), a pair of Full Camera Link interface, an USB 3.0 interface, and some user Input/output interfaces. This FPGA board was specially designed to work with a popular sCMOS camera (Hamamatsu Flash 4.0 V3), where Full Camera Link is used to communicate the camera with the associated data acquisition card in a PC. This sCMOS camera, as well as many other sCMOS cameras, commonly operate in rolling shutter mode when they are used in SRLM. In this mode, the camera sensor is virtually split into two halves horizontally, and a raw image is read out row after row using a Full Camera Link interface (working in 80-bit (Deca) mode, five pixels within each clock), from the center to the two halves interleavely. Therefore, we need to take extra efforts to send an image from the camera to the FPGA board. Specifically, according to the parity of the number of rows, we assigned the different rows to an upper image or a lower image. We then allocated two identical circuit modules to process the upper and the lower images, and finally sent the identified ROIs to the memory of the PC using the USB 3.0 interface. Additionally, we noticed that a localization algorithm in SRLM usually can't handle several margin rows in the upper or the lower images due to incomplete fluorescence emission pattern, and thus would result in dark stripes (that is, areas without data points) in a final super-resolution image. Therefore, we discarded six rows in the top of the upper image, and generated six new rows in the bottom of the upper image using the top rows in the corresponding lower image. We performed a similar treatment to the lower image, that is, discarded six bottom rows and added six new rows in the top of the lower image. In this case, we not only eliminated the dark stripes, but also maintained the frame rate because the image sizes were kept the same before and after the treatment.

FPGA implementation of the pre-processing steps was divided into two main circuit modules: image filtering and ROI extraction. The input timing had a frequency of 85 MHz and the data was read out at 80-bit (Deca) mode with the Full Camera Link interface. The clock-driven image lines automatically flowed through the program-generated hardware, and the image filtering module was executed during image transmission. The FPGA implementation of image filtering included three steps: (1) a 3×3 Gaussian low-pass filter was used to smooth the raw image; (2) a 5×5 annular filter was used to remove background; and (3) a 7×7 standard deviation filter was used to calculate background fluctuation image; The Gaussian low-pass filter and the annular filter were realized using a standard image convolution process, while the standard deviation filter was modified to reduce the computation load in FPGA chip. To apply the standard deviation filter in a sub-region of 7×7 pixels, we sorted the outermost 24 pixels values and selected the first 12 pixels with the smaller pixel values. We further selected the 8 pixels values in the middle of the 12 pixels, and calculated the mean value and the standard deviation of these pixels. The former was used to represent mean background intensity, and the latter was used to represent the background fluctuation in this sub-region.

In the ROI extraction module, local maxima were calculated to identify ROIs. Then, the identified ROIs were extracted with different sizes and sent to a GPU device for further processing, as seen in the RTL (Register Transfer Language) schematic diagram (Fig. 3). To judge whether an ROI of 7×7 pixels contains one or more fluorophores, we used the following criteria: 1) the center pixel in the central 3×3 pixels has the maximum value; 2) the intensity of the center pixel is two times larger than the mean background intensity; 3) the sum value of the center pixel and its 8-neighbourhood pixels is 11 times larger than the mean background intensity; 4) the sum value of the center pixel and its 4-neighbourhood pixels is 9 times larger than the mean background intensity. The distances were calculated for the local maxima that satisfy the above criteria, and then used to classify single-emitter ROI or multi-emitter ROI (see Section 2.2). We note that the parameters used in the ROI extraction module were tested with simulated images to allow for high recall rate and low false detection rate, and that under the current implementation, about 91% of the FPGA logical resources were used for implementing the pre-processing steps.

Fig. 3. RTL schematic diagram of the pre-processing steps in FPGA.

Download Full Size | PDF

2.4 GPU implementation of the post-processing steps

As seen in Fig. 1(b), the post-processing steps include single-emitter localization, multi-emitter localization, rendering and statistics analysis. The last two steps are easy to implement due to a clear understanding on the treatments: Gaussian rendering is popularly employed in the rendering step [22], and the statistics analysis is used to collect important localization parameters, such as XYZ coordinates, total emission intensity and background (see Section 3.1 in our previous paper [23]). However, we should carefully select suitable localization algorithms for single emitter and multiple emitters. Among the numerous reported algorithms for molecule localization, it is widely accepted that the MLE-based localization algorithms are able to achieve the highest localization precision [8]. In the past decade, researchers have been trying to use GPU computation to accelerate the execution speed of the MLE-based localization algorithms. However, these algorithms would still be slowed down with an increase of activation density, since a more complex and time-consuming fitting model would be required when the activation density is higher. In this paper, after considering the ROI classification in the pre-processing steps, we decided to combine the FPGA-based pre-processing with a GPU-based multi-emitter localization method called QC-STORM [12], because the classified ROIs can be better processed with the several MLE-based localization algorithms in QC-STORM, including MLE_bfgs (sparse emitter), MLE_wt (sparse emitter with signal contamination), MLE_2e (two emitters), and MLE_3e (three emitters).

It is worth pointing out that the PC often checks the current status of the FIFOs and reads out a FIFO when it is half-full (> 40 KB). In this way, the classified ROIs in different FIFOs are transmitted to the memory of the PC, and the GPU reads the memory and determines which localization algorithms in QC-STORM should be used. Because we didn’t modify the mathematics in the localization algorithms of QC-STORM, the localization precision of QC-STORM will be guaranteed. However, we should be cautious to coordinate the double FIFOs with PC, especially for the last frame where the data volume may be less than half FIFO size.

2.5 Image simulation and algorithm evaluation

We simulated a series of image datasets with different activation densities. Each dataset contained a total of 200 images, with a FOV of 256×256 pixels and a pixel size of 100 nm. Fluorescent molecules were randomly distributed in the images, and the number of fluorescent molecules in each image was equal to the activation density multiplied by the image area. According to previous reports [12,24], the total number of signal photons from the fluorescent molecules was set to follow a lognormal distribution with a mean of 5, 000 photons and a standard deviation of 2, 000 photons, and the background photons followed a Poisson distribution with a mean value of 200 photons. We assumed the images to be detected by an sCMOS camera (Hamamatsu Flash 4.0 V3) working at 670 nm wavelength, 0.77 quantum efficiency and 1.6 e- readout noise. The full width at half maximum (FWHM) of the Gaussian point spread function (PSF) was 200 nm.

We compared the performance of HCP-STORM with QC-STORM and ThunderSTORM. We used data reduction rate and recall rate to quantify the performance in the pre-processing steps. The data reduction rate was calculated as Size_{(Raw - ROIs)}/Size_Raw, and here the size of ROIs accounts for the summed size of all ROIs. The recall rate was defined as the number of correctly identified ROIs divided by the sum of all ROIs. We used root-mean-square error (RMSE), Jaccard index and run time to evaluate the overall performance of the three methods in processing simulated datasets, and used the number of localization points and the Fourier ring correlation (FRC) resolution [25,26] for evaluating the overall performance in processing experimental data. Note that RMSE is normally used to represent the localization accuracy and Jaccard index is used to quantify detection rate [7]. Before calculating the FRC resolution, we detected and merged localizations in consecutive frames into a single localization, using a distance threshold of 50 nm, so that repeated localizations from the same emitter can be eliminated. The number of localization points and the FRC resolution are calculated at different densities. The evaluations were performed under a commercial PC equipped with 12-core CPU (Intel Core i7-8700), 32 GB memory, a graphics card (Nvidia TITAN Xp, with 3840 CUDA cores and 11 GB memory).

2.6 Super-resolution fluorescence imaging experiments

Microtubules in fixed COS-7 cells were immunolabelled with primary antibodies and secondary antibodies conjugated to Alexa Fluor 647, and imaged on a home-built SRLM system (as shown in Fig. 4), which mainly consists of an Olympus IX73 inverted microscope, a 640 nm excitation laser (∼7 kW/cm²), a 405 nm activation laser, an Olympus 60×/NA1.42 oil-immersion objective, and a Hamamatsu Flash 4.0 V3 sCMOS camera. The lasers were both purchased from LaserWave, China, and were coupled into the microscope using a multi-mode fiber combiner [3]. The standard STORM buffer consisting of 50 mM Tris pH 8.0, 10 mM NaCl, 10% (w/v) glucose, 100 mM mercaptoethylamine, 500 µg/mL glucose oxidase, and 40 µg/mL catalase, was used. The sample was placed on a stage for imaging. The fluorescence emission collected by the objective was separated from the lasers with a dichroic mirror (ZT488/532/633/830/1064rpc, Chroma, USA), passed through emission filters (FF01-680/42, NF03-405/488/532/635E, Semrock, USA), and finally detected with an sCMOS camera (Flash 4.0 V3, Hamamatsu, Japan).

Fig. 4. The home-built SRLM system with FPGA-GPU cooperative computation.

Download Full Size | PDF

The images acquired by the camera were pre-processed in FPGA, and the extracted ROIs were passed through memory to GPU for localization and rendering. Under the control of CPU, a final super-resolution image was displayed and stored in the hard disk. Due to the resource limit in the FPGA chip, the imaging was performed with a FOV of 512 × 512 pixels and an exposure time of 10 ms. We made a duplication on the acquired images from the camera (via the FPGA board), and saved them directly to the hard disk.

3. Results and discussion

3.1 Comparing the time costs in different image processing steps of QC-STORM

Using the simulated images described in Section 2.5, we compared the time costs in different image processing steps of QC-STORM. The overall execution times were calculated from a total of 100 simulated images with 256×256 pixels in each image and different activation density (from 0.1 to 3 µm⁻²). The time costs of different image processing steps are shown in Fig. 5(a-c). Note that the calculated time depends on the computation ability of the PC. We found that the overall execution time increases significantly with higher activation density (as shown in Fig. 5). We also found that, as the activation density increases, the time in both localization and ROI extraction also increases rapidly. In fact, for an activation density of 3 µm⁻², the summed time of localization and ROI extraction contributes to over 70% of the overall execution time (Fig. 5(c)). Therefore, if we want to reduce the overall execution time, we need to develop not only a faster multi-emitter localization algorithm, but also a better way for ROI extraction. Additionally, it is important to point out that we should also pay attention to minimize the time needed in data access (see the time cost labeled in purple in Fig. 5), especially for the low activation density case (0.1 µm⁻²) where more than half time is spent on data access (Fig. 5(a)). In our FPGA-based ROI extraction method, the time cost in data access was minimized using line buffer.

Fig. 5. The time costs of different image processing steps in QC-STORM. (a) The time costs at an activation density of 0.1 µm⁻². (b) The time costs at an activation density of 1 µm⁻². (c) The time costs at an activation density of 3 µm⁻². The activation densities and the overall image processing times are shown in the lower left corner. Here the overall execution times were calculated from a total of 100 simulated images with 256×256 pixels in each image. Note that the calculated time depends on the computation ability of the PC.

Download Full Size | PDF

3.2 Evaluating the pre-processing performance using simulated images

We evaluated the performance of the FPGA-based ROI extraction method in processing simulated images at different activation densities. The evaluation was performed on a test environment with a Xilinx kintex-7 board, which was connected to a PC through a USB 3.0 interface. The simulated images were delivered to the FPGA chip from the PC instead of a camera, and the parameters described in Section 2.5 were calculated. As seen in Fig. 6(a), when the activation density increases, the number of single-emitter ROIs (7×7 pixels) increases at the beginning, reaches a maximum at an activation density of approximately 1.2 µm⁻², and then decreases slowly. However, the number of multi-emitter ROIs (11×11 pixels) keeps growing with the increase of activation density. When the activation density is higher than 2 µm⁻², the extracted ROIs mainly contain multiple emitters.

Fig. 6. The performance of FPGA-based pre-processing. (a) The number of extracted ROIs at different activation densities. (b) Data reduction rate at different activation densities. (c) Recall rates at different activation densities.

Download Full Size | PDF

On the other hand, if we chose to extract ROIs with a fixed size of 7×7 pixels, and then transmit the extracted ROIs instead of the whole raw image, we could effectively reduce the data flow when the activation density is lower than 2.5 µm⁻². If this density threshold is exceeded, the data extraction strategy will no longer be effective for this ROI size, as indicating by the negative data reduction rates (Fig. 6(b), blue data points). However, if we extract ROIs with a combination of 7×7 pixels and 11×11 pixels (Fig. 6(b)), red data points), this limitation would disappear, at least for an activation densities of up to 4 µm⁻².

Furthermore, we compared the recall rates among the three methods (Fig. 6(c)), and found out that the recall rate of the FPGA-based ROI extraction method exhibits a comparable performance as QC-STORM, and is significantly better than ThunderSTORM when the activation density is higher than 1 µm⁻².

3.3 Evaluating the pre-processing performance using experimental images

We evaluated the pre-processing performance of the FPGA-based ROI extraction method in processing experimental images (see details in Section 2.6). We show an overlay of all raw images in Fig. 7(a), and an overlay of all extracted ROIs in Fig. 7(b). We found that the data volume reduced significantly from 32 MB in all raw images to 0.5 MB in ROIs, indicating a data reduction efficiency of ∼ 60 times. We further checked into a representative raw image (Fig. 7(c)) and the extracted ROIs from this image (Fig. 7(d)), and realized that not all emitters can be extracted, which is probably due to low SNR or uneven background.

Fig. 7. The pre-processing performance of the FPGA-based ROI extraction method using experimental images. (a) Overlay of 1000 raw image frames. (b) Overlay of all the extracted ROIs from the 1000 image frames. (c) A representative raw image frame. (d) The extracted ROIs from the representative raw image frame shown in (c). (e) Comparing the identified ROIs and the ground truth. The data were from (c) and (d). The red circles are for ground truth, the green squares are for the extracted ROIs with 7×7 pixels, and the blue squares are for the extracted ROIs with 11×11 pixels.

Download Full Size | PDF

3.4 Evaluating the overall performance using simulated images

Based on the simulated images described in Section 2.5, we compared the overall performance of HCP-STORM with QC-STORM, using RMSE (for localization precision), Jaccard index (for detection rate) and overall execution time (or called image processing speed). The threshold for calculating Jaccard index was 100 nm. Compared to QC-STORM, HCP-STORM achieves a comparable localization precision (Fig. 8(a)), but suffers from a degradation in the detection rate (Fig. 8(b)). This degradation is probably due to the simplified pre-processing procedures, where line buffer instead of frame buffer was used, and line buffer could not support advanced image processing algorithms. However, the overall execution time of HCP-STORM is 4∼15 times faster than that of QC-STORM. For activation density up to 4 µm⁻², HCP-STORM requires a maximum of 0.15 ms to finish all data processing steps, comparing to about 2 ms for QC-STORM (Fig. 8(c)).

Fig. 8. Comparison on the overall performance of HCP-STORM and QC-STORM in processing simulated images at different activation densities. (a) RMSE at activation density. (b) Jaccard index at activation density. (c) Run time at activation density. (d) Pre-processing time at activation density. Note that the times in (c) and (d) are for one simulated image with 256×256 pixels, and the calculated values depend on the computation ability of the PC. The dotted green line in (c) indicates the run time of 0.15 ms.

Download Full Size | PDF

Additionally, the pre-processing time of HCP-STORM is constant (0.061 ms) for all activation densities, while the pre-processing time of QC-STORM increases notably with higher activation densities (Fig. 8(d)). From these results, we estimated that the overall execution time HCP-STORM (0.15ms@256×256 pixels, corresponding to 9.6ms@2048×2048 pixels) is fast enough to enable real-time processing for sCMOS camera working at full data rate (that is, 2048×2048 pixels at 100 frame per second (fps)). The RMSE and recall rate are both acceptable for activation densities of up to 2.5 molecules / µm². Note that the overall execution time of these two methods were measured at a low-end PC rather than a powerful workstation (which was used in our previous QC-STORM paper [12]), and that the speed gain of HCP-STORM over QC-STORM is probably due to several factors, including but not limited to, a minimized and constant pre-processing time, simultaneous FPGA and GPU acceleration, and fewer identified ROIs.

3.5 Evaluating the overall performance using experimental images

We compared the overall performance of HCP-STORM with QC-STORM and ThunderSTORM, using image quality (quantified by the total number of localization points, FRC resolution and line profile) and image processing speed. The super-resolution image shown in Fig. 9(a) was reconstructed from 10,000 raw images of 512 × 512 pixels in each image. The total image processing time using different localization methods is shown in the upper right corner of Fig. 9(a). We found that HCP-STORM is about 25 times faster than QC-STORM, and 295 times faster than ThunderSTORM. Specifically, HCP-STORM requires only 6.1s to finish all image processing of these 10,000 raw images of 512 × 512 pixels in each image, corresponding to 0.61ms for one raw image. Therefore, for this kind of SRLM experiments with sparse biological structures, we would need about 9.76 ms to process a raw image with 2048 × 2048 pixels. This execution speed is sufficient for enabling real-time data processing for the representative Hamamatsu Flash 4.0 V3 sCMOS cameras working at full data rate (2048 × 2048 pixels @ 100fps).

Fig. 9. Evaluation of the overall performance of the three methods using experimental images. (a) A super-resolution image reconstructed from a total of 10000 raw images with 512 × 512 pixels in each image. The images were processed by HCP-STORM. The overall image processing times are shown in the top-right corner. (b) Super-resolution images from different localization methods and a representative raw image of the boxed region i) in (a). The FRC resolution and the total number of localization points are shown in the lower-left-hand corners. (c) Projected line profiles of the marked areas in (b). (d) Super-resolution images from different localization methods and a representative raw image of the boxed region ii) in (a). (e) Projected line profiles of the marked areas in (d).

Download Full Size | PDF

Further analysis on two enlarged areas (Fig. 9(b)–9(e)) points out that the image quality of HCP-STORM is not as good as that of QC-STORM and ThunderSTORM. Since the localization algorithms of HCP-STORM are the same as those in QC-STORM, the image quality degradation in HCP-STORM would be mainly resulting from reduced localization points. As seen in Fig. 9(b), the localized molecules from HCP-STORM (1.4×10⁴ locations) are about 18% less than those from QC-STORM (1.7×10⁴ locations). Similar properties are also found in Fig. 9(d). To solve this low recall rate problem, we should upgrade the current ROI extraction method to deal with the complex situations (including but not limited to low SNR and uneven fluorescence background) in experimental images. However, due to the limited resources in the current FPGA chip (currently 91% of the FPGA resources have been used), we could not implement a more complicated ROI extraction method into this chip. To solve this problem, we are planning to integrate an FPGA chip with greater resources into our computation platform in the near future.

4. Conclusion

In this paper, we proposed a method called HCP-STORM for real-time multi-emitter localization in SRLM. This method takes advantages of FPGA-GPU cooperative computation to accelerate multi-emitter localization, and relies on a series of MLE-based localization algorithms from a recent reported multi-emitter localization method called QC-STORM, to ensure localization precision. Using both simulated and experimental images, we verified that FPGA implementation of the pre-processing steps in QC-STORM is able to speed up significantly the overall data processing speed, with only small degradation in the image quality. In fact, we demonstrated that a combination of HCP-STORM with a low-end PC is sufficient to process a simulated raw image of 256×256 pixels within 0.15 ms, compared to ∼2 ms in QC-STORM. In this case, the RMSE and recall rate of HCP-STORM are still good enough for an activation density of up to 2.5 molecules / µm². We further proved that HCP-STORM required only 0.61ms to finish all image processing procedures for an experimental raw image of 512 × 512 pixels, which is fast enough to enable real-time image processing for SRLM with a popular Hamamatsu Flash 4.0 V3 sCMOS camera working at full data rate (2048 × 2048 pixels @ 100fps), with a small drop in the quality of the super-resolution images. We believe that this drawback could be avoided after we increase the computation resources of the FPGA chip in the near future.

Furthermore, the application of FPGA-GPU cooperative computation could be extended to 3D SRLM or multi-color SRLM, if we can have sufficient resources to process images in FPGA (for example, a larger filtered convolution kernel). Taking all findings in this paper together, we are confident to conclude that FPGA-GPU cooperative computation will provide new chances for real-time data processing in HT-SRLM, where we could avoid using expensive workstation and graphics card for GPU computation.

Funding

National Natural Science Foundation of China (81827901, 82160345); Key research and development program of Hainan province (ZDYF2021GXJS017); Fundamental Research Funds for the Central Universities (2018KFYXKJC039); Natural Science Foundation of Hainan Province (620RC558); Natural Science Foundation Project of CQCSTC (cstc2018jcyjAX0398); Start-up Fund from Hainan University (KYQD(ZR)20022, KYQD(ZR)-20077).

Acknowledgments

We thank the Optical Bioimaging Core Facility of WNLO-HUST for technical support.

Disclosures

Chinese patent (No. 201710089310.9).

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. R. Pepperkok and J. Ellenberg, “High-throughput fluorescence microscopy for systems biology,” Nat. Rev. Mol. Cell Biol. 7(9), 690–696 (2006). [CrossRef]

2. M. Mattiazzi Usaj, E. B. Styles, A. J. Verster, H. Friesen, C. Boone, and B. J. Andrews, “High-Content Screening for Quantitative Cell Biology,” Trends Cell Biol. 26(8), 598–611 (2016). [CrossRef]

3. Z. Zhao, B. Xin, L. Li, and Z. L. Huang, “High-power homogeneous illumination for super-resolution localization microscopy with large field-of-view,” Opt. Express 25(12), 13382–13395 (2017). [CrossRef]

4. Z. N. Zhang, Y. J. Wang, R. Piestun, and Z. L. Huang, “Characterizing and correcting camera noise in back-illuminated sCMOS cameras,” Opt. Express 29(5), 6668–6690 (2021). [CrossRef]

5. H. Ma and Y. Liu, “Super-resolution localization microscopy: Toward high throughput, high quality, and low cost,” APL Photonics 5(6), 060902 (2020). [CrossRef]

6. D. Mahecic, I. Testa, J. Griffie, and S. Manley, “Strategies for increasing the throughput of super-resolution microscopies,” Curr. Opin. Chem. Biol. 51, 84–91 (2019). [CrossRef]

7. D. Sage, H. Kirshner, T. Pengo, N. Stuurman, J. Min, S. Manley, and M. Unser, “Quantitative evaluation of software packages for single-molecule localization microscopy,” Nat. Methods 12(8), 717–724 (2015). [CrossRef]

8. A. Small and S. Stahlheber, “Fluorophore localization algorithms for super-resolution microscopy,” Nat. Methods 11(3), 267–279 (2014). [CrossRef]

9. A. Beghin, A. Kechkar, C. Butler, F. Levet, M. Cabillic, O. Rossier, G. Giannone, R. Galland, D. Choquet, and J.-B. Sibarita, “Localization-based super-resolution imaging meets high-content screening,” Nat. Methods 14(12), 1184–1190 (2017). [CrossRef]

10. T. Quan, P. Li, F. Long, S. Zeng, Q. Luo, P. N. Hedde, G. U. Nienhaus, and Z. L. Huang, “Ultra-fast, high-precision image analysis for localization-based super resolution microscopy,” Opt. Express 18(11), 11867–11876 (2010). [CrossRef]

11. Y. Wang, T. Quan, S. Zeng, and Z. L. Huang, “PALMER: a method capable of parallel localization of multiple emitters for high-density localization microscopy,” Opt. Express 20(14), 16039–16049 (2012). [CrossRef]

12. L. Li, B. Xin, W. Kuang, Z. Zhou, and Z.-L. Huang, “Divide and Conquer: Real-time maximum likelihood fitting of multiple emitters for super-resolution localization microscopy,” Opt. Express 27(15), 21029–21049 (2019). [CrossRef]

13. M. Ovesny, P. Krizek, J. Borkovec, Z. Svindrych, and G. M. Hagen, “ThunderSTORM: a comprehensive ImageJ plug-in for PALM and STORM data analysis and super-resolution imaging,” Bioinformatics 30(16), 2389–2390 (2014). [CrossRef]

14. I. Munro, E. Garcia, M. Yan, S. Guldbrand, S. Kumar, K. Kwakwa, C. Dunsby, M. A. A. Neil, and P. M. W. French, “Accelerating single molecule localization microscopy through parallel processing on a high-performance computing cluster,” J. Microsc. 273, 148–160 (2019). [CrossRef]

15. W. Ouyang, A. Aristov, M. Lelek, X. Hao, and C. Zimmer, “Deep learning massively accelerates super-resolution localization microscopy,” Nat. Biotechnol. 36(5), 460–468 (2018). [CrossRef]

16. E. Nehme, L. E. Weiss, T. Michaeli, and Y. Shechtman, “Deep-STORM: super-resolution single-molecule microscopy by deep learning,” Optica 5(4), 458–464 (2018). [CrossRef]

17. R. Strack, “Deep learning advances super-resolution imaging,” Nat. Methods 15(6), 403 (2018). [CrossRef]

18. S. K. Gaire, Y. Zhang, H. Li, R. Yu, H. F. Zhang, and L. Ying, “Accelerating multicolor spectroscopic single-molecule localization microscopy using deep learning,” Biomed. Opt. Express 11(5), 2705–2721 (2020). [CrossRef]

19. N. Boyd, E. Jonas, H. Babcock, and B. Recht, “DeepLoco: Fast 3D Localization Microscopy Using Neural Networks,” bioRxiv, 267096 (2018).

20. S. K. Gaire, Y. Wang, H. F. Zhang, D. Liang, and L. Ying, “Accelerating 3D single-molecule localization microscopy using blind sparse inpainting,” J. Biomed. Opt. 26(2), 026501 (2021). [CrossRef]

21. H. Ma, H. Kawai, E. Toda, S. Zeng, and Z. L. Huang, “Localization-based super-resolution microscopy with an sCMOS camera part III: camera embedded data processing significantly reduces the challenges of massive data handling,” Opt. Lett. 38(11), 1769–1771 (2013). [CrossRef]

22. D. Baddeley, M. B. Cannell, and C. Soeller, “Visualization of Localization Microscopy Data,” Microsc. Microanal. 16(1), 64–72 (2010). [CrossRef]

23. Y. Du, C. Wang, C. Zhang, L. Guo, Y. Chen, M. Yan, Q. Feng, M. Shang, W. Kuang, Z. Wang, and Z.-L. Huang, “Computational framework for generating large panoramic super-resolution images from localization microscopy,” Biomed. Opt. Express 12(8), 4759–4778 (2021). [CrossRef]

24. J. Min, C. Vonesch, H. Kirshner, L. Carlini, N. Olivier, S. Holden, S. Manley, J. C. Ye, and M. Unser, “FALCON: fast and unbiased reconstruction of high-density super-resolution microscopy data,” Sci. Rep. 4(1), 4577 (2015). [CrossRef]

25. R. P. J. Nieuwenhuizen, K. A. Lidke, M. Bates, D. L. Puig, D. Gruenwald, S. Stallinga, and B. Rieger, “Measuring image resolution in optical nanoscopy,” Nat. Methods 10(6), 557–562 (2013). [CrossRef]

26. N. Banterle, K. H. Bui, E. A. Lemke, and M. Beck, “Fourier ring correlation as a resolution criterion for super-resolution microscopy,” J. Struct. Biol. 183(3), 363–367 (2013). [CrossRef]

Accelerating multi-emitter localization in super-resolution localization microscopy with FPGA-GPU cooperative computation

Abstract

1. Introduction

2. Methods

2.1 Heterogeneous computing platform for FPGA-GPU cooperative computation

2.2 Pre-processing steps for extracting ROIs from raw images

2.3 FPGA implementation of the pre-processing steps

2.4 GPU implementation of the post-processing steps

2.5 Image simulation and algorithm evaluation

2.6 Super-resolution fluorescence imaging experiments

3. Results and discussion

3.1 Comparing the time costs in different image processing steps of QC-STORM

3.2 Evaluating the pre-processing performance using simulated images

3.3 Evaluating the pre-processing performance using experimental images

3.4 Evaluating the overall performance using simulated images

3.5 Evaluating the overall performance using experimental images

4. Conclusion

Funding

Acknowledgments

Disclosures

Data availability

References

Data availability

Cited By

Figures (9)

Optics Express