Supervised classification methods for flash X-ray single particle diffraction imaging

Jing Liu; Gijs van der Schot; Stefan Engblom

doi:10.1364/OE.27.003884

1. Introduction

Modern X-ray Free Electron Laser (XFEL) technology has provided the opportunity for exploring biological structures from individual biological particles, rather than relying on crystallization-based technologies. It is therefore potentially possible to investigate biomolecules or biological processes that are intrinsically dynamic. XFELs produce X-ray pulses shorter than 50 femtosecond (fs), which are 10⁹ times more brilliant than the radiation produced in conventional synchrotrons. The ultra-short and extremely bright X-ray pulses outrun the radiation damage and allow the recording of sufficiently strong and interpretable 2-dimensional (2D) diffraction patterns from single biological particles [1, 2]. This principle is called diffract-and-destroy and has been shown to be successful for particles as large as small cells, and down to viruses smaller than 50 nanometers (nm) [3–7].

Another feature of XFELs is their high repetition rates. The Linac Coherent Light Source (LCLS) [8] operates at 120 Hz and can produce over 400,000 diffraction patterns per hour, i.e., more than 1.6 TB per hour or 38 TB per day. The massive volume of data makes manual classification of diffraction patterns impractical. The challenge is much more severe in the newest facility — the European XFEL [9], which operates at up to 27,000 Hz and can store more than 12.6 million images per hour [10]. Ideally, all these images would originate from one single biomolecule per exposure. However, the detector also records diffracted signals from multiple scatterers such as particle clusters, buffer impurities, and contaminant materials as discussed in [5,7].

In order to assemble the 2D diffraction patterns into 3D structures, it is essential that data frames are classified and that diffraction patterns originating from contaminants and multiple molecules are sorted out. In 2014, a real-time rejection method [11] was proposed to select diffraction patterns by thresholding and using Time-of-Flight spectroscopy. Other previous sorting algorithms was based on feature vectors and spectral clustering techniques [12, 13]. Diffusion maps and manifold embedding [14] have also been proposed, and have successfully classified icosahedral shaped viruses [15–18]. However, all of these methods work best after a substantial amount of data has been retrieved, and/or are in the need of a solid understanding of the raw data. This makes them less ideal in a streaming context.

In this paper, we develop two template-based classification methods for particle selection — the Eigen-Image (EI) and the Log-Likelihood (LL) method. Both methods assess the similarity between template diffraction patterns and incoming patterns by analyzing eigenvector projections and log-likelihood functions, respectively. By relying on templates none of these methods are dependent on access to the full dataset, and consequently they are suitable for real-time processing at site. With our methods, we thus aim to select single-particle diffraction patterns of homogeneous quality. Such datasets may hopefully help 3D assembling algorithms [19] to converge more quickly and improve on the final 3D resolution.

In §2, we briefly describe a typical Flash X-ray single-particle diffraction Imaging (FXI) experiment. Next, we introduce the EI and the LL method for classification in §3. Following data descriptions in §4, we perform numerical experiments to evaluate the sharpness of our classification methods in §5. A concluding discussion is found in §6.

2. Flash X-ray single particle diffraction Imaging (FXI)

For a typical FXI experiment, the diffraction data acquired is depicted graphically in Fig. 1. A stream of biological molecules is injected into the X-ray interaction region, where sample particles interact with incoming coherent X-ray pulses, resulting in a collection of diffraction patterns on the detector. This procedure is a stochastic process as the interactions between particles and X-ray pulses occur at random. Firstly, the number of particles at the interaction point is unobserved, i.e., we may obtain blank frames with only background noises, single-particle patterns, multiple-particles patterns, and frames with signals from contaminants. Secondly, the current FXI technology cannot monitor the orientations of particles, and therefore extra steps are necessary to recover the 3D structure from single-particle frames. Last but not least, the strengths of the diffraction signals vary a lot, mainly due to the stochastic nature of the XFELs and the different locations of particles in the interaction region, respectively. The relative strength of the diffraction signal is referred to as photon fluence, and we denote it by ϕ.

Fig. 1 (a): A typical setup of an FXI experiment. [(b)–(e)]: Four diffraction patterns from a mimivrius taken from [4]. (b) was a blank frame which contains only background scattering. [(c) and (d)] were frames from multiple particles or with contaminants. (e) was a single-particle frame from an icosahedral-shape virus with a relatively strong signal. This is the most interesting pattern and can be used for assembling a 3D structure in later steps. All diffraction patterns are displayed in logarithmic scale.

Download Full Size | PDF

Typical FXI setups use digital detectors made up of individual pixels, and therefore the captured frames are pixelized. Further, some pixel counts near the center are inaccessible or overflow as a result of physical limitations and arrangements of the detector.

3. Classification Methods

Template-based methods for classifying patterns allow for identifying the class of an unlabelled pattern by searching for its best-matched template. For such methods, the collection of templates is referred to as the training dataset, an unlabelled pattern is called a testing image, and the classification procedure is referred to as the classifier. In this section, we discuss two classifiers — the Eigen-Image (EI) [20–25] and the Log-Likelihood (LL) classifier [26–28], to classify a testing diffraction pattern relying on a training dataset.

3.1. Eigen-Image (EI) Classifier

The EI method has two steps — the training and the classification step. In the training step, we train our EI classifier by projecting the training dataset to its eigenvectors. In the classification step, we label a testing image by minimizing the distance between the eigenvector projections of the testing image and the training dataset.

Let i.i.d. template diffraction patterns $T = {(T_{k})}_{k = 1}^{M_{data}}$ be the training dataset, consisting of M_data frames. Since the detector is pixelized, we denote the kth pattern by $T_{k} = {(T_{i k})}_{i = 1}^{M_{pix}}$ . To train an EI classifier, we first transfer the training dataset T into the image space A by the shift

A = {(A_{k})}_{k = 1}^{M_{data}} = {(T_{k} - \bar{T})}_{k},

where T̄ is the pixel average of the training dataset,

\bar{T} = \frac{1}{M_{data}} \sum_{k = 1}^{M_{data}} T_{k} .

Practically, the covariance matrix of A (namely AA^T) is too large to decompose into eigenvectors (size M_pix), and therefore we factorize the matrix A^TA instead (size M_data, where usually M_data ≪ M_pix). Hence,

A^{T} A = V Λ V^{T},

where Λ is the main diagonal matrix, whose diagonal elements are the corresponding eigenvalues, and V is the matrix of eigenvectors of A^TA. We can now compute the eigenvectors of the covariance matrix AA^T by

U = A V,

and U is sometimes also referred to as eigenfaces [20,25].

The eigenvector projection matrix of the image space A is defined as follows:

Ω = U^{T} A .

Using U and Ω, we can now classify a testing diffraction pattern

P = {(P_{i})}_{i = 1}^{M_{pix}}

, by minimizing the Euclidean distance (i.e. the L₂-norm) between its eigenvector projection matrix W and Ω,

{arg}_{k} min {‖ W_{k} - Ω_{k} ‖}_{L_{2}},

where

W = U^{T} (P - \bar{T}) .

3.2. Log-Likelihood (LL) Classifier

The LL Classifier attempts to classify a testing image by maximizing the log-likelihood function of a given probability density function. Since the photon counting procedure is assumed to obey the Poisson distribution, we can write the joint likelihood function as follows:

\prod_{i = 1}^{M_{pix}} P (P_{i} | T_{i k}, ϕ_{k}) = \prod_{i = 1}^{M_{pix}} \frac{{(ϕ_{k} T_{i k})}^{P_{i}} e^{- ϕ_{k} T_{i k}}}{P_{i}!} = : 𝒬_{i k},

where ϕ is the photon fluence (relative signal strength), and can be estimated by

ϕ_{k} = \frac{\sum_{i = 1}^{M_{pix}} P_{i}}{\sum_{i = 1}^{M_{pix}} T_{i k}} .

The joint log-likelihood function ℒ for the LL classifier is therefore

log (𝒬_{i k}) \propto \sum_{i = 1}^{M_{pix}} P_{i} log (T_{i k}) + P_{i} log (ϕ_{k}) - ϕ_{k} T_{i k} = : ℒ_{k},

We can now classify the testing image P by simply maximizing the joint log-likelihood function in [Eq. (10)]:

{arg}_{k} max ℒ_{k} = {arg}_{k} max \sum_{i = 1}^{M_{pix}} P_{i} log (T_{i k}) + P_{i} log (ϕ_{k}) - ϕ_{k} T_{i k} .

For classifying multiple testing images, the EI method computes U and Ω only once at the beginning, and we then compute [Eq. (6)] and [Eq. (7)] for each testing image. The computations needed by EI is smaller than the LL classifier, since the computational complexity of [Eq. (7)] is O(M_data × M_pix) arithmetic operations, which is the complexity of the last term alone in [Eq. (11)].

4. Data Description

The raw FXI data frames differ to varying degrees. Typically they may be sorted into hierarchical categories. On the top level, the classes can be single-particle patterns and non-single-particle patterns. Further, single-particle patterns, which typically are the most interested ones, may be classified by the particle rotations, sizes, shapes, etc. Making templates for all possible categories can be hard or even impossible, and consequently we train and evaluate our classifiers with single-particle diffraction patterns.

Most viruses have either a helical or an icosahedral capsid structure [29,30], and icosahedral viruses are of great interests [15,17,31]. Therefore, we focus on FXI experiments of icosahedral particles. To illustrate our method, we used regular uniform-density icosahedrons to generate diffraction patterns via Condor [32]. For our simulations, we used a setup similar to the beam profile of the FXI mimivirus experiment [4]. More specifically, we used X-ray pulses with 1 mJ peak energy and 1 nm wavelength. We also assumed that the X-ray pulses had a circular focus of 10 μm in diameter. Further, the distance between the detector and the interaction region was 0.74 meters, and the detector itself was 960 × 960 pixels, with the size of each pixel 75 × 75 μm². Finally, a circular missing-data area of 80 pixels in diameter was set to zero.

To assess our classifiers systematically, we gradually increased the complexity of the testing dataset. With five synthetic testing datasets, we mimicked diffraction patterns of particles with noise, different fluences, and of various sizes and shapes. We also evaluated our methods for the actual mimivirus FXI data [31]. Fig. 2 illustrates two noisy icosahedral diffraction patterns at particle sizes 180 nm and 200 nm in the same particle orientation, and one spheroid diffraction pattern at size 180 nm.

Fig. 2 (a): A noisy diffraction pattern from a 180 nm icosahedron. (b): A noisy pattern from a 200 nm icosahedron in the same particle orientation. (c): A noiseless pattern from a 180 nm spheroid.

Download Full Size | PDF

4.1. Homogeneous Datasets

We first simulated diffraction patterns from a regular icosahedron of diameter 180 nm. The training dataset T had 290 frames, and the Euclidean distances between two arbitrary patterns were larger than or equal to 220. The first testing dataset D was a noiseless homogeneous dataset, which contained M_data = 1000 noiseless icosahedral diffraction patterns. The first 290 frames were from the training dataset T and were used as benchmarks. The rest 810 frames were random-orientation patterns from the same icosahedron.

Since the photon counting procedure is assumed to follow the Poisson distribution, we added Poissonian noise to D for our noisy dataset P,

P_{k} ~ Poisson (D_{k}), k = 1, 2, \dots, M_{data} .

By scaling P with different fluences, we obtained our last homogeneous testing dataset — the scaled noisy dataset F by

F_{k} ~ Poisson (Φ_{k} D_{k}),

where Φ_k was uniformly and randomly chosen between 0.01 to 1.1,

Φ_{k} ~ 𝒰 {0.01, 1.1} .

4.2. Heterogeneous Particle Sizes

Considering the potential size variation of viruses, we generated our testing dataset S (M_data = 2000) from uniform-density icosahedrons with randomly and uniformly chosen diameters between 150 nm and 210 nm (∼ 𝒰{150, 210}). Similar to F, all patterns in S were Poissonian with random fluences according to [Eq. (14)].

4.3. Heterogeneous Particle Shapes

To mimic heterogeneous particle shapes, the synthetic testing dataset X contained diffraction patterns from both icosahedrons and spheroids. The diameters of the objects varied from 150 nm to 210 nm, with changing fluences Φ_k ∼ 𝒰{0.01, 1.1}. Further, the shapes of the spheroids were also changing, as the aspect ratios of the spheroids (the ratio of the length of the minor axis to the length of the major one) were varying between 0.6 and 1. In total, the dataset X contained M_data = 1200 frames — 200 spheroidal patterns and 1000 icosahedral patterns randomly selected from S.

4.4. Mimivirus Dataset

To be relevant to real FXI experiments, we also classified the mimivirus dataset [4,31], which contained 198 single mimivirus patterns. To classify this dataset, we generated a new training dataset (T₂) with the corresponding experimental beam profile [4], consisting of 1000 random-orientation frames of a 490 nm icosahedron. Later, we moved forward to the raw data frames from the FXI mimivirus experiments [4]. To capture non-single particles, and keep the training dataset (T₃) as small as possible, we only added ten spherical patterns, with particle sizes varying from 100 nm to 1000 nm. Table 1 lists the primary parameters of all datasets.

Table 1. Primary parameters of all datasets.

View Table | View all tables in this article

5. Experiments

We now perform numerical experiments to investigate the efficiency and the accuracy of our EI and LL classifiers. For saving memory and execution time without losing much accuracy in the classification, only the central 480 × 480 pixels were used in the computations, and they were binned into 120 × 120 pixels, i.e., every 4 × 4 pixels were averaged into one pixel.

5.1. Metrics

Since the proposed methods are sensitive to the particle rotations, the best-matched pattern from the training dataset should have the closest particle rotation to the testing image. Therefore, we compare the best-matched template with the testing image taking also the particle size into consideration. Let $Γ_{k} = {(Γ_{i k})}_{i = 1}^{M_{pix}}$ be the kth frame of the testing dataset Γ. Let $R = {(R_{i})}_{i = 1}^{M_{pix}}$ be the best-matched pattern of Γ_k from the training dataset. The pattern distance between Γ_k and R is now defined as:

C_{k} (Γ_{k}, R) = \underset{s, {\hat{Φ}}_{k}}{arg min} \frac{\sum_{i = 1}^{M_{pix}} {({\hat{Φ}}_{k} U_{i k} (R, s) - Γ_{i k})}^{2}}{\sum_{i = 1}^{M_{pix}} {({\hat{Φ}}_{k} U_{i k} (R, s))}^{2}},

where Φ̂_k is the estimated fluence,

{\hat{Φ}}_{k} = \frac{\sum_{i = 1}^{M_{pix}} Γ_{i k}}{\sum_{i = 1}^{M_{pix}} U_{i k} (R, s)} .

Further, U_k(R, s) is an interpolation (or extrapolation) method that resizes the pattern Rs times and returns a scaled image at the same size as Γ_k. Note that, U(R, s) = R for our homogeneous testing datasets D, P and F, and Φ̂_k = 1 for the first two. It is natural to set 0.5 as a threshold to determine if the testing image Γ_k can be accepted by R.

Similarly, we define the fluence distance by

E_{k} = \frac{{(Φ_{k} - {\hat{Φ}}_{k})}^{2}}{Φ_{k}^{2}},

where Φ_k is the true fluence used to generate Γ_k.

5.2. Homogeneous Patterns

We first tested our classifiers on the homogeneous synthetic datasets — the noiseless dataset D, the noisy dataset P, and the scaled-Poisson dataset F, as listed in Table 1. Since the first 290 images of the three testing datasets were modifications of the training dataset, we used them as benchmarks, and compared their average pattern distance with the distances from the remaining patterns in the datasets. As listed in Table 2, we observed that both classifiers matched all benchmark frames successfully with pattern distance around 0, 0.03 and 0.04 on datasets D, P and F, respectively. The classification accuracy was 100% for the benchmark frames. However, for the remaining patterns, the EI classifier performed slightly better than the LL classifier, obtaining about 1% less pattern distance.

Table 2. The average pattern distance for classification as defined in [Eq.(15)] of the EI and the LL classifier. Benchmark is the average distance of the benchmark patterns, and Remaining denotes the distance of the remaining patterns in the testing dataset, with respect to their best-matched templates.

View Table | View all tables in this article

The fluence distance, as defined in [Eq. (17)], of the dataset F from the EI classifier was 0.035, comparing with 0.049 from the LL classifier, excluding the benchmark patterns. Further, the EI classifier was more efficient and took only 3.7 ms per image in our Matlab implementation, nearly 15 times faster than the LL classifier.

5.3. Heterogeneous Particle Sizes

We next evaluated the dataset S, containing patterns from icosahedrons of diameters between 150 nm to 210 nm. Fig. 3 illustrate the average pattern and fluence distance for the EI and the LL classifiers. As expected, both classifiers obtained the smallest distances when the particle size of the testing pattern was similar to the template size (180 nm), and the LL classifier had slightly larger distances on the average. Furthermore, the EI classifier was better at estimating particle sizes, as shown in Fig. 3(c), and this also implied that the EI classifier was more accurate in searching for the best-matched template than the LL classifier was.

Fig. 3 The pattern and the fluence distance of dataset S, which contained 2000 icosahedral diffraction patterns of different sizes, from the EI (a) and the LL (b) classifier. The classification and the fluence distance were defined in [Eq. (15)] and [Eq. (17)], respectively. Both classifiers obtained the smallest distances around the template size (180 nm). (c): The absolute errors in estimating sizes from the EI (blue triangle) and the LL (red star) classifier. On average we obtained a minimum size error of 1 nm at around 180 nm particle size, and a maximum error of 4 nm around the upper boundary of the sizes in our testing dataset.

Download Full Size | PDF

The size and the fluence estimation procedures together took around 80 ms for each image, i.e., approximately 1.5 times longer than the LL classifier or 22 times longer than the EI classifier. In other words, with size estimation, the EI classifier can handle 12 images per minute and the LL classifier can perform 8 images per minute using our straightforward Matlab implementation. With Matlab Parallel Computing Toolbox and Distributed Computing Server, it is therefore possible to speed up both classifiers to the LCLS repetition rate (120 Hz). Since both methods can be parallelized, the European XFEL detector read-out rate of 3,520 Hz [10] is within reach for a compiled-language implementation.

5.4. Heterogeneous Shapes

In this section, we investigate the performances of the LL and EI classifier for the dataset X, which contained particles with heterogeneous shapes and sizes. For identifying the spheroids in X, we added a 180 nm sphere diffraction pattern into the training dataset T, see T₁ (the training dataset 1) and X for more details in Table 1. Both classifiers distinguished the icosahedral and spheroidal diffraction patterns successfully, as listed in Table 3. All icosahedral diffraction patterns were classified as icosahedron with small pattern distances (< 0.25). With a pattern-distance threshold of 0.5, the EI classifier rejected 78 elongated spheroidal patterns and identified 114 spheroidal frames as spheroids successfully. However, 8 (4%) frames were misclassified as an icosahedron, and their pattern distances were between 0.42 and 0.5, see Fig. 4(b). The LL classifier gave a similar but slightly worse result as it misclassified 9 spheroidal frames as icosahedrons. Since our desired patterns were indeed icosahedral, we could also describe the testing dataset by two classes – the icosahedron and the non-icosahedron. Both classifiers gave 100% recall (true positive rate), while the EI classifier gave a classification accuracy of 99.33% that was slightly better than the LL classifier (99.25%), see more details on the definitions of accuracy and recall in [33].

Fig. 4 (a): The pattern distances of dataset X. For both classifiers, all icosahedral patterns were located in the perfectly matched region. For the EI classifier, all elongated spheroidal patterns (78 patterns) were rejected. For the remaining 122 accepted spheroidal patterns, 114 were successfully classified as spheroids, and eight frames or 4% of the spheroidal patterns were misclassified. The LL classifier gave one more misclassified spheroidal pattern. [(b) and (c)]: The relationship between the pattern distances and the aspect ratios for the spheroidal patterns from the EI (b) and the LL classifier (c). The red stars were misclassified patterns. The aspect ratio was the ratio of the length of the minor axis to the length of the main axis of the spheroidal particle. [(d)–(h)]: Five combination images, corresponding to the five data points (red circles) in (a) of the EI classifier. The left half of each image was from the testing dataset X, and the right half was the best-matched patterns from the training dataset. The number in each figure was the pattern distance. All figures are drawn in logarithmic scale.

Download Full Size | PDF

Table 3. Classification results of dataset X. The threshold of the pattern distance for rejection was set to 0.5. For the cases where the results from the LL classifier were different from the EI classifier, the values from the LL classifier are shown in parentheses.

View Table | View all tables in this article

We visually illustrate the classification results in Fig. 4. The rejected frames from both classifiers were elongated frames and the aspect ratios for most of them were smaller than 0.75, see Figs. 4(b) and 4(c). Further, we observed that the pattern distances decreased with increasing aspect ratio for both classifiers.

5.5. Mimivirus diffraction patterns

We also tested our classifiers on the mimivirus FXI dataset, which has been used previously for a 3D mimivirus reconstruction [31]. To compensate detector saturation at the image center and low signal-to-noise ratio at the edges of the patterns, we used the central part of the diffraction patterns for classification, see Fig. 5. The training dataset for the mimivirus dataset (T₂) contained 1000 randomly-oriented icosahedral patterns of 490 nm in diameter. Furthermore, we binned 4 × 4 pixels into one pixel in the calculations.

Fig. 5 A mimivirus diffraction pattern (a) and its central region used for classification (b). The region shown in (b) was the region between two circles in (a).

Download Full Size | PDF

As expected, we obtained larger pattern distances from both the EI and the LL classifier, comparing with the synthetic dataset, see Fig. 6, and this is due to the heterogeneity in size and shape of the mimiviruses. Again, a pattern-distance threshold of 0.5 was used to detect irregular patterns. In total, both classifiers rejected 9.1% of patterns (18 patterns). To quantify the quality of our selected patterns, we fitted the completed dataset and the two selected datasets into a 3D intensity by the Expansion Maximization Compression (EMC) method [34], and looked at the correlation between the pattern distances and the sum of the largest 0.035% (or the largest 30), rotational probabilities of each diffraction pattern in Figs. 6(b) and 6(c).

Fig. 6 (a): The pattern distances of the mimivirus dataset from the EI and the LL Classifier. 18 of 198 patterns (9.1%) were rejected with a threshold of 0.5 from both classifiers. (b): The relationship between the pattern distance and the sum of the largest 0.035% (the largest 30) rotational probabilities of each diffraction pattern. [(d)–(h)]: Five combination images at the data points from the EI classifier (red circles) in (a) . The left part of each image was from the mimivirus dataset, and the right part was the corresponding template scaled by the recovered fluence. (g) was a slightly elongated pattern and the particle size of (h) was smaller than the template size.

Download Full Size | PDF

As expected, the sum of the rotational probabilities increased with decreasing pattern distance. However, we did not get a linear correlation, most likely due to the fact that the mimiviruses samples were not regular uniform-density icosahedrons, and had different particle sizes. For example, in [Figs. 6(d)–6(f)], the templates and the mimivirus patterns matched quite well, however, the particle in Fig. 6(g) was slightly elongated, and the particle size of Fig. 6(h) was 457 nm, which was 33 nm smaller than the template size (490 nm). Further, our classifiers also improved the 3D reconstruction results, i.e., the average (minimum) of the sum of the rotational probabilities were improved to 0.207 (0.034) for the EI classifier, and 0.204 (0.032) for the LL classifier, compared with 0.174 (0.029) from the completed dataset. However, the 3D intensities reconstructed from these three sets were quite similar, mainly due to the quite limited number of patterns.

5.6. Raw mimivirus dataset

The purpose of our final experiments is to investigate the method’s behaviour when confronted with a raw mimivirus dataset. These data frames differ considerably in between, and hence it is hard to simulate all possibilities in order to obtain good templates. For classifying the raw mimivirus dataset [31], we therefore used 1000 randomly-oriented patterns and 10 sphere patterns in different sizes as the training dataset, see T₃ in Table 1 for more details.

From 50,712 raw diffraction patterns in run 92 and run 93 (see more details about the experimental runs in Table 1), we found 578 hits by using the methods described in [35]. We centred all 578 hits (patterns with more than one particles) and set the pattern-distance threshold to 0.5. Further, we also limited the estimated sizes of selected patterns within [450, 540] nm. Finally, we also rejected all patterns which were accepted by the sphere templates. With these three criteria the EI classifier obtained 108 single-particle frames, and among them, 75 frames were manually classified as single-particle patterns, giving a precision (Positive Predictive Value, or PPV) of 0.69, see Table 4. Further, 14 manual singles were misclassified, giving a recall (True Positive Rate, TPR) of 0.84. The LL classifier gave similar but slightly worse results, see Table 5. Both classifiers obtained high accuracies (> 0.9). However, 33 (38) accepted frames were non-single hits for the EI (LL) classifier, giving a relatively low precision value of 0.69 (0.65). The often quoted F1 score (the harmonic mean of precision and recall) was 0.76 (0.72).

Table 4. Classification results of the raw mimivirus dataset [31] from the EI classifier, see text for more details on the classification measures (following [33, 36] closely). In the table, ACC, TPR, PPV, and F1, respectively, abbreviates Accuracy, True Positive Rate, Positive Predictive Value, and F1 score. TPR and PPV are sometimes referred to as the recall and the precision.

View Table | View all tables in this article

Table 5. Classification results of the raw mimivirus dataset [4] from the LL classifier.

View Table | View all tables in this article

It is worth to point out that the mimivirus dataset had a low single-hit rate, and we may get a better precision if we have more single-particles in the testing dataset. Further, most rejected single-particle frames were sorted out by the size threshold. By loosening the size constraint, we could reduce the number of rejected single-particle patterns. However, the number of accepted non-single-particle patterns may then instead increase. In Fig. 7, we also visually illustrate five diffraction patterns together with their classification results from the EI classifier.

Fig. 7 Five combination images from the EI classifier for the raw mimivirus data in [31]. The left part of each image was from the mimivirus dataset, and the right part was the corresponding template scaled by the recovered fluence. The color scale is logarithmic and ranges from 0 to 1000 photons per pixel. (a): One of the best accepted diffraction patterns. (b): One selected diffraction pattern with low photon counts: the total number of photons was around 2.4 × 10⁵. (c): A potential water droplet, which was rejected since it was accepted by a sphere template. (d) and (e): Patterns rejected by the size threshold.

Download Full Size | PDF

6. Conclusions

The FXI technique holds the promise of obtaining biomolecule structures from single particles. It operates at a high repetition rate and records thousands of millions of diffraction data every day. The stochastic nature of XFELs and the heterogeneity of the sample molecules make the recorded dataset too complex and massive to classify manually. By using our knowledge of the sample molecules, such as sizes and shapes, we can use template-based methods to reduce the complexity of the classification problem and select more homogeneous diffraction patterns. In consequence, the next step of FXI data analysis – the 3D orientation determination procedure, will hopefully need less computations, and be faster to converge, leading to a 3D model with an improved resolution. Both proposed methods obtained a high classification accuracy, and most non-single-particle patterns were sorted out. Some non-single-particle patterns were still selected and since the testing raw dataset had a low single-particle hit rate, we may had selected less non-single-particle patterns if we had a higher single-particle hit rate. Improvements in the quality of the raw data frames would thus still remain very beneficial for the end-result resolution.

In our straightforward Matlab implementations, both methods can classify a testing pattern in a few milliseconds, and they certainly can be accelerated to the XFEL repetition rates, albeit using considerable resources. We also observed that the rotational probabilities, from the 3D orientation determination procedure, increased with decreasing pattern distances. Further, the selected patterns from our classifiers fit better into a 3D Fourier intensity, resulting in a potentially better resolution of the 3D electron density of the sample molecules.

Newer facilities, such as the European XFEL, operate at high repetition rates and will create massive volumes of FXI diffraction data with heterogeneities to varying degrees. With our methods, we can use most of our knowledge of the sample molecules to reduce data storage and automatically select homogeneous single-particle patterns. We also foresee that an on-site FXI analysis pipeline, which connects our classifier to the 3D reconstruction procedure, can solve the 3D structure with sub-nanometer resolution in the near future.

Funding

This work was financially supported by the Swedish Research Council within the UPMARC Linnaeus center of Excellence (S. Engblom, J. Liu) and by the Swedish Research Council, the Röntgen-Ångström Cluster, the Knut och Alice Wallenbergs Stiftelse, the European Research Council (J. Liu, G. Schot).

References

1. R. Neutze, R. Wouts, D. van der Spoel, E. Weckert, and J. Hajdu, “Potential for biomolecular imaging with femtosecond X-ray pulses,” Nature 406, 752–757 (2000). [CrossRef] [PubMed]

2. H. N. Chapman, A. Barty, S. Marchesini, A. Noy, S. P. Hau-Riege, C. Cui, M. R. Howells, R. Rosen, H. He, J. C. H. Spence, U. Weierstall, T. Beetz, C. Jacobsen, and D. Shapiro, “High-resolution ab initio three-dimensional x-ray diffraction microscopy,” J. Opt. Soc. Am. A 23, 1179 (2006). doi:. [CrossRef]

3. M. M. Seibert, T. Ekeberg, F. R. N. C. Maia, M. Svenda, J. Andreasson, O. Joensson, D. Odic, B. Iwan, A. Rocker, D. Westphal, M. Hantke, D. P. DePonte, A. Barty, J. Schulz, L. Gumprecht, N. Coppola, A. Aquila, M. Liang, T. A. White, A. Martin, C. Caleman, S. Stern, C. Abergel, V. Seltzer, J.-M. Claverie, C. Bostedt, J. D. Bozek, S. Boutet, A. A. Miahnahri, M. Messerschmidt, J. Krzywinski, G. Williams, K. O. Hodgson, M. J. Bogan, C. Y. Hampton, R. G. Sierra, D. Starodub, I. Andersson, S. Bajt, M. Barthelmess, J. C. H. Spence, P. Fromme, U. Weierstall, R. Kirian, M. Hunter, R. B. Doak, S. Marchesini, S. P. Hau-Riege, M. Frank, R. L. Shoeman, L. Lomb, S. W. Epp, R. Hartmann, D. Rolles, A. Rudenko, C. Schmidt, L. Foucar, N. Kimmel, P. Holl, B. Rudek, B. Erk, A. Hoemke, C. Reich, D. Pietschner, G. Weidenspointner, L. Strueder, G. Hauser, H. Gorke, J. Ullrich, I. Schlichting, S. Herrmann, G. Schaller, F. Schopper, H. Soltau, K.-U. Kuehnel, R. Andritschke, C.-D. Schroeter, F. Krasniqi, M. Bott, S. Schorb, D. Rupp, M. Adolph, T. Gorkhover, H. Hirsemann, G. Potdevin, H. Graafsma, B. Nilsson, H. N. Chapman, and J. Hajdu, “Single mimivirus particles intercepted and imaged with an X-ray laser,” Nature 470, 78–81 (2011). [CrossRef] [PubMed]

4. T. Ekeberg, M. Svenda, M. Marvin Seibert, C. Abergel, F. Maia, V. Seltzer, D. P. DePonte, A. Aquila, J. Andreasson, B. Iwan, O. Jönsson, D. Westphal, D. Odić, I. Andersson, A. Barty, M. Liang, A. Martin, L. Gumprecht, H. Fleckenstein, and J. Hajdu, “Single-shot diffraction data from the mimivirus particle using an x-ray free-electron laser,” Sci. Data 3, 160060 (2016). [CrossRef] [PubMed]

5. M. F. Hantke, D. Hasse, F. R. N. C. Maia, T. Ekeberg, K. John, M. Svenda, N. D. Loh, A. V. Martin, N. Timneanu, D. S. Larsson, v. d. S. Gijs, G. H. Carlsson, M. Ingelman, J. Andreasson, D. Westphal, M. Liang, F. Stellato, D. P. DePonte, R. Hartmann, N. Kimmel, R. A. Kirian, M. M. Seibert, K. Mühlig, S. Schorb, K. Ferguson, C. Bostedt, S. Carron, J. D. Bozek, D. Rolles, A. Rudenko, S. Epp, H. N. Chapman, A. Barty, J. Hajdu, and I. Andersson, “High-throughput imaging of heterogeneous cell organelles with an X-ray laser,” Nat. Photonics 8, 943 (2014). [CrossRef]

6. G. van der Schot, M. Svenda, F. R. N. C. Maia, M. Hantke, D. P. DePonte, M. M. Seibert, A. Aquila, J. Schulz, R. Kirian, M. Liang, F. Stellato, B. Iwan, J. Andreasson, N. Timneanu, D. Westphal, F. N. Almeida, D. Odic, D. Hasse, G. H. Carlsson, D. S. D. Larsson, A. Barty, A. V. Martin, S. Schorb, C. Bostedt, J. D. Bozek, D. Rolles, A. Rudenko, S. Epp, L. Foucar, B. Rudek, R. Hartmann, N. Kimmel, P. Holl, L. Englert, N.-T. Duane Loh, H. N. Chapman, I. Andersson, J. Hajdu, and T. Ekeberg, “Imaging single cells in a beam of live cyanobacteria with an X-ray laser,” Nat. Commun. 6, 5704 (2015). [CrossRef] [PubMed]

7. B. J. Daurer, K. Okamoto, J. Bielecki, F. R. N. C. Maia, K. Mühlig, M. M. Seibert, M. F. Hantke, C. Nettelblad, W. H. Benner, M. Svenda, N. Tîmneanu, T. Ekeberg, N. D. Loh, A. Pietrini, A. Zani, A. D. Rath, D. Westphal, R. A. Kirian, S. Awel, M. O. Wiedorn, G. van der Schot, G. H. Carlsson, D. Hasse, J. A. Sellberg, A. Barty, J. Andreasson, S. Boutet, G. Williams, J. Koglin, I. Andersson, J. Hajdu, and D. S. D. Larsson, “Experimental strategies for imaging bioparticles with femtosecond hard X-ray pulses,” IUCrJ 4, 251 (2017). [CrossRef] [PubMed]

8. J. D. Bozek, “AMO instrumentation for the LCLS X-ray FEL,” The Eur. Phys. J. Special Top. 169, 129–132 (2009). [CrossRef]

9. E. A. Schneidmiller and M. V. Yurkov, “Photon beam properties at the European XFEL,” Tech. rep., European XFEL (XFEL) (2011).

10. J. Becker, L. Bianco, R. Dinapoli, P. Göttlicher, H. Graafsma, D. Greiffenberg, M. Gronewald, B. H. Henrich, H. Hirsemann, S. Jack, R. Klanner, H. Krüger, A. Klyuev, S. Lange, A. Marras, A. Mozzanica, B. Schmitt, J. Schwandt, I. Sheviakov, X. Shi, U. Trunk, M. Zimmer, and J. Zhang, “High speed cameras for x-rays : Agipd and others,” J. Instrumentation 8, C01042 (2013). [CrossRef]

11. J. Andreasson, A. V. Martin, M. Liang, N. Timneanu, A. Aquila, F. Wang, B. Iwan, M. Svenda, T. Ekeberg, M. Hantke, J. Bielecki, D. Rolles, A. Rudenko, L. Foucar, R. Hartmann, B. Erk, B. Rudek, H. N. Chapman, J. Hajdu, and A. Barty, “Automated identification and classification of single particle serial femtosecond X-ray diffraction data,” Opt. Express 22, 2497–2510 (2014). [CrossRef] [PubMed]

12. S. A. Bobkov, A. B. Teslyuk, R. P. Kurta, O. Y. Gorobtsov, O. M. Yefanov, V. A. Ilyin, R. A. Senin, and I. A. Vartanyants, “Sorting algorithms for single-particle imaging experiments at X-ray free-electron lasers,” J. Synchrotron Radiat. 22, 1345–1352 (2015). [CrossRef] [PubMed]

13. C. H. Yoon, P. Schwander, C. Abergel, I. Andersson, J. Andreasson, A. Aquila, S. Bajt, M. Barthelmess, A. Barty, M. J. Bogan, C. Bostedt, J. Bozek, H. N. Chapman, J.-M. Claverie, N. Coppola, D. P. DePonte, T. Ekeberg, S. W. Epp, B. Erk, H. Fleckenstein, L. Foucar, H. Graafsma, L. Gumprecht, J. Hajdu, C. Y. Hampton, A. Hartmann, E. Hartmann, R. Hartmann, G. Hauser, H. Hirsemann, P. Holl, S. Kassemeyer, N. Kimmel, M. Kiskinova, M. Liang, N.-T. D. Loh, L. Lomb, F. R. N. C. Maia, A. V. Martin, K. Nass, E. Pedersoli, C. Reich, D. Rolles, B. Rudek, A. Rudenko, I. Schlichting, J. Schulz, M. Seibert, V. Seltzer, R. L. Shoeman, R. G. Sierra, H. Soltau, D. Starodub, J. Steinbrener, G. Stier, L. Strüder, M. Svenda, J. Ullrich, G. Weidenspointner, T. A. White, C. Wunderer, and A. Ourmazd, “Unsupervised classification of single-particle X-ray diffraction snapshots by spectral clustering,” Opt. Express 19, 16542–16549 (2011). [CrossRef] [PubMed]

14. C. Yoon, “Novel algorithms in coherent diffraction imaging using x-ray free-electron lasers,” Proc.SPIE 8500, 85000H (2012). [CrossRef]

15. H. K. N. Reddy, C. H. Yoon, A. Aquila, S. Awel, K. Ayyer, A. Barty, P. Berntsen, J. Bielecki, S. Bobkov, M. Bucher, G. A. Carini, S. Carron, H. Chapman, B. Daurer, H. DeMirci, T. Ekeberg, P. Fromme, J. Hajdu, M. F. Hanke, P. Hart, B. G. Hogue, A. Hosseinizadeh, Y. Kim, R. A. Kirian, R. P. Kurta, D. S. D. Larsson, N. Duane Loh, F. R. N. C. Maia, A. P. Mancuso, K. Mühlig, A. Munke, D. Nam, C. Nettelblad, A. Ourmazd, M. Rose, P. Schwander, M. Seibert, J. A. Sellberg, C. Song, J. C. H. Spence, M. Svenda, G. Van der Schot, I. A. Vartanyants, G. J. Williams, and P. L. Xavier, “Coherent soft X-ray diffraction imaging of coliphage PR772 at the Linac coherent light source,” Sci. Data 4, 170079 (2017). [CrossRef] [PubMed]

16. M. Rose, S. Bobkov, K. Ayyer, R. P. Kurta, D. Dzhigaev, Y. Y. Kim, A. J. Morgan, C. H. Yoon, D. Westphal, J. Bielecki, J. A. Sellberg, G. Williams, F. R. Maia, O. M. Yefanov, V. Ilyin, A. P. Mancuso, H. N. Chapman, B. G. Hogue, A. Aquila, A. Barty, and I. A. Vartanyants, “Single-particle imaging without symmetry constraints at an X-ray free-electron laser,” IUCrJ 5, 727–736 (2018). [CrossRef] [PubMed]

17. R. P. Kurta, J. J. Donatelli, C. H. Yoon, P. Berntsen, J. Bielecki, B. J. Daurer, H. DeMirci, P. Fromme, M. F. Hantke, F. R. N. C. Maia, A. Munke, C. Nettelblad, K. Pande, H. K. N. Reddy, J. A. Sellberg, R. G. Sierra, M. Svenda, G. van der Schot, I. A. Vartanyants, G. J. Williams, P. L. Xavier, A. Aquila, P. H. Zwart, and A. P. Mancuso, “Correlations in scattered x-ray laser pulses reveal nanoscale structural features of viruses,” Phys. review letters 119, 158102 (2017). [CrossRef]

18. A. Hosseinizadeh, G. Mashayekhi, J. Copperman, P. Schwander, A. Dashti, R. Sepehr, R. Fung, M. Schmidt, C. H. Yoon, B. G. Hogue, G. B. R. F. C. J. C. C. Williams, A. L. Aquila, and A. Ourmazd, “Conformational landscape of a virus by single-particle x-ray scattering,” Nat. Methods 14, 877–881 (2017). [CrossRef] [PubMed]

19. N. D. Loh and V. Elser, “Reconstruction algorithm for single-particle diffraction imaging experiments,” Phys. Rev. E 80, 026705 (2009). [CrossRef]

20. B. Moghaddam, W. Wahid, and A. Pentland, “Beyond eigenfaces: probabilistic matching for face recognition,” in Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, (1998), pp. 30–35.

21. J. Burnstone and H. Yin, “Eigenlights: recovering illumination from face images,” in Intelligent Data Engineering and Automated Learning - IDEAL 2011, (Springer, 2011), pp. 490–497.

22. W. S. Yambor, B. A. Draper, and J. R. Beveridge, “Analyzing PCA-based Face Recognition Algorithms: Eigenvector Selection and Distance Measures,” in Empirical Evaluation Methods in Computer Vision, (WORLD SCIENTIFIC, 2011).

23. A. Pentland, B. Moghaddam, and T. Starner, “View-based and modular eigenspaces for face recognition,” in 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (1994), pp. 84–91.

24. M. Kirby and L. Sirovich, “Application of the Karhunen-Loeve procedure for the characterization of human faces,” IEEE Transactions on Pattern Analysis Mach. Intell. 12, 103–108 (1990). [CrossRef]

25. L. Sirovich and M. Kirby, “Low-dimensional procedure for the characterization of human faces,” J. Opt. Soc. Am. A 4, 519–524 (1987). [CrossRef] [PubMed]

26. C. Biernacki, G. Celeux, and G. Govaert, “Assessing a mixture model for clustering with the integrated completed likelihood,” IEEE Transactions on Pattern Analysis Mach. Intell. 22, 719–725 (2000). [CrossRef]

27. C. S. Long, K. M. Chugg, and A. Polydoros, “Further results in likelihood classification of qam signals,” in Military Communications Conference, 1994. MILCOM ’94. Conference Record, 1994 IEEE, vol. 1 (1994), pp. 57–61.

28. V. G. Chavali and C. R. C. M. da Silva, “Maximum-Likelihood Classification of Digital Amplitude-Phase Modulated Signals in Flat Fading Non-Gaussian Channels,” IEEE Transactions on Commun. 59, 2051–2056 (2011). [CrossRef]

29. J. Lidmar, L. Mirny, and D. R Nelson, “Virus shapes and buckling transitions in spherical shells,” Phys. Rev. E 68, 051910 (2003). [CrossRef]

30. V. Graziano and O. Monica, “Faceting ionic shells into icosahedra via electrostatics,” Proc. Natl. Acad. Sci. United States Am. 104, 18382–18386 (2007). [CrossRef]

31. T. Ekeberg, M. Svenda, C. Abergel, F. R. N. C. Maia, V. Seltzer, J.-M. Claverie, M. Hantke, O. Jönsson, C. Nettelblad, G. van der Schot, M. Liang, D. P. DePonte, A. Barty, M. M. Seibert, B. Iwan, I. Andersson, N. D. Loh, A. V. Martin, H. Chapman, C. Bostedt, J. D. Bozek, K. R. Ferguson, J. Krzywinski, S. W. Epp, D. Rolles, A. Rudenko, R. Hartmann, N. Kimmel, and J. Hajdu, “Three-dimensional reconstruction of the giant mimivirus particle with an X-ray free-electron laser,” Phys. review letters 114, 098102 (2015). [CrossRef]

32. M. F. Hantke, T. Ekeberg, and F. R. N. C. Maia, “Condor: a simulation tool for flash X-ray imaging,” J. Appl. Crystallogr. 49, 1356–1362 (2016). [CrossRef] [PubMed]

33. T. Fawcett, “Introduction to roc analysis,” Pattern Recognit. Lett. 27, 861–874 (2006). [CrossRef]

34. T. Ekeberg, S. Engblom, and J. Liu, “Machine learning for ultrafast X-ray diffraction patterns on large-scale GPU clusters,” Int. J. High Perform. Comput. Appl. 29233–243 (2015). [CrossRef]

35. A. Barty, R. A. Kirian, F. R. N. C. Maia, M. Hantke, C. H. Yoon, T. A. White, and H. Chapman, “Cheetah: software for high-throughput reduction and analysis of serial femtosecond X-ray diffraction data,” J. Appl. Crystallogr. 47, 1118–1131 (2014). [CrossRef] [PubMed]

36. D. Powers, “Evaluation: From precision, recall and f-factor to roc, informedness, markedness & correlation,” J. Mach. Learn. Technol. 2, 37–63 (2008).

Dataset	Size (nm)	M_data^f	Noise	Fluence Ψ	# Photons^g(×10⁵)
T^a(Training)	180	290	N/A	1	≈ 10
D^b(Noiseless)	180	1000	N/A	1	≈ 10
P^b(Shot-noise)	180	1000	Poisson	1	≈ 10
F^b(Fluence-scaled)	180	1000	Poisson	𝒰 {0.01, 1.1}^c	[0.1, 11]
S^b(Varied-size)	𝒰 {150, 210}^c	2000	Poisson	𝒰 {0.01, 1.1}^c	[0.1, 11]
X^d(Spheroid)	𝒰 {150, 210}^c	1200	Poisson	𝒰 {0.01, 1.1}^c	[0.1, 11]
T₁ⁱ(Training 1)	180	290+1	N/A	1	≈ 10
Mimivirus^e	≈ 490	198	N/A	N/A	[4.5, 34]
T₂ⁱ(Training 2)	≈ 490	1000	N/A	1	≈ 10
Raw-mimi^h	≈ 490	50712	N/A	N/A	N/A
T₃ⁱ(Training 3)	490	1000+10	N/A	1	≈ 10

Dataset	EI		LL
Dataset	Benchmark	Remaining	Benchmark	Remaining
D (Noiseless)	0	0.036	0	0.041
P (Shot-noise)	0.031	0.053	0.031	0.063
F (Fluence-scaled)	0.042	0.062	0.043	0.074

	Classified as
	Icosahedron	Spheroid	Rejected	Total
Data: icosahedron	1000	0	0	1000
Data: spheroid	8 (9)	114 (118)	78 (73)	200

	Run 92		Run 93		Run 92 & 93
	Single	Other	Single	Other	Single	Other
Accepted	44	20	31	13	75	33
Rejected	7	199	7	257	14	456

	ACC=0.90	TPR=0.86	ACC=0.94	TPR=0.82	ACC=0.92	TPR=0.84
	PPV=0.68	F1=0.77	PPV=0.70	F1=0.76	PPV=0.69	F1=0.76

	Run 92		Run 93		Run 92 & 93
	Single	Other	Single	Other	Single	Other
Accepted	43	23	28	15	71	38
Rejected	8	196	10	255	18	451

	ACC=0.89	TPR=0.84	ACC=0.92	TPR=0.74	ACC=0.90	TPR=0.80
	PPV=0.65	F1=0.74	PPV=0.65	F1=0.69	PPV=0.65	F1=0.72

Supervised classification methods for flash X-ray single particle diffraction imaging

Abstract

1. Introduction

2. Flash X-ray single particle diffraction Imaging (FXI)

3. Classification Methods

3.1. Eigen-Image (EI) Classifier

3.2. Log-Likelihood (LL) Classifier

4. Data Description

4.1. Homogeneous Datasets

4.2. Heterogeneous Particle Sizes

4.3. Heterogeneous Particle Shapes

4.4. Mimivirus Dataset

5. Experiments

5.1. Metrics

5.2. Homogeneous Patterns

5.3. Heterogeneous Particle Sizes

5.4. Heterogeneous Shapes

5.5. Mimivirus diffraction patterns

5.6. Raw mimivirus dataset

6. Conclusions

Funding

References

Cited By

Figures (7)

Tables (5)

Equations (17)

Optics Express