## Abstract

Image reconstruction in fluorescence optical tomography is a three-dimensional nonlinear ill-posed problem governed by a system of partial differential equations. In this paper we demonstrate that a combination of state of the art numerical algorithms and a careful hardware optimized implementation allows to solve this large-scale inverse problem in a few seconds on standard desktop PCs with modern graphics hardware. In particular, we present methods to solve not only the forward but also the non-linear inverse problem by massively parallel programming on graphics processors. A comparison of optimized CPU and GPU implementations shows that the reconstruction can be accelerated by factors of about 15 through the use of the graphics hardware without compromising the accuracy in the reconstructed images.

© 2011 OSA

## 1. Introduction

In this paper, we study an indirect measurement technology, namely fluorescence optical tomography, which relies on the ability of fluorescent dyes (fluorophores) to absorb and re-emit light at different wavelengths. The aim of fluorescence tomography is to obtain the distribution of a fluorophore inside an object (an only indirectly accessible quantity) from optical measurements at the boundary (an easily obtainable information). The strength of this imaging modality is the availability of a large number of fluorophores which change their optical properties in dependence of its biological surrounding. Thus, they reflect physiologically interesting parameters like the pH value [1, 2] or tissue oxygenation [3].

Unfortunately, photons are massively scattered in biological tissues. Therefore, fluorescence tomography requires the accurate simulation of light propagation in highly scattering media and the solution of ill-posed nonlinear inverse problems, which are computationally intensive tasks. Fast and reliable numerical algorithms will crucially determine the success of this new imaging modality in biomedical sciences as well as in engineering applications.

A recent trend is the application of graphics processing units (GPUs) in scientific computations. In the field of optical tomographic imaging, Zhang et al [4] have reported GPU accelerated bioluminescence reconstructions and make use the proprietary package CULA by EM Photonics for the inversion of the finite element system matrix. Prakash et al [5] also use CULA from within the Matlab-framework NIRFAST to speed up the reconstruction algorithm for diffuse optical tomography. In their work, six distinct tasks are GPU-accelerated such as the solution of a linear equation system and matrix-matrix multiplications.

The previously mentioned works use the GPU as co-processor mainly, i.e. data is transferred to the graphics card where certain operations are performed in parallel. The result is copied back to the CPU then. This submission is different in the sense that we aim to perform nearly all computations on the graphics hardware without the need to transfer intermediate data between the CPU and the graphics card. Thus, the GPU is not only used for trivial multiplication tasks but also for the assembly of the forward and adjoint system matrices and the sensitivity matrix, for example. This reduces the comparatively slow data transfer between the computer and the graphics hardware to a minimum and results in much faster image reconstruction. Furthermore, all algorithms are iterative which circumvents the need for matrix inversion algorithms. Our implementation is based on the open-source library *AGILE* [6] which is available at http://www.imt.tugraz.at/research/agile-gpu-image-reconstruction-library/.

The outline of this paper is as follows: Section 2 provides background material about fluorescence tomography and describes the iterative regularization method that is used for the stable solution of the inverse problem. Section 3 deals with the discretization of the problem by finite element methods and gives details for an efficient implementation on graphics hardware. In Section 4, we define a test problem and derive estimates for the complexity and required memory of the reconstruction algorithm. The theoretical results are confirmed by numerical tests, in which we compare the performance of our algorithms on CPU and GPU hardware. Our tests illustrate that the reconstruction times can be reduced significantly (by a factor of about 15) through the use of the graphics hardware.

## 2. Methods

#### 2.1. Mathematical model

The propagation of light in highly scattering media, e.g. in biological tissue, is typically modeled by the diffusion approximation [7], which serves as a first order approximation to the radiative transport equation. Since light of two different wavelengths is employed in fluorescence tomography, the mathematical model consists of a coupled system of partial differential equations (PDE) [8], namely

*φ*,

_{i}*i*=

*x*,

*m*are the photon densities for the excitation (x) and the emission (m) light in the domain Ω covered by tissue, and the source term

*q*models a collimated light source. The parameters

*κ*and

_{i}*μ*are the diffusion and absorption coefficients of the tissue, and

_{i}*γ*is a measure incorporating the fluorophore’s extinction coefficient and its quantum yield. We assume that these parameters depend in a known form on the concentration

*c*of fluorophore in the tissue, and we write

*κ*(

_{i}*c*),

*μ*(

_{i}*c*) or

*γ*(

*c*) if this dependence needs to be emphasized.

Homogeneous Robin boundary conditions [7]

*q*.

The simplest type of boundary measurement is the photon flux at the emission wavelength which leaves the domain which is modeled by

where the function*d*allows to model the aperture and additional (linear) transfer characteristics of the detector. Other types of measurement functions are possible as well and have been used in e.g. [9–11].

The tomographic measurement process finally consists of illuminating the object with a sequence of light sources *q _{j}*,

*j*= 1,...,

*ns*, and taking measurements of the emission light with an array of detectors

*d*,

_{i}*i*= 1,...,

*nd*at the boundary. This gives rise to a measurement matrix ℳ of dimension

*nd*×

*ns*, which obviously depends on the distribution

*c*of the fluorophore, and we write ℳ(

*c*) to express this dependence.

#### 2.2. Image reconstruction

The aim of fluorescence tomography is to determine the inaccessible fluorophore distribution *c* which gave rise to the measurements ℳ* ^{δ}*—which are perturbed by measurement noise and model errors—of the emitted light. A Tikhonov-type regularization term is incorporated in the objective functional for stability reasons. Then the reconstruction problem reads

*c*onto the output ℳ(

*c*) and ${S}_{k}^{*}$ is its adjoint. The regularization parameter

*α*stabilizes the solution and is usually decreased in every iteration.

_{k}For the sake of completeness, we would like to mention other possibilities for the regularizing of the Newton iteration, e.g. the truncated Newton-CG algorithm [13], the Levenberg-Marquardt method [14], or the Newton-Landweber method [15]. Several of these approaches have been demonstrated to work well for optical tomography applications [8, 16–19]. Non-quadratic penalty terms have been studied e.g. in [20, 21].

## 3. Implementation

In the following, we discuss the the main steps of the reconstruction algorithm in more detail and provide information for an efficient implementation. In order to quantify the expected computational workload, we also derive complexity estimates for the individual steps and provide upper bounds for the memory requirements.

#### 3.1. Discretization

For the discretization of the governing partial differential equations, we consider a standard finite element method with piecewise linear basis functions for the photon fields. In our computations, we use an unstructured mesh with *N _{V}* vertices consisting of

*N*tetrahedral elements. For computational efficiency but also simplicity, the optical parameters and the fluorophore concentration

_{T}*c*are discretized by piecewise constant functions over the same mesh. The discretized version of the system (1)–(2) reads

*K*,

_{i}*M*and

_{i}*R*,

_{i}*i*∈ {

*x*,

*m*}, denote the stiffness-, mass- and boundary mass-matrices at the respective wavelength. Each column of the

*N*×

_{V}*ns*matrix

*Q*corresponds to a different excitation

*q*,

_{j}*j*= 1,...,

*ns*, and the

*ns*columns of the solution matrices

*V*and

_{x}*V*store the vector representations of the corresponding photon density fields

_{m}*φ*and

_{x}*φ*. The measurement data are given by ℳ(

_{m}*c*) =

*D*

^{⊤}

*V*, where each of the

_{m}*nd*columns of the detector matrix

*D*corresponds to one of the detectors

*d*,

_{i}*i*= 1,...,

*nd*, in (4).

The discrete sensitivity *S* = *S*(*c*) is an *nd* × *ns* × *N _{T}* tensor (respectively an (

*nd*·

*ns*) ×

*N*matrix). Its action onto a parameter perturbation

_{T}*h*∈ ℝ

*is given by*

^{NT}*h*is an arbitrary element from the

*N*dimensional parameter space, and

_{T}*W*,

_{m}*W*denote the solutions of the adjoint system

_{x}The fully discrete equivalent of the iteratively regularized Gauß-Newton method (6) has the form listed in Algorithm 1.

Before going into details of the implementation and giving complexity estimates for the individual step as well as upper bounds for the memory requirements, let us make some general remarks:

- The presented algorithms are independent of the architecture used for the actual computations, i.e. apart from some implementation details, the same algorithms are used for CPU and GPU implementations. In our numerical tests, the high-level algorithms are always controlled by the CPU, while the actual computations (e.g. matrix-vector multiplications) are executed on the CPU or GPU, respectively, inside a few
*kernels*. Only these few kernels have to be adapted to the specific hardware. The advantage of such a “minimally invasive” framework is that algorithms can be migrated gradually from one hardware to another while at the same time the high-level abstraction of the algorithms can be preserved. Such an approach seems very natural, and has been employed previously, e.g. for the simulation of fluid dynamics on CPU-GPU clusters [22]. - All compute intensive steps of the reconstruction algorithm have a high degree of locality: A major part of the operations, e.g. the assembly of the system matrices or the sensitivity, can be performed in an element-by-element fashion. This means that similar computations can executed for individual elements independently from each other. Global operations, e.g. the solution of the linear systems, rely on sparse matrix algebra. Note that each row (or column) of the sparse matrices corresponds to a vertex of the mesh, and the non-zero entries of this row stem from degrees of freedom (vertices) belonging to the patch of elements surrounding this vertex. Therefore, also sparse-matrix operations share a high degree of locality, and consequently, the overall reconstruction algorithm is very well-suited for parallelization.

#### 3.2. Initialization

The first task is to set up a multilevel mesh hierarchy: The vertex coordinates of the finest level are loaded, and the transfer operators between adjacent levels *l* and *l* + 1 are stored as sparse prolongation matrices
${P}_{l}^{l+1}$. This allows to lift a function from level *l* to the next finer level *l* + 1, i.e.
${u}_{l+1}={P}_{l}^{l+1}{u}_{l}$ denotes the linear prolongation of a piecewise linear finite element function *u _{l}* to the finer mesh where it is represented by

*u*

_{l}_{+1}. The mesh hierarchy is only required for the initialization step. The memory requirement for storing the prolongation matrices ${P}_{l}^{l+1}$ is

*O*(

*N*).

_{V}The element vertex numbers are stored in an *N _{T}* × 4 table containing the global vertex numbers for the

*N*individual elements. The 4-element array

_{T}*g*then contains the numbers of the vertices spanning the tetrahedron

^{T}*T*. During the setup step, also the vertex-to-element connectivity is required, which is simply given by ${L}_{V}^{E}={\left({L}_{E}^{V}\right)}^{\top}$. The memory requirement for these look-up tables is

*O*(

*N*+

_{V}*N*).

_{T}One additional look-up table is needed for the efficient assembly of the sparse system matrix. This matrix is stored in compressed row-storage (CRS) format. Its sparsity pattern can be created from the mesh hierarchy and the mapping (local index → global index → CRS data index) can be readily computed. We store the inverse of this mapping, which is a one-to-many table and requires *O*(*N _{T}*) memory.

Finally, we pre-compute the mass-, boundary-mass- and stiffness-matrices, *M ^{T}*,

*R*and

^{T}*K*, for unit coefficients:

^{T}*N*matrices of size 4 × 4, which are copied to the GPU memory.

_{T}#### 3.3. Sparse matrix format

The solution of linear equation systems heavily relies on fast sparse matrix-vector products. For comparison of matrix-vector kernel performance of different sparse storage layouts we refer to [23]. The results presented in this paper are based on a blocked interleaved compressed row-storage (CRS) format sketched in Fig. 1. In contrast to the conventional CRS layout, the adapted storage scheme stores the elements of the rows interleaved which makes access from different threads more efficient. As not all of the rows have the same number of non-zero entries, the shorter rows have to be padded with dummy elements. In order not to waste too much memory, this padding is done in blocks of 32 rows (a block-size of two is used in Fig. 1 to illustrate the technique). Further details about this format can be found in [24].

#### 3.4. Assembly of the system matrices

When assembling of the system matrix, the individual element contributions have to be summed. In a straightforward implementation this would result in non-local random write-access on the output matrix which is costly on graphics hardware. However, due to the piecewise constant discretization of the coefficients *κ*, *μ*, *ρ* and *c*, the contribution of an element *T* to the excitation and emission system is given as

*K*,

^{T}*M*and

^{T}*R*are scaled with the element’s optical parameters and the output is stored in a vector

^{T}*H*with length (4 · 4 ·

*N*). In the second step, we sum for every position in the CRS matrix the corresponding elements from

_{T}*H*. In principle, this summation can be done by multiplying a suitable sparse matrix with

*H*. However, as all non-zero elements in this matrix are 1, we implemented a specialized GPU kernel for this task, which simply sums the elements whose indices are stored in a look-up table. By using such a two-step scheme, we avoid non-local write-access and need non-local read-access only for the second summation step, which is implemented using the GPU’s texture cache. Figure 2 illustrates this procedure. As the construction of the local element matrices does only require local memory access, it can be done fully in parallel which provides almost optimal performance on both, CPU and GPU processors. The total complexity of this step is

*O*(

*N*).

_{T}#### 3.5. Solution of the linear systems

For the solution of the linear systems *A _{x}V_{x}* =

*Q*and

*A*=

_{m}V_{m}*G*(

*c*)

*V*for the forward problem, respectively,

_{x}*A*=

_{m}W_{m}*D*and

*A*=

_{x}W_{x}*G*(

*c*)

*W*for the adjoint problem, we employ a preconditioned blocked conjugate gradient method (PBCG): instead of solving for one right hand side (i.e. a column in

_{m}*Q*) after another, we solve simultaneously for all right hand sides (blocked). The algorithm uses the PCG framework provided by the

*AGILE*library. Thus, only a new kernel had to be coded for the blocked matrix-vector-product.

To accelerate the convergence, a single *V* -cycle of a geometric multigrid preconditioner [25] is used. On the coarsest mesh level, an exact solve is performed by a blocked conjugate gradient method (BCG), which is similar to PBCG but without further preconditioning. A single BCG iteration is also used in the smoothing step. The application of one multigride *V* -cycle has linear complexity. If the right-hand side has *ns* columns, one application of the multigrid preconditioner requires *O*(*ns* ·*N _{V}*) algebraic operations.

Note that all basic operations in the conjugate gradient method and the multigrid preconditioner (e.g. the matrix-vector products or simpler vector operations) can easily be executed on CPUs or GPUs, respectively. Due to the preconditioning, the typical iteration numbers required for convergence of the PBCG algorithm are 10–15. Therefore, the total complexity of the solution of the forward and adjoint problems is *O*(*ns* ·*N _{V}*) and

*O*(

*nd*·

*N*), respectively.

_{V}#### 3.6. Assembly of the sensitivity

For the assembly of the sensitivity matrix, we use the representation (9). As *h* is a piecewise constant function, the mass-*M*(*h*) and stiffness-matrices *K*(*h*) are just the element matrices *M ^{T}* and

*K*for the

^{T}*T*-th finite element scaled by the derivatives of the optical parameters. Since the computations for the individual elements are independent from each other, this loop can be executed in parallel.

The pseudo-code for the GPU kernel used to assemble the sensitivity matrix is listed in Algorithm 2. The threads are started in blocks of dimension (**x**, **y**, **z**) = (4 × 4 × 32). The size of (4 × 4) is due to the size of the element-wise mass- and stiffness-matrices on a tetrahedral mesh. The last dimension is the number of elements which are processed in parallel and has been set to 32 to maximize the number of threads per block.

For readability, all data structures which exist once for every element being processed carry a superscript **z** instead of writing them as an array. Thus, *n*** ^{z}** is actually an array with 32 entries,

*d*

**[**

^{z}**x**] is a 2D array with size (32 × 4) and

*k*

**(**

^{z}**x,y**) is a 3D array with size (32 × 4 × 4), for example.

The number of the element which is processed by a thread can be inferred from the index of the thread block as shown in line 1. In line 2, *g ^{T}* denotes the element-to-vertex connectivity of the

*T*-th finite element, which is cached into the vector

*d*. The algorithm caches the element’s mass- and stiffness-matrices (lines 4–6) and the forward and ajoint solutions (lines 9–10 and 12–13). Line 15 is an abbreviation for the computation of the sensitivity contribution according to Eq. (9). These contributions are summed in parallel (line 16) and the result is stored in

*S*.

Assembling the sensitivity matrix requires *O*(*ns* ·*nd* ·*N _{T}*) operations and the same amount of memory. Hence, this is the most time consuming and memory intensive step in the current algorithm. We would also like to note that the assembly of the sensitivity matrices

*S*can be avoided completely if it is applied on-the-fly when computing the matrix-vector products; cf [19] and the remarks in section 4.6.

_{k}#### 3.7. Solution of the Gauß-Newton system

Formally, the Gauß-Newton matrix
${S}_{k}^{*}{S}_{k}+{\alpha}_{k}I$ is a dense *N _{T}* ×

*N*matrix, which would be expensive to compute and almost impossible to store. Since by construction ${S}_{k}^{*}{S}_{k}+{\alpha}_{k}I$ is symmetric and positive definite, the Gauß-Newton system can be solved efficiently by the method of conjugate gradients, which requires only the multiplication with

_{T}*S*and ${S}_{k}^{*}$. In this manner, the complexity of multiplication with the Gauß-Newton matrix is given by

_{k}*O*(

*ns*·

*nd*·

*N*) algebraic operations, whereas a multiplication with the pre-multiplied matrix ${S}_{k}^{*}{S}_{k}+{\alpha}_{k}I$ would be

_{T}*O*(

*N*

_{T2}).

For the iterative solution of the Gauß-Newton systems we utilize hardware optimized dense matrix-vector kernels, i.e. *LAPACK-BLAS* [26] for the CPU and a custom implementation based on *AGILE* for the GPU version.

## 4. Results

#### 4.1. Model problem

For our numerical tests, we considered a mouse phantom based on the Digimouse model [27]. The measurement setup consisted of *ns* = 24 sources and *nd* = 24 detectors, which were arranged on three rings around the homogeneous torso. Two spherical fluorophore inclusions with a diameter of 5 mm were placed at the height of the middle optode ring inside the body, and the aim of our test was to reconstruct these inclusions. The setup is depicted in Fig. 3.

For our simulations we used realistic optical tissue properties which were gathered from literature [28–31], cf Table 1. These parameters were kept constant during simulation and reconstruction. In practice, the estimation of those background parameters would require additional a-priori knowledge or measurements.

The data used in our numerical tests were generated on a finer mesh with 200835 elements and additionally perturbed by 1% Gaussian noise relative to the largest measurement datum.

For the finite element discretization, we used a hierarchy of three nested tetrahedral meshes, which were generated with the open source mesh generator NETGEN [32]. The three meshes consisted of 2720, 21 760, and 174080 elements and 853, 5103, and 34 677 vertices, respectively.

#### 4.2. Choice of parameters

The regularization parameters *α _{k}* and the stopping index maxit in the Gauß-Newton algorithm were chosen as follows: We conducted a series of numerical tests for ten similar phantoms, and determined a set of parameters

*α*that worked reasonably well for all samples. These parameters were then used for the actual reconstruction. For the results presented here, a geometric sequence

_{k}*α*=

_{k}*qα*

_{0}with

*α*

_{0}= 1 and

*q*= 0.2 was used. The Gauß-Newton iteration was stopped when the regularization parameter reached the pre-defined threshold

*α*= 1 × 10

_{min}^{−5}, which corresponds to maxit=8 Newton iterations.

#### 4.3. Numerical results

All computations were executed on an AMD Phenom 9950 processor. Both, the CPU code and the GPU host code (i.e. not the code running on the graphics hardware ifself) are single-threaded. The compute intensive operations (e.g. the matrix-vector multiplications and the assembly of the sensitivity matrix) were realized as compute kernels, which were implemented for both, CPU and GPU hardware. For our GPU tests, we used an NVIDIA GTX 480 graphics card and NVIDIA’s CUDA toolkit [33] for programming. A typical result of the image reconstruction is shown in Fig. 4.

#### 4.4. Performance analysis

In Table 2, we provide a detailed performance analysis of the reconstruction algorithm and compare the computation times of the CPU and GPU implementations. For both architectures we also compared the reconstruction with and without multigrid preconditioning. For all relevant tasks, the execution on the GPU results in a significant speed-up. An exception is the assembly of the system matrices on the coarsest level, where the number of elements is so small that the acceleration achieved on the graphics card is spoiled by the latency for starting the kernels on the GPU. However, the time spent for this part of the computation is negligible. The observed speed-up by factors of about 10 is in good agreement to the accelerations of sparse matrix-vector products [34]. (Note that the factor 10 is mainly due to the approximately 10 times larger memory bandwidth provided by GPUs).

#### 4.5. Accuracy

Only single-precision arithmetic executes very fast on current commodity graphics hardware, i.e. use of double-precision operations results in increased memory requirements and a tremendous performance drop. In order to obtain insight into the errors introduced by using single-precision operations, we conducted the tests in single-precision on CPU and GPU hardware, and also in double-precision on the CPU.

First, the outcome of the forward operator (i.e. the simulated measurement data) is compared for the true fluorescence distribution on the mesh with with 200835 elements. The errors are related to the largest measurement datum such that the numerical error can be compared to the data noise easily. Given the single- and double-precision CPU and GPU measurement data ℳ* _{CS}*, ℳ

*and ℳ*

_{CD}*the discrepancies are*

_{GS}To give evidence for the similarity of the resultant images, the relative errors based on the *L*^{2} norm
${\Vert c\Vert}_{{L}^{2}}^{2}:={\int}_{\mathrm{\Omega}}c{(x)}^{2}dx$ of the reconstructed concentrations are compared. Denoting the reconstructed fluorophore concentration from the single- and double-precision CPU and GPU implementation with *c _{CS}*,

*c*and

_{CD}*c*, these relative errors are

_{GS}#### 4.6. Memory requirements

In Table 3, we list the memory requirements for the essential data structures on the finest finite element mesh. The additional memory required for the coarse level matrices required for the multigrid preconditioner is negligible.

In our implementation, the sensitivity matrix requires most of the GPU memory. To avoid excessive memory consumption, the product of the sensitivity matrix with a vector can also be computed on-the-fly, without the need to store the whole sensitivity. Also the complexity of the multiplication with the sensitivity (now *O*(*ns* ·*nd* ·*N _{T}*)) can be reduced, i.e. it has been demonstrated in [19] that the application of the sensitivity can be realized in

*O*(

*ns*·

*N*) operations. Such an approach, however, requires the solution of the adjoint problems for every multiplication with the sensitivity matrix. Comparing with the computation times in Table 2, this would drastically slow down the computations in practice, at least for the values of

_{V}*ns*,

*nd*, and

*N*,

_{T}*N*used in our test study. If the number of sources or detectors becomes very large (e.g. if a camera is used as detector), such an alternative might become favorable.

_{V}## 5. Discussion

In this article, we considered the image reconstruction in fluorescence tomography by iterative regularization methods of Newton-type and a discretization by finite element methods. We demonstrated that a combination of state of the art numerical methods, the utilization of modern hardware, and a hardware optimized implementation allows to solve this large-scale nonlinear inverse problem within a few seconds on standard desktop PCs, without compromising accuracy. In particular, the extensive use of the GPU’s computing power resulted in a significant speed-up of the reconstruction process by factors of approximately 10–15. Such a degree of acceleration is in good agreement to other work [34, 35].

The use of unstructured meshes for our finite element simulations provides high flexibility with respect to geometry and facilitates the mesh generation process. A further acceleration of the computations could be achieved, however, by utilizing structured grids, which lead to structured sparsity patterns of the system matrices and allow faster access of global memory, cf [36].

In contrast to previous work [4, 5], we have presented algorithms which are fully iterative in nature and do not require matrix inversion libraries such as CULA, for example. Additionally, we ported the whole reconstruction algorithm to the graphics card and use the CPU only as controller which avoids expensive memory transfer between the host and the GPU.

The pre-computation of the mass- and stiffness-matrices of the individual elements and the creation of prolongation and restriction matrices for the mesh hierarchy are the only tasks which are still executed on the CPU. However, these setup steps have to be computed only once and could in principle also be stored along with the mesh data.

To achieve efficient code on graphics hardware, care has to be taken when accessing global memory. Memory access is most efficient, if a sequence of threads reads or writes to a sequential memory area (*coalesced* memory access) [33]. Therefore, the photon field matrices *V _{x}*,

*V*and

_{m}*W*,

_{x}*W*, but also the associated source and detector matrices

_{m}*Q*and

*D*are stored in row-major order. As one column of these matrices belongs to one photon field, this means that the elements of a given photon field are interleaved. In combination with the blocked conjugate gradient method, internal performance measurements have shown that this memory layout provides speed-ups by a factor of roughly 2–3 on CPUs and up to 10 on GPUs.

The lack of fast double-precision floating point operations on current graphics hardware limits the accuracy of the reconstructions to some extent. In our numerical examples, the relative errors in the forward simulation due to the single-precision arithmetic was less then 1% for both the CPU and GPU reconstructions. Thus, the numerical errors can be expected to be dominated by measurement noise and model errors and single-precision arithmetic may well suffice for this kind of (severely ill-posed) inverse problems.

The choice of an appropriate sequence of regularization parameter *α _{k}* crucially determines the performance of the iterative reconstruction algorithm. Following the analysis of the iteratively regularized Gauß-Newton method [12, 15],

*α*should decay exponentially, e.g.

_{k}*α*=

_{k}*α*

_{0}

*q*for some 0 <

^{k}*q*< 1. The choice of

*α*

_{0}and

*q*, however, depends on the problem. While in principle, appropriate values can be found by a careful analysis of the problem under investigation, a more practical way to tune these parameters is to perform a sequence of reconstructions for known solutions and select regularization parameter that work well for all cases in this training set. The same set of parameters can then be used for similar problems.

The choice of the stopping index maxit is another crucial ingredient for the numerical algorithm. If *α*_{0} and *q* have been fixed, and *α _{k}* =

*α*

_{0}

*q*, then this amounts to choosing a minimal regularization parameter

^{k}*α*:=

_{min}*α*

_{maxit}. A rigorous way to choose

*α*(respectively maxit) is provided by Morozov’s discrepancy principle [37, 38], i.e. the reconstruction algorithm is stopped, as soon as ||ℳ(

_{min}*c*) – ℳ

_{k}*|| ∼*

^{δ}*δ*, where

*δ*is a measure for the measurement and model errors. Another rigorous way would be to choose

*α*∼

*δ*. Both choices however require the knowledge of the level

*δ*of the perturbations, which is usually not known in practice. A heuristic way to choose

*α*without knowledge of

_{min}*δ*is provided by the L-curve method [38,39] or generalized cross-validation [40]. If a series of similar problems is considered, then maxit (or

*α*) can again be obtained by “training” the algorithm on a small data set with known solutions. This strategy is probably the most practical one and also the method we utilize in our experiments.

_{min}In our opinion, the fast and reliable image reconstruction is a core requirement for the acceptance of new imaging modalities like fluorescence tomography in pre-clinical and also engineering applications. GPU-accelerated algorithms can result in reconstruction times in the order of a few seconds, which allows the inspection of the results on-the-fly and, thus, certainly facilitates the adoption of this technique in a non-research environment.

## Acknowledgments

This work is supported by the project F3207-N18 granted by the Austrian Science Fund (FWF).

## References and links

**1. **S. Mordon, V. Maunoury, J. M. Devoisselle, Y. Abbas, and D. Coustaud, “Characterization of tumorous and normal tissue using a pH-sensitive fluorescence indicator (5,6-carboxyfluorescein) in vivo,” J. Photochem. Photobiol. B **13**, 307–314 (1992). [CrossRef] [PubMed]

**2. **I. Gannot, I. Ron, F. Hekmat, V. Chernomordik, and A. Gandjbakhche, “Functional optical detection based on pH dependent fluorescence lifetime,” Lasers Surg. Med. **35**, 342–348 (2004). [CrossRef] [PubMed]

**3. **E. Shives, Y. Xu, and H. Jiang, “Fluorescence lifetime tomography of turbid media based on an oxygen-sensitive dye,” Opt. Express **10**, 1557–1562 (2002). [PubMed]

**4. **B. Zhang, X. Yang, F. Yang, X. Yang, C. Qin, D. Han, X. Ma, K. Liu, and J. Tian, “The CUBLAS and CULA based GPU acceleration of adaptive finite element framework for bioluminescence tomography,” Opt. Express **18**, 20201–20214 (2010). [CrossRef] [PubMed]

**5. **J. Prakash, V. Chandrasekharan, V. Upendra, and P. K. Yalavarthy, “Accelerating frequency-domain diffuse optical tomographic image reconstruction using graphics processing units,” J. Biomed. Opt. **15**, 066009 (2010). [CrossRef]

**6. **F. Knoll, M. Freiberger, K. Bredies, and R. Stollberger, “AGILE: An open source library for image reconstruction using graphics card hardware acceleration,” in “Proceedings of the 19th Scientific Meeting and Exhibition of ISMRM, Montreal, CA,” (2011), p. 2554.

**7. **S. R. Arridge, “Optical tomography in medical imaging,” Inv. Probl. **15**, R41–R93 (1999). [CrossRef]

**8. **A. Joshi, W. Bangerth, and W. M. Sevick-Muraca, “Adaptive finite element based tomography for fluorescence optical imaging in tissue,” Opt. Express **12**, 5402–5417 (2004). [CrossRef] [PubMed]

**9. **S. R. Arridge and M. Schweiger, “The use of multiple data types in time-resolved optical absorption and scattering tomography (TOAST),” in “Mathematical Methods in Medical Imaging II,”, D. C. Wilson and J. N. Wilson, eds. (1993), *Proc. SPIE*2035, pp. 218–229. [CrossRef]

**10. **E. M. Sevick and B. Chance, “Photon migration in a model of the head measured using time and frequency domain techniques: potentials of spectroscopy and imaging,” in “Time-Resolved Spectroscopy and Imaging of Tissues,”, B. Chance and A. Katzir, eds. (1991), Proc. SPIE **1431**, pp. 84–96. [CrossRef]

**11. **S. R. Arridge, “Photon-measurement density functions. part I: Analytical forms,” Appl. Opt. **34**, 7395–7409 (1995). [CrossRef] [PubMed]

**12. **A. Bakushinsky, “The problem of the convergence of the iteratively regularized Gauss-Newton method,” Comput. Math. Math. Phys. **32**, 1353–1359 (1992).

**13. **M. Hanke, “Regularizing properties of a truncated Newton-CG algorithm for nonlinear inverse problems,” Numer. Func. Anal. Optim. **18**, 971–993 (1997). [CrossRef]

**14. **D. Marquardt, “An algorithm for least-squares estimation of nonlinear parameters,” SIAM J. Appl. Math. **11**, 431–441 (1963). [CrossRef]

**15. **B. Kaltenbacher, “Some Newton-type methods for the regularization of nonlinear ill-posed problems,” Inv. Probl. **13**, 729–753 (1997). [CrossRef]

**16. **H. Jiang, “Frequency-domain fluorescent diffusion tomography: A finite-element-based algorithm and simulations,” Appl. Opt. **37**, 5337–5343 (1998). [CrossRef]

**17. **R. Roy and E. M. Sevick-Muraca, “Truncated Newtons optimization scheme for absorption and fluorescence optical tomography: Part I theory and formulation,” Opt. Express **4**, 353–371 (1999). [CrossRef] [PubMed]

**18. **M. Schweiger, S. R. Arridge, and I. Nissilä, “Gauss–Newton method of image reconstruction in diffuse optical tomography,” Phys. Med. Biol. **50**, 2365–2386 (2005). [CrossRef] [PubMed]

**19. **H. Egger and M. Schlottbom, “Efficient reliable image reconstruction schemes for diffuse optical tomography,” Inv. Probl. Sci. Eng. **19**, 155–180 (2011). [CrossRef]

**20. **A. Godavarty, E. M. Sevick-Muraca, and M. J. Eppstein, “Three-dimensional fluorescence lifetime tomography,” Med. Phys. **32**, 992–1000 (2005). [CrossRef] [PubMed]

**21. **M. Freiberger, H. Egger, and H. Scharfetter, “Nonlinear inversion schemes for fluorescence optical tomography,” IEEE Trans. Biomed. Eng. **57**, 2723–2729 (2010). [CrossRef]

**22. **D. Göddeke, C. Becker, and S. Turek, “Integrating GPUs as fast co–processors into the parallel FE package FEAST,” in “19th Symposium Simulations Technique,”, M. Becker and H. Szczerbicka, eds., *Frontiers in Simulation* (SCS Publishing House, 2006), pp. 277–282.

**23. **M. M. Baskaran and R. Bordawekar, “Optimizing sparse matrix-vector multiplication on GPUs,” IBM Technical Report RC24704, IBM Ltd. (2009).

**24. **M. Liebmann, “Efficient PDE solvers on modern hardware with applications in medical and technical sciences,” Ph.D. thesis (University of Graz, 2009).

**25. **W. L. Briggs, V. E. Henson, and S. F. McCormick, *A multigrid tutorial* (SIAM, 2000), 2nd ed. [CrossRef]

**26. **E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, *LAPACK Users’ Guide* (SIAM, 1999), 3rd ed. [CrossRef]

**27. **B. Dogdas, D. Stout, A. F. Chatziiouannou, and R. M. Leahy, “Digimouse: a 3D whole body mouse atlas from CT and cryosection data,” Phys. Med. Biol. **52**, 577–587 (2007). [CrossRef] [PubMed]

**28. **G. Alexandrakis, F. R. Rannou, and A. F. Chatziioannou, “Tomographic bioluminescence imaging by use of a combined optical-PET (OPET) system: a computer simulation feasibility study,” Phys. Med. Biol. **50**, 4225–4241 (2005). [CrossRef] [PubMed]

**29. **M. Keijzer, W. M. Star, and P. R. M. Storchi, “Optical diffusion in layered media,” Appl. Opt. **27**, 1820–1824 (1988). [CrossRef] [PubMed]

**30. **A. Joshi, “Adaptive finite element methods for fluorescence enhanced optical tomography,” Ph.D. thesis (Texas A&M University, 2005).

**31. **M. L. Landsman, G. Kwant, G. A. Mook, and W. G. Zijlstra, “Light-absorbing properties, stability, and spectral stabilization of indocyanine green,” J. Appl. Physiol. **40**, 575–583 (1976). [PubMed]

**32. **J. Schöberl, “NETGEN - an advancing front 2D/3D-mesh generator based on abstract rules,” Comput. Vis. Sci. **1**, 41–52 (1997). [CrossRef]

**33. **NVIDIA, *NVIDIA CUDA Programming Guide 2.0* (NVIDIA Cooperation, 2008).

**34. **N. Bell and M. Garland, “Implementing sparse matrix-vector multiplication on throughput-oriented processors,” in “SC ’09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis,” (ACM, 2009), pp. 1–11.

**35. **D. Göddeke, H. Wobker, R. Strzodka, J. Mohd-Yusof, P. S. McCormick, and S. Turek, “Co-processor acceleration of an unmodified parallel solid mechanics code with FEASTGPU,” Int. J. Comput. Sci. Eng. **4**, 254–269 (2009). [CrossRef]

**36. **M. Köster, D. Göddeke, H. Wobker, and S. Turek, “How to gain speedups of 1000 on single processors with fast FEM solvers – benchmarking numerical and computational efficiency,” Tech. rep., Fakultät für Mathematik, TU Dortmund (2008). Ergebnisberichte des Instituts für Angewandte Mathematik, Nummer 382.

**37. **V. A. Morozov, “On the solution of functional equations by the method of regularization,” Soviet Math. Dokl. **7**, 414–417 (1966).

**38. **H. W. Engl, M. Hanke, and A. Neubauer, *Regularization of Inverse Problems* (Kluwer, 1996).

**39. **P. C. Hansen, *Rank-Defficient and Discrete Ill-Posed Problems* (SIAM, 1998). [CrossRef]

**40. **G. Wahba, *Spline Models for Observational Data* (SIAM, 1990).