## Abstract

In this paper, a new and convenient calibration algorithm is proposed for unsynchronized camera networks with a large capture volume. The proposed method provides a simple and accurate means of calibration using a small 3D reference object. Moreover, since the inaccuracy of the object is also compensated simultaneously, the manufacturing cost can be decreased. The extrinsic and intrinsic parameters are recovered simultaneously by capturing an object placed arbitrarily in different locations in the capture volume. The proposed method first resolves the problem linearly by factorizing projection matrices into the camera and the object pose parameters. Due to the multi-view constraints imposed on factorization, consistency of the rigid transformations among cameras and objects can be imposed. These consistent estimation results can be further refined using a non-linear optimization process. The proposed algorithm is evaluated via simulated and real experiments in order to verify that it is more efficient than previous methods.

© 2012 OSA

## 1. Introduction

To perform vision-based tasks such as 3D reconstruction and tracking using camera networks, the cameras must be calibrated in advance. To obtain accurate calibration results, the first method developed consists in setting up a calibration object with great accuracy [1]. In cases where the image capture volume is large, the calibration object should also be as large as possible. This is due to the fact that the points on the object should cover the volume so as to provide enough reference information. However, in practice it is difficult to handle and maintain a large calibration object; while manufacturing a large object with accuracy is both expensive and poses a difficult problem to resolve.

To avoid the problem of having to use a 3D calibration object, a method using a simple 2D object was proposed in [2, 3]. In practice, this method is very handy where a small capture volume is involved. However, certain drawbacks remain when the capture volume is increased. The large 2D object should be flat and must be moved by hand to obtain multiple images so as to make calibration possible. Moreover, for camera networks in which the cameras are arranged around the capture volume, it is impossible for the cameras on the opposite side to see the reference points concurrently when using a 2D object. This is one of the disadvantages of using a 2D object compared to using a 3D object. To acquire good relative extrinsic parameters between the cameras, it is desirable to obtain as many reference points observed in a common coordinate system as possible.

Svoboda *et al.* suggested a more convenient method in which a freely moving bright spot is used as the calibration object [4]. This method, which employs the self-calibration technique, requires electronically synchronized cameras that can capture the multiple images simultaneously. However, since the cameras may have different shutter lag and no external trigger input for *Genlock*, this requirement cannot always be guaranteed. If synchronization is not ensured for the method, then a large number of still images should be captured, as in *stop-motion* animation, which is a very difficult task. Consequently, such a method is unsuitable for unsynchronized systems, so in this case we would have no choice but to use large 2D or 3D reference objects, as in previous methods. Moreover, methods based on the self-calibration technique require a great deal of overlap among the many cameras to ensure stable calibration results. This implies the need for densely distributed cameras. The self-calibration technique also suffers from critical configurations of cameras and reference points, for which self-calibration is not possible [1].

Many methods adopting the idea of using moving bright spots have been suggested for the calibration of sparsely distributed camera networks. Kurillo *et al.* and Medeiros *et al.* each proposed a method based on pairwise calibration [5, 6]. They used a rod of known length with LED lights at each end to establish a consistent scale for sparsely distributed cameras. Chen *et al.* used an iterative approach to refine the calibration results from unsynchronized cameras [7]. However, all these methods assume that the intrinsic parameters are given *a priori*, which contributes to stable calibration results for sparsely distributed cameras.

Sinha and Pollefeys developed a multi-camera calibration method using the motion of silhouettes [8]. This method makes use of discrete sampling on the silhouette boundary of a subject to find the *frontier points*. The latter can be used to determine the epipolar geometry between any two views. This elegant approach has the advantage of not using any calibration object. However, many frontier points should be spread over the images (See Fig.8 in [8]) to obtain reliable results. These can be obtained when a subject moves widely over the capture volume. However, if the cameras are unsynchronized, the subject must act as it would in a *stop-motion* animation. Although this work also suggested a scheme for solving the problem of unsynchronized video streams, an approximate interpolation strategy was used to treat sub-frame discrepancy, and it can only be applied when the subject’s inter-frame motion is small.

Another method which also uses object silhouettes, by introducing an optimization process over a continuous space, also has been suggested [9]. Although this method alleviates the above requirements concerning the distribution of the *frontier points*, it requires good initialization of the camera parameters.

In this paper, we propose a new calibration algorithm using a small 3D reference object for camera networks. The purpose is to overcome the problem stated above and to eliminate the need for large 2D or 3D reference objects in the case of unsynchronized camera systems. The method is very useful in that it is easier to manufacture, move, and maintain a small reference object. The extrinsic and intrinsic parameters are recovered simultaneously by capturing the object, which is placed arbitrarily in different locations, in the capture volume. The calibration object is maneuvered so that the reference points on the object can fully cover the capture volume. The relative pose parameters between the objects are obtained simultaneously with the camera parameters. Thus the object acts as a large virtual calibration object, and the proposed method provides a simple and accurate means of calibrating camera networks. Moreover, since the proposed method can compensate the inaccuracy of the object, the manufacturing cost of the 3D reference object can be reduced. The radial lens distortion is also calibrated simultaneously using a similar method to [2].

We also think that if the reference object has many feature points on the surface and the object is positioned in many locations, many corresponding image points can be obtained and these can also be used as the input of the algorithm of [4]. In this paper, this issue is also considered in an experimental comparison, which shows that the previous methods cannot obtain stable estimation results in the case of sparsely distributed cameras compared with the proposed method.

## 2. Related works

To calibrate camera networks, Kassebaum *et al.* proposed a method of placing a 3D calibration object in only one position in each pairwise view overlap [10]. They assumed that the intrinsic parameters were given by a separate estimation method. Although the method can obtain reasonable results by virtue of the given parameters, the need for a large object for accuracy cannot be avoided. Tests are presented in this paper to show that the results obtained with that approach are substantially less accurate than the ones achieved with our algorithm.

Most self-calibration methods based on the factorization approach decompose a point measurement matrix into the camera and point parameters [4, 11–14]. The proposed method newly suggests a factorization-based approach which decomposes projection matrices corresponding to each object location into camera and object parameters. The proposed method is inspired by the works of [3, 15, 16], in which the measurement matrix of primitives, e.g., rectangles or parallelepipeds, was decomposed. The proposed method applies this type of work to multi-camera calibration using a small reference object. Since the projection matrices have more geometric information than points, it is very helpful in practice for sparsely distributed cameras compared to the point-based self-calibration technique.

The point-based factorization method used an iterative approach to estimate the scale factors corresponding to the projective depths of each point [4, 12]. However, in the proposed framework, very simple approaches can be used to estimate the scale factors linearly. Two alternative approaches are introduced. One of them was originated from the work of Wilczkowiak *et al.* [16]. In that work, they also needed a rank-3 factorization as in our case to decompose the measurement matrix of parallelepipeds. In this paper, we analyze the two approaches and suggest which one is better in terms of accuracy.

The work most closely related with the proposed method is that of Ueshiba and Tomita [3]. However, this method has the disadvantage of using a 2D reference object as mentioned above and, consequently, has many missing entries in the measurement matrix when used in camera network environments. They did not suggest any method to resolve this problem as in [13, 14], and only performed experiments for a small capture volume. However, in the proposed method using a 3D object, a simple approach is introduced to resolve this problem.

## 3. Preliminaries

#### 3.1. Camera parameterization

The camera is represented using a pinhole model. The projection model of a point **X** in the 3D world coordinate system to a point **x** in the 2D image of the *i*th camera is expressed as follows:

**x**and

**X**are homogeneous coordinates of the points and ’≅’ indicates equality up to scale.

**K**

*is the camera’s intrinsic matrix, given by [1]. The rotation matrix*

_{i}**R**

*and the vector*

_{i}**t**

*represent the camera’s orientation and position, respectively. The 3 × 4 matrix*

_{i}**M**

*encapsulates the camera’s intrinsic and extrinsic parameters.*

_{i}#### 3.2. Object pose parameterization

Let **X**_{ref} be the homogeneous coordinates of a reference point with respect to the coordinate system attached to the 3D object. When the object is located at the *j*th position on the world coordinate system, the point is represented as follows:

**S**

*represents the object’s orientation and the vector*

^{j}**v**

*its position.*

^{j}## 4. Image measurements

#### 4.1. Projection matrix

Consider the projection of a reference point into a camera image plane. Using the results from Section 3, the projection of the reference points in the image is:

The 3 × 4 matrix
${\mathbf{P}}_{i}^{j}$ will be called the *projection matrix*. It represents a perspective projection that maps the reference points on the object located at the *j*th position onto the corresponding imaged points of the *i*th camera. This is illustrated in Fig 1. The rotation matrix **R**_{i}**S*** ^{j}* and the vector

**R**

_{i}**v**

*+*

^{j}**t**

*represent the*

_{i}*i*th camera’s rotation and position with respect to the coordinate system attached to the

*j*th object.

When at least six reference points in general position on the object can be viewed, the projection matrix can be computed up to scale [17]. Assume that the object is positioned in *j*th location and can be viewed from the *i*th camera. Let
${\mathbf{X}}_{\text{ref}}^{k}$ be the *k*th point on the reference object with respect to the coordinate system attached to the object. Let
${\tilde{\mathbf{x}}}_{i}^{{j}_{k}}\left(={\left[u,v\right]}^{T}\right)$ be the corresponding point measured from the captured image. For each point correspondence
${\mathbf{X}}_{\text{ref}}^{k}\leftrightarrow {\tilde{\mathbf{x}}}_{i}^{{j}_{k}}$, we can derive a relationship

*l*th row of ${\mathbf{P}}_{i}^{j}$. From set of

*n*point correspondences, a 2

*n*× 12 matrix Λ can be obtained by stacking up the Eq. (4). The projection matrix is computed up to scale by solving the set of equations

The parameters of the projection matrix are further refined by minimizing the re-projection errors defined as follows:

*i*th camera when the object is placed in the

*j*th location, and ${\widehat{\mathbf{x}}}_{i}^{{j}_{k}}$ represents the image points synthesized with ${\mathbf{P}}_{i}^{j}$ and ${\mathbf{X}}_{\text{ref}}^{k}$. This non-linear minimization problem is solved with the Levenberg-Marquardt algorithm initialized with the results from Eq. (5).

#### 4.2. Measurement matrix

Let us now consider the case in which an object is located at *n* different locations and seen by *m* cameras. Let
${\tilde{\mathbf{P}}}_{i}^{j}$ be the estimation result of the projection matrix associated with the *i*th camera and the *j*th object pose and
${\lambda}_{i}^{j}$ a scale factor such that the following equality can be written:

We may gather the estimated projection matrix for all *m* cameras and *n* object poses into the following single matrix:

**W̃**will be called the

*measurement matrix*. When the scale factors are recovered, the measurement matrix can be factorized as follows:

## 5. Parameter estimation

#### 5.1. Rescaling the measurement matrix (Method 1)

In this section, we will describe how to solve the scale factor problem referred to in Section 4.2. Let
${\tilde{\mathbf{H}}}_{i}^{j}$ be the leading 3 × 3 sub-matrix of
${\tilde{\mathbf{P}}}_{i}^{j}$, which can be written as
${\tilde{\mathbf{H}}}_{i}^{j}\cong {\mathbf{K}}_{i}{\mathbf{R}}_{i}{\mathbf{S}}^{j}$. Assume that, for alternative scale factors
$\left\{{\rho}_{i}^{j}\right\}$, the following *reduced measurement* matrix can be factorized as follows:

If {*a _{i}*} have the values such that

*i*’s, then From this observation, we can see that if all the ${\tilde{\mathbf{H}}}_{i}^{j}$’s are rescaled to have unit determinants, the above factorization is possible. Then, the alternative scale factors can be determined as

#### 5.2. Rescaling the measurement matrix (Method 2)

In fact, after obtaining
${\tilde{\mathbf{P}}}_{i}^{j}$, we can acquire
$\left\{{\lambda}_{i}^{j}\right\}$ by RQ-decomposition of
${\tilde{\mathbf{H}}}_{i}^{j}$’s. Assume that
${\tilde{\mathbf{H}}}_{i}^{j}$ is decomposed as
${\tilde{\mathbf{H}}}_{i}^{j}=\mathrm{\Sigma}\mathrm{\Theta}$, where Σ is an upper triangular matrix and Θ is an orthogonal matrix whose determinant is 1. Since **K*** _{i}* is an upper triangular matrix and (

**R**

_{i}**S**

*) is also rotation matrix, we can think that $\mathrm{\Sigma}={\mu}_{i}^{j}{\mathbf{K}}_{i}$ and Θ =*

^{j}**R**

_{i}**S**

*. Moreover, since (*

^{j}**K**

*)*

_{i}_{33}is 1, (Σ)

_{33}is ${\mu}_{i}^{j}$.

Because ${\mathbf{H}}_{i}^{j}={\mathbf{K}}_{i}{\mathbf{R}}_{i}{\mathbf{S}}^{j}$, where ${\mathbf{H}}_{i}^{j}$ is the leading 3 × 3 sub-matrix of ${\mathbf{P}}_{i}^{j}$, the scale factors can be determined as ${\lambda}_{i}^{j}=1/{\mu}_{i}^{j}$.

#### 5.3. Analysis of the two rescaling methods

In this section, a more detailed discussion of the two rescailing methods described in Sections 5.1 and 5.2 is given.

Suppose that
${\tilde{\mathbf{H}}}_{i}^{j}$ has RQ-decomposition into the product of Σ and Θ, where Σ is an upper triangular matrix and Θ is an orthogonal matrix whose determinant is 1. Let the diagonal elements of Σ be *a*, *b*, and *c*. Then, the scale factor from *Method 2* is 1/*c*. It is worthwhile to note that {(**K*** _{i}*)

_{11}, (

**K**

*)*

_{i}_{22}}(= {

*f*,

_{u}*f*}) = {

_{v}*a*/

*c*,

*b*/

*c*}. The determinant of ${\tilde{\mathbf{H}}}_{i}^{j}$ is the product of the diagonal elements,

*abc*.

Consequently, the scale factor
${\rho}_{i}^{j}$ from *Method 1* is (*abc*)^{−1/3} and
${\lambda}_{i}^{j}$ from *Method 2* is (*ccc*)^{−1/3}. However, it is worthwhile to note that these scale factors are not ideal values because they are computed from
${\tilde{\mathbf{H}}}_{i}^{j}$ contaminated with image noise. From this observation, we can see that
${\rho}_{i}^{j}$ has better numerical conditioning than
${\lambda}_{i}^{j}$ in the presence of noise. When
${\tilde{\mathbf{H}}}_{i}^{j}$ are contaminated with image noise, the values of *a*, *b*, and *c* are also deviated from their true values. Since these variables are independent of each other, the product of *a*, *b*, and *c* has noise reduction effect compared to *ccc*. On the other hand, the product of three *c*’s amplifies its noise.

To support this analysis, we performed the following simulated experiments. We generated 3000 random
${\tilde{\mathbf{H}}}_{i}^{j}$’s. Zero-mean Gaussian-distributed noise was added to the elements of these matrices. The standard derivation used for each element was 10% of the magnitude of the element. Fig. 2 shows the distribution of the relative error of *abc* and *ccc*. For example, the relative error of *abc* is ‖*abc* − *ãb̃c̃*‖/*abc*, where {*a*, *b*, *c*} are obtained from
${\mathbf{H}}_{i}^{j}$ and {*ã*, *b̃*, *c̃*} from contaminated
${\tilde{\mathbf{H}}}_{i}^{j}$. The experimental comparison and evaluation of the two methods are presented in Section 6.1.2.

#### 5.4. Factorization

As usual in the previous factorization approaches, the SVD (Singular Value Decomposition) is used to obtain the low-rank factorization of **Ỹ**. Let the SVD of **Ỹ** be given as:

Assume that the diagonal matrix **D** contains the singular values of **Ỹ**: *σ*_{1}≥*σ*_{2}≥. . . ≥*σ*_{3}* _{n}*. In the absence of noise,

**Ỹ**satisfying Eq. (10) has rank 3 and consequently

*σ*

_{4}=

*σ*

_{5}= . . . =

*σ*

_{3}

*= 0. If noise were present, this would not be the case. If we want to find the rank 3 matrix which is closest to*

_{n}**Ỹ**in the Frobenius norm, such a matrix can be obtained by setting all the singular values, besides the three largest ones, to zero. Let

**Ū**and

**V̄**be the matrices consisting of the first three columns of

**U**and

**V**, respectively. Then, the rank-3 factorization result can be given as:

**T**is an arbitrary non-singular 3 × 3 matrix. To obtain the camera and object pose parameters, we have to resolve this affine ambiguity. This issue is considered in the next section. It is worthwhile noting that this ambiguity always exists even if the method described in section 5.2 is used.

#### 5.5. Resolving affine ambiguity

Let the matrices **Û**_{3m}_{×3} and **V̂**_{2}_{n}_{×3} in Eq. (16) be decomposed in the 3 × 3 and 2 × 3 submatrices, respectively:
${\widehat{\mathbf{U}}}_{3m\times 3}^{T}=\left[\begin{array}{ccc}{\mathbf{U}}_{1}^{T}& \dots & {\mathbf{U}}_{m}^{T}\end{array}\right]$ and
${\widehat{\mathbf{V}}}_{2n\times 3}^{T}=\left[\begin{array}{ccc}{\mathbf{V}}_{1}^{T}& \dots & {\mathbf{V}}_{n}^{T}\end{array}\right]$.

From Eq. (10) and (16), we have ${\mathbf{T}}^{-1}{\mathbf{V}}_{j}^{T}={\mathbf{S}}^{j}$. This can be written as:

From the summation of Eq. (17) for all*j*’s, it can be seen that

**TT**

*≅*

^{T}**D**because

**V̂**

^{T}**V̂**=

**D**and, consequently, that $\mathbf{T}\cong \sqrt{\mathbf{D}}{\mathbf{R}}_{w}$, where

**R**

*is an arbitrary rotation matrix representing the fact that the global Euclidean reference frame can be chosen arbitrarily.*

_{w}#### 5.6. Reconstructing the camera and object pose parameters

We can obtain **K*** _{i}* and

**R**

*from RQ-decomposition of*

_{i}**U**

_{i}**T**and

**S**

*from the rotation matrix closest to ${\mathbf{T}}^{-1}{\mathbf{V}}_{j}^{T}$ [18].*

_{j}Let ${\tilde{\mathbf{h}}}_{i}^{j}$ be the fourth column of ${\tilde{\mathbf{P}}}_{i}^{j}$, which can be written as:

**t**

*and*

_{i}**v**

*.*

^{j}#### 5.7. Filling missing entries

In practice, for camera networks where all the fields of view of the cameras do not overlap, the calibration object placed in different locations cannot be seen from all the cameras simultaneously. In this case, some entries in the matrix **Ỹ** are missing and the factorization cannot proceed. To fill the missing entries, the following simple deduction process can be used.

*k*th camera and the

*l*th object position. The candidates for the intermediate

*k*th camera and the

*l*th object position may be multiple. In this case, all the candidates can be used simultaneously to compensate the image noise.

Owing to the simple loop style Eq. (19), the filling process can be completed much more simply compared to the complex process of the point-based factorization methods [13, 14], which fit a low rank matrix to the measurement matrix.

#### 5.8. Refining the parameters

The parameters are further refined by minimizing the re-projection errors defined as follows:

*k*th point on the reference object and ${I}_{i}^{j}$ is the set of indices of the reference points that can be visible from the

*i*th camera when the object is placed in the

*j*th location. ${\tilde{\mathbf{x}}}_{i}^{{j}_{k}}$ represents the image points measured from the captured images while ${\widehat{\mathbf{x}}}_{i}^{{j}_{k}}$ represents the image points synthesized with the camera and object parameters. This non-linear minimization problem is solved with the Levenberg-Marquardt algorithm initialized with the estimation results obtained using the technique described in the previous sections. Due to the multi-view constraints imposed on the factorization, consistency of the rigid transformations among cameras and objects is guaranteed. These consistent estimation results can be used directly to the initialization without any pretreatment.

#### 5.9. Compensating the inaccuracy of a reference object

Critical issues when using a 3D reference object are manufacturing cost and accuracy. There is a trade-off between these two factors. The manufacturing accuracy severely affects the calibration quality in the previous method using a 3D object. On the other hand, since the images of the same object placed in different locations are captured using the proposed method, it is possible to compensate the inaccuracy of the object.

It is assumed that the object is composed of planes and that the planes are totally flat. Because the size of the object in the proposed method is small, it is not so difficult to satisfy this requirement for individual small planes. As such, it can be thought that the inaccuracy is caused while these planes are being assembled and patterns printed on the surface. Since it is easy to measure the relative 2D positions between the feature points on each surface plane, we can limit the variables regarding the inaccuracy to the rotation matrices **R*** _{f}* and the translation vectors

**t**

*of the planes with respect to the coordinate attached to the object. These variables are described in Fig. 3 for one example plane. Since the planes at different positions can be viewed from more than two cameras concurrently, as in Fig. 3, and the variables are not changed while the object moves, they can be computed to satisfy overall consistency. The re-projection error function is extended from Eq. (20) as follows:*

_{f}*f*(

*k*) is the index of the face including the

*k*th point.

**R**

*(*

_{f}*k*) and

**t**

*(*

_{f}*k*) are initialized according to the original drawing of the object. The other parameters are initialized with the estimation results obtained up to section 5.7.

#### 5.10. Dealing with radial distortion

Up to now, the lens distortion of cameras has not been considered. In the proposed method, the first two terms of radial distortion are considered because it has been found that a more elaborate model offers negligible improvement when compared with sensor quantization and just causes numerical instability [2]. A similar method used in [2] is adopted.
${\widehat{\mathbf{x}}}_{i}^{{j}_{k}}$ of Eq. (21) is extended to include the radial lens distortion parameters. The real (distorted) normalized image coordinates (*x̆*, *y̆*) are represented as follows:

*x*,

*y*) are distortion-free normalized image coordinates, and

*k*

_{1}and

*k*

_{2}are the coefficients of the radial distortion. An initial guess of

*k*

_{1}and

*k*

_{2}can be obtained through similar equations to Eq. (13) in [2], which can be modified to be proper for the proposed method or simply by setting them to 0.

The initial values of the other parameters can be set as the estimation results obtained up to Section 5.9. Otherwise, both parameters related to object inaccuracy and radial distortion are refined simultaneously by initializing the other parameters with the estimation results obtained up to Section 5.7. It was found that the results of the former and latter approaches are almost the same, but that the former approach requires more computational time. All of the experimental results presented in this paper were obtained using the latter approach.

## 6. Experimental results

#### 6.1. Simulated experiments

Simulated experiments were performed to analyze the performance of the proposed algorithm. The advantage of simulated experiments instead of real image experiments is that a rigorous ground truth is available, while the relevant parameters are varied systematically.

Two types of experimental environments were constructed. First, six cameras were evenly located on a circle with a radius of 3m (See Fig. 4(a)). Since all the cameras looked towards the center, all the fields of view of the cameras overlapped. In the second environment, five cameras were mounted on the side walls of a corridor (See Fig. 4(b)). Since the optical axes of all the cameras were set to be parallel to the end wall of the corridor, not all of the fields of view of the cameras overlapped. The length, width, and height of the corridor were 5m, 6m, and 2m, respectively. We refer to these two environments as *ENV1* and *ENV2*, respectively.

Synthetic 1600×1200 images were taken by the cameras with the following intrinsic parameters: (*f _{u}*,

*f*,

_{v}*s*,

*u*

_{0},

*v*

_{0})=(1600, 1600, 0, 800, 600). Zero-mean Gaussian-distributed noise with standard deviation

*σ*was added to the image projections. It was assumed that there were no gross outliers owing to false correspondence. However, to consider outliers in feature point detection owing to perspective and photometric effect, 10% of the noise-vectors are replaced with those from another Gaussian distribution with 2

*σ*for all simulated experiments.

The 3D object having the shape shown in Fig. 7(a) was used and each face had 9 reference points. The number of trials was 100 for each simulation and the results shown are the averages. All results were obtained with the method described in Section 5.1 in determining the scale factors except for the results of Section 6.1.2. Since the performances obtained for all intrinsic parameters are highly similar, the estimation results for *f _{v}*,

*u*, and

_{o}*v*are not depicted here.

_{o}### 6.1.1. Performance w.r.t noise level, object size, and number of object positions

In this section, we analyze the performance with respect to the noise magnitude, the size of the reference object, and the number of object positions.

All of the results are displayed in Table 1. From the results, it is clear that the proposed method can obtain accurate estimates with a relatively small calibration object. When the number of object positions is increased in the case of a smaller object (240mm), error compensation can be observed. The object with a size of 240mm had a reasonable minimum size for acquiring stable results for the given environments. It is worthwhile noting that the number of object positions is equal to the number of shots required to calibrate the network. In *ENV2*, more object positions were needed than in *ENV1*, because not all the cameras viewed the same region and the object was placed in several separate view overlaps. Fig. 5 shows the visibility map for the object positions in *ENV2* when the number of object positions is 25. From this map, it can be seen that the results for *ENV2* validate the process of filling the missing entries described in Section 5.7.

The last four rows of Table 1 are to compare the proposed method with the method suggested by Kassebaum *et al.* [10]. As expected, the method could not yield accurate estimation results without a sufficiently sized object. To obtain estimation results comparable with those of the proposed method, that method would have needed a very large object with a width of 1450mm in these simulation environments. Moreover, even though a large object was used, that method obtained worse results than those of the proposed method when the intrinsic parameters were not given. This is due to the fact that the feature points cannot be placed inside the object. Unlike when using a large object placed in only one position, the proposed method can provide a great number of feature points evenly distributed over the capture volume.

### 6.1.2. Comparison of the two rescaling methods

We suggested two methods of determining the scale factors in Sections 5.1 and 5.2. The comparison of the two methods is summarized in Table 2. The number of object positions was 25 in this experiment and the environment was *ENV2*. After carrying out the factorization step, the RPE results obtained from *Method 1* were better than those obtained with the *Method 2*. These results can be expected from the analysis of Section 5.3.

These results are the initial values for the non-linear optimization described in Section 5.8. In the case of a small noise level (*σ* = 0.5) and a larger object (360mm), the final results after the refinement step are fortunately similar. However, the differences in the initial values affect the final results when the noise level increases or a smaller object (240mm) is used.

The feasibility of the factorization step depends on the rank 3-ness of the reduced measurement matrix **Ỹ** rescaled by the two methods. Fig. 6 shows the magnitude of the singular values of rescaled **Ỹ** for each environment of Table 2. To be close to rank 3, the ratio of the 3rd and 4th singular values should be large with respect to the ratio of the 1st and 3rd singular values. We can see that this ratio of *Method 1* is larger than that of *Method 2*. From these experimental results, it can be concluded that *Method 1* is more desirable than *Method 2*.

### 6.1.3. Performance w.r.t object inaccuracy and radial distortion

In this section, we test the feasibility of the approach described in Sections 5.9 and 5.10. The width of the object was 360mm. The image noise was zero-mean Gaussian-distributed noise with standard deviation of 0.5. The number of object positions was 9 for *ENV1* and 25 for *ENV2*, respectively. The results are summarized in Table 3. It can be seen that the proposed method can be used even if the reference object is inaccurate. The radial distortion parameters were also well estimated. The proposed algorithm could not converge for the inaccuracy level more than (*θ̂ _{f}*, ‖

**t̂**

*‖)=(3°, 5mm) in*

_{f}*ENV1*and (1°, 3mm) in

*ENV2*(See Table 3 for an explanation of the parameters). Since there were many missing entries in

*ENV2*compared with

*ENV1*, error propagation during the filling process caused this difference. However, this performance means that we can allow a considerable degree of inaccuracy with the object and decrease the manufacturing cost.

### 6.1.4. Comparison with the method of [4]

As mentioned in Section 1, the feature points detected for the proposed method can be used as the input of the algorithm of [4]. Unlike a moving bright spot, which can be viewed from almost any direction, all the features on the object cannot be seen from each camera because the object cannot be made to be transparent. Consequently, the survival duration of the features along the cameras is relatively short compared to instances where a moving bright spot is used. If the cameras are arranged more densely, more stable results may be obtained due to the effects of chaining local stable reconstruction. The simulation results for the method of [4] using the features detected for the proposed method are shown in Table 4. The width of the object was 360mm. The image noise was zero-mean Gaussian-distributed noise with standard deviation of 0.5. Since the results of [4] are up to scale, a similarity transformation was applied so that the distances between the estimated camera positions and their corresponding ground truths were minimal. From the results, we can see that satisfactory results cannot be obtained even if the number of object positions increases. Since the cameras of *ENV2* look down at the capture volume, the points on the upper plane of the object can be viewed at three adjacent cameras simultaneously. This is the reason why the calibration results can be obtained in *ENV2* compared to the results in *ENV1* when the number of cameras is small (5 or 6). To identify the effects of the number of cameras, more cameras were located evenly in the environments while fixing the number of object positions. As mentioned above, more accurate calibration results were obtained. This implies that the method of [4] can be applied only in the case of densely distributed cameras when the feature points are extracted from a 3D reference object for unsynchronized camera networks.

#### 6.2. Real image experiments

### 6.2.1. Manufacturing a reference object

Owing to the advantage of the proposed algorithm described in Section 6.1.3, we simply constructed the reference object with cheap foam boards manually (See Fig. 7(a)). Eighteen 2D patterns containing nine concentric circles were printed on sheets of paper which were then attached to the reference object. Since at least two planes should be visible from the cameras in any direction in order to compute the projection matrices, we considered this symmetric design to be rational. The width was about 360mm.

### 6.2.2. Extraction of feature points

The feature points on the reference object were extracted automatically. First, circle candidates were extracted via edge detection and blob coloring. Then, two concentric circles having the nearest mass centers were paired to discriminate the probable circles on the background. By recognizing the characters shown in Fig. 7(a), the indices of the faces and circles were identified. After obtaining the ellipse parameters of the imaged circles using the method of [19], the bi-tangents of every pair of circles on the same plane were computed and the intersection points between the bi-tangents were used as feature points. The extraction results are shown in Fig. 7(b). It is worthwhile noting that these point coordinates should be normalized to obtain stable calibration results [20]. This was also performed for the simulated experiments outlined above.

### 6.2.3. Actual camera network environments

The proposed method was evaluated on two actual camera networks. The types of environments were similar to *ENV1* in Fig. 4(a) and *ENV2* in Fig. 4(b). We refer to these environments as *Network1* and *Network2*, respectively. Fig. 8 shows the illustration of the camera networks. Table 5 shows the detailed specifications of the camera networks.

*Network1* is shown in the left-hand image of Fig. 8. Images captured from some cameras of *Network1* are shown in Fig. 9. Fig. 10 shows the recovered cameras and objects. From these figure, it can be seen that the qualitative positions of cameras and objects were estimated correctly. Since the ground truth is not available in real image experiments, the re-projection results are good indices of the performance of the proposed method. The average re-projection error(RPE) of the reference points was 0.0835 pixels. When the inaccuracy of the object was not compensated by the technique described in Section 5.9, the average RPE was 0.284 pixels. This implies that the object had inaccuracy and the pose discrepancy of the object’s faces from an original drawing was well estimated. The average of the estimated discrepancy of the position was 2.32mm.

The right-hand image of Fig. 8 shows *Network2*. Fig. 11 shows the images of the reference object captured from some cameras. Fig. 12 shows the visibility map for the object positions in *Network2*. The filling process described in Section 5.7 was conducted successfully in this experiment. The visualization of the calibration results are shown in Fig. 13. The average RPE was 0.0473 pixels. When the inaccuracy of the object was not compensated, the average RPE is 0.148 pixels.

In order to investigate the consistency of the proposed method, we applied it to the camera set {0,2,4} and {1,3}, respectively, in both *Network1* and *Network2*. The calibration results are shown in Table 6 and 7. We can see that the differences between the results from the subgroups of the cameras and those from all the cameras are quite small. These results imply that the proposed algorithm provides consistent and stable estimation results.

The results of calibration could not be obtained using the method of [4] with the feature points detected in either *Network1* or *Network2*. In these environments, no features could be viewed from more than two cameras simultaneously due to the sparse distribution of the cameras.

## 7. Conclusions

The intrinsic and extrinsic parameters of camera networks were recovered only by capturing a small inaccurate reference object. The proposed method provides a simple and accurate means of calibrating unsynchronized camera networks without the need for a large reference object for accuracy. The algorithm is based on the factorization of projection matrices into the camera and object pose parameters. Since the camera and object pose parameters can be acquired simultaneously, consistency in the rigid transformations among cameras and objects is obtained. These consistent results were used as the initial values for the non-linear optimization step for further refinement. It was shown that the proposed method can compensate the inaccuracy of the object and, consequently, decrease the manufacturing cost of the 3D reference object. The radial lens distortion was also calibrated simultaneously. It was also shown that the method of [4] is not suitable for sparsely distributed cameras when a 3D reference object is used for unsynchronized camera networks. Various performance analyses of the proposed method were also presented through the simulated and real image experiments.

## Acknowledgments

This work was supported by the strategic technology development program of MCST/MKE/KEIT [KI001798, Development of Full 3D Reconstruction Technology for Broadcasting Communication Fusion]. The authors would also like to thank the anonymous reviewers for their valuable comments to improve the quality of the paper, especially for the suggestions and ideas about the analysis of the two rescaling methods and the noise modelling.

## References and links

**1. **R. Hartley and A. Zisserman, *Multiple View Geometry in Computer Vision*, 2nd. ed. (Cambridge University Press, 2003).

**2. **Z. Zhang, “A flexible new technique for camera calibration,” Tech. Rep. MSR-TR-98-71, Microsoft Corporation (1998).

**3. **T. Ueshiba and F. Tomita, “Plane-based calibration algorithm for multi-camera systems via factorization of homography matrices,” in “*Proc. IEEE International Conference on Computer Vision*,” (Nice, France, 2003), pp. 966–973. [CrossRef]

**4. **T. Svoboda, D. Martinec, and T. Pajdla, “A convenient multi-camera selfcalibration for virtual environments,” Presence: Teleop. Virt. Environ. **14**, 407–422 (2005). [CrossRef]

**5. **G. Kurillo, Z. Li, and R. Bajcsy, “Wide-area external multi-camera calibration using vision graphs and virtual calibration object,” in “Proc. ACM/IEEE International Conference on Distributed Smart Cameras,” (2008), pp. 1–9. [CrossRef]

**6. **H. Medeiros, H. Iwaki, and J. Park, “Online distributed calibration of a large network of wireless cameras using dynamic clustering,” in “Proc. ACM/IEEE International Conference on Distributed Smart Cameras,” (2008), pp. 1–10. [CrossRef]

**7. **X. Chen, J. Davis, and P. Slusallek, “Wide area camera calibration using virtual calibration objects,” in “Proc. IEEE International Conference on Computer Vision and Pattern Recognition,” Hilton Head, SC, USA (2000), pp. 520–527.

**8. **S. N. Sinha and M. Pollefeys, “Camera network calibration and synchronization from silhouettes in archived video,” Int. J. Comput. Vision **87**, 266–283 (2010). [CrossRef]

**9. **E. Boyer, “On using silhouettes for camera calibration,” in “Proc. Asian Conference on Computer Vision,” Hyderabad, India (2006), pp. 1–10.

**10. **J. Kassebaum, N. Bulusu, and W.-C. Feng, “3-d target-based distributed smart camera network localization,” IEEE Trans. Image Process. **19**, 2530–2539 (2010). [CrossRef] [PubMed]

**11. **C. Tomasi and T. Kanade, “Shape and motion from image streams under orthography: a factorization method,” Int. J. Comput. Vision **9**, 137–154 (1992). [CrossRef]

**12. **P. Sturm and B. Triggs, “A factorization based algorithm for multi-image projective structure and motion,” in “Proc. European Conference on Computer Vision,” Cambridge, UK (1996), pp. 709–720.

**13. **D. Jacobs, “Linear fitting with missing data: Applications to structure from motion and to characterizing intensity images,” in “Proc. IEEE International Conference on Computer Vision and Pattern Recognition,” San Juan, Puerto Rico (1997), pp. 206–212.

**14. **D. Martinec and T. Pajdla, “Structure from many perspective images with occlusions,” in “Proc. European Conference on Computer Vision,” Copenhagen, Denmark, (2002), pp. 355–369.

**15. **P. Sturm, “Algorithms for plane-based pose estimation,” in “Proc. IEEE International Conference on Computer Vision and Pattern Recognition,” Hilton Head Island, SC, USA, (2000), pp. 706–711.

**16. **M. Wilczkowiak, P. Sturm, and E. Boyer, “Using geometric constraints through parallelepipeds for calibration and 3D modelling,” IEEE Trans. Pattern Anal. Mach. Intell. **27**, 194–207 (2005). [CrossRef] [PubMed]

**17. **O. Faugeras, *Three-Dimensional Computer Vision* (The MIT Press, 1993).

**18. **K. S. Arun, T. S. Huang, and S. D. Blosten, “Least-squares fitting of two 3-D point sets,” IEEE Trans. Pattern Anal. Mach. Intell. **PAMI-9**, 698–700 (1987). [CrossRef]

**19. **A. Fitzgibbon, M. Pilu, and R. B. Fisher, “Direct least square fitting of ellipses,” IEEE Trans. Pattern Anal. Mach. Intell. **21**, 476–480 (1999). [CrossRef]

**20. **R. Hartley, “In defence of the 8-point algorithm,” in “Proc. International Conference on Computer Vision,” Sendai, Japan (1995), pp. 1064–1070.