Precise chirp control with model-based reinforcement learning for broadband frequency-swept laser of LiDAR

Haohao Zhao; Haohao Zhao; Guohui Yuan; Guohui Yuan; Zhuoran Wang; Zhuoran Wang; Zhuoran Wang; Zhuoran Wang

doi:10.1364/OE.488283

1. Introduction

Artificial intelligence (AI) [1] has been capturing the attention of great minds in the past several decades, and the prevalence of data science and information technology has been greatly accelerating AI development in countless applications recently. As a subdomain of AI, machine learning (ML) [2], a powerful intelligent data-driven technique, is becoming a vital tool to analyze, discover and understand our surrounding world. With the appearance of deep learning (DL) [3], ML has tackled a lot of challenging tasks, which are traditionally inaccessible. As a special case, deep reinforcement learning (RL) at the intersection of DL and dynamic system control has successfully been introduced to the games of GO [4] and StarCraft [5] for the purpose of generalizing AI to realize the real autonomous robots [6], and self-driving cars [7] involving the interaction between the environment and the agent.

Motivated by the success of RL in these mentioned scenarios, the combination of RL and optics also attracts a great deal of attention of the optical society, including adaptive optics (AO) [8,9], quantum optics [10], nonlinear spectroscopy [11], optical communication [12,13] and so on. RL presents a novel approach to control and optimize optical systems, especially in scenarios where conventional methods struggle under nonideal conditions, such as nonlinearity and noise environment. For example, in order to achieve a high-power laser beam, Ref. [14] used a neural network (NN) to predict the relative phase noise, and adjusted and stabilized the relative phase of multiple light sources by employing deep RL to realize coherent beam combining. Intuitively, in many cases, the RL agent usually interacts with the optical system directly to explore the best action policy, which is deemed as the model-free RL (MFRL). However, to achieve the excellent feedback control strategy, RL requires hundreds of thousands of times of interaction with the environment generally. As mentioned in [15], it is difficult to quantify the required data volume to achieve the expert-level performance, especially in the real-world tasks. Therefore, without any constraint, the process is quite time consuming and complicated, and the MFRL is pretty data-expensive and high sample complexity. Simultaneously, for some scenarios, the unpredictability in the control action would cause serious damage to the system. For example, the abrupt increase of injection current above the threshold can damage the optical amplifier or the laser easily, even in a very short period. Implementing a strategy to limit the range of random actions may be an option to protect the system components from the damage. In [16], the action selection was constrained under special circumstances to reduce the scope of exploration and avoid the meaningless exploration. But there are still risks. For example, if the designed action constraint relies on human experience, it would be difficult to evaluate the long-term impact of the action on the state if the action is relatively complex or if the relationship between the action and the state is complicated. Introducing intense artificial information or prior information could potentially hinder the generalization of the algorithm on a case-by-case basis. Additionally, if the constraint strategy is driven solely by data, there is a risk that it may not work as intended.

To handle the issues, the environment can be simulated, and then the agent training process can be achieved by the interaction between the agent and the environmental model. Subsequently, the trained agent can be applied to the real-world optical system. The whole approach is called the model-based reinforcement learning (MBRL). In Ref. [17], to achieve the inverse design of structural color for display, trained supervised learning (SL) models were investigated to reflect the relationship between the geometry and color. Based on these models, the MBRL was employed to explore and design specific geometries to achieve the desired colors. Reference [8] utilized the MBRL to control the AO loop, and the adaptive deformable mirror and Shack-Hartmann sensor misalignment were addressed by predicting the temporal evolution of turbulence, thereby solving the typical calibration issues in AO systems.

Following this routine, we propose a MBRL and validate it in the broadband frequency-swept linearization task for the frequency modulated continuous wave (FMCW) light detection and ranging (LiDAR). FMCW LiDAR is one of the emerging techniques providing high-precision detection of non-cooperative targets in which the frequency-swept laser (FSL) is employed as the FMCW optical source. As the detection resolution of the FMCW LiDAR is inversely proportional to the bandwidth of the optical source, a broadband FSL is demanded to take full advantage of the spatial resolution of the FMCW LiDAR. However, the bandwidth of the FSL is limited by its inherent nonlinearity, especially at a high modulation rate. Therefore, for a higher resolution, a control system has to produce the required modulation signal to obtain a broadband and linear frequency sweep. Generally speaking, traditional linearization methods are roughly separated into active control methods and passive post-process methods. The optical phase locked loop (OPLL) [18] is a classical active control method. Since the optical frequency sweep is locked to the external reference signal in an optoelectronic feedback loop, it is characterized by a precision control at the cost of expensive components and system complexity. Meanwhile, in the case of a broadband frequency sweep, the frequency twist owing to the high nonlinearity may throw the loop out of locking. For the passive post-process method using the output of an auxiliary interferometer as the reference signal to resample the ranging signal [19], it is reported to be successful in dealing with the linearization in ranging process at the expense of post-processing algorithm and system complexity. And the mismatch of the reference signal and the ranging signal may result in the inaccurate ranging result.

By inviting the MBRL to achieve a broadband and linear frequency sweep in this work, we design the frequency measurement system and establish the system model employed as the RL environment. The control policy is optimized in this "digital environment". The agent can be well trained in a "digital environment" easily, and the issue of data hungry would be alleviated. The strategy exploration process would be under control as well without any concern about the system damage. Furthermore, with the improved data efficiency, the system complexity, the cost of computation and signal synchronization are effectively controlled. Since the enhanced nonlinearity brought by the broadband sweep leads to an extreme increase in the dimensions of the state and action spaces in the environment, we introduce a twin critic network on the basis of the Actor-Critic RL structure to increase the stability of the optimization process and better cope with the complex dynamic characteristics of the frequency-swept process. In the training process of the NN, we adopt a delaying strategy to the policy update and introduce a smoothing regularization strategy to the target policy to reduce the threat induced by the estimation error of the critic NNs and further enhance the network stability. With the optimized control policy, the broadband optical source is linearized and can be equipped to the FMCW LiDAR system with a much better space resolution.

2. Related works

2.1 Comparison of the MBRL and the MFRL

In this section, we hope to further discuss about the MBRL and the MFRL, which has been debating for decades in the RL community. As mentioned in [20], both methods rely on computing the functions to estimate the value of states and actions, and anticipating future events by using a backed-up value as an updating target for an approximate value function. During this process, the agent interacts with the environment to explore and learn its characteristics, which necessitates a substantial amount of data to perform well.

The fundamental difference between these methods lies in their respective approaches to learning. Model-free methods are trial-and-error learners that rely on the direct interaction with the environment and are efficient in capturing environmental characteristics. However, the high sample complexity limits their practicality. On the contrary, the model-based methods are considered as a promising approach to decrease the sample complexity using a planning approach that relies on an accurate model [15]. Although this approach is more sample-efficient, accurately modeling the environment can be challenging in certain domains, and modelling errors would lead to deficiencies of policies [21].

Recent research has demonstrated that meta-RL methods can overcome the modelling error by training a meta-policy to adapt to tasks on-the-fly [22]. Additionally, probabilistic models and ensembles [23] have been used to characterize the uncertainty of learned models, leading to model-based methods that match model-free asymptotic performance in challenging domains while using fewer samples. In some cases, MBRL can even outperform MFRL on long-term empirical performance, particularly when using a high-quality learned model [24].

Moreover, in real-world applications, we consider not only the efficiency but also ease of use and reliability, even have to handle the significant dynamical uncertainty. Considering efficiency, sample complexity, and safety, we prefer a model-based approach for the broadband frequency-swept linearization task.

2.2 Overestimation in RL methods

The RL methods have to handle the issue of the exploration-exploitation trade-off which results in the incomplete exploration and the variance of the value function. In addition to the greed for the maximum value, all of these factors contribute to the common problem of overestimation in RL methods. Here, we provide a broad classification of RL methods which are the value-based and the policy-based methods, to discuss the approaches to address the overestimation.

Value-based methods learn the state or state-action value function and then select actions based on it. Q-learning is a typical algorithm in this category, where Q represents the state-action value function. To address the overestimation bias of Q-learning, double Q-learning [25] applies the double estimator to Q-learning, which converges to the optimal policy and performs better than Q-learning in certain settings. The deep Q-network (DQN) [26] approximates the Q-function using a convolutional NN instead of a lookup table, making it more powerful in high-dimensional tasks. To address the substantial overestimation of the Q-function, the double DQN [27] uses two separate value functions to decouple the selection and evaluation of actions, resulting in improved stability in the training process compared to DQN. Other extensions, such as dueling DQN [28] and prioritized experience replay (PER) [29], also address the overestimation bias.

Policy-based methods, such as trust region policy optimization (TRPO) [30], optimize the policy directly and avoid the policy degradation due to the value function estimation error. They perform better in continuous action space tasks but have shortcomings in data efficiency and convergence.

To combine the advantages of both kinds of methods, the actor-critic method is designed where the policy-based method works as the actor and the value-based method works as the critic. By extending the DQN and deterministic policy gradient (DPG) [31] , deep deterministic policy gradient (DDPG) [7] learns competitive policies for tasks using low-dimensional observations. However, it also suffers from overestimation bias due to the noise estimation of the value. To address this issue, a variant of DDPG, known as twin delay DDPG (TD3) [32], maintains a pair of critics along with a single actor, following the routine of double DQN. It exceeds the performance of numerous state-of-the-art algorithms and is widely employed in spacecraft control [33], traffic control [34], and power management [35].

2.3 Principle of the FMCW LiDAR

Before establishing the MBRL, we analyze the principle of the FMCW LiDAR to confirm the final purpose of the FSL linearization and the evaluation metrics of linearity. Figure 1(a) depicts the schematic of an FMCW LiDAR system. Part of the FMCW emitted from the FSL radiates to the space as the detection signal, and the rest acts as the reference signal interfering with the detection signal returning from a stationary detected target. The beat signal is received by a photodetector (PD). The frequency waveforms of the detection signal, the reference signal and the beat signal are illustrated in Fig. 1(b).

Fig. 1. The illustration of the FMCW LiDAR system. PD represents photodetector, ADC represents analog digital converter.

Download Full Size | PDF

As the frequency sweep is linear theoretically, the reference signal frequency in the period $(-T_m/2+\tau _d, T_m/2)$ can be written as

(1)$$f_1(t) = \xi t+f_0,$$

(2)$$\xi = \delta f \cdot f_m,$$

where $\xi$ describes the modulation rate, $f_0$ is the center frequency, $\delta f$ is the bandwidth of the optical source, $f_m = 1/T_m$ is the modulation frequency and $\tau _d$ is the delay time corresponding to the detected round-trip distance $2d$ illustrated in Fig. 1(a). Similarly, the frequency of the detection signal is given as

(3)$$f_2(t) = \xi (t-\tau_d)+f_0.$$

Since $\tau _d$ is small enough in practice, the recorded photocurrent is given by

(4)$$I(\tau_d,t) \propto cos(\phi_{1}(t) -\phi_{2}(t) ) \approx cos(2\pi\tau_d f_1(t))=cos(\phi _b(t)),$$

where $\phi _b(t)$, $\phi _{1}(t)$, and $\phi _{2}(t)$ are the phases of the beat signal, the reference signal and the detection signal, separately. Therefore,

(5)$$\phi_b(t) =2\pi \tau_d f_1(t) =2\pi \tau_d(\xi t+f_0),$$

and the beat frequency is a constant

(6)$$f^{*}_b = \xi \tau_d.$$

The detected distance $d$ is proportional to the beat frequency

(7)$$d=\frac{1}{2}c\tau_d =\frac{c}{2 \xi}f^{*}_b,$$

where $c$ is the speed of light. Obviously, for a desired precision range inversion, an accurate beat frequency is required. When the spectrum of the beat signal is obtained by the fast Fourier transform (FFT), the space resolution defined by the Rayleigh resolution is given as $\delta d = \frac {c}{2\delta f}$. Since we use the Hanning window in FFT in this work, the theoretical space resolution (TSR) is estimated from the full-width at half-maximum (FWHM) of the beat spectrum given as [36]

(8)$$\delta d_{FWHM}=2\delta d = \frac{c}{\delta f}.$$

In this work, it is employed as the basic evaluation indicator of linearity and detection precision.

3. Methodology

The schematic of MBRL broadband frequency-swept linearization is shown in Fig. 2. With the frequency measurement system, we analyze the influence of the nonlinearity and establish the system model as the environment of the RL agent. In this way, the training process of the MBRL is totally released from the experimental platform to enhance the data efficiency. And the RL agent is designed based on the Actor-Critic structure. With the training data provided by the model, the control policy is optimized.

Fig. 2. The illustration of model-based reinforcement learning. NN represents the neural network.

Download Full Size | PDF

3.1 Frequency measurement system and model

The MBRL starts with the establishment of the frequency measurement system as given in Fig. 3. Since the beat signal phase is proportional to the optical frequency sweep, an auxiliary Mach-Zehnder interferometer (MZI) with the relative delay $\tau$ is introduced as the core element of the measurement system. Considering the inherent nonlinearity of the FSL, the chirp frequency is defined as:

(9)$$f(t) = f_0 + \xi t + f_{nl}(t) = f_0+F(u(t)),$$

where $f_{nl}(t)$ is the nonlinear part of the frequency sweep, $u(t)$ is the modulation signal, and $F(\cdot )$ represents the nonlinear mapping relationship. With the recorded beat signal, the beat signal phase is extracted by a Hilbert Transform, and according to Eq. (5), it can be represented as

(10)$$\phi _b(t) = 2\pi \tau f(t) = 2 \pi \tau (f_0 + \xi t + f_{nl}(t)) = 2 \pi \tau(f_0+F(u(t))).$$

Furthermore, the beat frequency represented as Eq. (11) is proportional to the change rate of the frequency-swept nonlinearity $f_{nl}(t)$.

(11)$$f_b(t) = \frac{1}{2\pi}\frac{d \phi _b(t)}{dt}=\xi \tau + \tau \frac{df_{nl}(t)}{dt}.$$

Due to the existence of the nonlinearity, the beat frequency is time-variant. Considering the Eq. (7), it is particularly difficult to convert the detection distance precisely with an imprecise beat frequency. And the beat signal bandwidth can be estimated by the Carson bandwidth rule as

(12)$$\delta f_b = 2(1+\beta)f_m,$$

where $\beta$ is the modulation index, and is approximated by the root mean square (RMS) value of the frequency-swept nonlinearity $f_{nl,rms}$ in this case, i.e. $\beta = 2\pi \tau f_{nl,rms}$ [37]. Plugging into Eq. (12), the estimation of the bandwidth can be rewritten as

(13)$$\delta f_b = 2(1+2\pi \tau f_{nl,rms})f_m.$$

In this way, with the spectrum analysis of the beat signal, we can evaluate the frequency-swept nonlinearity quantitatively. And the TSR of the FMCW LiDAR can be rewritten as

(14)$$\delta d_{FWHM} = \frac{c} {2\xi} \delta f_b = \frac{c(1+2\pi \tau f_{nl,rms}) f_m}{\xi} = \frac{c(1+2\pi \tau f_{nl,rms})}{\delta f}.$$

The existence of the frequency-swept nonlinearity limits the detection resolution obviously. And when the nonlinearity term $2 \pi \tau f_{nl,rms} \ll 1$, $\delta d_{FWHM}$ can be represented as Eq. (8). Note that this function is owing to the sawtooth modulation signal used in this experiment where $\xi = \delta f / T_m = \delta f \cdot f_m$. Therefore, the FSL linearization is particularly important for the high-precision FMCW LiDAR.

Fig. 3. Setup of frequency measurement system. MZI represents Mach-Zehnder interferometer, PD is photodetector, ADC is analog digital converter.

Download Full Size | PDF

To accomplish our MBRL, we establish the system model on the basis of the experimental data and simple physics relationships imitating the response of the system. Since the modulation slope and the beat frequency are the input and the output of the FMCW frequency measurement system, the kernel of the model is expected to build a mapping relationship between them defined as Eq. (15) instead of building the detailed mathematical model directly.

(15)$$f_b (t)=h(u^{'} (t))=h(\zeta(t)),$$

where $\zeta (t)=u^{'}(t)$ represents the modulation slope. In our work, it is evaluated by a pair of the modulation slope and the corresponding beat frequency and represented as

(16)$$\tilde{f}_b (t)=\tilde{h} (\zeta(t))=\frac{f_{b,1} (t)}{\zeta_1(t)}\zeta(t),$$

where the $\tilde{f}_b (t)$ is the evaluation of the beat frequency, $\tilde{h} (\zeta (t))$ is the evaluation of the mapping relationship, $f_{b,1} (t)$ and $\zeta _1(t)$ is a pair of known beat frequency and modulation slope.

Simultaneously, the mapping relationship of the beat frequency and the modulation slope can also be deduced according to Eq. (10).

(17)$$f_b(t) = \frac{1}{2\pi}\frac{d \phi _b(t)}{dt}=\tau \frac{df(t)}{dt}=\tau F^{'}(u) \frac{du(t)}{dt} =G(u)\zeta(t),$$

where $G(\cdot )$ represents the nonlinearity property of the system. The consistency of this result with the Eq. (16) is obvious. Therefore, we come up with

(18)$$\tilde{G}(u(t))=\frac{f_{b,1} (t)}{\zeta_1(t)},$$

where $\tilde{G}$ is the evaluation of $G(u(t))$ calculated by a pair of known modulation slope and the corresponding beat frequency. Therefore, the mapping relationship can also be written as

(19)$$\tilde{f}_b (t) = \tilde{G}(u(t))\zeta(t).$$

It does not matter how many independent variables $G(\cdot )$ have, as well as the $F(\cdot )$ in Eq. (9), in our work.

However, the randomness factors in the system make the measured frequency change all the time meaning that the calculated $\tilde{G}(u(t))$ can only represent system characteristics instantaneously. To mimic the experimental system as good as possible, we introduce the noise term $n(t)$ to characterize the inherent randomness of the system. In this way, the agent is capable of learning more about the system characteristics from the training data generated by the model, which makes the well-trained policy perform better in the experimental system. Therefore, we establish the model as

(20)$$f_{b,m}(t) = \tilde{G}(u(t))\zeta(t) +n(t).$$

With any given modulation signal slope, the corresponding beat frequency is calculated. This model is used as the RL environment to provide the simulation data and assist the agent to find the modulation signal required by the broadband and linear frequency sweep.

3.2 MBRL for broadband frequency-swept linearization

The main structure of MBRL is shown in Fig. 2. During the training process, the RL agent obtains the state $s_t$ from the environment, provides the action $a_t$ decided by the current control policy to guide the state transition, and receives the reward $r_t$ to evaluate the action. In the case of the broadband frequency-swept linearization, we design the state $s_t$ as

(21)$$s_t=normalization ([u(t), f_{b,m}(t), f_{b,m}(t)-f_{b,m}(t-1)]),$$

where $normalization(x_i) = \frac {x_i - x_{i,min}}{x_{i,max}-x_{i,min}}$ is the normalization function. In addition, the action $a_t$ is defined as the modulation signal slope $\zeta (t)$ in the continuous action space. In order to improve the optimization efficiency, the reward function is designed as

(22)$$r_t={-}normalization(|f_{b,m}(t) -f^{*}_{b}|).$$

It means that the smaller the frequency deviation from the reference frequency $f^{*}_b$, the greater the reward is. When the beat frequency equals to the constant at each time step, the reward is maximum and the nonlinearity should be linearized completely according to the Eq. (11).

As shown in Fig. 2, the Actor-Critic structure is employed as the core of the MBRL agent. The actor NN optimizes the deterministic control policy with the aim of learning reward-maximizing behavior. And the critic NN estimates the state-action value function to evaluate the current policy. The state-action value function represents the discount cumulative reward of $(s_t, a_t)$ defined as

(23)$$Q(s_t,a_t) =\mathbb{E}[\sum_{t^{'}=t}^T{\gamma ^{t^{'}-t}r_{t^{'}}}|s_t,a_t],$$

where $\mathbb {E}$ represents the expectation, and $\gamma$ is the discount factor. The critic NN fits the function by using the temporal difference learning based on the Bellman equation.

(24)$$Q^{*}(s_t,a_t) =\mathbb{E}[r(s_t,a_t) +\gamma max_{a_{t+1}} Q^{*}(s_{t+1},a_{t+1})],$$

where $Q ^{*}(s_t,a_t)$ is the optimal state-action value function. As the deep function estimation requires multiple gradient updates, a frozen target network is designed to maintain a stable objective.

(25)$$y_t=r( s_t,a_t) +\gamma [Q^{'}(s_{t+1},\mu^{'}(s_{t+1}|\theta ^{\mu^{'}})|\theta ^{Q^{'}})],$$

where $\theta ^{\mu ^{'}}$ and $\theta ^{Q^{'}}$ are the hyper-parameters of the target actor NN and the target critic NN, separately.

As mentioned in [25],

(26)$$\mathbb{E}[max(Q_1,Q_2,\ldots)] \geq max[\mathbb{E}(Q_1),\mathbb{E}(Q_2),\ldots]$$

where $Q_i=Q(s_{t+1},a_{t+1,i})$ represents the state-action value of different actions and returns a set of stochastic values. Since the incomplete exploration and the variance of the value function, the equation cannot be equal. In addition to the desire for the maximum value in the critic NN training process, an overestimation bias is induced by function approximation error and accumulated during the optimization process effecting the network convergence stability and even resulting in a suboptimal policy. And an alternative double estimator method [25] is proposed to find an estimation for the maximum value of the $Q$ function. Therefore, we employ the twin critic NNs in the algorithm, i.e. two pairs of evaluation NN and target NN with the same NN structure but different weights, as the double estimator and select the smaller output of the target NNs as the optimal objective to calculate $y_t$. Simultaneously, in terms of the interaction between the critic NNs and the actor NNs, the existence of the estimation error may cause the method fails to learn. Therefore, we design a delaying policy update until the value estimation error is small enough, and introduce a smoothing regularization strategy to the target policy enforcing that similar actions have similar values to avoid the deterministic policy overfit to the narrow peak in the value estimation. In this way, the estimation objective is rewritten as

(27)$$y_t=r(s_t,a_t) +\gamma min_{i=1,2}[ Q^{'}_{i}( s_{t+1},\mu^{'}( s_{t+1}|\theta ^{\mu^{'}}) + \epsilon |\theta ^{Q^{'}_{i}})],$$

where $\epsilon \sim \mathcal {N}(0,\delta )$.

The loss function of the critic NN is defined as the mean square error (MSE) between the estimation result and the objective.

(28)$$L(\theta ^{Q_{i}}) =\mathbb{E}[(Q_{i}(s_t,a_t|\theta ^{Q_{i}})-y_t)^2],$$

where $\theta ^{Q_{i}}$ ($i=1,2$) are the hyper-parameters of the evaluation critic NNs. And the actor NN is updated through the deterministic policy gradient algorithm:

(29)$$\nabla _{\theta ^{\mu}}J=\mathbb{E}[\nabla _{\theta ^{\mu}}Q_{1}(s,a|\theta ^{Q_{1}})|s=s_t,a=\mu (s_t|\theta ^{\mu})],$$

where $\theta ^{\mu }$ is the hyper-parameter of the evaluation actor NN. Considering the environment model is simplified, a light-weight NN structure is applied to the actor NNs and the critic NNs accordingly. Except for the input and output layers, there are only two hidden layers stacked together with the nonlinear activation function. And the Adam optimizer is employed.

In the training process, the interaction between the evaluation actor NN and the system model starts with a random policy. The experienced data is stored in the replay buffer. The NNs will not start training until the storage of data in the buffer is big enough. The mini-batch used in the training process is also sampled from the buffer. To guarantee the exploration of the environment, the action determined by the evaluation actor NN is randomly perturbed before transmitted to the model. This gives the agent more opportunities to acquire states other than the current policy for higher rewards but reduces the exploitation of the experienced data, which is called as the exploration-exploitation trade-off. When the reward curve converges, the linearization control policy is optimized. Following this routine, we generate the modulation signal with the obtained policy and, accordingly, capture such a beat signal with a constant frequency to enhance the TSR of the FMCW LiDAR.

4. Results and discussion

To verify the control validity of the algorithm, we apply the MBRL to the frequency measurement system. The established experimental platform is shown in Fig. 4. The frequency measurement system uses a distributed feedback (DFB) laser emitting at 1550nm as the optical source. The initial modulation signal is a linear sawtooth signal $u(t)$ with $f_m = 1$kHz ($T_m=1$ms) superimposed on a 380mA DC bias. The bandwidth is about 117GHz. The relative delay of the MZI is $\tau =5$ns, and the beat signal is recorded by a PD (PDA10CS2, Thorlab). With the experimental data collected from the system, the modelling process, training process and the modulation signal generation process are produced by a laptop computer off-line. The training parameters are listed in Table 1. And the influence of the important hyper-parameters and network components on the convergence of the MBRL are analyzed in Section 4.1. The experimental performance of the control policy obtained by the MBRL is discussed in Section 4.2.

Fig. 4. The experimental platform of the frequency measurement system. MZI represents Mach-Zehnder interferometer, PD represents photodetector.

Download Full Size | PDF

Table 1. Training parameters of MBRL

View Table | View all tables in this article

4.1 Training performance of the MBRL algorithm

4.1.1 Parameter optimization

The intense nonlinearity of the broadband FSL leads to an extreme increase in the dimensions of the state and action spaces resulting in a more difficult and time-consuming convergence of the network. Therefore, the hyper-parameters involved in the training process are optimized, especially the replay buffer size (RBS), the batch size (BS), the learning rate (LR), the exploration rate (ER) and the discount factor $\gamma$. The MBRL performance with different hyper-parameters are shown in Fig. 5 to reflect the relationship between the number of training period and the average reward per period. Figure 5(a) shows the effect of the RBS. In this work, the replay buffer is employed to store the experienced data generated during the interaction between the agent and the environment. And the training data is also sampled from here. Therefore, with a larger RBS, the distribution of the training data is more stable, which is clearly beneficial to the converge of the networks. As shown in Fig. 5(a), when the RBS is no larger than 1000000, the network fluctuates greatly during the training process. Simultaneously, a concern with a large RBS is the data update efficiency. As it is possible to obtain better experienced data with a higher reward with the network optimization process continuing, if the RBS is excessive, the proportion of these high-reward samples in the replay buffer is small affecting the efficiency of the network learning the reward-maximizing behavior. As the reward curve shown in Fig. 5(a) where the RBS equals to 1500000, the excessively large RBS leads to the deterioration of convergence. To balance the converge stability and the data efficiency, the RBS is set to 1300000.

Fig. 5. Agent rewards based on different hyper-parameters.

Download Full Size | PDF

The BS is also one of the key factors affecting the network performance. As the case of DL, the BS determines the capability of the hidden feature representation. Especially in this case, since the experienced data in the replay buffer has been continuously updating and the mini-batch is sampled from the replay buffer randomly, a larger BS contributes to maintain the stability of the data structure in the batch to improve the stability of the training process. However, the increase of the BS also results in the aggravation of computation cost. According to the reward curves shown in Fig. 5(b), when the BS is set to less than 4096, the converge processes fluctuate. Therefore, the BS is set to 4096 in this work.

Furthermore, we also consider the effect of the LR on the converge. Since the LR defines the time step of the optimization process, the design of the LRs of the actor NN (LR-A) and the critic NN (LR-C) are critical. A concern with a large LR is that the unstable distribution of the training data may cause the divergence as shown in Fig. 5(c) where LR-A is 0.01 and LR-C is 0.01. And with the decrease of them, the fluctuation is also suppressed. On the other hand, a small LR may reduce the efficiency of the network optimization. Comparing the reward curve whose LR-A is 0.001, LR-C is 0.0001 and the reward curve whose LR-A is 0.0001, LR-C is 0.001 shown in Fig. 5(c), the former curve performs much better. It means that the critic NN should be more sensitive to the variations of data distribution in this work. Based on the above analysis, the LR-A is set to 0.001 and the LR-C is set to 0.0001 eventually.

The ER represents the strategy of the exploration-exploitation trade-off in this work. We define a fixed range of the action noise, and the ER determines the periods it continuous. A larger ER means the attenuation rate of the action noise is smaller to ensure the adequate exploration of the state space. It is possible to find a better policy but it requires much more time to converge as shown in Fig. 5(d) where the ER is 1/3. Conversely, a decrease of the ER enhances the agent’s utilization of the experienced data with a risk of omitting the optimal policy. According to the reward curves in Fig. 5(d), the procedure of the evolution is accelerated with the decrease of ER. Therefore, the ER is set to 1/5.

The final hyper-parameter we are concerned seriously is the discount factor $\gamma$ used in the calculation of the state-action value. According to Eq. (23), $\gamma$ ranging from zero to one determines the importance of the future reward in a period. As $\gamma$ is defined to less than one, the further reward would not be important than the closer one, and the current reward is the most concerned. However, it is not enough to evaluate the action selection of the current state without the future reward. An action with a large reward only in the short term is not in line with the optimization goal of MBRL. Especially in such a broadband frequency-swept task, the nonlinearity is quite different at different time steps of control period, and the modulation signal determined by the action continues to accumulate over time steps. Therefore, the long-term cumulative reward is more valuable. The reward curves in Fig. 5(e) demonstrate that when $\gamma$ is small, the fluctuation is obvious. The closer $\gamma$ is to one, the more focus on the future reward the agent would receive. However, since the future reward is estimated by the critic NNs, the estimation error may cause the divergence of the network. When it is set to 1, all the rewards have the same important grade, but the agent does not work well at this time as the result shown in Fig. 5(e) where $\gamma$ equals to 1. Therefore, $\gamma$ is set to 0.8.

With the optimized hyper-parameters, the RL agent is well-trained. Since these hyper-parameters are mainly decided by the linearization task, there is no need to re-optimize them when the optical element is replaced in the experimental system, which would not affect the nature of the task like characteristics and complexity. Therefore, the MBRL has the potential to be the solution to easily generalize the realization of optical systems manufactured by different elements based on the same system framework.

4.1.2 Ablation experiment

In the proposed MBRL, we note the influence of overestimation bias of the state-action value in the policy optimization process, which is ubiquitous in Actor-Critic-based RL methods. The overestimation allows a random bad state evaluation to be a high value leading to unstable network convergence and even the suboptimal policy. Combined with the high-dimensional control task which is the result of the high nonlinearity in this work, it results in a extremely divergent value. Therefore, we introduce several tricks including the twin critic NN (TCN), the delaying policy update (DPU), and the regularization of the target strategy (RTS), and evaluate how important they are in this section.

Figure 6 shows the results of ablation experiments by removing each component from MBRL to evaluate its contribution. It is obvious that the reward curve rises faster without these three components. But it is pretty hard to converge and the maximal reward is about -0.05 which is much smaller than -0.005 achieved by the MBRL. With introducing the TCN, the fluctuation of the reward curve is suppressed effectively and the maximal reward raises to -0.01. Since the TCN provides a smaller choice of the target state-action value, it converges stably in a more gradual way. Simultaneously, an accurate estimation helps the agent find a better policy. Furthermore, the DPU reduces the variance induced to the policy update process to prevent the policy degradation and smooth the converge process as the reward curve shown in Fig. 6. In terms of the RTS, since the deterministic policy can overfit to the value evaluation and the target policy is susceptible to the induced error, the regularization helps fitting the value of a small area around the target action beneficial to the smooth estimation. Consequently, the proposed MBRL performs best in this linearization task to obtain the optimal control policy.

Fig. 6. Agent rewards based on different component combinations.

Download Full Size | PDF

4.2 Experimental performance of the MBRL algorithm

4.2.1 Linearization performance of the MBRL algorithm

In our manuscript, the MBRL is established on the basis of Actor-Critic structure. To accomplish the broadband frequency-swept linearization task, we optimize the Actor-Critic structure further by inviting a twin critic NN structure to prevent the agent from the suboptimal policy updating caused by the overestimation of the state-action value function. And the strategies of policy updating and target regularization are also optimized to reduce the accumulation of estimation errors. Therefore, we compare the proposed MBRL and the original Actor-Critic algorithm named the model-based-AC with the same model to demonstrate the control efficiency of our algorithm.

First of all, we compare the training process of these two model-based methods as shown in Fig. 7. During early training, the model-based-AC algorithm shows a faster reward increase compared to the MBRL due to its simpler NN structure and weight-updating strategy. However, in subsequent periods, the reward curve of the model-based-AC exhibits a wavelike pattern indicating an unstable convergence of the policy caused by the cumulative error in value function estimation. While the proposed MBRL effectively eliminates the impact of estimation errors in the value function and exhibits stable convergence. The final reward achieved by the model-based-AC is -0.047, which is much lower than -0.005 achieved by the proposed MBRL. It indicates that the model-based-AC has a much greater deviation from the reference frequency than the MBRL after the training process. It also implies that in the broadband frequency-swept linearization task, the MBRL has a better control accuracy and would perform better in the experimental system. Furthermore, it can be inferred that for the control task with high-dimension continuous state and action space in the case of broadband frequency-swept, the overestimation of the state-action value has a significant impact on the control effect, which means that the twin critic NNs, the strategies optimization of the policy updating and target regularization utilized in our algorithm contribute to stabilizing the training process and achieving better control accuracy.

Fig. 7. Agent rewards based on different control methods.

Download Full Size | PDF

In the experiment, the modulation signals generated by the methods are applied to the frequency measurement system, and the beat signals collected from the system are analyzed to illustrate the control efficiency of the methods. The initial modulation signal is a linear sawtooth signal, and the frequency vs. time curve of the corresponding beat signal is shown in Fig. 8. It is obvious that without the aid of control methods, the nonlinearity of the FSL leads to a time-variant frequency with a large fluctuation range. However, with the control methods provided by the model-based-AC or the proposed MBRL, the fluctuation range of the beat frequency is significantly reduced demonstrating an efficient linearization of both model-based methods. Furthermore, Fig. 9 shows that with the control of the model-based methods, the spectral peaks of the beat signals are sharper indicating fewer frequency components involved. And with the control of the MBRL and the model-based-AC, the signal noise rates (SNR) of the beat signals raise to 19.7dB and 14.6dB, separately, as listed in Table 2, representing a quality improvement of the signals. With the bandwidth of the beat signal listed in Table 2 and the Eq. (13), we calculate the RMS of the residual nonlinearity (RN) to be 44.56MHz and 74.80MHz for the beat signal with the MBRL and the model-based-AC control, respectively. Compared to the value obtained by the system without control, the linearization efficiency of these model-based methods is significant, and the proposed MBRL performs better. Furthermore, according to Eq. (14), we can estimate the TSR of the LiDAR systems employing an FSL to be 0.0068m, 0.0098m and 0.03m, respectively. Therefore, it can be concluded that the MBRL offers superior control efficiency of linearization and has more significant potential for high-precision detection applications.

Fig. 8. Frequency vs. time curves of different control methods.

Download Full Size | PDF

Fig. 9. Power spectra of different control methods.

Download Full Size | PDF

Table 2. Evaluation metrics for different methods.

View Table | View all tables in this article

To achieve a stable resolution, we need to monitor the long-term performance of the control methods. As shown in Fig. 10, the experimental system operates continuously for over 2 hours and the RMS of the RN is used to evaluate the linearization efficiency. The modulation signal is generated during the agents interact with the model based on the well-trained policy. Although the initial beat frequency and the policy are determined, the next beat frequency is slightly changed because of the noise term in the model, and it further affects the following option of the modulation slopes. Therefore, our generated modulation signal is flexible to the random changes of the model and the external environmental noise to some extent. In practice, it is enough for our system to update the modulation signal every five minutes (or even a longer period of time) and we illustrate the stability of the methods in the continuous operating system as shown in Fig. 10. The curves of the RMS of the RN show small fluctuations over time without degradation indicating the stable linearization condition and control efficiency of both model-based methods. And due to the good control accuracy, the MBRL has a better linearization performance at most of time.

Fig. 10. Long-term performance of different control methods.

Download Full Size | PDF

Moreover, since the updating process of the modulation signal does not require the beat signal from the frequency measurement system, it is capable of simplifying the auxiliary branch of the FMCW LiDAR when the MBRL is applied to linearize the frequency-swept light source. The branch includes the MZI, PD, and ADC, which are typically required by traditional methods, such as the iteration method.

Consequently, by comparing the MBRL with model-based-AC, it has been demonstrated that the proposed control algorithm exhibits excellent performance in terms of control efficiency of linearization the FSL. Moreover, the occasional updating strategy of the modulation signal enhances its robustness during long-term operation and provides a potential approach to simplify the LiDAR system. By implementing of an efficient embedded system, it could be generalized to experimental contexts in which optimization of light sources is required for the actual systems, such as the quantum optics, and the nonlinear spectroscopy.

4.2.2 Experimental analysis of the noise term in the model

To demonstrate the contributions of the noise term in the training and the experimental application of the algorithms, we design the contrast experiments based on our MBRL with and without the noise term, and model-based-AC with and without the noise term.

The training processes of these control methods are shown in Fig. 7. It is obvious that without the noise in the model, the reward curves of the no-noise-model-based-AC and the no-noise-MBRL have more fluctuation, especially in the later stage of the training process. It is because the noise term in the model introduces the randomness to the state transition process preventing the policy from converging to a sub-optimization. Therefore, the methods that incorporate the noise in the model during training exhibit greater stability.

The control efficiency in the experiments is analyzed with the obtained beat signals as shown in Fig. 8 and Fig. 9. All these control methods have a good performance in the decrease of the variable range of the beat frequency vs. time. With the analysis of the spectrum, the bandwidths of the beat signals controlled by the no-noise-methods are higher compared to those with the noise term – 7.4kHz vs. 6.7kHz and 5.9kHz vs. 4.8kHz, as listed in Table 2 and the corresponding RMS of the RN are similar, indicating slightly worse resolution in the detection.

However, the long-term performances of the control methods demonstrate significant differences between the methods with and without noise in the model. As mentioned previously, the modulation signal is generated during the interaction of the agent and the model. Since the model has no stochastic noise term, the state transition is determined. Therefore, the modulation signal produced by the fixed policy and initial state is fixed. Compared to the corresponding method with the noise term, both no-noise methods appear degradation because of the random noise in the system in the later stage of the monitoring process, especially the no-noise-model-based-AC. The noise term in the model simulates the stochastic characteristics of the system, and it is incorporated into the control algorithms in the training process. Therefore, the optimized policy is adapted to the random changes of the model, which generates constantly updated modulation signals. When these modulation signals are used to control the laser regularly, we obtain a better performance faced with random changes of the system. Although it cannot compensate the randomness completely, the linearity is stable with no degradation as shown in Fig. 10.

Therefore, introducing the noise term in the model is beneficial to the stable convergence of the training process, and incorporates the stochastic characteristic with RL algorithms, resulting in better robustness for the long-term operation of broadband frequency-swept systems. Additionally, the noise term in the model enables the generation of regularly updated modulation signals without the need for new experimental data from the system, which simplifies the auxiliary branch of the LiDAR system typically required by traditional methods.

However, due to the high-dynamic FMCW frequency-swept process and limited experimental data used for model establishment, our simplified model cannot fully reflect the actual system characteristics. Therefore, although the control efficiency of our proposed MBRL is better than others, there is still a gap to the theoretical limit calculated by Eq. (8). And there are also strong motivations to reach the best performance as possible in real-world scenarios. Optimizing the model could improve the performance of the algorithm, resulting in lower nonlinearity and meeting the requirements of high-precision detection scenarios such as profile scanning with FMCW LiDAR and optical coherence tomography (OCT). Therefore, we have been working on how to establish the model with other powerful data-driven methods, such as the Multi-Layer Perception (MLP) and Generative Adversarial Network (GAN).

4.2.3 Generalization performance of the MBRL

To demonstrate the generalization of our method in different experimental conditions, we replace some components directly including the laser and MZI (The same type produced by the same company) in the frequency measurement system to test the efficiency of the optimized control policy. The beat signal collected from the system is analyzed as shown in Fig. 11 and Fig. 12. The range of the fluctuation of the beat frequency vs. time is greatly reduced. And the bandwidth of the beat signal is 5.8kHz and the RMS of the RN is 60.48MHz as listed in Table 3. Although the RN is higher than that of the agent trained from scratch in the Figures, the experimental results illustrate that our proposed MBRL still has a good control effect.

To obtain better performance, a few extra training periods are required to fine-tune the agent based on the well-train policy instead of training from scratch, and the re-training process is much faster as shown in Fig. 13. The reward of the re-training process starts from a higher reward representing the initial beat frequency is closer to the reference and reaches to -0.009 with almost 100 periods, much less than the requirement of training from scratch. And the beat signal analysis shown in Fig. 11 and Fig. 12 suggests the control efficiency is similar to that of the agent trained from scratch. As listed in Table 3, the bandwidth and RN of the beat signal are 5kHz and 47.75MHz, respectively.

Fig. 11. Frequency vs. time curves of different control methods.

Download Full Size | PDF

Fig. 12. Power spectra of different control methods.

Download Full Size | PDF

Fig. 13. Agent rewards based on different control methods.

Download Full Size | PDF

Table 3. Evaluation metrics for different methods.

View Table | View all tables in this article

Therefore, the proposed MBRL has a good generalization ability in the new system. Since the kernel physics principles behind the frequency measurement system, used as our experimental system, are the optical interference and the nonlinearity, the MBRL has the feasibility potential for existing instruments or experimental systems, such as the precise spectroscopy, the optical frequency domain reflectometer (OFDR)-based distributed sensing, and OCT. More generally, similar to our experiment system, there are many phenomena in optics are high-dimensional and nonlinear, with noise-sensitive dynamics, that are challenging to control using conventional methods. Therefore, RL methods have the potential to drive the next generation of optics and laser technologies, and even the next generation of scientific control technologies.

5. Conclusion

To summarize, the proposed MBRL linearizes the broadband frequency-swept, which is one of the key factors limiting the development of FMCW LiDAR. The established frequency measurement system model protected the system from the potential threat induced and incorporated the stochastic characteristic with the MBRL during the agent training process. And the data efficiency of the interaction between the RL agent and its environment was sufficiently improved. Although the broadband sweep resulting in the increased difficulty of the system control, the well-designed network structure and the optimized hyper-parameters guaranteed the successful acquisition of the desired modulation signal. The regular updated modulation signals enhance system robustness during long-term operation and provides a potential approach to simplify the LiDAR system. With the control of MBRL, the TSR raised to 0.0068m. The generalization of the proposed MBRL demonstrates its feasibility potential for existing instruments or experimental systems, such as the precise spectroscopy, OFDR-based distributed sensing, and OCT, which have similar physics principles to our experimental system. More generally, for the systems that are high-dimensional and nonlinear, with noise-sensitive dynamics, challenging to control using conventional methods, such as the coherent optical inference, the quantum optics, and the nonlinear spectroscopy, the MBRL could be a potential candidate. The MBRL is trained in the laptop so far. Since the training and the modulation signal generation processes are off-line, it can leverage enough computing capabilities and avoid the impact of the latency in data ingestion. After the training process, the model and the well-trained agent could be implemented on the field-programmable gate array (FPGA) to ensure the updating of the modulation signals in long-term continuous operation. The efficient FPGA implementation can provide enough computing capability to ensure the practical application of our algorithm in real-world systems.

Funding

Natural Science Foundation of Sichuan, China (2023NSFSC0492, 2022NSFSC0460); Zhejiang Provincial Natural Science Foundation of China (Y23F050001); Medico-Engineering Cooperation Funds from University of Electronic Science and Technology of China (ZYGX2021YGLH214); Municipal Government of Quzhou (2022D026, 2022D032); Quzhou City Science and Technology Project (2022K27, 2022K40).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. M. Musib, F. Wang, M. A. Tarselli, R. Yoho, K. H. Yu, R. M. Andrés, N. F. Greenwald, X. Pan, C. H. Lee, J. Zhang, K. Dutton-Regester, J. W. Johnston, and I. M. Sharafeldin, “Artificial intelligence in research,” Science 357(6346), 28–30 (2017). [CrossRef]

2. G. Carleo, I. Cirac, K. Cranmer, L. Daudet, M. Schuld, N. Tishby, L. Vogt-Maranto, and L. Zdeborová, “Machine learning and the physical sciences,” Rev. Mod. Phys. 91(4), 045002 (2019). [CrossRef]

3. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature 521(7553), 436–444 (2015). [CrossRef]

4. D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, “Mastering the game of go without human knowledge,” Nature 550(7676), 354–359 (2017). [CrossRef]

5. O. Vinyals, I. Babuschkin, W. M. Czarnecki, et al., “Grandmaster level in starcraft ii using multi-agent reinforcement learning,” Nature 575(7782), 350–354 (2019). [CrossRef]

6. X. Da, Z. Xie, D. Hoeller, B. Boots, A. Anandkumar, Y. Zhu, B. Babich, and A. Garg, “Learning a contact-adaptive controller for robust, efficient legged locomotion,” arXiv, arXiv:2009.10019 (2020). [CrossRef]

7. T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv, arXiv:1509.02971 (2015). [CrossRef]

8. J. Nousiainen, C. Rajani, M. Kasper, and T. Helin, “Adaptive optics control using model-based reinforcement learning,” Opt. Express 29(10), 15327–15344 (2021). [CrossRef]

9. B. Pou, F. Ferreira, E. Quinones, D. Gratadour, and M. Martin, “Adaptive optics control with multi-agent model-free reinforcement learning,” Opt. Express 30(2), 2991–3015 (2022). [CrossRef]

10. V. Cimini, I. Gianani, N. Spagnolo, F. Leccese, F. Sciarrino, and M. Barbieri, “Calibration of quantum sensors by neural networks,” Phys. Rev. Lett. 123(23), 230502 (2019). [CrossRef]

11. C. M. Valensise, A. Giuseppi, G. Cerullo, and D. Polli, “Deep reinforcement learning control of white-light continuum generation,” Optica 8(2), 239–242 (2021). [CrossRef]

12. X. Chen, B. Li, R. Proietti, H. Lu, Z. Zhu, and S. B. Yoo, “Deeprmsa: A deep reinforcement learning framework for routing, modulation and spectrum assignment in elastic optical networks,” J. Lightwave Technol. 37(16), 4155–4163 (2019). [CrossRef]

13. X. Luo, C. Shi, L. Wang, X. Chen, Y. Li, and T. Yang, “Leveraging double-agent-based deep reinforcement learning to global optimization of elastic optical networks with enhanced survivability,” Opt. Express 27(6), 7896–7911 (2019). [CrossRef]

14. H. Tünnermann and A. Shirakawa, “Deep reinforcement learning for coherent beam combining applications,” Opt. Express 27(17), 24223–24230 (2019). [CrossRef]

15. S. Y. Arnob, R. Islam, and D. Precup, “Importance of empirical sample complexity analysis for offline reinforcement learning,” arXiv, arXiv:2112.15578 (2021). [CrossRef]

16. J. Xiao, G. Wang, Y. Zhang, and L. Cheng, “A distributed multi-agent dynamic area coverage algorithm based on reinforcement learning,” IEEE Access 8, 33511–33521 (2020). [CrossRef]

17. Z. Huang, X. Liu, and J. Zang, “The inverse design of structural color using machine learning,” Nanoscale 11(45), 21748–21758 (2019). [CrossRef]

18. Y. Feng, W. Xie, Y. Meng, L. Zhang, Z. Liu, W. Wei, and Y. Dong, “High-performance optical frequency-domain reflectometry based on high-order optical phase-locking-assisted chirp optimization,” J. Lightwave Technol. 38(22), 6227–6236 (2020). [CrossRef]

19. Y. Tian, J. Cui, Z. Wang, and J. Tan, “Nonlinear correction of a laser scanning interference system based on a fiber ring resonator,” Appl. Opt. 61(4), 1030–1034 (2022). [CrossRef]

20. R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (MIT Press, 2018).

21. T. Wang, X. Bao, I. Clavera, J. Hoang, Y. Wen, E. Langlois, S. Zhang, G. Zhang, P. Abbeel, and J. Ba, “Benchmarking model-based reinforcement learning,” arXiv, arXiv:1907.02057 (2019). [CrossRef]

22. F. M. Luo, T. Xu, H. Lai, X. H. Chen, W. Zhang, and Y. Yu, “A survey on model-based reinforcement learning,” arXiv, arXiv:2206.09328 (2022). [CrossRef]

23. K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep reinforcement learning in a handful of trials using probabilistic dynamics models,” Adv. Neural Inf. Process. Syst.31 (2018).

24. J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, T. Lillicrap, and D. Silver, “Mastering atari, go, chess and shogi by planning with a learned model,” Nature 588(7839), 604–609 (2020). [CrossRef]

25. H. Hasselt, “Double q-learning,” Adv. Neural Inf. Process. Syst.23 (2010).

26. V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv, arXiv:1312.5602 (2013). [CrossRef]

27. H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 (2016).

28. Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, “Dueling network architectures for deep reinforcement learning,” in International Conference on Machine Learning (PMLR, 2016), pp. 1995–2003.

29. T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” arXiv, arXiv:1511.05952 (2015). [CrossRef]

30. J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International Conference on Machine Learning (PMLR, 2015), pp. 1889–1897.

31. D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in International Conference on Machine Learning (PMLR, 2014), pp. 387–395.

32. S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” in International Conference on Machine Learning (PMLR, 2018), pp. 1587–1596.

33. K. Iiyama, K. Tomita, B. A. Jagatia, T. Nakagawa, and K. Ho, “Deep reinforcement learning for safe landing site selection with concurrent consideration of divert maneuvers,” arXiv, arXiv:2102.12432 (2021). [CrossRef]

34. L. Qian, X. Xu, Y. Zeng, and J. Huang, “Deep, consistent behavioral decision making with planning features for autonomous vehicles,” Electronics 8(12), 1492 (2019). [CrossRef]

35. K. Deng, Y. Liu, D. Hai, H. Peng, L. Löwenstein, S. Pischinger, and K. Hameyer, “Deep reinforcement learning based energy management strategy of fuel cell hybrid railway vehicles considering fuel cell aging,” Energy Convers. Manage. 251, 115030 (2022). [CrossRef]

36. R. K. Ula, Y. Noguchi, and K. Iiyama, “Three-dimensional object profiling using highly accurate fmcw optical ranging system,” J. Lightwave Technol. 37(15), 3826–3833 (2019). [CrossRef]

37. X. Zhang, J. Pouls, and M. C. Wu, “Laser frequency sweep linearization by iterative learning pre-distortion for fmcw lidar,” Opt. Express 27(7), 9965–9974 (2019). [CrossRef]

Parameter symbol	Parameter description	Value
$N$	Number of periods	1500
$M$	Number of time steps	3278
RBS	Replay buffer size	1300000
BS	Batch size	4096
LR-A	Learning rate of the actor NN	0.001
LR-C	Learning rate of the critic NN	0.0001
$γ$	Discount factor of the reward	0.8
ER	Exploration rate	1/5
$d_{t}$	Delaying period of the target NNs update	2
$k$	Update rate of the target NNs	0.01
$δ$	Variance of the regularization of the target policy	0.02

	Method	SNR (dB)	Bandwidth of beat signal (kHz)	RMS of RN (MHz)	TSR (m)
1	without Control	1.6	23.8	346.96	0.0300
2	Model-based-AC	14.6	6.7	74.80	0.0098
3	No-noise-model-based-AC	15.3	7.4	85.94	0.0107
4	No-noise-MBRL	17	5.9	62.07	0.0085
5	Ours	19.7	4.8	44.56	0.0068

	Method	SNR (dB)	Bandwidth of beat signal (kHz)	RMS of RN (MHz)	TSR (m)
1	without Control	1.6	23.8	346.96	0.0300
2	No-retrained-MBRL	8.5	5.8	60.48	0.0080
3	Retrained-MBRL	11.8	5	47.75	0.0073
4	MBRL-trained-from-scratch	19.7	4.8	44.56	0.0068

Parameter symbol	Parameter description	Value
$N$	Number of periods	1500
$M$	Number of time steps	3278
RBS	Replay buffer size	1300000
BS	Batch size	4096
LR-A	Learning rate of the actor NN	0.001
LR-C	Learning rate of the critic NN	0.0001
$γ$	Discount factor of the reward	0.8
ER	Exploration rate	1/5
$d_{t}$	Delaying period of the target NNs update	2
$k$	Update rate of the target NNs	0.01
$δ$	Variance of the regularization of the target policy	0.02

	Method	SNR (dB)	Bandwidth of beat signal (kHz)	RMS of RN (MHz)	TSR (m)
1	without Control	1.6	23.8	346.96	0.0300
2	Model-based-AC	14.6	6.7	74.80	0.0098
3	No-noise-model-based-AC	15.3	7.4	85.94	0.0107
4	No-noise-MBRL	17	5.9	62.07	0.0085
5	Ours	19.7	4.8	44.56	0.0068

Precise chirp control with model-based reinforcement learning for broadband frequency-swept laser of LiDAR

Abstract

1. Introduction

2. Related works

2.1 Comparison of the MBRL and the MFRL

2.2 Overestimation in RL methods

2.3 Principle of the FMCW LiDAR

3. Methodology

3.1 Frequency measurement system and model

3.2 MBRL for broadband frequency-swept linearization

4. Results and discussion

4.1 Training performance of the MBRL algorithm

4.1.1 Parameter optimization

4.1.2 Ablation experiment

4.2 Experimental performance of the MBRL algorithm

4.2.1 Linearization performance of the MBRL algorithm

4.2.2 Experimental analysis of the noise term in the model

4.2.3 Generalization performance of the MBRL

5. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (13)

Tables (3)

Equations (29)

Optics Express