Coherent beam combining is a method to scale the peak and average power levels of laser systems beyond the limit of a single emitter system. This is achieved by stabilizing the relative optical phase of multiple lasers and combining them. We investigated the use of reinforcement learning (RL) and neural networks (NN) in this domain. Starting from a randomly initialized neural network, the system converged to a phase stabilization policy, which was comparable to a software implemented proportional-integral-derivative (PID) controller. Furthermore, we demonstrate the capability of neural networks to predict relative phase noise, which is one potential advantage of this method.
© 2019 Optical Society of America under the terms of the OSA Open Access Publishing Agreement
Coherent beam combining (CBC) is becoming a standard method for realizing laser sources, which are not feasible by other means [1–3]. Compared to the challenges of scaling a single emitter system, it is often more simple to use multiple emitters and combine their output beams. The added complexity for the relative phase stabilization is often more than offset by the possibility to use comparatively simple single emitters.
A common way to combine multiple emitters into one is a Mach-Zehnder Interferometer (MZI) with a common source seed laser and (fiber) amplifiers in each arm. The light fields interfere on the output beam splitters and therefore the complete output power can be combined into a single beam. This requires the phase of the light fields to be stable. The most common way to achieve relative phase stability between the beams is to measure the power level at the output and either use the Hänsch-Culliard (HCD) scheme , a dither method such as LOCSET , or a hill-climbing method to generate the error signal. A proportional integrative derivative (PID) controller is then used to compensate the arm length difference and achieve stable phases. In the case of pulsed laser systems, not just the phase, but also the timing of the pulses needs to be synchronized . Scaling such systems to either many emitters, pulsed lasers with low repetition rates or over long distances, leads to complexity due to many degrees of freedom or due to intrinsic latency. In these cases artificial intelligence based control methods could offer some advantages:
- Combining systems often require many optical and/or electronic components making the system more complex to develop and manufacture. A flexible self-learning software solution that can be used across different laser architectures can make this process more simple.
- Standard CBC stabilization methods rely exclusively on a reaction to observed data. If the latency of the feedback loop cannot be made fast enough the performance suffers or the lock fails. In classic controls, this can be compensated in part by model predictive control (MPC) . But MPC is not feasible for CBC as manually developing such models for each laser is not efficient.
- Additional information can be derived from the combination of multiple types of sensors and unstructured feedback (like camera data for example). While Kalman filters and iterative algorithms can achieve the combination of different sensor types, automated derivation from data has the benefit that one algorithm will work among a larger class of systems without manual tuning.
Machine learning and specifically reinforcement learning (RL) has the potential to address these topics as it reduces the need for manual model development while being flexible enough to adapt to complex problem domains. In principle it is very simple to set up: ideally one just picks the feedback channels and a metric to optimize for and it will find a suitable control strategy (which is called policy in this context). This process happens without manual intervention and the same algorithm is able to scale over many different laser and control architectures. While the resulting control policy is not guaranteed to be optimal, our results were comparable to a manually tuned PID controller for our laser. What makes RL even more attractive is its capability to adapt to complex problems. Instead of an error signal, reinforcement learning expects any kind of characterization of the state of the system, which can then be used to maximize a reward. In the most simple case, the reward can be output power, but any other metric which is maximized for desired behavior is fine too. This metric is only needed for training and not for the stabilization itself. Therefore, a feed of complex and or unstructured data can be used as the feedback channel without manual preprocessing.
In very simple terms, RL works by starting with a random control policy represented by a mathematical function, typically a neural network. RL then observes the rewards and iteratively tunes the policy using stochastic gradient ascent. Although the principle is quite simple, the details are quite complicated and beyond the scope of this paper, an in-depth introduction is given by Sutton . At least in applications where latency is not as critical - such as go or video games and robotics – this concept has been successfully applied [9–11]. In these cases, the neural network acquires the domain knowledge necessary to perform the task. In CBC, a neural network would be given a characterization of the current system state and used to determine the change of the control voltage of the phase actuator.
NNs have been used for adaptive optics in telescopes as early as the 1990s  and phase stabilization using NNs, called neural phase locked loops, have been used in the electronics domain before . Due to the recent advances in the field, there seems to be an increasing interest in the optics community in NNs including applications like beam shaping and CBC . To our knowledge this is the first report of using deep reinforcement learning for a beam combining task, besides us presenting parts of this work at conferences last year [15, 16].
2. Implementation of reinforcement learning
We first verified the capability of a reinforcement learning algorithm to learn to control the driving voltage of a mirror mounted on a piezoelectric transducer (PZT) to combine two fiber amplifiers. We built a standard MZI based CBC setup for testing (Fig. 1). The seed was a soliton mode-locked oscillator operating at at a repetition rate of 48 MHz and about 5 nm bandwidth. Before the interferometer, the pulses were stretched. After splitting up in a fused coupler, the beams were amplified by two polarization maintaining erbium-doped fiber amplifiers and combined again. The hardware used to run the algorithm was a Windows PC running Python and TensorFlow  with an AD/DA card (National Instruments 6323) and a high voltage driver.
When applying machine learning techniques to coherent beam combination, we need to be able to use digital controllers and be able to tolerate typical PC latencies of the order of milliseconds. While this is relatively slow compared to analog controllers, for most applications this is not a significant problem, although if a very low noise output beam is required, this will be a limiting factor. It would be possible to implement neural networks on FPGAs. FPGAs would significantly decrease latency but complicate the setup.
To apply reinforcement learning to the problem, we used the Deep-Q Network (DQN) described by Minth et al . We split up our continuous infinite time series into so called episodes with a length of 500 observations. The state of the system was described by the 20 last power observations and the 20 last observations from the PZT monitor port. Although a longer history could be beneficial, there is a compromise between accuracy and speed, so we picked 20 as a compromise. This state is then fed into a 3-layer dense neural network (100 neurons each) which calculates the value of 21 different actions which are mapped to a change in PZT control voltage. The PZT control-voltage is then changed according to the state which has the maximum value.
As the reward function the last known power was used, and the network was trained using back-propagation by saving all results into a replay buffer. Since the network tries to optimize the reward, after the training it was able to maximize the power with about a 1% RMS noise (Fig. 2(a)). At this point the neural network can be saved and run without training (Fig. 2(b)). Therefore, the use of reinforcement learning has generated a neural network suitable for phase locking by only observing data without any explicit information about optics or coherent beam combining. The training phase took about 4 hours on our hardware. Results from a simulation imply that this would significantly increase to about 1-2 days if a four-channel system with a single photodiode as feedback has to be trained (see section 5).
To evaluate the achieved performance we observed the output power noise in more detail. To minimize potential long term drifts in this experiment we used a single frequency seed instead of the mode locked laser. The coherence length of the single-frequency seed ensures that the fringe contrast stays approximately constant over a long time and therefore after re-acquisition of the lock, the performance does not degrade even if the actuator range is limited. We also did not amplify in the interferometer arms to make sure that the observed noise is caused by the combining and not by pump noise coupling directly to interferometer noise. We used a separate National Instruments 8008 USB for control and the NI 6323 in hardware timed mode (10 kHz) to acquire the data for performance evaluation. This means we can evaluate the phase noise up to 5 kHz. We do not expect different behavior beyond this frequency because the speed of the controller is limited. We also recorded the power noise present without any beam combination to gain a baseline for the detection sensitivity.
The output of the DQN method is discrete, yet our control task is continuous. As we have seen this can be addressed simply by mapping outputs to different control voltage changes, but a direct continuous output could be advantageous for the noise performance. There is an extension to deep reinforcement learning called deep deterministic policy gradient (DDPG) , which is able to address this. To train a DDPG two networks are necessary, one network decides the action - which is simply the required change of the control voltage - and a second neural network evaluates the choice by estimating the expected future reward. While slightly more difficult to implement, this allows more fine grained control over the output, so we used this method in this experiment.
The best estimation of residual phase noise can be done at the mid fringe point. Therefore we adapted the reward function to where is the current power and is the maximum output power. The largest reward is 0 and achieved for . To have a meaningful comparison, we also implemented a simple PID controller in software and used it to lock to the same point. The result can be seen in Fig. 3. The performance of both controllers was comparable and the unity gain frequency was about 100 Hz. This was limited by the AD/DA converter used in this experiment. Therefore we have shown that the neural network controller performs comparably the PID controller.
4. Phase prediction
We have shown how reinforcement learning can be applied in CBC settings yielding comparable results to a software implemented PID controller. To show that neural networks can predict the relative phase noise in CBC, we measured the interferometer phase noise in the time domain. A power spectral density of the free running phase noise present in our system is shown in Fig. 4. The free running phase noise is very similar to results reported before [19, 20]. To our knowledge, the free running phase noise of such systems usually follows this pattern, but with significant differences in the peaks. The peaks contribute to large parts of the total phase noise (yellow curve, integrated from the right). These peaks are likely to be predictable since they are regular oscillations, but they strongly depend on the environment. Because the environment is subject to changes it is usually not efficient to manually optimize the controller for them.
Here the predictive capabilities of neural networks present an advantage. Neural networks have been shown to be capable of representing complex domain specific knowledge. To test the feasibility of predicting the future relative phase between two amplifiers using a neural network, we used our collected phase noise data and trained a combination of a convolutional layer (30 neurons) with a gated recurrent unit (GRU)  to predict the future phase noise from a batch of values seen previously. Suchneural networks have had success in analyzing and synthesizing sound and language data. Vibrations in the frequency region of sound are a common source of phase noise, which implies that such a neural network architecture might also be useful for phase noise prediction. To evaluate the performance of the neural network, we need a baseline. The alternative to active prediction is assuming a completely random process, therefore this should be the baseline. If the phase change follows a Gaussian random walk, predicting the last observed relative phase would be the best possible prediction of the future relative phase. We trained the neural-network on the first half of the data and then evaluated it on the second half. The achieved standard deviations of the neural network model compared with the last known value prediction is shown in Fig. 5. Depending on the latency the trained neural network outperforms the baseline by up to 75%. Therefore, real phase drifts present in CBC systems consist of predictable and unpredictable parts. While a Gaussian random walk cannot be predicted, the process present in our CBC system is partly predictable and deep neural networks can learn to do this prediction without additional human intervention. However, the effect only becomes significant for high latency in the range of 10 ms and even then the timing needs to be very accurate. This experiment was performed on static data as a supervised learning task (this means for each data point the correct answer is known) and without any technical issues such as real world latency fluctuations and glitches. This is most likely the reason why we did not observe lower noise in our control experiment. However as RL uses neural networks as function approximators, it should be possible to use this in a controller if latency is long but well controlled.
To repeatably test potential phase prediction under repeatable conditions, we used a simulation with no latency and strong predictable noise. The simulation generated power measurements of a two-channel CBC system with Gaussian random walk phase noise ( radian μ = 0 radian per step, sampled at a hypothetical 2000 Hz) and additionally a strong noise peak at 1000 Hz (1 radian amplitude). We then again trained a neural network using the DDPG algorithm to lock 50% output power and compared this with a PID controller. The result is shown in Fig. 6(a). The phase noise is shown as the green trace in Fig. 6(b). The neural network is able to predict the 1000 Hz fluctuation and therefore compensate it almost completely, while the PID controller cannot. Since the component is large, the output power noise of the PID controller was 35% RMS while the output power noise of the NN controller was 9% RMS. Fig. 6(b) also shows the residual phase noise as a power spectral density. Here one can see that the NN controller almost completely compensates the peak at 1000 Hz. The PID controller gain is also lower at low frequencies as it needs to be stable in spite of this strong oscillation.
5. Future prospects
A common question in CBC is how to efficiently combine many emitters. Therefore, we tested the DDPG algorithm in a simulated 4 channel test environment with purely Gaussian random walk noise ( radian, μ = 0 radian per step and channel). We trained the network with an episode length of 200 samples. The policy eventually converged (Fig. 7(a)) although it took about 2-5 times more data than the two channel experiment shown in the first section. Therefore we can expect a real time training time of about 1 to 2 days, even though we optimized the data efficiency by using a technique called prioritized experience replay  as implemented by Liang et al.  for this experiment.
After training, the performance of the neural net was comparable to a stochastic parallel gradient ascent algorithm we implemented for comparison (Fig. 7(b)). Considering there are no correlations in the noise that the neural network can exploit in this simulation, this is an encouraging result. However, the increased training time also shows potential limitations for many channel systems and we expect systems with more than 8-16 channels to not be feasible if only a single photodiode and no manual derivation of error signals are used.
However, even for many channel systems, including RL methods could present an advantage when feeding a multidimensional signal into the input of the neural network. For example, the output power from all the ends of the interferometer (instead of just one) or the multidimensional error signal derived from LOCSET could be used. Usually access to more useful data usually allows for better performance and more efficient data representation leads to reduced training time. One way to increase the data efficiency is classic preprocessing (for example using LOCSET), but there are also machine learning techniques that can help with this task . So overall RL methods should be seen as a way to extend the classic control methods in challenging situations and not as a way to replace them.
We have demonstrated the use of reinforcement learning for coherent beam combining. The advantages of neural network based control policies is that they have the flexibility to adapt to complex input signals and specific domain properties without the need to manually implement these. We have shown that this technology can be applied for CBC by combining two fiber amplifiers and have furthermore demonstrated that neural networks have the capabilities to anticipate future phase changes in fiber amplifiers. We have theoretically investigated the scalability to more than a two channel systems and discussed potential challenges.
Japan Society for the Promotion of Science (JSPS) (JP18H01896); Ministry of Education, Culture, Sports, Science and Technology (MEXT) (MEXT Q-LEAP).
1. T. Y. Fan, “Laser beam combining for high-power, high-radiance sources,” IEEE J. Sel. Top. Quantum Electron. 11, 567–577 (2005). [CrossRef]
2. M. Müller, M. Kienel, A. Klenke, T. Gottschall, E. Shestaev, M. Plötner, J. Limpert, and A. Tünnermann, “1 kW 1 mJ eight-channel ultrafast fiber laser,” Opt. Lett. 41, 3439–3442 (2016). [CrossRef]
3. A. Klenke, M. Müller, H. Stark, M. Kienel, C. Jauregui, A. Tünnermann, and J. Limpert, “Coherent beam combination of ultrafast fiber lasers,” IEEE J. Sel. Top. Quantum Electron. 24, 1–9 (2018). [CrossRef]
4. T. Hansch and B. Couillaud, “Laser frequency stabilization by polarization spectroscopy of a reflecting reference cavity,” Opt. Commun. 35, 441–444 (1980). [CrossRef]
5. T. M. Shay, V. Benham, J. T. Baker, A. D. Sanchez, D. Pilkington, and C. A. Lu, “Self-synchronous and self-referenced coherent beam combination for large optical arrays,” IEEE J. Sel. Top. Quantum Electron. 13, 480–486(2007). [CrossRef]
7. J. B. Rawlings, “Tutorial overview of model predictive control,” IEEE Control. Syst. Mag. 20, 38–52 (2000). [CrossRef]
8. R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction (MIT, 2018).
9. A. Y. Ng, A. Coates, M. Diel, V. Ganapathi, J. Schulte, B. Tse, E. Berger, and E. Liang, “Autonomous inverted helicopter flight via reinforcement learning,” in Experimental Robotics IX, (Springer, 2006), pp. 363–372.
10. V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602 (2013).
11. D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, D. Sander, G. Dominik, N. John, K. Nal, S. Ilya, L. Timothy, L. Madeleine, K. Koray, G. Thore, and H. Demis, “Mastering the game of go with deep neural networks and tree search,” Nature 529, 484–489 (2016). [CrossRef] [PubMed]
12. D. G. Sandler, T. K. Barrett, D. A. Palmer, R. Q. Fugate, and W. J. Wild, “Use of a neural network to control an adaptive optics system for an astronomical telescope,” Nature 351, 300–302 (1991). [CrossRef]
13. F. C. Hoppensteadt and E. M. Izhikevich, “Pattern recognition via synchronization in phase-locked loop neural networks,” IEEE Transactions on Neural Networks 11, 734–738 (2000). [CrossRef]
14. T. Hou, Y. An, Q. Chang, P. Ma, J. Li, L. Huang, D. Zhi, J. Wu, R. Su, Y. Ma, and P. Zhou, “Deep learning-based phase control method for coherent beam combining and its application in generating orbital angular momentum beams,” arXiv preprint arXiv:1903.03983 (2019).
15. H. Tünnermann and A. Shirakawa, “End-to-end reinforcement learning for coherent beam combination,” in 8th EPS-QEOD Europhoton Conference, (2018). TuP.11.
16. H. Tünnermann and A. Shirakawa, “Reinforcement learning for coherent beam combining,” in Pacific Rim Conference on Lasers and Electro-Optics (CLEO-PR), (2018). W1A.2.
17. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: A system for large-scale machine learning,” in 12th Symposium on Operating Systems Design and Implementation, (2016), pp. 265–283.
18. T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971 (2015).
20. H. Tünnermann, J. H. Pöld, J. Neumann, D. Kracht, B. Willke, and P. Weßels, “Beam quality and noise properties of coherently combined ytterbium doped single frequency fiber amplifiers,” Opt. Express 19, 19600–19606 (2011). [CrossRef] [PubMed]
21. J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555 (2014).
22. T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” arXiv preprint arXiv:1511.05952 (2015).
23. E. Liang, R. Liaw, P. Moritz, R. Nishihara, R. Fox, K. Goldberg, J. E. Gonzalez, M. I. Jordan, and I. Stoica, “Rllib: Abstractions for distributed reinforcement learning,” arXiv preprint arXiv:1712.09381 (2017).
24. N. Wahlström, T. B. Schön, and M. P. Deisenroth, “From pixels to torques: Policy learning with deep dynamical models,” arXiv preprint arXiv:1502.02251 (2015).